05 Sep 2022

Fresh harvest of OCRd text from Trove's digitised periodicals – 9gb of text to explore and analyse!

I’ve updated the GLAM Workbench’s harvest of OCRd text from Trove’s digitised periodicals. This is a completely fresh harvest, so should include any corrections made in recent months. It includes:

1,430 periodicals
OCRd text from 41,645 issues
About 9gb of text

The easiest way to explore the harvest is probably this human-readable list. The list of periodicals with OCRd text is also available as a CSV. You can find more details in the Trove journals section of the GLAM Workbench, and download the complete corpus from CloudStor.

Finding which periodical issues in Trove have OCRd text you can download is not as easy as it should be. The fullTextInd index doesn’t seem to distinguish between digitised works (with OCR) and born-digital publications (like PDFs) without downloadable text. You can use has:correctabletext to find articles with OCR, but you can’t get a full list of the periodicals the articles come from using the title facet. As this notebook explains, you can search for nla.obj, but this returns both digitised works and publications supplied through edeposit. In previous harvests of OCRd text I processed all of the titles returned by the nla.obj search, finding out whether there was any OCRd text by just requesting it and seeing what came back. But the number of non-digitised works on the list of periodicals in digital form has skyrocketed through the edeposit scheme and this approach is no longer practical. It just means you waste a lot of time asking for things that don’t exist.

For the latest harvest I took a different approach. I only processed periodicals in digital form that weren’t identified as coming through edeposit. These are the publications with a fulltext_url_type value of either ‘digitised’ or ‘other’ in my dataset of digital periodicals. Is it possible that there’s some downloadable text in edeposit works that’s now missing from the harvest? Yep, but I think this is a much more sensible, straightforward, and reproduceable approach.

That’s not the only problem. As I noted when creating the list of periodicals in digital form, there are duplicates in the list, so they have to be removed. You then have to find information about the issues available for each title. This is not provided by the Trove API, but there is an internal API used in the web interface that you can access – see this notebook for details. I also noticed that sometimes where there’s a single issue of a title, it’s presented as if each page is an issue. I think I’ve found a work around for that as well.

All these doubts, inconsistencies and workarounds mean that I’m fairly certain I don’t have everything. But I do think I have most of the OCRd text available from digitised periodicals, and I do have a methodology, documented in this notebook, that at least provides a starting point for further investigation. As I noted in my wishlist for a Trove Researcher Platform, it would be great if more metadata for digitised works, other than newspapers, was made available through the API.

Tim Sherratt

Fresh harvest of OCRd text from Trove's digitised periodicals – 9gb of text to explore and analyse!