06 Aug 2021

Updated! Lots and lots of text freshly harvested from Trove periodicals

For a few years now I’ve been harvesting downloadable text from digitised periodicals in Trove and making it easily available for exploration and research. I’ve just completed the latest harvest – here’s the summary:

1,163 digitised periodicals had text available for download
Text was downloaded from 51,928 individual issues
Adding up to a total of around 12gb of text

If you want to dive straight in, here’s a list of all the harvested periodicals, with links to download a summary of available issues, as well as all the harvested text (there’s one file per issue). You’ll notice that the list includes a large number of parliamentary papers and government reports as well as published journals.

List of Trove periodicals with downloadable text

All of the harvested text is available from a public folder on CloudStor.

The harvesting process involves a few different steps:

First I generate a list of periodicals available in digital form from Trove. This includes digitised titles, as well as born-digital titles submitted through e-Legal Deposit. This produced a CSV file containing the details of 7,270 titles. See this notebook for details.
Then I work through this list of titles to find out how many issues of each title are available through Trove. This information isn’t accessible through the API, so I have to do some screen scraping.
Next I work through the list of issues and try to download the text contents. Most of the born-digital titles don’t have downloadable text.
Once I’ve downloaded all the text I can from a title, I create a CSV file for it that lists the available issues and notes whether text is available for each. This file is stored with the text on CloudStor.
Once I’ve checked all the titles, I generate another CSV file that lists the details of all the periodicals that have downloadable text.
The code to harvest and document the downloaded text is available in this notebook. #dhhacks

Tim Sherratt

Updated! Lots and lots of text freshly harvested from Trove periodicals