Tim Sherratt

Sharing recent updates and work-in-progress

Aug 2021

Updated! Lots and lots of text freshly harvested from Trove periodicals

For a few years now I’ve been harvesting downloadable text from digitised periodicals in Trove and making it easily available for exploration and research. I’ve just completed the latest harvest – here’s the summary:

  • 1,163 digitised periodicals had text available for download
  • Text was downloaded from 51,928 individual issues
  • Adding up to a total of around 12gb of text

If you want to dive straight in, here’s a list of all the harvested periodicals, with links to download a summary of available issues, as well as all the harvested text (there’s one file per issue). You’ll notice that the list includes a large number of parliamentary papers and government reports as well as published journals.

List of Trove periodicals with downloadable text

All of the harvested text is available from a public folder on CloudStor.

The harvesting process involves a few different steps:

  • First I generate a list of periodicals available in digital form from Trove. This includes digitised titles, as well as born-digital titles submitted through e-Legal Deposit. This produced a CSV file containing the details of 7,270 titles. See this notebook for details.
  • Then I work through this list of titles to find out how many issues of each title are available through Trove. This information isn’t accessible through the API, so I have to do some screen scraping.
  • Next I work through the list of issues and try to download the text contents. Most of the born-digital titles don’t have downloadable text.
  • Once I’ve downloaded all the text I can from a title, I create a CSV file for it that lists the available issues and notes whether text is available for each. This file is stored with the text on CloudStor.
  • Once I’ve checked all the titles, I generate another CSV file that lists the details of all the periodicals that have downloadable text.
  • The code to harvest and document the downloaded text is available in this notebook. #dhhacks