Tim Sherratt

Sharing recent updates and work-in-progress

May 2024

More tools for harvesting Trove newspaper articles

I’ve just added a couple of new notebooks to the Trove Newspaper & Gazette Harvester section of the GLAM Workbench.

Using the Trove Harvester as a Python package provides a basic example of using the trove-newspaper-harvester Python package. While there’s already a simple web app version of the harvester, I wanted a notebook version running in the JupyterLab interface that I could integrate with other tools and notebooks. All you need to do to harvest all the articles in a Trove newspaper search is paste in your Trove API key and the search query url from the Trove web interface.

Screenshot from the Reshaping your newspaper harvest describing the Harvest Slicer

Reshaping your newspaper harvest provides a slice and dice wonder tool for Trove newspaper harvests, enabling you to repackage OCRd text by decade, year, and newspaper title. It saves the results as zip files, concatenated text files, or CSV files with embedded text. These repackaged slices should suit a variety of text analysis tools and questions. I’ve be thinking about doing something like this for a while, and think it should be quite useful.

In my usual way, I started off writing a tutorial for the Trove Data Guide on ways of loading digitised newspaper data into text analysis tools and then realised I needed these notebooks to fill in some gaps in the data processing pipeline. So after a day or two of yak shaving I now have to get back to the tutorial.