Tim Sherratt

Sharing recent updates and work-in-progress

Aug 2023

Some important updates for the Trove Newspaper & Gazette Harvester

Version 3 of the Trove API is out, and version 2 is scheduled to be decommissioned in early 2023 – that means I have a lot of code to update! First cab of the rank is the Trove Newspaper & Gazette Harvester with version 0.7.1 now available.

Screenshot of the Trove Newspaper and Gazette Harvester documentation page.

The Harvester is a Python package that can be used as either a library or a command-line tool. It’s been around in some form for more than 10 years. The latest updates include:

  • support for version 3 of the Trove API
  • automatic creation of a metadata file describing each harvest according to the RO-Crate format
  • automatic creation of a harvester config file, capturing the query parameters sent to Trove as well as the Harvester options
  • the ability to initiate a harvest from an existing config file
  • more memory-friendly generation of CSV result files (no loading everything into Pandas)

The RO-Crate integration was part of my work for the ARDC’s HASS Community Data Lab. The Harvester was already generating a simple metadata file that captured some of the harvest parameters, but now it documents the context of the harvest in much more detail, and saves it in a standard, Linked Open Data based, format.

Every harvest now creates an ro-crate-metadata.json file. This file includes details of the datasets created by the Harvester, such as the results.csv file that includes article metadata, and the text directory that contains the OCRd text. It also captures contextual information about the Harvester itself. The Harvester and the datasets are linked through a CreateAction that describes the harvesting process. The harvester_config.json file that saves the query parameters and Harvester options is also linked as an input to this process. In this way, all the components of the harvest are described and linked.

Here’s an example RO-Crate file.

Trove is changing all the time. By capturing information such as the query, the harvester version, the date, and the number of results, the RO-Crate file will help researchers document, manage, and share their research. And now that you can start a new harvest with an existing config file, it’s easy for researchers to re-run a harvest to see what changes over time.

As well as updating the Python package, I’ve also updated the Trove Newspaper & Gazette Harvester section of the GLAM Workbench. Here you’ll find examples of the Harvester in action, as well as some ways of exploring the harvested data. If you’d like to take the Harvester for a spin, the easiest way to start is with web app version – no software to install, no code to navigate! If you’re an Australian university researcher you can spin it up on the new ARDC Binder service in seconds.