Version 3 of the Trove API is out, and version 2 is scheduled to be decommissioned in early 2023 – that means I have a lot of code to update! First cab of the rank is the Trove Newspaper & Gazette Harvester with version 0.7.1 now available.
The Harvester is a Python package that can be used as either a library or a command-line tool. It’s been around in some form for more than 10 years. The latest updates include:
The RO-Crate integration was part of my work for the ARDC’s HASS Community Data Lab. The Harvester was already generating a simple metadata file that captured some of the harvest parameters, but now it documents the context of the harvest in much more detail, and saves it in a standard, Linked Open Data based, format.
Every harvest now creates an ro-crate-metadata.json
file. This file includes details of the datasets created by the Harvester, such as the results.csv
file that includes article metadata, and the text
directory that contains the OCRd text. It also captures contextual information about the Harvester itself. The Harvester and the datasets are linked through a CreateAction
that describes the harvesting process. The harvester_config.json
file that saves the query parameters and Harvester options is also linked as an input to this process. In this way, all the components of the harvest are described and linked.
Here’s an example RO-Crate file.
Trove is changing all the time. By capturing information such as the query, the harvester version, the date, and the number of results, the RO-Crate file will help researchers document, manage, and share their research. And now that you can start a new harvest with an existing config file, it’s easy for researchers to re-run a harvest to see what changes over time.
As well as updating the Python package, I’ve also updated the Trove Newspaper & Gazette Harvester section of the GLAM Workbench. Here you’ll find examples of the Harvester in action, as well as some ways of exploring the harvested data. If you’d like to take the Harvester for a spin, the easiest way to start is with web app version – no software to install, no code to navigate! If you’re an Australian university researcher you can spin it up on the new ARDC Binder service in seconds.