Tim Sherratt

Sharing recent updates and work-in-progress

Sep 2022

Do you want your Trove newspaper articles in bulk? Meet the new Trove Newspaper Harvester Python package!

The Trove Newspaper Harvester has been around in different forms for more than a decade. It helps you download all the articles in a Trove newspaper search, opening up new possibilities for large-scale analysis. You can use it as a command-line tool by installing a Python package, or through the Trove Newspaper Harvester section of the GLAM Workbench.

I’ve just overhauled development of the Python package. The new trove-newspaper-harvester replaces the old troveharvester repository. The command-line interface remains the same (with a few new options), so it’s really a drop in replacement. Read the full documentation of the new package for more details.

Screenshot of the trove-newspaper-harvester documentation describing its use as a Python library.

Here’s a summary of the changes:

  • the package can now be used as a library (that you incorporate into your own code) as well as a standalone command-line tool – this means you can embed the harvester in your own tools or workflows
  • both the library and the CLI now let you set the names of the directories in which your harvests will be saved – this makes it easier to organise your harvests into groups and give them meaningful names
  • the harvesting process now saves results into a newline-delimited JSON file (one JSON object per line) – the library has a save_csv() option to convert this to a CSV file, while the CLI automatically converts the results to CSV to maintain compatibility with previous versions
  • behind the scenes, the package is now developed and maintained using nbdev – this means the code and documentation are all generated from a set of Jupyter notebooks
  • the Jupyter notebooks include a variety of automatic tests which should make maintenance and development much easier in the future

I’ve also updated the Trove Newspaper Harvester section of the GLAM Workbench to use the new package. The new core library will make it easier to develop more complex harvesting examples – for example, searching for articles from a specific day across a range of years. If you find any problems, or want to suggest improvements, please raise an issue.