Tim Sherratt

Sharing recent updates and work-in-progress

Jul 2022

Catching up – some recent GLAM Workbench updates!

There’s been lots of small updates to the GLAM Workbench over the last couple of months and I’ve fallen behind in sharing details. So here’s an omnibus list of everything I can remember…

Data

  • Weekly harvests of basic Trove newspaper data continue, there’s now about three months worth. You can view a summary of the harvested data through the brand new Trove Newspaper Data Dashboard. The Dashboard is generated from a Jupyter notebook and is updated whenever there’s a new data harvest.
  • There’s also weekly harvests of files digitised by the NAA, now 16 months worth of data.
  • Updated harvest of Trove public tags (Zenodo) – includes 2,201,090 unique public tags added to 9,370,614 resources in Trove between August 2008 and July 2022.
  • I’ve started moving other pre-harvested datasets out of the GLAM Workbench code repositories, into their own data repositories. This means better versioning and citability. The first example is the list of Trove newspapers with articles post the 1955 copyright cliff of death – here’s the GH repo, and the Zenodo record.
  • To bring together datasets that provide historical data about Trove itself, I’ve created a Trove historical data community on Zenodo. Anyone’s welcome to contribute. There’s much more to come.
Tag cloud showing the frequency of the two hundred most commonly-used tags in Trove.

Tag cloud generated from the latest harvest of Trove Tags

Code

  • Big thanks to Mitchell Harrop who contributed a new Heritage Council of Victoria section to the GLAM Workbench providing examples using the Victorian Heritage Database API.
  • The troveharvester Python package has been updated. Mainly to remove annoying Pandas warnings and to make use of the trove-query-parser package.
  • As a result of the above, the Trove Newspaper & Gazette Harvester section of the GLAM Workbench has been updated. No major changes to notebooks, but I’ve implemented basic testing and linting to improve code quality.
  • The Trove newspapers section of the GW has been updated. There were a few bug fixes and minor improvements. In particular there was a problem downloading data and HTML files from QueryPic, and some date queries in QueryPic were returning no results.
  • The tool to download complete, high-res newspaper page images has been updated so that you now no longer need to supply an API key. Also fixed a problem with displaying the images in Voila.
  • The recordsearch_data_scraper Python package has been updated. This fixes a bug where agency and series searches with only one result weren’t being captured properly.
  • The RecordSearch section of the GW has been updated. This is incorporates the above update, but I took the opportunity to update all packages, and implement basic testing and linting. The Harvest items from a search in RecordSearch notebook has been simplified and reorganised. There are two new notebooks: Exploring harvested series data, 2022 – generates some basic statistics from the harvest of series data in 2022 and compares the results to the previous year; Summary of records digitised in the previous week – run this notebook to analyse the most recent dataset of recently digitised files, summarising the results by series.
  • A new Zotero translator for Libraries Tasmania has been developed