It’s pretty obvious that access to digitised resources, like Trove’s newspapers, has changed the practice of history in Australia. But how? I’m certain that the historiographical implications of the growth and development of online collections will become a topic of increasing interest to historians, and that exploration of this topic will lead to important insights into the relationship between what we keep, what we value, and what we know. But for this to happen we need to have data documenting changes in online collections. What became available when? How was it delivered to users? How did the search indexing work?
In general, GLAM collection interfaces exist in an eternal present – they’re not good at explaining changes, or communicating their own histories. Australian GLAM organisations also share little statistical information. If you’re lucky, you might get something useful out of annual reports, but that’s about it. Trove, in fact, removed all their online collection statistics in the 2020 interface update. Web archives capture individual pages, but not complete systems. If we don’t document the shape and structure of online collections now, how will future historians understand their impact?
A couple of years ago I gave a short talk entitled ‘Living archaeologies of online collections’ for the Digital Preservation Coalition (video & slides).
In the talk, I described some of my piecemeal and inconsistent attempts to capture this history – starting back in 2011 when I first harvested data from Trove about the number of digitised newspaper articles. I’ve been continuing to create and update datasets, and have been trying to improve the way they’re organised and described – making them more FAIR – but I’ve still got a long way to go. At the moment there’s information spread across the GLAM Workbench, GitHub repositories, and Zenodo records, so I thought it might be useful if I pulled everything together into one big list. Hence this post. I often forget about what I’ve done in the past, so it’ll help me keep track of where the gaps are and what’s left to do. And hopefully it’ll encourage others to think about the significance and possibilities of this data, and perhaps share their own datasets.
There’s also a growing list of datasets in the Trove Historical Data community on Zenodo
I suppose I could also add the Trove Data Guide to this list. It’s not a dataset, but it is a snapshot of the current state of Trove. I’ll continue to update it, and these updates will themselves be saved as versions in GitHub and Zenodo, allowing future researchers to dig back through the layers.
A lot of what I do is focused on the present – building tools and resources that help researchers make use of GLAM collections right now. Those tools and resources will eventually decay as I shuffle off, as collections evolve, and as technologies change. But I’m hoping that these datasets will grow in value over time. I think it was Jason Scott who coined the phrase ‘metadata is a love letter to the future’. I suppose this is my love letter to future historians.
https://github.com/wragge/trove-zone-totals
This repository contains an automated git scraper that uses the Trove API to save data about the contents of Trove’s zones and categories. It runs every week and updates the following data files:
In the web interface, ‘zones’ were replaced by ‘categories’ in 2020. However, categories were not available through the Trove API until the release of version 3 in June 2023. To try and document the differences between zones and categories, totals from both were captured until version 2 of the API was switched off in September 2024.
https://github.com/wragge/trove-newspaper-totals-historical/
The files in this dataset were created at irregular intervals between 2011 and 2022 for use in visualising Trove’s newspaper corpus. Harvests from 2011 were screen scraped from the Trove website. Harvests after 2012 make use of the year
and state
facets from the Trove API. There are 9 versions:
Each version includes two data files:
https://github.com/wragge/trove-newspaper-totals
This repository contains an automated git scraper that uses the Trove API to save information about the number of digitised newspaper articles currently available through Trove. It runs every week and updates four data files:
By retrieving all versions of these files from the commit history, you can analyse changes in Trove over time.
A weekly summary of the harvested data is presented in the Trove Newspapers Data Dashboard.
https://github.com/GLAM-Workbench/trove-newspaper-titles-web-archives
These datasets were created by harvesting information about newspaper titles in Trove from web archives. The harvesting method is documented by Gathering historical data about the addition of newspaper titles to Trove in the GLAM Workbench.
https://github.com/GLAM-Workbench/trove-newspapers-corrections
OCR errors in Trove’s digitised newspapers can be corrected by users. To help understand patterns in newspaper correction, this dataset has been created to record information about the number of articles with corrections. The data was extracted from the Trove API using this notebook.
https://github.com/GLAM-Workbench/trove-newspapers-data-post-54/
Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the ‘copyright cliff of death’). The data was extracted from the Trove API using this notebook.
There are 8 versions of this dataset:
https://github.com/GLAM-Workbench/trove-newspapers-non-english
This dataset contains information about newspapers published in languages other than English that have been digitised and made available through Trove. Data about the languages present in newspapers was generated by harvesting a sample of articles from each newspaper using the Trove API, and then using language detection software on the OCRd text of each article. The method is documented in this notebook in the GLAM Workbench.
There are two data files:
There are two versions of this dataset:
This dataset contains information about the published issues of newspapers digitised and made available through Trove. The data was harvested from the Trove API, using this notebook in the GLAM Workbench.
There are two data files in this dataset:
There are 5 versions of this dataset:
https://github.com/GLAM-Workbench/trove-lists-metadata/
Trove users can create collections of resources using Trove’s ‘lists’. This dataset contains metadata describing all public lists, harvested via the Trove API. To reduce file size, the details of the resources collected by each list are not included, just the total number of resources. The data was extracted using this notebook from the Trove lists and tags section of the GLAM Workbench.
There are 4 versions of this dataset:
This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024.
There are 3 versions of this dataset:
This dataset was derived from the full harvest of Trove public tags. It contains a list of unique tags and the total number of resources in Trove each tag is attached to.
There are 3 versions of this dataset:
https://github.com/wragge/trove-contributor-totals
This repository contains an automated git scraper that uses the Trove API to save details of organisations and projects that contribute metadata to Trove. As well as counts of total resources by contributor, this dataset includes counts of resources from each contributor by format and category. It runs every week and updates the following data files:
These data files were generated using version 2 of the Trove API:
To help people find and use digitised resources other than newspapers in Trove, I’ve been harvesting, sharing, and visualising metadata relating to specific formats, such as books and periodicals. The methods I’ve used have changed over time, and there are some earlier versions that I still need to extract from the Git repositories, but these are the current datasets. I’m planning to set up automatic re-harvests for some or all of these, so there’ll be a better record of change over time.
There’s more information about these datasets in both the GLAM Workbench and the Trove Data Guide.
https://github.com/GLAM-Workbench/trove-books-data
There are 2 versions of this dataset:
https://github.com/GLAM-Workbench/trove-periodicals-data
This dataset was created by checking, correcting, and enriching data about digitised periodicals obtained from the Trove API. Additional metadata describing periodical titles and issues was extracted from the Trove website and used to check the API results. Where titles were wrongly described as issues, and vice versa, the records were corrected. Additional descriptive metadata was also added into the records. Separate CSV formatted data files were created for titles and issues. Finally, the titles and issues data was loaded into an SQLite database for use with Datasette.
There are 4 data files in this repository:
There are two versions of this dataset:
https://github.com/GLAM-Workbench/trove-parliamentary-papers-data
This dataset contains metadata describing Commonwealth Parliamentary Papers that have been digitised and are made available through Trove.
There’s one version of this dataset:
https://github.com/GLAM-Workbench/trove-maps-data
This dataset contains metadata describing digitised maps in Trove, harvested from the Trove API and other sources.
There are 2 data files in this dataset:
There are 2 versions of this dataset:
https://github.com/GLAM-Workbench/trove-oral-histories-data
There are 2 data files in this dataset:
There are 2 versions of this dataset:
https://github.com/GLAM-Workbench/trove-images-rights-data/
This dataset includes information about the application of licences and rights statements to images by Trove contributors.
There are 2 data files in this repository:
There are 3 versions of this dataset:
https://github.com/GLAM-Workbench/nla-finding-aids-data/
This repository contains data about the National Library of Australia’s digitised manuscript finding aids, harvested from Trove.
This dataset contains 2 data files:
There is one version of this dataset:
https://github.com/GLAM-Workbench/trove-web-archives-collections
This dataset contains details of the subject and collection groupings used by Pandora to organise archived web resource titles.
There are two data files in this dataset:
There are 2 versions of this dataset:
https://github.com/GLAM-Workbench/trove-ned-periodicals-data
This dataset contains details of periodical titles and issues submitted to the Trove through the NLA’s National edeposit scheme.
There are 3 data files in this dataset:
There is one version of this dataset:
The NAA datasets are all over the place at present and I need to do a lot of work to get them standardised and organised. These are the main datasets, but there are others I need to add.
https://github.com/wragge/closed_access
https://github.com/GLAM-Workbench/recordsearch
Versions in Figshare:
Versions in GitHub:
https://github.com/GLAM-Workbench/recordsearch
CSV file containing basic descriptive information about all the series currently registered on RecordSearch as well as the total number of items described, digitised, and in each access category.
https://github.com/GLAM-Workbench/recordsearch
https://github.com/wragge/naa-recently-digitised
This dataset contains weekly harvests of newly digitised files in RecordSearch. The automated scraper is currently scheduled to run each Sunday, saving a list of files that have been digitised in the previous week.
There are 177 data files, created between 28 March 2021 and the present.