Preserving the history of online collections (my love letter to future historians)
It’s pretty obvious that access to digitised resources, like Trove’s newspapers, has changed the practice of history in Australia. But how? I’m certain that the historiographical implications of the growth and development of online collections will become a topic of increasing interest to historians, and that exploration of this topic will lead to important insights into the relationship between what we keep, what we value, and what we know. But for this to happen we need to have data documenting changes in online collections. What became available when? How was it delivered to users? How did the search indexing work?
In general, GLAM collection interfaces exist in an eternal present – they’re not good at explaining changes, or communicating their own histories. Australian GLAM organisations also share little statistical information. If you’re lucky, you might get something useful out of annual reports, but that’s about it. Trove, in fact, removed all their online collection statistics in the 2020 interface update. Web archives capture individual pages, but not complete systems. If we don’t document the shape and structure of online collections now, how will future historians understand their impact?
A couple of years ago I gave a short talk entitled ‘Living archaeologies of online collections’ for the Digital Preservation Coalition (video & slides).
In the talk, I described some of my piecemeal and inconsistent attempts to capture this history – starting back in 2011 when I first harvested data from Trove about the number of digitised newspaper articles. I’ve been continuing to create and update datasets, and have been trying to improve the way they’re organised and described – making them more FAIR – but I’ve still got a long way to go. At the moment there’s information spread across the GLAM Workbench, GitHub repositories, and Zenodo records, so I thought it might be useful if I pulled everything together into one big list. Hence this post. I often forget about what I’ve done in the past, so it’ll help me keep track of where the gaps are and what’s left to do. And hopefully it’ll encourage others to think about the significance and possibilities of this data, and perhaps share their own datasets.
There’s also a growing list of datasets in the Trove Historical Data community on Zenodo
I suppose I could also add the Trove Data Guide to this list. It’s not a dataset, but it is a snapshot of the current state of Trove. I’ll continue to update it, and these updates will themselves be saved as versions in GitHub and Zenodo, allowing future researchers to dig back through the layers.
A lot of what I do is focused on the present – building tools and resources that help researchers make use of GLAM collections right now. Those tools and resources will eventually decay as I shuffle off, as collections evolve, and as technologies change. But I’m hoping that these datasets will grow in value over time. I think it was Jason Scott who coined the phrase ‘metadata is a love letter to the future’. I suppose this is my love letter to future historians.
1. Trove zones, categories, and formats
trove-zone-totals
https://github.com/wragge/trove-zone-totals
This repository contains an automated git scraper that uses the Trove API to save data about the contents of Trove’s zones and categories. It runs every week and updates the following data files:
- Total number of resources by Trove category, weekly updates, 13 June 2023 to present
- Total number of resources by Trove category and format, weekly updates, 13 June 2023 to present
In the web interface, ‘zones’ were replaced by ‘categories’ in 2020. However, categories were not available through the Trove API until the release of version 3 in June 2023. To try and document the differences between zones and categories, totals from both were captured until version 2 of the API was switched off in September 2024.
- Total number of resources by Trove zone, weekly updates, 9 March 2023 to 1 September 2024
- Total number of resources by Trove zone and format, weekly updates, 9 March 2023 to 1 September 2024
2. Trove newspapers
trove-newspaper-totals-historical
https://github.com/wragge/trove-newspaper-totals-historical/
The files in this dataset were created at irregular intervals between 2011 and 2022 for use in visualising Trove’s newspaper corpus. Harvests from 2011 were screen scraped from the Trove website. Harvests after 2012 make use of the year and state facets from the Trove API. There are 9 versions:
- 12 April 2011
- 4 August 2011
- 12 September 2014
- 29 November 2015
- 14 December 2016
- 28 July 2019
- 10 July 2020
- 27 April 2021
- 21 January 2022
Each version includes two data files:
- Total number of newspaper articles by year
- Total number of newspaper articles by year and state
trove-newspaper-totals
https://github.com/wragge/trove-newspaper-totals
This repository contains an automated git scraper that uses the Trove API to save information about the number of digitised newspaper articles currently available through Trove. It runs every week and updates four data files:
- Total number of newspaper articles by year, weekly updates, 19 April 2022 to present
- Total number of newspaper articles by year and state, weekly updates, 19 April 2022 to present
- Total number of articles by newspaper, weekly updates, 20 April 2022 to present
- Total number of newspaper articles by category, weekly updates, 20 April 2022 to present
By retrieving all versions of these files from the commit history, you can analyse changes in Trove over time.
A weekly summary of the harvested data is presented in the Trove Newspapers Data Dashboard.
trove-newspaper-titles-web-archives
https://github.com/GLAM-Workbench/trove-newspaper-titles-web-archives
These datasets were created by harvesting information about newspaper titles in Trove from web archives. The harvesting method is documented by Gathering historical data about the addition of newspaper titles to Trove in the GLAM Workbench.
- All web archive captures of Trove newspaper titles, 2009 to 2021
- First appearance of each newspaper title in web archive captures, 2009 to 2021
- Alphabetical list of newspaper titles showing approximately when they first appeared in Trove, 2009 to 2021
trove-newspapers-corrections
https://github.com/GLAM-Workbench/trove-newspapers-corrections
OCR errors in Trove’s digitised newspapers can be corrected by users. To help understand patterns in newspaper correction, this dataset has been created to record information about the number of articles with corrections. The data was extracted from the Trove API using this notebook.
- Number of corrections by newspaper, 7 versions:
- 13 August 2019
- 10 July 2020
- 27 April 2021
- 21 January 2022
- 24 June 2022
- 14 September 2024
- Number of corrections by year, 2 versions:
- 24 June 2024
- 14 September 2024
- Number of corrections by category, 2 versions:
- 24 June 2024
- 14 September 2024
trove-newspapers-data-post-54
https://github.com/GLAM-Workbench/trove-newspapers-data-post-54/
Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the ‘copyright cliff of death’). The data was extracted from the Trove API using this notebook.
There are 8 versions of this dataset:
- 7 June 2019
- 12 August 2019
- 10 July 2020
- 11 November 2020
- 27 April 2021
- 21 January 2022
- 27 June 2024
- 14 September 2024
trove-newspapers-non-english
https://github.com/GLAM-Workbench/trove-newspapers-non-english
This dataset contains information about newspapers published in languages other than English that have been digitised and made available through Trove. Data about the languages present in newspapers was generated by harvesting a sample of articles from each newspaper using the Trove API, and then using language detection software on the OCRd text of each article. The method is documented in this notebook in the GLAM Workbench.
There are two data files:
- CSV data file of the main languages detected for each newspaper with non-English language content
- A markdown formatted list of all newspapers found with non-English language content
There are two versions of this dataset:
- 9 July 2024
- 14 September 2024
trove-newspaper-issues
This dataset contains information about the published issues of newspapers digitised and made available through Trove. The data was harvested from the Trove API, using this notebook in the GLAM Workbench.
There are two data files in this dataset:
- Total number of newspaper issues per year for each digitised newspaper
- A complete list of newspaper issues available from Trove
There are 5 versions of this dataset:
- 18 October 2021
- 20 January 2022
- 3 August 2023
- 26 June 2024
- 13 September 2024
3. Trove lists and tags
trove-lists-metadata
https://github.com/GLAM-Workbench/trove-lists-metadata/
Trove users can create collections of resources using Trove’s ‘lists’. This dataset contains metadata describing all public lists, harvested via the Trove API. To reduce file size, the details of the resources collected by each list are not included, just the total number of resources. The data was extracted using this notebook from the Trove lists and tags section of the GLAM Workbench.
There are 4 versions of this dataset:
- 20 September 2018
- 22 September 2020
- 5 July 2022
- 6 June 2024
Public tags added to resources in Trove, 2008 to 2024
This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024.
There are 3 versions of this dataset:
- 10 July 2021
- 6 July 2022
- 6 June 2024
Tag counts
This dataset was derived from the full harvest of Trove public tags. It contains a list of unique tags and the total number of resources in Trove each tag is attached to.
There are 3 versions of this dataset:
- 10 July 2021
- 6 July 2022
- 6 June 2024
4. Trove contributors
https://github.com/wragge/trove-contributor-totals
This repository contains an automated git scraper that uses the Trove API to save details of organisations and projects that contribute metadata to Trove. As well as counts of total resources by contributor, this dataset includes counts of resources from each contributor by format and category. It runs every week and updates the following data files:
- Number of resources by contributor (unmodified JSON from API), weekly updates, 9 March 2023 to present
- Number of resources by contributor (flattened data as CSV), weekly updates, 9 March 2023 to present
- Number of resources by contributor and category, weekly updates, 21 June 2024 to present
- Number of resources by contributor, category, and format, weekly updates, 21 June 2024 to present
These data files were generated using version 2 of the Trove API:
- Number of resources by contributor and zone, weekly updates, 9 March 2023 to 12 May 2024
- Number of resources by contributor, zone, and format, weekly updates, 9 March 2023 to 12 May 2024
5. Trove digitised resources (other than newspapers)
To help people find and use digitised resources other than newspapers in Trove, I’ve been harvesting, sharing, and visualising metadata relating to specific formats, such as books and periodicals. The methods I’ve used have changed over time, and there are some earlier versions that I still need to extract from the Git repositories, but these are the current datasets. I’m planning to set up automatic re-harvests for some or all of these, so there’ll be a better record of change over time.
There’s more information about these datasets in both the GLAM Workbench and the Trove Data Guide.
Books
https://github.com/GLAM-Workbench/trove-books-data
There are 2 versions of this dataset:
- 20 November 2023
- 14 February 2024
Periodicals
https://github.com/GLAM-Workbench/trove-periodicals-data
This dataset was created by checking, correcting, and enriching data about digitised periodicals obtained from the Trove API. Additional metadata describing periodical titles and issues was extracted from the Trove website and used to check the API results. Where titles were wrongly described as issues, and vice versa, the records were corrected. Additional descriptive metadata was also added into the records. Separate CSV formatted data files were created for titles and issues. Finally, the titles and issues data was loaded into an SQLite database for use with Datasette.
There are 4 data files in this repository:
- NDJSON file of titles and issues harvested from Trove API
- CSV file of periodical titles enriched with additional metadata
- CSV file of periodical issues
- SQLite database with linked titles and issues data
There are two versions of this dataset:
- 29 February 2024
- 12 March 2024
Parliamentary Papers
https://github.com/GLAM-Workbench/trove-parliamentary-papers-data
This dataset contains metadata describing Commonwealth Parliamentary Papers that have been digitised and are made available through Trove.
There’s one version of this dataset:
- 23 February 2024
Maps
https://github.com/GLAM-Workbench/trove-maps-data
This dataset contains metadata describing digitised maps in Trove, harvested from the Trove API and other sources.
There are 2 data files in this dataset:
There are 2 versions of this dataset:
- 1 February 2023
- 8 June 2024
Oral histories
https://github.com/GLAM-Workbench/trove-oral-histories-data
There are 2 data files in this dataset:
There are 2 versions of this dataset:
- 16 November 2023
- 15 December 2024
Images
https://github.com/GLAM-Workbench/trove-images-rights-data/
This dataset includes information about the application of licences and rights statements to images by Trove contributors.
There are 2 data files in this repository:
- Number of images by licence type and contributor
- Number of out-of-copyright images by licence type and contributor
There are 3 versions of this dataset:
- 17 February 2020
- 9 March 2022
- 24 April 2024
Finding aids
https://github.com/GLAM-Workbench/nla-finding-aids-data/
This repository contains data about the National Library of Australia’s digitised manuscript finding aids, harvested from Trove.
This dataset contains 2 data files:
- List of urls of digitised finding aids in Trove
- Summary information describing each digitised finding aid
There is one version of this dataset:
- 1 March 2023
6. Trove born digital resources
Pandora web archive collections
https://github.com/GLAM-Workbench/trove-web-archives-collections
This dataset contains details of the subject and collection groupings used by Pandora to organise archived web resource titles.
There are two data files in this dataset:
There are 2 versions of this dataset:
- 2 May 2024
- 7 May 2024
NED periodicals
https://github.com/GLAM-Workbench/trove-ned-periodicals-data
This dataset contains details of periodical titles and issues submitted to the Trove through the NLA’s National edeposit scheme.
There are 3 data files in this dataset:
- Metadata of periodical titles submitted through NED
- Metadata of periodical issues submitted through NED
- SQLite database combining linked titles and issues data
There is one version of this dataset:
- 10 April 2024
7. National Archives of Australia
The NAA datasets are all over the place at present and I need to do a lot of work to get them standardised and organised. These are the main datasets, but there are others I need to add.
Records with the access status ‘Closed’
https://github.com/wragge/closed_access
https://github.com/GLAM-Workbench/recordsearch
Versions in Figshare:
- Files in the National Archives of Australia currently withheld from public access
- Files in the National Archives of Australia withheld from public access in 2015
- Files in the National Archives of Australia withheld from public access in 2016
- Files in the National Archives of Australia withheld from public access in 2017
Versions in GitHub:
- January 2016
- January 2017
- January 2018
- January 2019
- January 2020
- January 2021
- January 2022
- January 2023 (harvested but not in repo yet)
- January 2024 (harvested but not in repo yet)
Summary data about all series in RecordSearch
https://github.com/GLAM-Workbench/recordsearch
CSV file containing basic descriptive information about all the series currently registered on RecordSearch as well as the total number of items described, digitised, and in each access category.
- Summary data about all series in RecordSearch, May 2021
- Summary data about all series in RecordSearch, April 2022
Recently digitised files
https://github.com/GLAM-Workbench/recordsearch
Recently digitised files – weekly snapshots
https://github.com/wragge/naa-recently-digitised
This dataset contains weekly harvests of newly digitised files in RecordSearch. The automated scraper is currently scheduled to run each Sunday, saving a list of files that have been digitised in the previous week.
There are 177 data files, created between 28 March 2021 and the present.