Preserving the history of online collections (my love letter to future historians)

Friday, September 20, 2024

It’s pretty obvious that access to digitised resources, like Trove’s newspapers, has changed the practice of history in Australia. But how? I’m certain that the historiographical implications of the growth and development of online collections will become a topic of increasing interest to historians, and that exploration of this topic will lead to important insights into the relationship between what we keep, what we value, and what we know. But for this to happen we need to have data documenting changes in online collections. What became available when? How was it delivered to users? How did the search indexing work?

In general, GLAM collection interfaces exist in an eternal present – they’re not good at explaining changes, or communicating their own histories. Australian GLAM organisations also share little statistical information. If you’re lucky, you might get something useful out of annual reports, but that’s about it. Trove, in fact, removed all their online collection statistics in the 2020 interface update. Web archives capture individual pages, but not complete systems. If we don’t document the shape and structure of online collections now, how will future historians understand their impact?

A couple of years ago I gave a short talk entitled ‘Living archaeologies of online collections’ for the Digital Preservation Coalition (video & slides).

In the talk, I described some of my piecemeal and inconsistent attempts to capture this history – starting back in 2011 when I first harvested data from Trove about the number of digitised newspaper articles. I’ve been continuing to create and update datasets, and have been trying to improve the way they’re organised and described – making them more FAIR – but I’ve still got a long way to go. At the moment there’s information spread across the GLAM Workbench, GitHub repositories, and Zenodo records, so I thought it might be useful if I pulled everything together into one big list. Hence this post. I often forget about what I’ve done in the past, so it’ll help me keep track of where the gaps are and what’s left to do. And hopefully it’ll encourage others to think about the significance and possibilities of this data, and perhaps share their own datasets.

Screenshot of some of the datasets in the Trove Historical Data community in Zenodo

There’s also a growing list of datasets in the Trove Historical Data community on Zenodo

I suppose I could also add the Trove Data Guide to this list. It’s not a dataset, but it is a snapshot of the current state of Trove. I’ll continue to update it, and these updates will themselves be saved as versions in GitHub and Zenodo, allowing future researchers to dig back through the layers.

A lot of what I do is focused on the present – building tools and resources that help researchers make use of GLAM collections right now. Those tools and resources will eventually decay as I shuffle off, as collections evolve, and as technologies change. But I’m hoping that these datasets will grow in value over time. I think it was Jason Scott who coined the phrase ‘metadata is a love letter to the future’. I suppose this is my love letter to future historians.

1. Trove zones, categories, and formats

trove-zone-totals

https://github.com/wragge/trove-zone-totals

This repository contains an automated git scraper that uses the Trove API to save data about the contents of Trove’s zones and categories. It runs every week and updates the following data files:

Total number of resources by Trove category, weekly updates, 13 June 2023 to present
Total number of resources by Trove category and format, weekly updates, 13 June 2023 to present

In the web interface, ‘zones’ were replaced by ‘categories’ in 2020. However, categories were not available through the Trove API until the release of version 3 in June 2023. To try and document the differences between zones and categories, totals from both were captured until version 2 of the API was switched off in September 2024.

Total number of resources by Trove zone, weekly updates, 9 March 2023 to 1 September 2024
Total number of resources by Trove zone and format, weekly updates, 9 March 2023 to 1 September 2024

2. Trove newspapers

trove-newspaper-totals-historical

https://github.com/wragge/trove-newspaper-totals-historical/

The files in this dataset were created at irregular intervals between 2011 and 2022 for use in visualising Trove’s newspaper corpus. Harvests from 2011 were screen scraped from the Trove website. Harvests after 2012 make use of the year and state facets from the Trove API. There are 9 versions:

12 April 2011
4 August 2011
12 September 2014
29 November 2015
14 December 2016
28 July 2019
10 July 2020
27 April 2021
21 January 2022

Each version includes two data files:

Total number of newspaper articles by year
Total number of newspaper articles by year and state

trove-newspaper-totals

https://github.com/wragge/trove-newspaper-totals

This repository contains an automated git scraper that uses the Trove API to save information about the number of digitised newspaper articles currently available through Trove. It runs every week and updates four data files:

Total number of newspaper articles by year, weekly updates, 19 April 2022 to present
Total number of newspaper articles by year and state, weekly updates, 19 April 2022 to present
Total number of articles by newspaper, weekly updates, 20 April 2022 to present
Total number of newspaper articles by category, weekly updates, 20 April 2022 to present

By retrieving all versions of these files from the commit history, you can analyse changes in Trove over time.

A weekly summary of the harvested data is presented in the Trove Newspapers Data Dashboard.

trove-newspaper-titles-web-archives

https://github.com/GLAM-Workbench/trove-newspaper-titles-web-archives

These datasets were created by harvesting information about newspaper titles in Trove from web archives. The harvesting method is documented by Gathering historical data about the addition of newspaper titles to Trove in the GLAM Workbench.

trove-newspapers-corrections

https://github.com/GLAM-Workbench/trove-newspapers-corrections

OCR errors in Trove’s digitised newspapers can be corrected by users. To help understand patterns in newspaper correction, this dataset has been created to record information about the number of articles with corrections. The data was extracted from the Trove API using this notebook.

Number of corrections by newspaper, 7 versions:
- 13 August 2019
- 10 July 2020
- 27 April 2021
- 21 January 2022
- 24 June 2022
- 14 September 2024
Number of corrections by year, 2 versions:
- 24 June 2024
- 14 September 2024
Number of corrections by category, 2 versions:
- 24 June 2024
- 14 September 2024

trove-newspapers-data-post-54

https://github.com/GLAM-Workbench/trove-newspapers-data-post-54/

Due to copyright restrictions, most of the digitised newspaper articles on Trove were published before 1955. However, some articles published after 1954 have been made available. This repository provides data about digitised newspapers in Trove that have articles available from after 1954 (the ‘copyright cliff of death’). The data was extracted from the Trove API using this notebook.

There are 8 versions of this dataset:

7 June 2019
12 August 2019
10 July 2020
11 November 2020
27 April 2021
21 January 2022
27 June 2024
14 September 2024

trove-newspapers-non-english

https://github.com/GLAM-Workbench/trove-newspapers-non-english

This dataset contains information about newspapers published in languages other than English that have been digitised and made available through Trove. Data about the languages present in newspapers was generated by harvesting a sample of articles from each newspaper using the Trove API, and then using language detection software on the OCRd text of each article. The method is documented in this notebook in the GLAM Workbench.

There are two data files:

There are two versions of this dataset:

9 July 2024
14 September 2024

trove-newspaper-issues

This dataset contains information about the published issues of newspapers digitised and made available through Trove. The data was harvested from the Trove API, using this notebook in the GLAM Workbench.

There are two data files in this dataset:

Total number of newspaper issues per year for each digitised newspaper
A complete list of newspaper issues available from Trove

There are 5 versions of this dataset:

18 October 2021
20 January 2022
3 August 2023
26 June 2024
13 September 2024

3. Trove lists and tags

trove-lists-metadata

https://github.com/GLAM-Workbench/trove-lists-metadata/

Trove users can create collections of resources using Trove’s ‘lists’. This dataset contains metadata describing all public lists, harvested via the Trove API. To reduce file size, the details of the resources collected by each list are not included, just the total number of resources. The data was extracted using this notebook from the Trove lists and tags section of the GLAM Workbench.

There are 4 versions of this dataset:

20 September 2018
22 September 2020
5 July 2022
6 June 2024

Public tags added to resources in Trove, 2008 to 2024

This dataset contains details of 2,495,958 unique public tags added to 10,403,650 resources in Trove between August 2008 and June 2024.

There are 3 versions of this dataset:

10 July 2021
6 July 2022
6 June 2024

Tag counts

This dataset was derived from the full harvest of Trove public tags. It contains a list of unique tags and the total number of resources in Trove each tag is attached to.

There are 3 versions of this dataset:

10 July 2021
6 July 2022
6 June 2024

4. Trove contributors

https://github.com/wragge/trove-contributor-totals

This repository contains an automated git scraper that uses the Trove API to save details of organisations and projects that contribute metadata to Trove. As well as counts of total resources by contributor, this dataset includes counts of resources from each contributor by format and category. It runs every week and updates the following data files:

Number of resources by contributor (unmodified JSON from API), weekly updates, 9 March 2023 to present
Number of resources by contributor (flattened data as CSV), weekly updates, 9 March 2023 to present
Number of resources by contributor and category, weekly updates, 21 June 2024 to present
Number of resources by contributor, category, and format, weekly updates, 21 June 2024 to present

These data files were generated using version 2 of the Trove API:

Number of resources by contributor and zone, weekly updates, 9 March 2023 to 12 May 2024
Number of resources by contributor, zone, and format, weekly updates, 9 March 2023 to 12 May 2024

5. Trove digitised resources (other than newspapers)

To help people find and use digitised resources other than newspapers in Trove, I’ve been harvesting, sharing, and visualising metadata relating to specific formats, such as books and periodicals. The methods I’ve used have changed over time, and there are some earlier versions that I still need to extract from the Git repositories, but these are the current datasets. I’m planning to set up automatic re-harvests for some or all of these, so there’ll be a better record of change over time.

There’s more information about these datasets in both the GLAM Workbench and the Trove Data Guide.

Books

https://github.com/GLAM-Workbench/trove-books-data

There are 2 versions of this dataset:

20 November 2023
14 February 2024

Periodicals

https://github.com/GLAM-Workbench/trove-periodicals-data

This dataset was created by checking, correcting, and enriching data about digitised periodicals obtained from the Trove API. Additional metadata describing periodical titles and issues was extracted from the Trove website and used to check the API results. Where titles were wrongly described as issues, and vice versa, the records were corrected. Additional descriptive metadata was also added into the records. Separate CSV formatted data files were created for titles and issues. Finally, the titles and issues data was loaded into an SQLite database for use with Datasette.

There are 4 data files in this repository:

There are two versions of this dataset:

29 February 2024
12 March 2024

Parliamentary Papers

https://github.com/GLAM-Workbench/trove-parliamentary-papers-data

This dataset contains metadata describing Commonwealth Parliamentary Papers that have been digitised and are made available through Trove.

There’s one version of this dataset:

23 February 2024

Maps

https://github.com/GLAM-Workbench/trove-maps-data

This dataset contains metadata describing digitised maps in Trove, harvested from the Trove API and other sources.

There are 2 data files in this dataset:

There are 2 versions of this dataset:

1 February 2023
8 June 2024

Oral histories

https://github.com/GLAM-Workbench/trove-oral-histories-data

There are 2 data files in this dataset:

There are 2 versions of this dataset:

16 November 2023
15 December 2024

Images

https://github.com/GLAM-Workbench/trove-images-rights-data/

This dataset includes information about the application of licences and rights statements to images by Trove contributors.

There are 2 data files in this repository:

There are 3 versions of this dataset:

17 February 2020
9 March 2022
24 April 2024

Finding aids

https://github.com/GLAM-Workbench/nla-finding-aids-data/

This repository contains data about the National Library of Australia’s digitised manuscript finding aids, harvested from Trove.

This dataset contains 2 data files:

There is one version of this dataset:

1 March 2023

6. Trove born digital resources

Pandora web archive collections

https://github.com/GLAM-Workbench/trove-web-archives-collections

This dataset contains details of the subject and collection groupings used by Pandora to organise archived web resource titles.

There are two data files in this dataset:

There are 2 versions of this dataset:

2 May 2024
7 May 2024

NED periodicals

https://github.com/GLAM-Workbench/trove-ned-periodicals-data

This dataset contains details of periodical titles and issues submitted to the Trove through the NLA’s National edeposit scheme.

There are 3 data files in this dataset:

There is one version of this dataset:

10 April 2024

7. National Archives of Australia

The NAA datasets are all over the place at present and I need to do a lot of work to get them standardised and organised. These are the main datasets, but there are others I need to add.

Records with the access status ‘Closed’

https://github.com/wragge/closed_access

https://github.com/GLAM-Workbench/recordsearch

Versions in Figshare:

Versions in GitHub:

January 2016
January 2017
January 2018
January 2019
January 2020
January 2021
January 2022
January 2023 (harvested but not in repo yet)
January 2024 (harvested but not in repo yet)

Summary data about all series in RecordSearch

https://github.com/GLAM-Workbench/recordsearch

CSV file containing basic descriptive information about all the series currently registered on RecordSearch as well as the total number of items described, digitised, and in each access category.

Recently digitised files

https://github.com/GLAM-Workbench/recordsearch

Details of files digitised between 25 February and 26 March 2020

Recently digitised files – weekly snapshots

https://github.com/wragge/naa-recently-digitised

This dataset contains weekly harvests of newly digitised files in RecordSearch. The automated scraper is currently scheduled to run each Sunday, saving a list of files that have been digitised in the previous week.

There are 177 data files, created between 28 March 2021 and the present.

glamworkbench