Tim Sherratt

Sharing recent updates and work-in-progress

May 2022

Working with Trove data – a collection of tools and resources

The ARDC is organising a couple of public forums to help gather researcher requirements for the Trove component of the HASS RDC. One of the roundtables will look at ‘Existing tools that utilise Trove data and APIs’. Last year I wrote a summary of what the GLAM Workbench can contribute to the development of humanities research infrastructure, particularly in regard to Trove. I thought it might be useful to update that list to include recent additions to the GLAM Workbench, as well as a range of other datasets, software, tools, and interfaces that exist outside of the GLAM Workbench.

Since last year’s post I’ve also been working hard to integrate the GLAM Workbench with other eResearch services such as Nectar and CloudStor, and to document and support the ways that individuals and institutions can contribute code and documentation.

Getting and moving data

There’s lots of fabulous data in Trove and other GLAM collections. In fact, there’s so much data that it can be difficult for researchers to find and collect what’s relevant to their interests. There are many tools in the GLAM Workbench to help researchers assemble their own datasets. For example:

  • Get newspaper articles in bulk with the Trove Newspaper and Gazette Harvester – This has been around in some form for more than ten years (it pre-dates the Trove API!). Give it the url of a search in Trove’s newspapers and gazettes and the harvester will save all the metadata in a CSV file, and optionally download the complete articles as OCRd text, images, or PDFs. The amount of data you harvest is really only limited by your patience and disk space. I’ve harvested more than a million articles in the past. The GLAM Workbench includes a web app version of the harvester that runs live in the cloud – just paste in your Trove API key and the search url, and click the button.
  • Harvest information about newspaper issues – When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? On what dates were newspapers published? This notebook shows how you can get information about issues from the Trove API.
  • Get Trove newspaper pages as images – If you need a nice, high-resolution version of a newspaper page you can use this web app. If you want to harvest every front page (or some other particular page) here’s an example that gets all the covers of the Australian Women’s Weekly. A pre-harvested collection of the AWW covers is included as a bonus extra.
  • Get Trove newspaper articles as images – The Trove web interface makes it difficult to download complete images of articles, but this tool will do the job. There’s a handy web app to grab individual images, but the code from this tool is reused in other places such as the Trove Newspaper Harvester and the Omeka uploader, and could be built-in to your own research workflows.
  • Harvest the issues of a newspaper as PDFs – This notebook harvests whole issues of newspapers as PDFs – one PDF per issue.
  • Upload Trove newspaper articles to Omeka – Whether you’re creating on online exhibition or building a research database, Omeka can be very useful. This notebook connects Trove’s newspapers to Omeka for easy upload. Your selected articles can come from a search query, a Trove list, a Zotero library, or just a list of article ids. Metadata records are created in Omeka for each article and newspaper, and an image of each article is attached. My Omeka S Tools software package also includes an example using Trove newspapers.
  • Get OCRd text from digitised periodicals in Trove – They’re often overshadowed by the newspapers, but there’s now lots of digitised journals, magazines, and parliamentary papers in Trove. You can get article-level data from the API, but not issue data. This notebook enables researchers to get metadata and OCRd text from every available issue of a periodical. To make researchers’ lives even easier, I regularly harvest all the available OCRd text from digitised periodicals in Trove. The latest harvest downloaded 51,928 issues from 1,163 periodicals – that’s about 10gb of text. You can browse the list of periodicals with OCRd text, or search this database. All the OCRd text is stored in a public repository on CloudStor.
  • Get page images from digitised periodicals in Trove – There’s more than text in digitised periodicals, and you might want to download images of pages for visual analysis. This notebook shows you how to get cover images, but could be easily modified to get another page, or a PDF. I used a modified version of this to create a collection of 3,471 full page editorial cartoons from The Bulletin, 1886 to 1952 – all available to download from CloudStor.
  • Get OCRd text from digitised books in Trove – Yep, there’s digitised books as well as newspapers and periodicals. You can download OCRd text from an individual book using the Trove web interface, but how do you make a collection of books without all that pointing and clicking? This notebook downloads all the available OCRd text from digitised books in Trove. The latest harvest includes text from 26,762 works. You can explore the results using this database.
  • Harvest parliamentary press releases from Trove – Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and fulltext from a search of the parliamentary press releases. For example, here’s a collection of politicians talking about ‘refugees’, and another relating to COVID-19.
  • Harvest details of Radio National programs from Trove – Trove creates records for programs broadcast on ABC Radio National, for the major current affairs programs these records at at segment level. Even though they don’t provide full transcripts, this data provide a rich, fine-grained record of Australia’s recent political, social, and economic history. This notebook shows you how to download the Radio National data. If you just want to dive straight in, there’s also a pre-harvested collection containing more than 400,000 records, with separate downloads for some of the main programs.
  • Find all the versions of an archived web page in Trove – Many of the tools in the Web Archives section of the GLAM Workbench will work with the Australian Web Archive, which is part of Trove. This notebook shows you how to get data about the number of times a web page has been archived over time.
  • Harvesting collections of text from archived web pages in Trove – If you want to explore how the content of a web page changes over time, you can use this notebook to capture the text content of every archived version of a web page.
  • Convert a Trove list into a CSV file – While Trove provides a data download option for lists, it leaves out a lot of useful data. This notebook downloads full details of newspaper articles and other works in a list and saves them as CSV files. Like the Trove Newspaper Harvester, it lets you download OCRd text and images from newspaper articles.
  • Collecting information about Trove user activity – It’s not just the content of Trove that provides interesting research data, it’s also the way people engage with it. Using the Trove API it’s possible to harvest details of all user created lists and tags. And yes, there’s pre-harvested collections of lists and tags for the impatient.

While I’m focusing here on Trove, there’s also tools to create datasets from the National Archives of Australia, Digital NZ and Papers Past, the National Museum of Australia and more. And there’s a big list of readily downloadable datasets from Australian GLAM organisations.

Visualisation and analysis

Many of the notebooks listed above include examples that demonstrate ways of exploring and analysing your harvested data. There are also a number of companion notebooks that examine some possibilities in more detail, for example:

But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:

  • QueryPic – Another tool that’s been around in different forms for a decade, QueryPic visualises searches in Trove’s newspapers. The latest web app couldn’t be simpler, just paste in your API key and a search url and create charts showing the number of matching articles over time. You can combine queries, change time scales, and download the data and visualisations. Interested to see how other researchers have used it? Here’s a Twitter thread with links to some publications.
  • Visualise Trove newspaper searches over time – This is like a deconstructed version of QueryPic that walks you through the process of using Trove’s facets to assemble a dataset of results over time. It provide a lot of detail on the sorts of data available, and the questions we can ask of it.
  • Visualise the total number of newspaper articles in Trove by year and state – This notebook uses a modified version of the code above to analyse the construction and context of Trove’s newspaper corpus itself. What are you actually searching? Meet the WWI effect and the copyright cliff of death! This is a great place to start if you want to get people thinking critically about digital resources are constructed.
  • Trove newspapers – number of issues per day, 1803–2020 – visualisation of the number of newspaper issues published every day in Trove.
  • Analyse rates of OCR correction – Some more meta-analysis of the Trove corpus itself, this time focusing on patterns of OCR correction by Trove users.
  • Identifying non-English language newspapers in Trove – There are a growing number of non-English language newspapers digitised in Trove. However, if you’re only searching using English keywords, you might never know that they’re there. This notebook analyses a sample of articles from every newspaper in Trove to identify non-English content.
  • Beyond the copyright cliff of death – Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. This notebook helps you find out how many, and which newspapers they were published in.
  • Map Trove newspaper results by state – This notebook uses the Trove state facet to create a choropleth map that visualises the number of search results per state.
  • Map Trove newspaper results by place of publication – This notebook uses the Trove title facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.
  • Compare two versions of an archived web page – This notebook demonstrates a number of different ways of comparing versions of archived web pages. Just choose a repository, enter a url, and select two dates to see comparisons based on: page metadata, basic statistics such as file size and number of words, numbers of internal and external links, cosine similarity of text, line by line differences in text or code, and screenshots.
  • Display changes in the text of an archived web page over time – This web app gathers all the available versions of a web page and then visualises changes in its content between versions – what’s been added, removed, and changed?
  • Use screenshots to visualise change in a page over time– Create a series of full page screenshots of a web page over time, then assemble them into a time series.

There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.

Documentation and examples

All the Trove notebooks in the GLAM Workbench help document the possibilities and limits of the Trove API. The examples above can be modified and reworked to suit different research interests. Some notebooks also explore particular aspects of the API, for example:

  • Trove API Introduction – Some very basic examples of making requests and understanding results.
  • Today’s news yesterday – Uses the date index and the firstpageseq parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.
  • The use of standard licences and rights statements in Trove image records – Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of a list of standard licences and rights statements. We can also use this index to build a picture of which rights statements are currently being used, and by who.
  • Random items from Trove – Changes to the Trove API meant that techniques you could previously use to select resources at random no longer work. This section documents some alternative ways of retrieving random-ish works and newspaper articles from Trove.

And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.

Videos

I’ve started making videos to help you get started with the GLAM Workbench.

Datasets

A number of pre-harvested datasets are noted above in the ‘Getting and moving data’ section. Here’s a fairly complete list of ready-to-download datasets harvested from Trove.

Newspapers

Books and periodicals

Other

See also Sources of Australian GLAM data in the GLAM Workbench.

Software

The GLAM Workbench makes use of a number of software packages that I’ve created in Python to work with Trove data. These are openly-licensed and available for installation from PyPi.

  • Trove Harvester – harvest newspaper and gazette articles
  • Trove Query Parser – convert search parameters from the Trove web interface into a form the API understands
  • Trove Newspaper Images – tools for downloading images from Trove’s digitised newspapers and gazettes

Other tools and interfaces

Over the years I’ve developed many tools and interfaces using Trove data. Some have been replaced by the GLAM Workbench, but others keep chugging along, for example:

See also More GLAM tools and interfaces in the GLAM Workbench. #dhhacks