Also on tonight’s episode of ‘In GLAM This Week’, the NAA digitised 20,517 files. Of these:

13,850 were name index cards from the Bonegilla Migrant Camp (Series A2571 & A2572).

4,938 were CMF personnel files (Series B884)

See: github.com/wragge/na…

Screenshot of top twenty series with the most files digitised this week.

This week’s changes in Trove newspapers:

+7,451 articles

+16,628 articles with corrections

+8,148 articles with tags

+491 articles with comments

github.com/wragge/tr…

Another overdue maintenance task completed! The Tung Wah Newspaper index has been migrated from a custom Django app on a now defunct web host, to a nice, neat Datasette instance on Heroku. https://resources.chineseaustralia.org/tung_wah_newspaper_index

Screen capture of the home page of the Tung Wah Newspaper index, an English-language index of two Chinese-language newspapers – the Tung Wah News 東華新報 (1898–1902) and the later Tung Wah Times 東華報 (1902–1936) – published in Sydney, Australia. The image shows there are links to search the index, and to browse by issue and year.

TIL you can do date maths in Trove. Searching for date:[NOW-10YEAR TO NOW] in newspapers & gazettes returns articles from the last 10 years. Try it: trove.nla.gov.au/search/ca…

My Trove researcher platform wishlist

The ARDC is collecting user requirements for the Trove researcher platform for advanced research. This is a chance to start from scratch, and think about the types of data, tools, or interface enhancements that would support innovative research in the humanities and social sciences. The ARDC will be holding two public roundtables, on 13 and 20 May, to gather ideas. I created a list of possible API improvements in my response to last year’s draft plan, and thought it might be useful to expand that a bit, and add in a few other annoyances, possibilities, and long-held dreams.

My focus is again on the data; this is for two reasons. First because public access to consistent, good quality data makes all other things possible. But, of course, it’s never just a matter of OPENING ALL THE DATA. There will be questions about priorities, about formats, about delivery, about normalisation and enrichment. Many of these questions will arise as people try to make use of the data. There needs to be an ongoing conversation between data providers, research tool makers, and research tool users. This is the second reason I think the data is critically important – our focus should be on developing communities and skills, not products. A series of one-off tools for researchers might be useful, but the benefits will wane. Building tools through networks of collaboration and information sharing based on good quality data offers much more. Researchers should be participants in these processes, not consumers.

Anyway, here’s my current wishlist…

APIs and data

  • Bring the web interface and main public API back into sync, so that researchers can easily transfer queries between the two. The Trove interface update in 2020 reorganised resources into ‘categories’, replacing the original ‘zones’. The API, however, is still organised by zone and knows nothing about these new categories. Why does this matter? The web interface allows researchers to explore the collection and develop research questions. Some of these questions might be answered by downloading data from the API for analysis or visualisation. But, except for the newspapers, there is currently no one-to-one correspondence between searches in the web interface and searches using the API. There’s no way of transferring your questions – you need to start again.

  • Expand the metadata available for digitised resources other than newspapers. In recent years, the NLA has digitised huge quantities of books, journals, images, manuscripts, and maps. The digitisation process has generated new metadata describing these resources, but most of this is not available through the public API. We can get an idea of what’s missing by comparing the digitised journals to the newspapers. The API includes a newspaper endpoint that provides data on all the newspapers in Trove. You can use it to get a list of available issues for any newspaper. There is no comparable way of retrieving a list of digitised journals, or the issues that have been digitised. The data’s somewhere – there’s an internal API that’s used to generate lists of issues in the browse interface and I’ve scraped this to harvest issue details. But this information should should be in the public API. Manuscripts are described using finding aids, themselves generated from EAD formatted XML files, but none of this important structured data is available from the API, or for download. There’s also other resource metadata, such as parent/child relationships between different levels in the object hierarchy (eg publication > pages). These are embedded in web pages but not exposed in the API. The main point is that when it comes to data-driven research, digitised books, journals, manuscripts, images, and maps are second-class citizens, trailing far behind the newspapers in research possibilities. There needs to be a thorough stocktake of available metadata, and a plan to make this available in machine actionable form.

  • Standardise the delivery of text, images, and PDFs and provide download links through the API. As noted above, digitised resources are treated differently depending on where they sit in Trove. There are no standard mechanisms for downloading the products of digitisation, such as OCRd text and images. OCRd text is available directly though the API for newspaper and journal articles, but to download text from a book or journal issue you need to hack the download mechanisms from the web interface. Links to these should be included in the API. Similarly, machine access to images requires various hacks and workarounds. There should be a consistent approach that allows researchers to compile image datasets from digitised resources using the API. Ideally IIIF standard APIs should be used for the delivery of images and maps. This would enable the use of the growing ecosystem of IIIF compliant tools for integration, analysis, and annotation.

  • Provide an option to exclude search results in tags and comments. The Trove advanced search used to give you the option of excluding search results which only matched tags or comments, rather than the content of the resource. Back when I was working at Trove, the IT folks said this feature would be added to a future version of the API, but instead it disappeared from the web interface with the 2020 update! Why is this important? If you’re trying to analyse the occurance of search terms within a collection, such as Trove’s digitised newspapers, you want to be sure that the result reflects the actual content, and not a recent annotation by Trove users.

  • Finally add the People & Organisations data to the main API. Trove’s People & Organisations section was ahead of the game in providing machine-readable access, but the original API is out-of-date and uses a completely different query language. Some work was done on adding it to the main RESTful API, but it was never finished. With a bit of long-overdue attention, the People & Organisations data could power new ways of using and linking biographical resources.

  • Improve web archives CDX API. Although the NLA does little to inform researchers of the possibilities, the web archives software it uses (Pywb) includes some baked in options for retrieving machine-readable data. This includes support for the Memento protocol, and the provision of a CDX API that delivers basic metadata about individual web page captures. The current CDX API has some limitations ( documented here ). In particular, there’s no pagination or results, and no support for domain-level queries. Addressing these limitations would make the existing CDX API much more useful.

  • Provide new data sources for web archives analysis. There needs to be an constructive, ongoing discussion about the types of data that could be extracted and shared from the Australian web archive. For example, a search API, or downloadable datasets of word frequencies. The scale is a challenge, but some pilot studies could help us all understand both the limits and the possibilities.

  • Provide a Write API for annotations. Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources. Indeed, this would create exciting possibilities for embedding Trove resources within systems of scholarly analysis, allowing insights gained through research to be automatically fed back into Trove to enhance discovery and understanding.

  • Provide historical statistics on Trove resources. It’s important for researchers to understand how Trove itself changes over time. There used to be a page that provided regularly-updated statistics on the number of resources and user annotations, but this was removed by the interface upgrade in 2020. I’ve started harvesting some basic stats relating to Trove newspapers, but access to general statistics should be reinstated.

  • Reassess key authentication and account limits. DigitalNZ recently changed their policy around API authentication, allowing public access without a key. Authentication requirements hinder exploration and limit opportunities for using the API in teaching and workshops. Similarly, I don’t think the account usage limits have been changed since the API was released, even though the capacity of the systems has increased. It seems like time that both of these were reassessed.

Ok, I’ll admit, that’s a pretty long list, and not everything can be done immediately! I think this would be a good opportunity for the NLA to develop and share an API and Data Roadmap, that is regularly updated, and invites comments and suggestions. This would help researchers plan for future projects, and build a case for further government investment.

Integration

  • Unbreak Zotero integration. The 2020 interface upgrade broke the existing Zotero integration and there’s no straightforward way of fixing it without changes at the Trove end. Zotero used to be able to capture search results, metadata and images from most of the zones in Trove. Now it can only capture individual newspaper articles. This greatly limits the ability of researchers to assemble and manage their own research collections. More generally, a program to examine and support Zotero integration across the GLAM sector would be a useful way of spending some research infrastructure dollars.

  • Provide useful page metadata. Zotero is just one example of a tool that can extract structured metadata from web pages. Such metadata supports reuse and integration, without the need for separate API requests. Only Trove’s newspaper articles currently provide embedded metadata. Libraries used to lead the way is promoting the use of standardised, structured, embedded page metadata (Dublin Core anyone?), but now?

  • Explore annotation frameworks. I’ve mentioned the possibility of a Write API for annotations above, but there are other possibilities for supporting web scale annotations, such as Hypothesis. Again, the current Trove interface makes the use of Hypothesis difficult, and again this sort of integration would be usefully assessed across the whole GLAM sector.

Tools & interfaces

Obviously any discussion of new tools or interfaces needs to start by looking at what’s already available. This is difficult when the NLA won’t even link to existing resources such as the GLAM Workbench. Sharing information about existing tools needs to be the starting point from which to plan investment in the Trove Researcher Platform. From there we can identify gaps and develop processes and collaborations to meet specific research needs. Here’s a list of some Trove-related tools and resources currently available through the GLAM Workbench.

Spending the evening updating the NAA section of the #GLAMWorkbench. Here’s a fresh harvest of the agency functions currently being used in RecordSearch… gist.github.com/wragge/d1…

This morning has been all bug hunting. But at least I’ve now found & fixed the problem, & released version 0.0.14 of the RecordSearch Data Scraper! https://github.com/wragge/recordsearch_data_scraper/releases/tag/v0.0.14 Also updated on Pypi.

Changes in Trove newspaper titles in the last week:

+26,731 articles from Daily News (WA)

+88,485 articles from Sunraysia Daily (Vic)

+6,810 articles from Dalgety’s Review (WA)

See: github.com/wragge/tr…

Changes in Trove newspapers in the past week:

+124,404 articles available

+17,306 articles with corrections

+8,269 articles with tags

+523 articles with comments

See: github.com/wragge/tr…

Somewhat unexpectedly the US National Archives & Records Administration catalogue includes some historic photos of Tasmania… catalog.archives.gov/search

I’ve got a site, I suppose I need to add some content now… glam-workbook.net #GLAMWorkbook

Working with Trove data – a collection of tools and resources

The ARDC is organising a couple of public forums to help gather researcher requirements for the Trove component of the HASS RDC. One of the roundtables will look at ‘Existing tools that utilise Trove data and APIs’. Last year I wrote a summary of what the GLAM Workbench can contribute to the development of humanities research infrastructure, particularly in regard to Trove. I thought it might be useful to update that list to include recent additions to the GLAM Workbench, as well as a range of other datasets, software, tools, and interfaces that exist outside of the GLAM Workbench.

Since last year’s post I’ve also been working hard to integrate the GLAM Workbench with other eResearch services such as Nectar and CloudStor, and to document and support the ways that individuals and institutions can contribute code and documentation.

Getting and moving data

There’s lots of fabulous data in Trove and other GLAM collections. In fact, there’s so much data that it can be difficult for researchers to find and collect what’s relevant to their interests. There are many tools in the GLAM Workbench to help researchers assemble their own datasets. For example:

  • Get newspaper articles in bulk with the Trove Newspaper and Gazette Harvester – This has been around in some form for more than ten years (it pre-dates the Trove API!). Give it the url of a search in Trove’s newspapers and gazettes and the harvester will save all the metadata in a CSV file, and optionally download the complete articles as OCRd text, images, or PDFs. The amount of data you harvest is really only limited by your patience and disk space. I’ve harvested more than a million articles in the past. The GLAM Workbench includes a web app version of the harvester that runs live in the cloud – just paste in your Trove API key and the search url, and click the button.
  • Harvest information about newspaper issues – When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? On what dates were newspapers published? This notebook shows how you can get information about issues from the Trove API.
  • Get Trove newspaper pages as images – If you need a nice, high-resolution version of a newspaper page you can use this web app. If you want to harvest every front page (or some other particular page) here’s an example that gets all the covers of the Australian Women’s Weekly. A pre-harvested collection of the AWW covers is included as a bonus extra.
  • Get Trove newspaper articles as images – The Trove web interface makes it difficult to download complete images of articles, but this tool will do the job. There’s a handy web app to grab individual images, but the code from this tool is reused in other places such as the Trove Newspaper Harvester and the Omeka uploader, and could be built-in to your own research workflows.
  • Harvest the issues of a newspaper as PDFs – This notebook harvests whole issues of newspapers as PDFs – one PDF per issue.
  • Upload Trove newspaper articles to Omeka – Whether you’re creating on online exhibition or building a research database, Omeka can be very useful. This notebook connects Trove’s newspapers to Omeka for easy upload. Your selected articles can come from a search query, a Trove list, a Zotero library, or just a list of article ids. Metadata records are created in Omeka for each article and newspaper, and an image of each article is attached. My Omeka S Tools software package also includes an example using Trove newspapers.
  • Get OCRd text from digitised periodicals in Trove – They’re often overshadowed by the newspapers, but there’s now lots of digitised journals, magazines, and parliamentary papers in Trove. You can get article-level data from the API, but not issue data. This notebook enables researchers to get metadata and OCRd text from every available issue of a periodical. To make researchers’ lives even easier, I regularly harvest all the available OCRd text from digitised periodicals in Trove. The latest harvest downloaded 51,928 issues from 1,163 periodicals – that’s about 10gb of text. You can browse the list of periodicals with OCRd text, or search this database. All the OCRd text is stored in a public repository on CloudStor.
  • Get page images from digitised periodicals in Trove – There’s more than text in digitised periodicals, and you might want to download images of pages for visual analysis. This notebook shows you how to get cover images, but could be easily modified to get another page, or a PDF. I used a modified version of this to create a collection of 3,471 full page editorial cartoons from The Bulletin, 1886 to 1952 – all available to download from CloudStor.
  • Get OCRd text from digitised books in Trove – Yep, there’s digitised books as well as newspapers and periodicals. You can download OCRd text from an individual book using the Trove web interface, but how do you make a collection of books without all that pointing and clicking? This notebook downloads all the available OCRd text from digitised books in Trove. The latest harvest includes text from 26,762 works. You can explore the results using this database.
  • Harvest parliamentary press releases from Trove – Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and fulltext from a search of the parliamentary press releases. For example, here’s a collection of politicians talking about ‘refugees’, and another relating to COVID-19.
  • Harvest details of Radio National programs from Trove – Trove creates records for programs broadcast on ABC Radio National, for the major current affairs programs these records at at segment level. Even though they don’t provide full transcripts, this data provide a rich, fine-grained record of Australia’s recent political, social, and economic history. This notebook shows you how to download the Radio National data. If you just want to dive straight in, there’s also a pre-harvested collection containing more than 400,000 records, with separate downloads for some of the main programs.
  • Find all the versions of an archived web page in Trove – Many of the tools in the Web Archives section of the GLAM Workbench will work with the Australian Web Archive, which is part of Trove. This notebook shows you how to get data about the number of times a web page has been archived over time.
  • Harvesting collections of text from archived web pages in Trove – If you want to explore how the content of a web page changes over time, you can use this notebook to capture the text content of every archived version of a web page.
  • Convert a Trove list into a CSV file – While Trove provides a data download option for lists, it leaves out a lot of useful data. This notebook downloads full details of newspaper articles and other works in a list and saves them as CSV files. Like the Trove Newspaper Harvester, it lets you download OCRd text and images from newspaper articles.
  • Collecting information about Trove user activity – It’s not just the content of Trove that provides interesting research data, it’s also the way people engage with it. Using the Trove API it’s possible to harvest details of all user created lists and tags. And yes, there’s pre-harvested collections of lists and tags for the impatient.

While I’m focusing here on Trove, there’s also tools to create datasets from the National Archives of Australia, Digital NZ and Papers Past, the National Museum of Australia and more. And there’s a big list of readily downloadable datasets from Australian GLAM organisations.

Visualisation and analysis

Many of the notebooks listed above include examples that demonstrate ways of exploring and analysing your harvested data. There are also a number of companion notebooks that examine some possibilities in more detail, for example:

But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:

  • QueryPic – Another tool that’s been around in different forms for a decade, QueryPic visualises searches in Trove’s newspapers. The latest web app couldn’t be simpler, just paste in your API key and a search url and create charts showing the number of matching articles over time. You can combine queries, change time scales, and download the data and visualisations. Interested to see how other researchers have used it? Here’s a Twitter thread with links to some publications.
  • Visualise Trove newspaper searches over time – This is like a deconstructed version of QueryPic that walks you through the process of using Trove’s facets to assemble a dataset of results over time. It provide a lot of detail on the sorts of data available, and the questions we can ask of it.
  • Visualise the total number of newspaper articles in Trove by year and state – This notebook uses a modified version of the code above to analyse the construction and context of Trove’s newspaper corpus itself. What are you actually searching? Meet the WWI effect and the copyright cliff of death! This is a great place to start if you want to get people thinking critically about digital resources are constructed.
  • Trove newspapers – number of issues per day, 1803–2020 – visualisation of the number of newspaper issues published every day in Trove.
  • Analyse rates of OCR correction – Some more meta-analysis of the Trove corpus itself, this time focusing on patterns of OCR correction by Trove users.
  • Identifying non-English language newspapers in Trove – There are a growing number of non-English language newspapers digitised in Trove. However, if you’re only searching using English keywords, you might never know that they’re there. This notebook analyses a sample of articles from every newspaper in Trove to identify non-English content.
  • Beyond the copyright cliff of death – Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. This notebook helps you find out how many, and which newspapers they were published in.
  • Map Trove newspaper results by state – This notebook uses the Trove state facet to create a choropleth map that visualises the number of search results per state.
  • Map Trove newspaper results by place of publication – This notebook uses the Trove title facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.
  • Compare two versions of an archived web page – This notebook demonstrates a number of different ways of comparing versions of archived web pages. Just choose a repository, enter a url, and select two dates to see comparisons based on: page metadata, basic statistics such as file size and number of words, numbers of internal and external links, cosine similarity of text, line by line differences in text or code, and screenshots.
  • Display changes in the text of an archived web page over time – This web app gathers all the available versions of a web page and then visualises changes in its content between versions – what’s been added, removed, and changed?
  • Use screenshots to visualise change in a page over time– Create a series of full page screenshots of a web page over time, then assemble them into a time series.

There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.

Documentation and examples

All the Trove notebooks in the GLAM Workbench help document the possibilities and limits of the Trove API. The examples above can be modified and reworked to suit different research interests. Some notebooks also explore particular aspects of the API, for example:

  • Trove API Introduction – Some very basic examples of making requests and understanding results.
  • Today’s news yesterday – Uses the date index and the firstpageseq parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.
  • The use of standard licences and rights statements in Trove image records – Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of a list of standard licences and rights statements. We can also use this index to build a picture of which rights statements are currently being used, and by who.
  • Random items from Trove – Changes to the Trove API meant that techniques you could previously use to select resources at random no longer work. This section documents some alternative ways of retrieving random-ish works and newspaper articles from Trove.

And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.

Videos

I’ve started making videos to help you get started with the GLAM Workbench.

Datasets

A number of pre-harvested datasets are noted above in the ‘Getting and moving data’ section. Here’s a fairly complete list of ready-to-download datasets harvested from Trove.

Newspapers

Books and periodicals

Other

See also Sources of Australian GLAM data in the GLAM Workbench.

Software

The GLAM Workbench makes use of a number of software packages that I’ve created in Python to work with Trove data. These are openly-licensed and available for installation from PyPi.

  • Trove Harvester – harvest newspaper and gazette articles
  • Trove Query Parser – convert search parameters from the Trove web interface into a form the API understands
  • Trove Newspaper Images – tools for downloading images from Trove’s digitised newspapers and gazettes

Other tools and interfaces

Over the years I’ve developed many tools and interfaces using Trove data. Some have been replaced by the GLAM Workbench, but others keep chugging along, for example:

See also More GLAM tools and interfaces in the GLAM Workbench. #dhhacks

New articles available this week from specific Trove newspapers:

+39,296 from Daily News (WA)

+39,810 from Australian Jewish Herald (Vic)

+104,190 from Sunraysia Daily (Vic)

More: github.com/wragge/tr…

Changes to Trove newspapers in the last week:

+211,393 articles available

+15,738 articles with corrections

+8,233 articles with tags

+499 articles with comments

More here: github.com/wragge/tr…

And so it starts… #GLAMWorkbench

Screenshot of GLAM Workbook welcome page. Text states: 'This is a companion to the GLAM Workbench. Here you'll documentation, tips, tutorials, and exercises to help you work with digital collections from galleries, libraries, archives, and museums (the GLAM sector).'

Followed up my last FOI request about HASS research infrastructure to find out why the appendix was missing, & now it’s available! Includes an interesting (& selective) list of HASS infrastructure projects. www.dropbox.com/sh/uq7vz4…

And with that I think I’ll call it quits for #DayofDH2022 in this part of the world. Time to go check the backyard for pademelons before turning in. Hope all the DH-y types around the world have an interesting, inspiring & productive day!

Ok, I’ve created a new #GLAMWorkbench meta issue to try and bring together all the things I’m trying to do to improve & automate the code & documentation. This should help me keep track of things… github.com/GLAM-Work… #DayofDH2022

A couple of hours of #DayofDH2022 left – feeling a bit uninspired, so I’m going to do some pruning & reorganising of the #GLAMWorkbench issues list: github.com/GLAM-Work…

Hmm, didn’t get any writing done and now it’s time to cook dinner for the family. Bit of a frustrating day really. Might do a bit of coding later. #DayofDH2022

Workspace photo for #DayofDH2022. Old iMac for the socials & meetings. Newish laptop running Pop!_Os for dev work. To my left is a big window facing the backyard through which I can see visiting rosellas (Green & Eastern).

Photograph of my desktop. There are three monitors - one iMac, and a laptop on a stand connected to an external monitor. In the foreground are my distance glasses and a notebook.

Now after coding, bug chasing, meeting, & prototype demo, I have to try and get into a headspace to do some writing about government surveillance records in the National Archives of Australia. But first some late lunch! #DayofDH2022

Meeting and demo done. Love working with ANU Archives, & really excited about the latest version of the Sydney Stock Exchange database. Still WIP but you can play with the prototype here: sydney-stock-exchange-xqtkxtd5za-ts.a.run.app/stock_exc… #DayofDH2022

I’ve hit a bit of a brick wall with my Datasette deployment. I’m probably missing something obvious. Suggestions welcome… github.com/simonw/da… #DayofDH2022

And Cloudstor is down for maintenance. I suppose I wan’t be doing that demo then… #DayofDH2022