11 May 2022

My Trove researcher platform wishlist

The ARDC is collecting user requirements for the Trove researcher platform for advanced research. This is a chance to start from scratch, and think about the types of data, tools, or interface enhancements that would support innovative research in the humanities and social sciences. The ARDC will be holding two public roundtables, on 13 and 20 May, to gather ideas. I created a list of possible API improvements in my response to last year’s draft plan, and thought it might be useful to expand that a bit, and add in a few other annoyances, possibilities, and long-held dreams.

My focus is again on the data; this is for two reasons. First because public access to consistent, good quality data makes all other things possible. But, of course, it’s never just a matter of OPENING ALL THE DATA. There will be questions about priorities, about formats, about delivery, about normalisation and enrichment. Many of these questions will arise as people try to make use of the data. There needs to be an ongoing conversation between data providers, research tool makers, and research tool users. This is the second reason I think the data is critically important – our focus should be on developing communities and skills, not products. A series of one-off tools for researchers might be useful, but the benefits will wane. Building tools through networks of collaboration and information sharing based on good quality data offers much more. Researchers should be participants in these processes, not consumers.

Anyway, here’s my current wishlist…

APIs and data

Bring the web interface and main public API back into sync, so that researchers can easily transfer queries between the two. The Trove interface update in 2020 reorganised resources into ‘categories’, replacing the original ‘zones’. The API, however, is still organised by zone and knows nothing about these new categories. Why does this matter? The web interface allows researchers to explore the collection and develop research questions. Some of these questions might be answered by downloading data from the API for analysis or visualisation. But, except for the newspapers, there is currently no one-to-one correspondence between searches in the web interface and searches using the API. There’s no way of transferring your questions – you need to start again.
Expand the metadata available for digitised resources other than newspapers. In recent years, the NLA has digitised huge quantities of books, journals, images, manuscripts, and maps. The digitisation process has generated new metadata describing these resources, but most of this is not available through the public API. We can get an idea of what’s missing by comparing the digitised journals to the newspapers. The API includes a newspaper endpoint that provides data on all the newspapers in Trove. You can use it to get a list of available issues for any newspaper. There is no comparable way of retrieving a list of digitised journals, or the issues that have been digitised. The data’s somewhere – there’s an internal API that’s used to generate lists of issues in the browse interface and I’ve scraped this to harvest issue details. But this information should should be in the public API. Manuscripts are described using finding aids, themselves generated from EAD formatted XML files, but none of this important structured data is available from the API, or for download. There’s also other resource metadata, such as parent/child relationships between different levels in the object hierarchy (eg publication > pages). These are embedded in web pages but not exposed in the API. The main point is that when it comes to data-driven research, digitised books, journals, manuscripts, images, and maps are second-class citizens, trailing far behind the newspapers in research possibilities. There needs to be a thorough stocktake of available metadata, and a plan to make this available in machine actionable form.
Standardise the delivery of text, images, and PDFs and provide download links through the API. As noted above, digitised resources are treated differently depending on where they sit in Trove. There are no standard mechanisms for downloading the products of digitisation, such as OCRd text and images. OCRd text is available directly though the API for newspaper and journal articles, but to download text from a book or journal issue you need to hack the download mechanisms from the web interface. Links to these should be included in the API. Similarly, machine access to images requires various hacks and workarounds. There should be a consistent approach that allows researchers to compile image datasets from digitised resources using the API. Ideally IIIF standard APIs should be used for the delivery of images and maps. This would enable the use of the growing ecosystem of IIIF compliant tools for integration, analysis, and annotation.
Provide an option to exclude search results in tags and comments. The Trove advanced search used to give you the option of excluding search results which only matched tags or comments, rather than the content of the resource. Back when I was working at Trove, the IT folks said this feature would be added to a future version of the API, but instead it disappeared from the web interface with the 2020 update! Why is this important? If you’re trying to analyse the occurance of search terms within a collection, such as Trove’s digitised newspapers, you want to be sure that the result reflects the actual content, and not a recent annotation by Trove users.
Finally add the People & Organisations data to the main API. Trove’s People & Organisations section was ahead of the game in providing machine-readable access, but the original API is out-of-date and uses a completely different query language. Some work was done on adding it to the main RESTful API, but it was never finished. With a bit of long-overdue attention, the People & Organisations data could power new ways of using and linking biographical resources.
Improve web archives CDX API. Although the NLA does little to inform researchers of the possibilities, the web archives software it uses (Pywb) includes some baked in options for retrieving machine-readable data. This includes support for the Memento protocol, and the provision of a CDX API that delivers basic metadata about individual web page captures. The current CDX API has some limitations ( documented here ). In particular, there’s no pagination or results, and no support for domain-level queries. Addressing these limitations would make the existing CDX API much more useful.
Provide new data sources for web archives analysis. There needs to be an constructive, ongoing discussion about the types of data that could be extracted and shared from the Australian web archive. For example, a search API, or downloadable datasets of word frequencies. The scale is a challenge, but some pilot studies could help us all understand both the limits and the possibilities.
Provide a Write API for annotations. Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources. Indeed, this would create exciting possibilities for embedding Trove resources within systems of scholarly analysis, allowing insights gained through research to be automatically fed back into Trove to enhance discovery and understanding.
Provide historical statistics on Trove resources. It’s important for researchers to understand how Trove itself changes over time. There used to be a page that provided regularly-updated statistics on the number of resources and user annotations, but this was removed by the interface upgrade in 2020. I’ve started harvesting some basic stats relating to Trove newspapers, but access to general statistics should be reinstated.
Reassess key authentication and account limits. DigitalNZ recently changed their policy around API authentication, allowing public access without a key. Authentication requirements hinder exploration and limit opportunities for using the API in teaching and workshops. Similarly, I don’t think the account usage limits have been changed since the API was released, even though the capacity of the systems has increased. It seems like time that both of these were reassessed.

Ok, I’ll admit, that’s a pretty long list, and not everything can be done immediately! I think this would be a good opportunity for the NLA to develop and share an API and Data Roadmap, that is regularly updated, and invites comments and suggestions. This would help researchers plan for future projects, and build a case for further government investment.

Integration

Unbreak Zotero integration. The 2020 interface upgrade broke the existing Zotero integration and there’s no straightforward way of fixing it without changes at the Trove end. Zotero used to be able to capture search results, metadata and images from most of the zones in Trove. Now it can only capture individual newspaper articles. This greatly limits the ability of researchers to assemble and manage their own research collections. More generally, a program to examine and support Zotero integration across the GLAM sector would be a useful way of spending some research infrastructure dollars.
Provide useful page metadata. Zotero is just one example of a tool that can extract structured metadata from web pages. Such metadata supports reuse and integration, without the need for separate API requests. Only Trove’s newspaper articles currently provide embedded metadata. Libraries used to lead the way is promoting the use of standardised, structured, embedded page metadata (Dublin Core anyone?), but now?
Explore annotation frameworks. I’ve mentioned the possibility of a Write API for annotations above, but there are other possibilities for supporting web scale annotations, such as Hypothesis. Again, the current Trove interface makes the use of Hypothesis difficult, and again this sort of integration would be usefully assessed across the whole GLAM sector.

Tools & interfaces

Obviously any discussion of new tools or interfaces needs to start by looking at what’s already available. This is difficult when the NLA won’t even link to existing resources such as the GLAM Workbench. Sharing information about existing tools needs to be the starting point from which to plan investment in the Trove Researcher Platform. From there we can identify gaps and develop processes and collaborations to meet specific research needs. Here’s a list of some Trove-related tools and resources currently available through the GLAM Workbench.

Update (18 May): some extra bonus bugs

I forgot to add these annoying bugs:

The newspaper endpoint returns both newspaper and gazette titles, even though there’s a separate gazette endpoint. This forces you to do silly workarounds like this in the GLAM Workbench.
The list zone has recurring problems. At the moment it’s impossible to harvest a complete set of Trove lists.

Tim Sherratt

My Trove researcher platform wishlist

APIs and data

Integration

Tools & interfaces

Update (18 May): some extra bonus bugs