Do you want your Trove newspaper articles in bulk? Meet the new Trove Newspaper Harvester Python package!

The Trove Newspaper Harvester has been around in different forms for more than a decade. It helps you download all the articles in a Trove newspaper search, opening up new possibilities for large-scale analysis. You can use it as a command-line tool by installing a Python package, or through the Trove Newspaper Harvester section of the GLAM Workbench.

I’ve just overhauled development of the Python package. The new trove-newspaper-harvester replaces the old troveharvester repository. The command-line interface remains the same (with a few new options), so it’s really a drop in replacement. Read the full documentation of the new package for more details.

Screenshot of the trove-newspaper-harvester documentation describing its use as a Python library.

Here’s a summary of the changes:

  • the package can now be used as a library (that you incorporate into your own code) as well as a standalone command-line tool – this means you can embed the harvester in your own tools or workflows
  • both the library and the CLI now let you set the names of the directories in which your harvests will be saved – this makes it easier to organise your harvests into groups and give them meaningful names
  • the harvesting process now saves results into a newline-delimited JSON file (one JSON object per line) – the library has a save_csv() option to convert this to a CSV file, while the CLI automatically converts the results to CSV to maintain compatibility with previous versions
  • behind the scenes, the package is now developed and maintained using nbdev – this means the code and documentation are all generated from a set of Jupyter notebooks
  • the Jupyter notebooks include a variety of automatic tests which should make maintenance and development much easier in the future

I’ve also updated the Trove Newspaper Harvester section of the GLAM Workbench to use the new package. The new core library will make it easier to develop more complex harvesting examples – for example, searching for articles from a specific day across a range of years. If you find any problems, or want to suggest improvements, please raise an issue.

From 48 PDFs to one searchable database – opening up the Tasmanian Post Office Directories with the GLAM Workbench

A few weeks ago I created a new search interface to the NSW Post Office Directories from 1886 to 1950. Since then, I’ve used the same process on the Sydney Telephone Directories from 1926 to 1954. Both of these publications had been digitised by the State Library of NSW and made available through Trove. To build the new interfaces I downloaded the text from Trove, indexed it by line, and linked it back to the online page images.

But there are similar directories from other states that are not available through Trove. The Tasmanian Post Office Directory, for example, has been digitised between 1890 and 1948 and made available as 48 individual PDF files from Libraries Tasmania. While it’s great that they’ve been digitised, it’s not really possible to search them without downloading all the PDFs.

As part of the Everyday Heritage project, Kate Bagnall and I are working on mapping Tasmania’s Chinese history – finding new ways of connecting people and places. The Tasmanian Post Office Directories will be a useful source for us, so I thought I’d try converting them into a database as I had with the NSW directories. But how?

There were several stages involved:

  • Downloading the 48 PDF files
  • Extracting the text and images from the PDFs
  • Making the separate images available online so they could be integrated with the search interface
  • Loading all the text and image links into a SQLite database for online delivery using Datasette

And here’s the result!

Screenshot of search interface for the Tasmanian Post Office Directories.

Search for people and places in Tasmania from 1890 to 1948!

The complete process is documented in a series of notebooks, shared through the brand new Libraries Tasmania section of the GLAM Workbench. As with the NSW directories, the processing pipeline I developed could be reused with similar publications in PDF form. Any suggestions?

Some technical details

There were some interesting challenges in connecting up all the pieces. Extracting the text and images from the PDFs was remarkably easy using PyMuPDF, but the quality of the text wasn’t great. In particular, I had trouble with columns – values from neighbouring columns would be munged together, upsetting the order of the text. I tried working with the positional information provided by PyMuPDF to improve column detection, but every improvement seemed to raise another issue. I was also worried that too much processing might result in some text being lost completely.

I tried a few experiments re-OCRing the images with Textract ( a paid service from Amazon) and Tesseract. The basic Textract product provides good OCR, but again I needed to work with the positional information to try and reassemble the columns. On the other hand, Tesseract’s automatic layout detection seemed to work pretty well with just the default settings. It wasn’t perfect, but good enough to support search and navigation. So I decided to re-OCR all the images using Tesseract. I’m pretty happy with the result.

The search interfaces for the NSW directories display page images loaded directly from Trove into an OpenSeadragon viewer. The Tasmanian directories have no online images to integrate in this way, so I had to set up some hosting for the images I extracted from the PDFs. I could have just loaded them from an Amazon s3 bucket, but I wanted to use IIIF to deliver the images. Fortunately there’s a great project that uses Amazon’s Lambda service to provide a Serverless IIIF Image API. To prepare the images for IIIF, you convert them to pyramidal TIFFs (a format that contains an image at a number of different resolutions) using VIPS. Then you upload the TIFFs to an s3 bucket and point the Serverless IIIF app at the bucket. There’s more details in this notebook. It’s very easy and seems to deliver images amazingly quickly.

The rest of the processing followed the process I used with the NSW directories – using SQLite-utils and Datasette to package the data and deliver it online via Google Cloudrun.

Postscript: Time and money

I thought I should add a little note about costs (time and money) in case anyone was interested in using this workflow on other publications. I started working on this on Sunday afternoon and had a full working version up about 24 hours later – that includes a fair bit of work that I didn’t end up using, but doesn’t include the time I spent re-OCRing the text a day or so later. This was possible because I was reusing bits of code from other projects, and taking advantage of some awesome open-source software. Now that the processing pipeline is pretty well-defined and documented it should be even faster.

The search interface uses cloud services from Amazon and Google. It’s a bit tricky to calculate the precise costs of these, but here’s a rough estimate.

I uploaded 63.9gb of images to Amazon s3. These should cost about US$1.47 per month to store.

The Serverless IIIF API uses Amazon’s Lambda service. At the moment my usage is within the free tier, so $0 so far.

The Datasette instance uses Google Cloudrun. Costs for this service are based on a combination of usage, storage space, and the configuration of the environment. The size of the database for the Tasmanian directories is about 600mb, so I can get away with 2gb of application memory. (The NSW Post Office directory currently uses 8gb.) These services scale to zero – so basically they shut down if they’re not being used. This saves a lot of money, but means there can be a pause if they need to start up again. I’m running the Tasmanian and NSW directories, as well as the GLAM Name Index search, within the same Google Cloud account, and I’m not quite sure how to itemise the costs. But overall, it’s costing me about US$4.00 a month to run them all. Of course if usage increases, so will the costs!

So I suppose the point is that these sorts of approaches can be quite a practical and cost-effective way of improving access to digitised resources, and don’t need huge investments in time or infrastructure.

If you want to contribute to the running costs of the NSW and Tasmanian directories you can sponsor me on GitHub.

Fresh harvest of OCRd text from Trove's digitised periodicals – 9gb of text to explore and analyse!

I’ve updated the GLAM Workbench’s harvest of OCRd text from Trove’s digitised periodicals. This is a completely fresh harvest, so should include any corrections made in recent months. It includes:

  • 1,430 periodicals
  • OCRd text from 41,645 issues
  • About 9gb of text

The easiest way to explore the harvest is probably this human-readable list. The list of periodicals with OCRd text is also available as a CSV. You can find more details in the Trove journals section of the GLAM Workbench, and download the complete corpus from CloudStor.

Finding which periodical issues in Trove have OCRd text you can download is not as easy as it should be. The fullTextInd index doesn’t seem to distinguish between digitised works (with OCR) and born-digital publications (like PDFs) without downloadable text. You can use has:correctabletext to find articles with OCR, but you can’t get a full list of the periodicals the articles come from using the title facet. As this notebook explains, you can search for nla.obj, but this returns both digitised works and publications supplied through edeposit. In previous harvests of OCRd text I processed all of the titles returned by the nla.obj search, finding out whether there was any OCRd text by just requesting it and seeing what came back. But the number of non-digitised works on the list of periodicals in digital form has skyrocketed through the edeposit scheme and this approach is no longer practical. It just means you waste a lot of time asking for things that don’t exist.

For the latest harvest I took a different approach. I only processed periodicals in digital form that weren’t identified as coming through edeposit. These are the publications with a fulltext_url_type value of either ‘digitised’ or ‘other’ in my dataset of digital periodicals. Is it possible that there’s some downloadable text in edeposit works that’s now missing from the harvest? Yep, but I think this is a much more sensible, straightforward, and reproduceable approach.

That’s not the only problem. As I noted when creating the list of periodicals in digital form, there are duplicates in the list, so they have to be removed. You then have to find information about the issues available for each title. This is not provided by the Trove API, but there is an internal API used in the web interface that you can access – see this notebook for details. I also noticed that sometimes where there’s a single issue of a title, it’s presented as if each page is an issue. I think I’ve found a work around for that as well.

All these doubts, inconsistencies and workarounds mean that I’m fairly certain I don’t have everything. But I do think I have most of the OCRd text available from digitised periodicals, and I do have a methodology, documented in this notebook, that at least provides a starting point for further investigation. As I noted in my wishlist for a Trove Researcher Platform, it would be great if more metadata for digitised works, other than newspapers, was made available through the API.

Explore Trove's digitised newspapers by place

I’ve updated my map displaying places where Trove digitised newspapers were published or distributed. You can view all the places on single map – zoom in for more markers, and click on a marker for title details and a link back to Trove.

A map of Australia with coloured markers indicating the number of Trove’s digitised newspapers published in different locations around the country.

If you want to find newspapers from a particular area, just click on a location using this map to view the 10 closest titles.

A map section focused on Walhalla in eastern Victoria with markers indicating nearby places where Trove’s digitised newspapers were published. A column on the right lists the newspaper titles.

You can view or download the dataset used to construct the map. Place names were extracted from the newspaper titles using the Geoscience Gazetteer.

Making NSW Postal Directories (and other digitised directories) easier to search with the GLAM Workbench and Datasette

As part of my work on the Everyday Heritage project I’m looking at how we can make better use of digitised collections to explore the everyday experiences woven around places such as Parramatta Road in Sydney. For example, the NSW Postal Directories from 1886 to 1908 and 1909 to 1950 have been digitised by the State Library of NSW and made available through Trove. The directories list residences and businesses by name and street location. Using them we can explore changes in the use of Parramatta Road across 60 years of history. But there’s a problem. While you can browse the directories page by page, searching is clunky. Trove’s main search indexes the contents of the directories by ‘article’. Each ‘article’ can be many pages long, so it’s difficult to focus in on the matching text. Clicking through from the search results to the digitised volume lands you in another set of search results, showing all the matches in the volume. However, the internal search index works differently to the main Trove index. In particular it doesn’t seem to understand phrase or boolean searches. If you start off searching for “parramatta road” , Trove tells you there’s 50 matching articles, but if you click through to a volume you’re told there’s no results. If you remove the quotes you get every match for ‘parramatta’ or ‘road’. It’s all pretty confusing.

The idea of ‘articles’ is really not very useful for publications like the Post Office Directories where information is mostly organised by column, row or line. You want to be able to search for a name, and go directly to the line in the directory where that name is mentioned. And now you can! Using Datasette, I’ve created an interface that searches by line across all 54 volumes of the NSW Post Office Directory from 1886 to 1950 (that’s over 30 million lines of text).

Screenshot of home page for the NSW Post Office Directories. There’s a search box to search across all 54 volumes, some search tips, and a summary of the data that notes there are more than 30 million rows.

Try it now!

Basic features

  • The full text search supports phrase, boolean, and wildcard searches. Just enter your query in the main search box to get results from all 54 volumes in a flash.

  • Each search result is a single line of text. Click on the link to view this line in context – it’ll show you 5 lines above and below your match, as well as a zoomable image of the digitised page from Trove.

  • For more context, you can click on View full page to see all the lines of text extracted from that page. You can then use the Next and Previous buttons to browse page by page.

  • To view the full digitised volume, just click on the View page in Trove button.

Screenshot of information about a single row in the NSW Post Office Directories. The row is highlighted in yellow, and displayed in context with five rows above and below. There’s a button to view the full page, and box displaying a zoomable image of the page from Trove.

How it works

There were a few stages involved in creating this resource, but mostly I was able to reuse bits of code from the GLAM Workbench’s Trove journals and books sections, and other related projects such as the GLAM Name Index Search. Here’s a summary of the processing steps:

  • I started with the two top-level entries for the NSW Postal Directories, harvesting details of the 54 volumes under them.
  • For each of these 54 volumes, I downloaded the OCRd text page by page. Downloading the text by page, rather than volume, was very slow, but I thought it was important to be able to link each line of text back to its original page.
  • To create links back to pages, I also needed the individual identifiers for each page. A list of page identifiers is embedded as a JSON string within each volume’s HTML, so I extracted this data and matched the page ids to the text.
  • Using sqlite-utils, I created a SQLite database with a separate table for every volume. Then I processed the text by volume, page, and line – adding each line line of text and its page details as a individual record in the database.
  • I then ran full text indexing across each line to make it easily searchable.
  • Using Datasette and its search-all plugin, I loaded up the database and BINGO! More than 30 million lines of text across 54 digitised volumes were instantly searchable.
  • To make it all public, I used Datasette’s publish function to push the database to Google’s Cloudrun service.

All the code is available in the journals section of the GLAM Workbench.

Future developments

One of the most exciting things to me is that this processing pipeline can be used with any digitised publication on Trove where it would be easier to search by line rather than article. Any suggestions?

Interested in Victorian shipwrecks? Kim Doyle and Mitchell Harrop have added a new notebook to the Heritage Council of Victoria section of the GLAM Workbench exploring shipwrecks in the Victorian Heritage Database: glam-workbench.net/heritage-…

Updates!

Minor update to RecordSearch Data Scraper – now captures ‘institution title’ for agencies if it is present. pypi.org/project/r…

Many thanks to the British Library – sponsors of the GLAM Workbench’s web archives section!

You might have noticed some changes to the web archives section of the GLAM Workbench.

Screenshot of the web archives section showing the acknowledgement of the British Library's sponsorship.

I’m very excited to announce that the British Library is now sponsoring the web archives section! Many thanks to the British Library and the UK Web Archive for their support – it really makes a difference.

The web archives section was developed in 2020 with the support of the International Internet Preservation Consortium’s Discretionary Funding Programme, in collaboration with the British Library, the National Library of Australia, and the National Library of New Zealand. It’s intended to help historians, and other researchers, understand what sort of data is available through web archives, how to get it, and what you can do with it. It provides a series of tools and examples that document existing APIs, and explore questions such as how web pages change over time. The notebooks focus on four particular web archives: the UK Web Archive, the Australian Web Archive (National Library of Australia ), the New Zealand Web Archive (National Library of New Zealand), and the Internet Archive. However, the tools and approaches could be easily extended to other web archives (and soon will be!). I introduced the web archives section of the GLAM Workbench in this seminar for the IIPC in August 2020:

According to the Binder launch stats, the web archives section is the most heavily used part of the GLAM Workbench. In December 2020, it won the British Library Labs Research Award. Last year I updated the repository, automating the build of Docker images, and adding integrations with Zenodo, Reclaim Cloud, and Australia’s Nectar research cloud. I’m also thinking about some new notebooks – watch this space!

The GLAM Workbench receives no direct funding from government or research agencies, and so the support of sponsors like the British Library and all my other GitHub sponsors is really important. Thank you! If you think this work is valuable, have a look at the Get involved! page to see how you can contribute. And if your organisation would like to sponsor a section of the GLAM Workbench, let me know!

New GLAM data to search, visualise and explore using the GLAM Workbench!

There’s lots of GLAM data out there if you know where to look! For the past few years I’ve been harvesting a list of datasets published by Australian galleries, libraries, archives, and museums through open government data portals. I’ve just updated the harvest and there’s now 463 datasets containing 1,192 files. There’s a human-readable version of the list that you can browse. If you just want the data you can download it as a CSV. Or if you’d like to search the list there’s a database version hosted on Glitch. The harvesting and processing code is available in this notebook.

The GLAM data from government portals section of the GLAM Workbench provides more information and a summary of results. For example, here’s a list of the number of data files by GLAM institution.

Table showing the number of datasets published by each GLAM organisation. Queensland State Archives is on top with 108 datasets.

Most of the datasets are in CSV format, and most have a CC-BY licence.

What’s inside?

Obviously it’s great that GLAM organisations are sharing lots of open data, but what’s actually inside all of those CSV files? To help you find out, I created the GLAM CSV Explorer. Click on the blue button to run it in Binder, then just select a dataset from the dropdown list. The CSV Explorer will download the file, and examine the contents of every field to try and determine the type of data it holds – such as text, dates, or numbers. It then summarises the results and builds a series of visualisations to give you an overview of the dataset.

Screenshot of GLAM CSV Explorer. A series of dropdown boxes allow you to select a dataset to analyse.

Search for names

Many of the datasets are name indexes to collections of records – GLAM staff or volunteers have transcribed the names of people mentioned in records as an aid to users. For Family History Month last year I aggregated all of the name indexes and made them searchable through a single interface using Datasette. The GLAM Name Index Search has been updated as well – it searches across 10.3 million records in 253 indexes from 10 GLAM organisations. And it’s free!

Screenshot of GLAM Name Index Search showing the list of GLAM organisations with the number of tables and rows you can search from each.

And a bit of maintenance…

As well as updating the data, I also updated the code repository adding the features that I’m rolling out across the whole of the GLAM Workbench. This includes automated Docker builds saved to Quay.io, integrations with Reclaim Cloud and Zenodo, and some basic quality controls through testing and code format checks.

Zotero now saves links to digitised items in Trove from the NLA catalogue!

I’ve made a small change to the Zotero translator for the National Library of Australia’s catalogue. Now, if there’s a link to a digitised version of the work in Trove, that link will be saved in Zotero’s url field. This makes it quicker and easier to view digitised items – just click on the ‘URL’ label in Zotero to open the link.

It’s also handy if you’re viewing a digitised work in Trove and want to capture the metadata about it. Just click on the ‘View catalogue’ link in the details tab of a Trove item, then use Zotero to save the details from the catalogue.

View embedded JSON metadata for Trove's digitised books and journals

The metadata for digitised books and journals in Trove can seem a bit sparse, but there’s quite a lot of useful metadata embedded within Trove’s web pages that isn’t displayed to users or made available through the Trove API. This notebook in the GLAM Workbench shows you how you can access it. To make it even easier, I’ve added a new endpoint to my Trove Proxy that returns the metadata in JSON format.

Just pass the url of a digitised book or journal as a parameter named url to https://trove-proxy.herokuapp.com/metadata/. For example:

https://trove-proxy.herokuapp.com/metadata/?url=https://nla.gov.au/nla.obj-2906940941

Screenshot of the collapsed JSON metadata returned from the url above. It includes fields such as 'title', 'accessConditions', 'marcData', and 'children'.

I’ve created a simple bookmarklet to make it simpler to open the proxy. To use it just:

  • Drag this link to your bookmarks toolbar: Get Trove work metadata
  • View a digitised book or journal in Trove.
  • Click on the bookmarklet to view the metadata in JSON format.

To view the JSON data in your browser you might need to install an extension like JSONView.

Where did all those NSW articles go? Trove Newspapers Data Dashboard update!

I was looking at my Trove Newspapers Data Dashboard again last night trying to figure out why the number of newspaper articles from NSW seemed to have dropped by more than 700,000 since my harvesting began. It took me a while to figure out, but it seems that the search index was rebuilt on 31 May, and that caused some major shifts in the distribution of articles by state, as reported by the main result API. So the indexing of the articles changed, not the actual number of articles. Interestingly, the number of articles by state reported by the newspaper API doesn’t show the same fluctuations.

Screenshot of data dashboard that compares the number of articles by state as reported by the results and newspapers APIs. There are major differences in the column that shows the change since April 2022.

This adds another layer of complexity to understanding how Trove changes over time. To try and document such things, I’ve added a ‘Significant events’ section to the Dashboard. I’ve also included a new ‘Total articles by publication state’ section that compares results from the result and newspaper APIs. This should make it easier to identify such issues in the future.

Stay alert people – remember, search interfaces lie!

Catching up – some recent GLAM Workbench updates!

There’s been lots of small updates to the GLAM Workbench over the last couple of months and I’ve fallen behind in sharing details. So here’s an omnibus list of everything I can remember…

Data

  • Weekly harvests of basic Trove newspaper data continue, there’s now about three months worth. You can view a summary of the harvested data through the brand new Trove Newspaper Data Dashboard. The Dashboard is generated from a Jupyter notebook and is updated whenever there’s a new data harvest.
  • There’s also weekly harvests of files digitised by the NAA, now 16 months worth of data.
  • Updated harvest of Trove public tags (Zenodo) – includes 2,201,090 unique public tags added to 9,370,614 resources in Trove between August 2008 and July 2022.
  • I’ve started moving other pre-harvested datasets out of the GLAM Workbench code repositories, into their own data repositories. This means better versioning and citability. The first example is the list of Trove newspapers with articles post the 1955 copyright cliff of death – here’s the GH repo, and the Zenodo record.
  • To bring together datasets that provide historical data about Trove itself, I’ve created a Trove historical data community on Zenodo. Anyone’s welcome to contribute. There’s much more to come.

Tag cloud showing the frequency of the two hundred most commonly-used tags in Trove.

Tag cloud generated from the latest harvest of Trove Tags

Code

  • Big thanks to Mitchell Harrop who contributed a new Heritage Council of Victoria section to the GLAM Workbench providing examples using the Victorian Heritage Database API.
  • The troveharvester Python package has been updated. Mainly to remove annoying Pandas warnings and to make use of the trove-query-parser package.
  • As a result of the above, the Trove Newspaper & Gazette Harvester section of the GLAM Workbench has been updated. No major changes to notebooks, but I’ve implemented basic testing and linting to improve code quality.
  • The Trove newspapers section of the GW has been updated. There were a few bug fixes and minor improvements. In particular there was a problem downloading data and HTML files from QueryPic, and some date queries in QueryPic were returning no results.
  • The tool to download complete, high-res newspaper page images has been updated so that you now no longer need to supply an API key. Also fixed a problem with displaying the images in Voila.
  • The recordsearch_data_scraper Python package has been updated. This fixes a bug where agency and series searches with only one result weren’t being captured properly.
  • The RecordSearch section of the GW has been updated. This is incorporates the above update, but I took the opportunity to update all packages, and implement basic testing and linting. The Harvest items from a search in RecordSearch notebook has been simplified and reorganised. There are two new notebooks: Exploring harvested series data, 2022 – generates some basic statistics from the harvest of series data in 2022 and compares the results to the previous year; Summary of records digitised in the previous week – run this notebook to analyse the most recent dataset of recently digitised files, summarising the results by series.
  • A new Zotero translator for Libraries Tasmania has been developed

Updated dataset! Harvests of Trove list metadata from 2018, 2020, and 2022 are now available on Zenodo: doi.org/10.5281/z… Another addition to the growing collection of historical Trove data. #GLAMWorkbench

Screen capture of version information from Zenodo showing that there are three available versions, v1.0, v1.1, and v1.2.

Coz I love making work for myself, I’ve started pulling datasets out of #GLAMWorkbench code repos & creating new data repos for them. This way they’ll have their own version histories in Zenodo. Here’s the first: github.com/GLAM-Work…

Ahead of my session at #OzHA2022 tomorrow, I’ve updated the NAA section of the #GLAMWorkbench. Come along to find out how to harvest file details, digitsed images, and PDFs, from a search in RecordSearch! github.com/GLAM-Work…

Noticed that QueryPic was having a problem with some date queries. Should be fixed in the latest release of the Trove Newspapers section of the #GLAMWorkbench: glam-workbench.net/trove-new… #maintenance #researchinfrastructure

The Trove Newspapers section of the #GLAMWorkbench has been updated! Voilá was causing a problem in QueryPic, stopping results from being downloaded. A package update did the trick! Everything now updated & tested. glam-workbench.net/trove-new…

Some more #GLAMWorkbench maintenance – this app to download a high-res page images from Trove newspapers now doesn’t require an API key if you have a url, & some display problems have been fixed. trove-newspaper-apps.herokuapp.com/voila/ren…

Screen shot of app --  Download a page image  The Trove web interface doesn't provide a way of getting high-resolution page images from newspapers. This simple app lets you download page images as complete, high-resolution JPG files.

The Trove Newspaper and Gazette Harvester section of the #GLAMWorkbench has been updated! No major changes to notebooks, just lots of background maintenance stuff such as updating packages, testing, linting notebooks etc. glam-workbench.net/trove-har…

Ordering some #GLAMWorkbench stickers…

Proof image of a hexagonal sticker. The sticker has white lettering on a blue blackground which reads GLAM Workbench. In the centre is a crossed hammer and wrench icon.

Using Datasette on Nectar

If you have a dataset that you want to share as a searchable online database then check out Datasette – it’s a fabulous tool that provides an ever-growing range of options for exploring and publishing data. I particularly like how easy Datasette makes it to publish datasets on cloud services like Google’s Cloudrun and Heroku. A couple of weekends ago I migrated the TungWah Newspaper Index to Datasette. It’s now running on Heroku, and I can push updates to it in seconds.

I’m also using Datasette as the platform for sharing data from the Sydney Stock Exchange Project that I’m working on with the ANU Archives. There’s a lot of data – more than 20 million rows – but getting it running on Google Cloudrun was pretty straightforward with Datasette’s publish command. The problem was, however, that Datasette is configured to run on most cloud services in ‘immutable’ mode and we want authenticated users to be able to improve the data. So I needed to explore alternatives.

I’ve been working with Nectar over the past year to develop a GLAM Workbench application that helps researchers do things like harvesting newspaper articles from a Trove search. So I thought I’d have a go at setting up Datasette in a Nectar instance, and it works! Here’s a few notes on what I did…

  • First of course you need get yourself a resource allocation on Nectar. I’ve also got a persistent volume storage allocation that I’m using for the data.
  • From the Nectar Dashboard I made sure that I had an SSH keypair configured, and created a security group to allow access via SSH, HTTP and HTTPS. I also set up a new storage volume.
  • I then created a new Virtual Machine using the Ubuntu 22.04 image, and attaching the keypair, security group, and volume storage. For the stock exchange data I’m currently used the ‘m3.medium’ flavour of virtual machine which provides 8gb of RAM and 4 VCPUs. This might be overkill, but I went with the bigger machine because of the size of the SQLite database (around 2gb). This is similar to what I used on Cloudstor after I ran into problems with the memory limit. I think most projects would run perfectly well using one of the ‘small’ flavours. In any case, it’s easy to resize if you run into problems.
  • Once the new machine was running I grabbed the IP address. Because I have DNS configured on my Nectar project, I also created a ‘datasette’ subdomain from the DNS dashboard by pointing an ‘A’ (alias) record to the IP address.
  • Using the IP address I logged into the new machine via SSH.
  • With all the Nectar config done, it was time to set up Datasette. I mainly just followed the excellent instructions in the Datasette documention for deploying Datasette using systemd. This involved installing datasette via pip, creating a folder for the Datasette data and configuration files, creating a datasette.service file for systemd.
  • I also used the datasette install command to add a couple of Datasette plugins. One of these is the datasette-github-auth plugin, which needs a couple of secret tokens set. I added these as environment variables in the datasette.service file.
  • The systemd setup uses Datasette’s configuration directory mode. This means you can put your database, metadata definitions, custom templates and CSS, and any other settings all together in a single directory and Datasette will find and use them. I’d previously passed runtime settings via the command line, so I had to create a settings.json for these.
  • Then I just uploaded all my Datasette database and configuration files to the folder I created on the virtual machine using rsync and started the Datasette service. It worked!
  • The next step was to use the persistent volume storage for my Datasette files. The persistent storage exists independently of the virtual machine, so you don’t need to worry about losing data if there’s a change to the instance. I mounted the storage volume as /pvol in the virtual machine as the Nectar documentation describes.
  • I created a datasette-root folder under pvol, copied the Datasette files to it, and changed the datasette.service file to point to it. This didn’t seem to work and I’m not sure why. So instead I created a symbolic link between /home/ubuntu/datasette-root and /pvol/datasette-root and and set the path in the service file back to /home/ubuntu/datasette-root. This worked! So now the database and configuration files are sitting in the persistent storage volume.
  • To make the new Datasette instance visible to the outside world, I installed nginx, and configured it as a Datasette proxy using the example in the Datasette documentation.
  • Finally I configured HTTPS using certbot.

Although the steps above might seem complicated, it was mainly just a matter of copying and pasting commands from the existing documentation. The new Datasette instance is running here, but this is just for testing and will disappear soon. If you’d like to know more about the Stock Exchange Project, check out the ANU Archives section of the GLAM Workbench.

Convert your Trove newspaper searches to an API query with just one click!

I’m thinking about the Trove Researcher Platform discussions & ways of integrating Trove with other apps and platforms (like the GLAM Workbench).

As a simple demo I modifed my Trove Proxy app to convert a newspaper search url from the Trove web interface into an API query (using the trove-query-parser package). The proxy app then redirects you to the Trove API Console so you can see the results of the API query without needing a key.

To make it easy to use, I created a bookmarklet that encodes your current url and feeds it to the proxy. To use it just:

  • Drag this link to your bookmarks toolbar: Open Trove API Console.
  • Run a search in Trove’s newspapers.
  • Click on the bookmarklet.

This little hack provides a bit of ‘glue’ to help researchers think their about search results as data, and explore other possibilities for download and analysis. #dhhacks

My Trove researcher platform wishlist

The ARDC is collecting user requirements for the Trove researcher platform for advanced research. This is a chance to start from scratch, and think about the types of data, tools, or interface enhancements that would support innovative research in the humanities and social sciences. The ARDC will be holding two public roundtables, on 13 and 20 May, to gather ideas. I created a list of possible API improvements in my response to last year’s draft plan, and thought it might be useful to expand that a bit, and add in a few other annoyances, possibilities, and long-held dreams.

My focus is again on the data; this is for two reasons. First because public access to consistent, good quality data makes all other things possible. But, of course, it’s never just a matter of OPENING ALL THE DATA. There will be questions about priorities, about formats, about delivery, about normalisation and enrichment. Many of these questions will arise as people try to make use of the data. There needs to be an ongoing conversation between data providers, research tool makers, and research tool users. This is the second reason I think the data is critically important – our focus should be on developing communities and skills, not products. A series of one-off tools for researchers might be useful, but the benefits will wane. Building tools through networks of collaboration and information sharing based on good quality data offers much more. Researchers should be participants in these processes, not consumers.

Anyway, here’s my current wishlist…

APIs and data

  • Bring the web interface and main public API back into sync, so that researchers can easily transfer queries between the two. The Trove interface update in 2020 reorganised resources into ‘categories’, replacing the original ‘zones’. The API, however, is still organised by zone and knows nothing about these new categories. Why does this matter? The web interface allows researchers to explore the collection and develop research questions. Some of these questions might be answered by downloading data from the API for analysis or visualisation. But, except for the newspapers, there is currently no one-to-one correspondence between searches in the web interface and searches using the API. There’s no way of transferring your questions – you need to start again.

  • Expand the metadata available for digitised resources other than newspapers. In recent years, the NLA has digitised huge quantities of books, journals, images, manuscripts, and maps. The digitisation process has generated new metadata describing these resources, but most of this is not available through the public API. We can get an idea of what’s missing by comparing the digitised journals to the newspapers. The API includes a newspaper endpoint that provides data on all the newspapers in Trove. You can use it to get a list of available issues for any newspaper. There is no comparable way of retrieving a list of digitised journals, or the issues that have been digitised. The data’s somewhere – there’s an internal API that’s used to generate lists of issues in the browse interface and I’ve scraped this to harvest issue details. But this information should should be in the public API. Manuscripts are described using finding aids, themselves generated from EAD formatted XML files, but none of this important structured data is available from the API, or for download. There’s also other resource metadata, such as parent/child relationships between different levels in the object hierarchy (eg publication > pages). These are embedded in web pages but not exposed in the API. The main point is that when it comes to data-driven research, digitised books, journals, manuscripts, images, and maps are second-class citizens, trailing far behind the newspapers in research possibilities. There needs to be a thorough stocktake of available metadata, and a plan to make this available in machine actionable form.

  • Standardise the delivery of text, images, and PDFs and provide download links through the API. As noted above, digitised resources are treated differently depending on where they sit in Trove. There are no standard mechanisms for downloading the products of digitisation, such as OCRd text and images. OCRd text is available directly though the API for newspaper and journal articles, but to download text from a book or journal issue you need to hack the download mechanisms from the web interface. Links to these should be included in the API. Similarly, machine access to images requires various hacks and workarounds. There should be a consistent approach that allows researchers to compile image datasets from digitised resources using the API. Ideally IIIF standard APIs should be used for the delivery of images and maps. This would enable the use of the growing ecosystem of IIIF compliant tools for integration, analysis, and annotation.

  • Provide an option to exclude search results in tags and comments. The Trove advanced search used to give you the option of excluding search results which only matched tags or comments, rather than the content of the resource. Back when I was working at Trove, the IT folks said this feature would be added to a future version of the API, but instead it disappeared from the web interface with the 2020 update! Why is this important? If you’re trying to analyse the occurance of search terms within a collection, such as Trove’s digitised newspapers, you want to be sure that the result reflects the actual content, and not a recent annotation by Trove users.

  • Finally add the People & Organisations data to the main API. Trove’s People & Organisations section was ahead of the game in providing machine-readable access, but the original API is out-of-date and uses a completely different query language. Some work was done on adding it to the main RESTful API, but it was never finished. With a bit of long-overdue attention, the People & Organisations data could power new ways of using and linking biographical resources.

  • Improve web archives CDX API. Although the NLA does little to inform researchers of the possibilities, the web archives software it uses (Pywb) includes some baked in options for retrieving machine-readable data. This includes support for the Memento protocol, and the provision of a CDX API that delivers basic metadata about individual web page captures. The current CDX API has some limitations ( documented here ). In particular, there’s no pagination or results, and no support for domain-level queries. Addressing these limitations would make the existing CDX API much more useful.

  • Provide new data sources for web archives analysis. There needs to be an constructive, ongoing discussion about the types of data that could be extracted and shared from the Australian web archive. For example, a search API, or downloadable datasets of word frequencies. The scale is a challenge, but some pilot studies could help us all understand both the limits and the possibilities.

  • Provide a Write API for annotations. Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources. Indeed, this would create exciting possibilities for embedding Trove resources within systems of scholarly analysis, allowing insights gained through research to be automatically fed back into Trove to enhance discovery and understanding.

  • Provide historical statistics on Trove resources. It’s important for researchers to understand how Trove itself changes over time. There used to be a page that provided regularly-updated statistics on the number of resources and user annotations, but this was removed by the interface upgrade in 2020. I’ve started harvesting some basic stats relating to Trove newspapers, but access to general statistics should be reinstated.

  • Reassess key authentication and account limits. DigitalNZ recently changed their policy around API authentication, allowing public access without a key. Authentication requirements hinder exploration and limit opportunities for using the API in teaching and workshops. Similarly, I don’t think the account usage limits have been changed since the API was released, even though the capacity of the systems has increased. It seems like time that both of these were reassessed.

Ok, I’ll admit, that’s a pretty long list, and not everything can be done immediately! I think this would be a good opportunity for the NLA to develop and share an API and Data Roadmap, that is regularly updated, and invites comments and suggestions. This would help researchers plan for future projects, and build a case for further government investment.

Integration

  • Unbreak Zotero integration. The 2020 interface upgrade broke the existing Zotero integration and there’s no straightforward way of fixing it without changes at the Trove end. Zotero used to be able to capture search results, metadata and images from most of the zones in Trove. Now it can only capture individual newspaper articles. This greatly limits the ability of researchers to assemble and manage their own research collections. More generally, a program to examine and support Zotero integration across the GLAM sector would be a useful way of spending some research infrastructure dollars.

  • Provide useful page metadata. Zotero is just one example of a tool that can extract structured metadata from web pages. Such metadata supports reuse and integration, without the need for separate API requests. Only Trove’s newspaper articles currently provide embedded metadata. Libraries used to lead the way is promoting the use of standardised, structured, embedded page metadata (Dublin Core anyone?), but now?

  • Explore annotation frameworks. I’ve mentioned the possibility of a Write API for annotations above, but there are other possibilities for supporting web scale annotations, such as Hypothesis. Again, the current Trove interface makes the use of Hypothesis difficult, and again this sort of integration would be usefully assessed across the whole GLAM sector.

Tools & interfaces

Obviously any discussion of new tools or interfaces needs to start by looking at what’s already available. This is difficult when the NLA won’t even link to existing resources such as the GLAM Workbench. Sharing information about existing tools needs to be the starting point from which to plan investment in the Trove Researcher Platform. From there we can identify gaps and develop processes and collaborations to meet specific research needs. Here’s a list of some Trove-related tools and resources currently available through the GLAM Workbench.

Update (18 May): some extra bonus bugs

I forgot to add these annoying bugs: