19 May 2025

No more harvesting data from the National Archives of Australia

A couple of weeks ago I bid farewell to Trove due to the cancellation of my API keys and the NLA’s lack of transparency around changes to API access. Now it seems I have to wave goodbye to 16+ years of work on RecordSearch, the National Archives of Australia’s online database.

I noticed this morning that my weekly harvest of recently digitised files in RecordSearch had failed. A quick check showed that my harvester was being blocked by Cloudflare’s bot protection software. I wasn’t really surprised. Websites are using tools like this to protect themselves against AI scraper bots, and I’d already seen it in action on another Australian government site. In the war between content providers and AI scrapers, researchers and digital preservation efforts are copping collateral damage.

But while we can’t blame the NAA for safeguarding its systems, we can be critical of the fact that it still doesn’t provide its collection data in machine-readable form. There were a couple of datasets shared for a GovHack event many years ago, a short-lived API for the WWI service records in series B2455, and an API attached to a beta discovery service that never saw the light of day (despite many $$$ being spent on it). Without direct access to the data, researchers have had to scrape it from RecordSearch’s web interface. That’s no longer possible.

I started scraping data from RecordSearch back in 2008 when I was working at the NAA. Eventually I packaged up some Python code to help other researchers create datasets. This was completely rewritten as the RecordSearch Data Scraper a few years back, and you can find various tools and examples using it in the RecordSearch section of the GLAM Workbench. In theory, I might be able to modify the scraper to get around the bot protection, but with the bot wars escalating, it hardly seems worth it – I might get the scraper working, only for it fall foul of the latest bot detection rules. It’s really now up to the NAA to decide whether it will find other ways to give researchers access to its data.

So it seems like I’ll be archiving all my RecordSearch code. Unfortunately, many of the RecordSearch notebooks in the GLAM Workbench will no longer work, so I’ll be adding warnings and explanations over coming weeks.

While not entirely unexpected, it’s all pretty sad. The RecordSearch scrapers have powered some of my favourite research projects. They enabled Kate Bagnall and me to download the metadata and images behind The Real Face of White Australia – a process we described in our article ‘The people inside’. Using the scrapers I’ve been able to analyse the process of access examination, and extract thousands of redactions from digitised ASIO surveillance files. Without the scrapers I would never have discovered #redactionart!

But while I won’t be harvesting any new data, I have a few datasets that I’d like to explore further. Fortunately, I just finished compiling some summary data about every series in RecordSearch, and I want to compare this latest harvest with datasets from 2021 and 2022. I need to do some more analysis of ten years' worth of data capturing the details of files with the access status of ‘closed’. I’ve been working on an update to my redaction finder, which I think I should still be able to finish. And there’s also a lot of data that volunteers have transcribed from records relating to the Real Face of White Australia that I need to pull together.

While my life has been dominated by Trove in recent years, as a historian my heart has always been with the collections of the National Archives of Australia. I’m hoping this is just a temporary setback, and that new methods for data access will emerge.

Tim Sherratt

No more harvesting data from the National Archives of Australia