Some big pictures of newspapers in Trove and DigitalNZ

One of the things I really like about Jupyter is the fact that I can share notebooks in a variety of different formats. Tools like QueryPic can run as simple web apps using Voila, static versions of notebooks can be viewed using NBViewer, and live versions can be spun up as required on Binder. It’s also possible to export notebooks at PDFs, slideshows, or just plain-old HTML pages. Just recently I realised I could export notebooks to HTML using the same template I use for Voila. This gives me another way of sharing – static web pages delivered via the main GLAM Workbench site.

Here’s a couple of examples:

Both are HTML pages that embed visualisations created using Altair. The visualisations are rendered using javascript, and even though the notebook isn’t running in a live computing environment, there’s some basic interactivity built-in – for example, you can hover for more details, and click on the DigitalNZ chart to search for articles from a newspaper. More to come! #dhhacks

Exploring GLAM data at ResBaz

The video of my key story presentation at ResBaz Queensland (simulcast via ResBaz Sydney) is now available on Vimeo. In it, I explore some of the possibilities of GLAM data by retracing my own journey through WWI service records, The Real Face of White Australia, #redactionart, and Trove – ending up at the GLAM Workbench, which brings together a lot of my tools and resources in a form that anyone can use. The slides are also available, and there’s an archived version of everything in Zenodo.

This and many other presentations about the GLAM Workbench are listed here. It seems I’ve given at least 11 talks and workshops this year! #dhhacks

GLAM Workbench Nectar Cloud Application updated!

The newly-updated DigitalNZ and Te Papa sections of the GLAM Workbench have been added to the list of available repositories in the Nectar Research Cloud’s GLAM Workbench Application. This means you can create your very own version of these repositories running in the Nectar Cloud, simply by choosing them from the app’s dropdown list. See the Using Nectar help page for more information.

I’ve also taken the opportunity to make use of the new container registry service developed by the ARDC as part of the ARCOS project. The app now pulls the GLAM Workbench Docker images from via the container registry’s cache. This means that copies of the images are cached locally, speeding things up and saving on data transfers. Yay for integration!

Thanks again to Andy and the Nectar Cloud staff for their help! #dhhacks

DigitalNZ & Te Papa sections of the GLAMWorkbench updated!

In preparation for my talk at ResBaz Aotearoa, I updated the DigitalNZ and Te Papa sections of the GLAM Workbench. Most of the changes are related to management, maintenance, and integration of the repositories. Things like:

  • Setting up GitHub actions to automatically generate Docker images when the repositories change, and to upload the images to the container registry
  • Automatic generation of an index.ipynb file based on to act as a front page within Jupyter Lab
  • Addition of a reclaim-manifest.jps file to allow for one-click installation of the repository in Reclaim Cloud
  • Additional documentation in with instructions on how to run the repository via Binder, Reclaim Cloud, Nectar Research Cloud, and Docker Desktop.
  • Addition of a .zenodo.json metadata file so that new releases are preserved in Zenodo
  • Switch to using pip-tools for generating requirements.txt files, and the include unpinned requirements in
  • Update of all Python packages

From the user’s point of view, the main benefit of these changes is the ability to run the repositories in a variety of different environments depending on your needs and skills. The Docker images, generated using repo2Docker, are used by Binder, Reclaim Cloud, Nectar, and Docker Desktop. Same image, multiple environments! See ‘Run these notebooks’ in the DigitalNZ and Te Papa sections of the GLAM Workbench for more information.

Of course, I’ve also re-run all of the notebooks to make sure everything works and to update any statistics, visualisations, and datasets. As a bonus, there’s a couple of new notebooks in the DigitalNZ repository:


A template for GLAM Workbench development

I’m hoping that the GLAM Workbench will encourage GLAM organisations and GLAM data nerds (like me) to create their own Jupyter notebooks. If they do, they can put a link to them in the list of GLAM Jupyter resources. But what if they want to add the notebooks to the GLAM Workbench itself?

To make this easier, I’ve been working on a template repository for the GLAM Workbench. It generates a new skeleton repository with all the files you need to develop and manage your own section of the GLAM Workbench. It uses GitHub’s built in templating feature, together with Cookiecutter , and this GitHub Action by Stefan Buck. Stefan has also written a very helpful blog post.

The new repository is configured to do various things automatically, such as generate and save Docker images, and integrate with Reclaim Cloud and Zenodo. Lurking inside the dev folder of each new repository, you’ll find some basic details on how to set up and manage your development environment.

This is just the first step. There’s more documentation to come, but you’re very welcome to try it out. And, of course, if you are interested in contributing to the development of the GLAM Workbench, let me know and I’ll help get you set up!

More thoughts on the Trove researcher platform for advanced research

Previously on ‘What could we do with $2.3 million?’, the National Library of Australia produced a draft plan for an ‘Advanced Researcher Platform’ that was thoroughly inadequate. Rather than submit this plan to the ARDC for consideration as part of the HASS RDC process, the NLA wisely decided to make some fundamental changes. The redrafted draft is now available for re-feedback. This is where we pick up the story…

So what has changed?

Generally speaking there seem to be two major changes.

  • There’s a much greater focus on consultation and collaboration.
  • There’s less detail about what will actually be developed.

These two changes can work together positively. The details of what’s being developed can be worked out through consultation, ensuring that what’s developed meets real research needs. But this assumes first that the consultation process is effective, and second that the overall scope of the project is appropriate. I’m not convinced on either of these two fronts.

TL;DR – it’s better than it was, but the hand waving around collaboration and integration aren’t convincing, the scope needs rethinking, and I still can’t see how what’s proposed provides good value for money.


Given that there was almost no space for consultation in the previous draft, this version could only do better. There’s certainly lots of talk about consulting with the HASS community, and a new governance structure that includes researcher representatives, but what will the consultation process actually deliver?

The NLA is now partnering with the ANU (why the ANU? because it’s closest?), and the ANU will apparently be driving the consultation process. The whole process is complex and weirdly hierarchical. Any HASS researcher can participate in the initial rounds, but their numbers dwindle as you move up the hierarchy, until you reach the Project Board where only researchers with specific institutional affiliations are admitted. I’m imagining a Hunger Games scenario…

The process aims to gather feedback on ‘what developments would offer the best assistance to the majority and be feasibly achieved within the project timeframe’. That sounds ok, but earlier in the document the outcome of the consultation process is described in the following way:

The outcome will be expressed as one (or potentially two) goals articulated in high level requirements document(s) and align to the objectives as described in this project plan. Indicative goals might include: a graph visualising occurrence of keywords over time; an interface visualising place of publication as geospatial data on a map; or, or a concordance for exploring textual data such as with key word in context.

So all of this complex, hierarchical consultative structure is just to decide whether we have a line graph or a map (or both if we’re really lucky)? If you look at the project deliverables, it seems that the researcher feedback funnels into Deliverable 6 in Work Package 3 – ‘Analysing and visualising research data’. But what about the rest of the project? Will researcher feedback have any role in determining how datasets are created, for example?

I suppose even this very limited consultation is better than what was previously proposed (ie. nothing). But what then happens to the feedback from researchers? An Advisory Panel will be selected (by whom?) to collate the ideas and produce the high-level requirements. Detailed requirements will be generated from the high-level requirements (don’t you just love project management speak?), and then subjected to the argy bargy of the MoSCoW process where priorities will be set. It’s likely that these priorities will be whittled down further as development proceeds. These are crucial decision-making stages where important ideas can be relegated to the ‘nice to have’ category and never heard of again. It’s not clear from the plan who is involved in this, and where the final decision-making power lies.

Of course, some of these details can be worked out later. But given that the big sell of this version of the plan is the expanded consultative process, I think it’s important to know where the power really lies. What role will researchers actually play in determining the outcomes of the project? This is not at all clear.


But what is the project? In general terms it hasn’t really changed. There will be some sort of portal where researchers can create and share datasets and visualisations. Crucially, it’s assumed that this portal will be part of Trove itself. As noted the last time round, the original project description provided by the government made no such assumption. It was focused on ’the delivery of researcher portals accessible through Trove’. The NLA has interpreted ‘through’ to mean ‘as part of’, and given the limits on the consultative process described above it seems this won’t change.

Or will it? I’m still puzzling over a few sections in the plan that talk about looking beyond the NLA to see whether there are existing options to meet user requirements. Deliverable 2 in Work Package 2 will:

Undertake an environmental scan for current research usage and tools such as (Glam Workbench) and a market scan to determine if these gaps can be filled by existing services that the HASS community and/or Trove support.

What’s with the weird brackets around ‘Glam Workbench’? Makes me think it was a last minute addition. I suppose I should be grateful that the NLA wants to spend some money to confirm that the GLAM Workbench actually exists. But then what? The next deliverable will:

determine which requirements will be delivered within the Trove Platform and which will be outsourced to other services.

So if the Trove Newspaper Harvester, for example, meets one of the requirements, will Trove simply link to it? Imagine that, Trove actually linking to one of the dozens of Trove tools and resources provided by the GLAM Workbench. Oh frabjous day! But then does the NLA still get the money to develop the thing that I’ve already developed, or will they share some of the project money with me (yeah right)? I really have no idea what’s envisaged here. How will the ‘solution architecture’ integrate existing tools and services? And what does that mean for the resourcing of the project as a whole?

Elsewhere the plan talks about services ‘dedicated specifically to Trove collections and/or to the Australian research community’ that could be ‘“plugged in”’ to the platform ecosystem’. That sounds hopeful, but if the platform is an ecosystem of tools and services from the NLA and beyond, then that changes the scope of the project completely. Why not start with that? Start with the idea of developing an ecosystem of tools and services making use of Trove data, rather than just developing a new Trove interface. Then we could work together to build something really useful.

It just seems that the scope of the project as a whole hasn’t been properly thought through. The original plan has been expanded in vague, hand wavy directions, without thinking through the implications and possibilities of that expansion. Tinkering around the edges isn’t enough, the nature of this project needs to be completely rethought.

Where’s the strategy?

Rather than have an open call for project funding, the HASS RDC process has focused instead on making strategic investments. But where’s the strategy? The current projects were identified through a number of scoping studies undertaken by the Department of Education, Skills, and Employment. But these scoping studies haven’t been publicly released, so we don’t really know why these projects were recommended for funding. Is giving the NLA buckets of money to develop a new interface really what was envisaged? Surely if you were thinking strategically, you’d be considering ways in which the rich data asset represented by Trove could be opened to new research uses. You’d look around at existing tools and resources and think about how they could be leveraged. You’d examine limitations in the delivery of Trove data, and think about what sort of plumbing was needed to connect up new and existing projects. So how did we end up here?

Perhaps we just need to take a step back and recognise that just because Trove provides the data, doesn’t mean it should direct the project. There needs to be another layer of strategic planning which identifies the necessary components and directs resources accordingly. As I noted before, there’s plenty of ways in which Trove’s data and APIs could be improved. Give them money to do that. But should they be building tools for researchers to use that data? Nope. Absolutely not.

I attended the eResearch Australasia Conference recently, and was really impressed with all the activity around research software development. If the tool building component of this project was opened up, it could provide a really useful focus for developing collaboration across the research software community and building capacities and understanding in HASS. It would also encourage greater reuse and integration. This would seems to be a much more strategic intervention.

Assorted questions

I’m not going to go through the plan in detail again. I’ve already spent a couple of weeks engaged in, or worrying about, the HASS RDC process. I’m tired, and I’m frustrated, and I can’t shake the depressing thought that the NLA will end up being rewarded for its bad behaviour. Many of my comments on the earlier draft still apply, particularly those around the API and the development of pathways for researchers.

It’s worth noting, however, that the ’sustainability’ section of the plan has disappeared completely – perhaps not surprisingly, as the only suggestion last time was for someone to give them more money. There are gestures towards integration, such as including representatives of the other HASS RDC projects on the Trove Project Board. But real integration would happen through technical interchange, not governance structures, and there’s no plan for that.

I’m also a bit confused about the role of ANU. It seems to be mostly focused on consultation, but then there are statements like:

The development phase will be completed as a collaboration between the ANU and NLA, with both institutions working on the development of their own systems to align the with the product goals and Trove.

What ANU systems are we talking about here? And why are they part of the project?

A couple of objectives were also added to the plan:

  • ‘Explore opportunities to enrich the Trove corpus’ – sounds good, but it’s not picked up anywhere else in the plan, and has no deliverables associated with it. Is it just window dressing?
  • ‘Develop Trove’s HASS community relationships and engagement capabilities’ – sorry if I’m cynical, but when your first plan doesn’t even bother to consult the HASS community, when you don’t even link to existing resources in the sector, why should we believe that this will now be an important, ongoing objective?

On the issue of community relationships, a few people in the previous round of feedback indicated to me that they didn’t want to criticise the NLA too harshly because they might want to work with them in the future. That’s not healthy community building.


After two attempts, the NLA has still not delivered a coherent project plan that demonstrates real value to the HASS sector and meets the ARDC’s assessment criteria. I think the project needs to be radically rethought, and leadership sought from outside the NLA to ensure that the available funding is used effectively.

I love Trove. I recognise the way it has transformed research, and was honoured to play a small part in its history. It should be appropriately funded. But it shouldn’t be funded to do everything.

In the end, we could do so much better, and so much more…

Have your say…

You can provide your own feedback on the new draft plan. There’ll be a roundtable event on 10 November when you can ask questions of the project participants. You can also submit your feedback by 17 November using the form at the bottom of this page. You might also want to remind yourself of the ARDC’s evaluation criteria.

I don’t think I’ll attend the roundtable, as this whole process has taken a bit of a toll, but I encourage you to do so. The more voices the better.

Update – 13 November

Here’s the final feedback that I submitted to the ARDC.

Coming up! GLAM Workbench at ResBaz(s)

Want a bit of added GLAM with your digital research skills? You’re in luck, as I’ll be speaking at not one, but three ResBaz events in November. If you haven’t heard of it before, ResBaz (Research Bazaar) is ‘a worldwide festival promoting the digital literacy at the centre of modern research’.

The programs of all three ResBaz events are chock full of excellent opportunities to develop your digital skills, learn new research methods, and explore digital tools. If you’re an HDR student you should check out what’s on offer.

New video – using the Trove Newspaper & Gazette Harvester

The latest help video for the GLAM Workbench walks through the web app version of the Trove Newspaper & Gazette Harvester. Just paste in your search url and Trove API key and you can harvest thousands of digitised newspaper articles in minutes!

Harvest newspaper issues as PDFs

An inquiry on Twitter prompted me to put together a notebook that you can use to download all available issues of a newspaper as PDFs. It was really just a matter of copying code from other tools and making a few modifications. The first step harvests a list of available issues for a particular newspaper from Trove. You can then download the PDFs of those issues, supplying an optional date range. Beware – this could consume a lot of disk space!

The PDF file names have the following structure:

[newspaper identifier]-[issue date as YYYYMMDD]-[issue identifier].pdf

For example:


I also took the opportunity to create a new Harvesting data heading in the Trove newspapers section of the GLAM Workbench. #dhhacks

GLAM Workbench now in the Nectar Research Cloud!

The GLAM Workbench isn’t dependent on one big piece of technological infrastructure. It’s basically a collection of Jupyter notebooks, and those notebooks can be used within a variety of different environments. This helps make the GLAM Workbench more sustainable – new components can be swapped in and out as required. It also makes it possible to create different pathways for users, depending on their digital skills, institutional support, and research needs. For example, links to Binder make it easy for users to explore the possibilities of the GLAM Workbench and accomplish quick tasks. But Binder has limits. Where do you go when your research project scales up?

Earlier this year I added one-click installation of GLAM Workbench repositories in Reclaim Cloud. Today I’m very pleased to announce that selected GLAM Workbench repositories can be installed as applications within the Nectar Research Cloud. Using nationally-funded digital infrastructure, researchers in Australian universities can now create their own workbenches in minutes. So whether you’re harvesting truckloads of data from Trove or analysing web archives at scale, you can move beyond Binder and set up an environment dedicated to your research project. Cool huh?

You can install GLAM Workbench repositories using this simple application in Nectar!

Currently four repositories can be installed on Nectar in this way:

But more will be added in the future as I update repositories to generate the necessary Docker images. Nectar installation information is now included in each of these four repositories, and I’ve added a Using the Nectar Cloud section to the help documentation that includes a detailed walkthrough of the installation process. If you strike any problems either raise an issue on GitHub, or ask a question at OzGLAM Chat.

Huge thanks to Andy, Jacob, and Jo at the Australian Research Data Commons (ARDC) who responded enthusiastically to my tweeted query, and packaged the repositories up into an easy-to-install, reusable application. After all the work I’ve put into the GLAM Workbench, it’s really exciting to see it embedded within Australia’s digital research infrastructure. #dhhacks

More GLAM Name Index updates from Queensland State Archives and SLWA

A new version of the GLAM Name Index Search is available. An additional 49 indexes have been added, bringing the total to 246. You can now search for names in more than 10.2 million records from 9 organisations.

The new indexes come from Queensland State Archives and the State Library of WA. QSA announced on Friday that they’d added two new indexes to their site. When I went to harvest them, I realised there was another 25 indexes that I hadn’t previously picked up. It seems that some QSA datasets are tagged as ‘Queensland State Archives’ in the portal, but others are tagged as ‘queensland state archives’ – and the tag search is case sensitive! I now search for both the upper and lower case tags.

There’s also a number of additions from the State Library of WA. These datasets were already in my harvest, but because of some oddities in their formatting, I hadn’t included them in the Index Search. Looking at them again, I realised they were right to go, so I’ve added them in.

Here’s the list of additions:

Queensland State Archives

  • Australian South Sea Islanders 1867 to 1908 - A-K
  • Australian South Sea Islanders 1867 to 1908 L-Z
  • Beaudesert Shire Burials - Logan Village 1878-2000 - Beaudesert Shire and Logan Village Burials 1878-2000
  • Immigrants, Bowen Immigration Depot 1885-1892
  • Brisbane Gaol Hospital Admission registers 1889-1911 - Index to Brisbane Gaol Hospital Admission Registers 1889-1911
  • Index to Correspondence of Queensland Colonial Secretary 1859-1861 - Index to Colonial Secretary s Correspondence Bundles 1859 - 1861.csv
  • Dunwich Benevolent Asylum records - Index to Dunwich Benevolent Asylum 1859-1948
  • Dunwich Benevolent Asylum records - Index to Dunwich Benevolent Asylum 1885-1907
  • Immigrants and Crew 1860-1865 (COL/A) - Index to Immigrants and Crew 1860 - 1964
  • Index to Immigration 1909-1932
  • Outdoor Relief 1900-1904 - Index to Outdoor Relief 1892-1920
  • Pensions 1908-1919 - Index to Pensions 1908-1909
  • Cases & treatment Moreton Bay Hospital 1830-1862 - Index to Register of Cases and treatment at Moreton Bay Hospital 1830-1862
  • Index to Registers of Agricultural Lessees 1885-1908
  • Index to Registers of Immigrants, Rockhampton 1882-1915
  • Pneumonic influenza patients, Wallangarra Quarantine Compound - Index to Wallangarra Flu Camp 1918-1919
  • Land selections 1885-1981
  • Lazaret patient registers - Lazaret Patient Registers
  • Leases, Selections and Pastoral Runs and other related records 1850-2014
  • Perpetual Lease Selections of soldier settlements 1917 - 1929 - Perpetual Lease Selections of soldier settlements 1917-1929
  • Photographic records of prisoners 1875-1913 - Photographic Records of Prisoners 1875-1913
  • Redeemed land orders 1860-1906 - Redeemed land orders 1860-1907
  • Register of the Engagement of Immigrants at the Immigration Depot - Bowen 1873-1912
  • Registers of Applications by Selectors 1868-1885
  • Registers of Immigrants Promissory Notes (Maryborough)
  • Education Office Gazette Scholarships 1900 - 1940 - Scholarships in the Education Office Gazette 1900 - 1940
  • Teachers in the Education Office Gazettes 1899-1925

State Library of Western Australia

  • WABI Subset: Eastern Goldfields - Eastern Goldfields
  • Western Australian Biographical Index (WABI) - Index entries beginning with A
  • Western Australian Biographical Index (WABI) - Index entries beginning with B
  • Western Australian Biographical Index (WABI) - Index entries beginning with C
  • Western Australian Biographical Index (WABI) - Index entries beginning with D and E
  • Western Australian Biographical Index (WABI) - Index entries beginning with F
  • Western Australian Biographical Index (WABI) - Index entries beginning with G
  • Western Australian Biographical Index (WABI) - Index entries beginning with H
  • Western Australian Biographical Index (WABI) - Index entries beginning with I and J
  • Western Australian Biographical Index (WABI) - Index entries beginning with K
  • Western Australian Biographical Index (WABI) - Index entries beginning with L
  • Western Australian Biographical Index (WABI) - Index entries beginning with M
  • Western Australian Biographical Index (WABI) - Index entries beginning with N
  • Western Australian Biographical Index (WABI) - Index entries beginning with O
  • Western Australian Biographical Index (WABI) - Index entries beginning with P and Q
  • Western Australian Biographical Index (WABI) - Index entries beginning with R
  • Western Australian Biographical Index (WABI) - Index entries beginning with S
  • Western Australian Biographical Index (WABI) - Index entries beginning with T
  • Western Australian Biographical Index (WABI) - Index entries beginning with U-Z
  • Digital Photographic Collection - Pictorial collection_csv
  • WABI subset: Police - WABI police subset
  • WABI subset: York - York and districts subset

Bonus update

After a bit more work last night I added in a dataset from the State Library of Victoria:

  • Melbourne and metropolitan hotels, pubs and publicans

That’s an extra 21,000 records, and takes the total number of indexes to 247 from 10 different GLAM organisations!


Getting data about newspaper issues in Trove

When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? How do you get a list of dates when newspapers were published? This notebook in the GLAM Workbench shows how you can get information about issues from the Trove API.

Using the notebook, I’ve created a couple of datasets ready for download and use.

Total number of issues per year for every newspaper in Trove

Harvested 10 October 2021

CSV formatted dataset containing the number of newspaper issues available on Trove, totalled by title and year – comprises 27,604 rows with the fields:

  • title – newspaper title
  • title_id – newspaper id
  • state – place of publication
  • year – year published
  • issues – number of issues

Download from Cloudstor: newspaper_issues_totals_by_year_20211010.csv (2.1mb)

Complete list of issues for every newspaper in Trove

Harvested 10 October 2021

CSV formatted dataset containing a complete list of newspaper issues available on Trove – comprises 2,654,020 rows with the fields:

  • title – newspaper title
  • title_id – newspaper id
  • state – place of publication
  • issue_id – issue identifier
  • issue_date – date of publication (YYYY-MM-DD)

To keep the file size down, I haven’t included an issue_url in this dataset, but these are easily generated from the issue_id. Just add the issue_id to the end of For example: Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.

Download from Cloudstor: newspaper_issues_20211010.csv (222mb)

For more information see the Trove newspapers section of the GLAM Workbench.

GLAM Workbench at eResearch Australasia 2021

Way back in 2013, I went to the eResearch Australasia conference as the manager of Trove to talk about new research possibilities using the Trove API. Eight years years later I was back, still spruiking the possibilities of Trove data. This time, however, I was discussing Trove in the broader context of GLAM data – all the exciting possibilities that have emerged as galleries, libraries, archives and museums make more of their collections available in machine-readable form. The big question is, of course, how do researchers, particularly those in the humanities, make use of that data? The GLAM Workbench is my attempt to address that question – to provide humanities researchers with both the tools and information they need, and an understanding of the possibilities that might emerge if they invest a bit of time in working with GLAM data. My eResearch Australasia 2021 presentation provides a quick introduction to the GLAM Workbench, here’s the video, and the slides.

A GLAM Workbench for humanities researchers from Tim Sherratt on Vimeo.

The presentation was pre-recorded, but I managed to sneak in an update via chat for those who attended the session. More news on this next week… 🥳

New Python package to download Trove newspaper images

There’s no reliable way of downloading an image of a Trove newspaper article from the web interface. The image download option produces an HTML page with embedded images, and the article is often sliced into pieces to fit the page.

This Python package includes tools to download articles as complete JPEG images. If an article is printed across multiple newspaper pages, multiple images will be downloaded – one for each page. It’s intended for integration into other tools and processing workflows, or for people who like working on the command line.

You can use it as a library:

from trove_newspaper_images.articles import download_images

images = download_images('107024751')

Or from the command line: 107024751 --output_dir images

If you just want to quickly download an article as an image without installing anything, you can use this web app in the GLAM Workbench. To download images of all articles returned by a search in Trove, you can also use the Trove Newspaper and Gazette Harvester.

See the documentation for more information. #dhhacks

More records for the GLAM Name Index Search

Two more datasets have been added to the GLAM Name Index Search! From the History Trust of South Australia and Collab, I’ve added:

In total there’s 9.67 million name records to search across 197 datasets provided by 9 GLAM organisations!

New preprint – ‘More than newspapers’

Here’s the preprint version of an article, ‘More than newspapers’, that I’ve submitted for a forum about Trove in a forthcoming issue of History Australia.


More QueryPic in action

Recently I created a list of publications that made use of QueryPic, my tool to visualise searches in Trove’s digitised newspapers. Here’s another example of the GLAM Workbench and QueryPic in action, in Professor Julian Meyrick’s recent keynote lecture, ‘Looking Forward to the 1950s: A Hauntological Method for Investigating Australian Theatre History’.

Some thoughts on the ‘Trove Researcher Platform for Advanced Research’ draft plan

Late last year the Federal Government announced it was making an $8.9 million investment in HASS and Indigenous research infrastructure. This program is being managed by the ARDC and will lead to the development of a HASS Research Data Commons. According to the ARDC, a research data commons:

brings together people, skills, data, and related resources such as storage, compute, software, and models to enable researchers to conduct world class data-intensive research

Sounds awesome!

Based on scoping studies commissioned by the Department of Education, Skills, and Employment (which have not yet been made public), four activities were selected for initial funding under this program. Draft project plans for these four activities have now been released for public comment.

One of these activities aims to develop a ’Trove researcher platform for advanced research’:

Augmenting existing National Library of Australia resources, this platform will enable a focus on the delivery of researcher portals accessible through Trove, Australia’s unique public heritage site. The platform will create tools for visualisation, entity recognition, transcription and geocoding across Trove content and other corpora.

You can download the draft project plan for the Trove platform. Funding for this activity will be capped at $2,301,185 across 2021-23. In this post I’ll try to pull together some of my own thoughts on this plan.

I suppose I’d better start with a disclaimer – I’m not a neutral observer in this. I started scraping data from Trove newspapers way back in 2010, building the first versions of tools like QueryPic and the Trove Newspaper Harvester. While I was manager of Trove, from 2013 to 2016, I argued for recognition of Trove as a key part of Australia’s humanities research infrastructure, and highlighted possible research uses of Trove data available through the API. Since then I’ve worked to bring a range of digital tools, examples, tutorials, and hacks together for researchers in the GLAM Workbench – a large number of these work with data from Trove.

I strongly believe that Trove should receive ongoing funding through NCRIS as a piece of national research infrastructure. Unfortunately though, the draft project plan does not make a strong case for investment – it’s vague, unimaginative, and makes little attempt to integrate with existing tools and services. I think it scores poorly against the ARDC’s evaluation criteria, and doesn’t seem to offer good value for money. As someone who has championed the use of Trove data for research across the last decade, I’m very disappointed.

What’s planned?

So what is being proposed? There seems to be three main components:

  1. Authenticated ‘project’ spaces for researchers where datasets relating to a particular research topic can be stored
  2. The ability to create custom datasets from a search in Trove
  3. Tools to visualise stored datasets.

There’s no doubt that these are all useful functions for researchers, but many problems arise when we look at how they’re going to be implemented.

1. Authenticated project spaces

The draft plan indicates that authentication of users through the Australian Access Federation is preferred. Why? Trove already has a system for the creation of user accounts. Using AAF would limit use of the new platform to those attached to universities or research agencies. I don’t understand what the use of AAF adds to the project, except perhaps to provide an example of integration with existing infrastructure services.

The plan notes that project spaces could be ‘public’ or ‘private’. Presumably a ‘public’ space would give access to stored datasets, but what sort of access controls would be available in relation to individual datasets? It’s also noted (Deliverable 7) that researchers would have ‘an option to “publish” their research findings for public consumption‘. Does this mean datasets and visualisations would be assigned a DOI (or other persistent identifier) and preserved indefinitely? How might these spaces integrate with existing data repositories?

2. Create custom datasets

The lack of detail in the plan makes it difficult to assess what’s being proposed here. But it seems that users would be able to construct a search using the Trove web interface (or a new search interface?) and save the results as a dataset.

What data would be searched? It’s not clear, but in reference to the visualisations it’s stated that data would come from ’Trove’s existing full text collections (newspapers and gazettes, magazines and newsletters, books)’. So no web archives, and no metadata from any of Trove’s aggregated collections (even without full text, collection metadata can create interesting research possibilities, see for example the Radio National records in the GLAM Workbench).

What will be included in each dataset? There’s few details, but at a minimum you’d expect something like a CSV containing the metadata of all the matching records, and files containing the full text content of the items. These could potentially be very large. There’s no indication about how storage and processing demands would be managed, but presumably there would be some per user, or per project, limits.

Deliverable 8, ‘Data and visual download’, states that:

All query results must be available as downloadable files, this would include CSV, JSON and XML for the query results list.

But there’s no mention of the full text content at all. Will it be included in downloadable datasets?

As well as the record metadata and full text, you’d want there to be some metadata captured about the dataset itself – the search query used, when it was captured, the number of records, etc. To support integration and reuse, it would be good to align this with something like RO Crate.

How will searches be constructed? It’s not clear if this will be integrated with the existing search interface, or be something completely separate; however, the plan does note that ‘limitations are put onto the dataset like keyword search terms and filters corresponding to the filters currently available in the interface’. So it seems that the new platform will be using the existing search indexes. It’s obviously important for the relationship between existing search functions and the new dataset creation tool to be explicit and transparent so that researchers understand what they’re getting.

It’s also worth noting that changes to the search interface last year removed some useful options from the advanced search form. In particular, you can no longer exclude matches in tags or comments. If you’re a researcher looking for the occurrence of a particular word, you generally don’t want to include records where that word only appears in a user added tag (I have a story about ‘Word War I’ that illustrates this!).

This raises a broader issue. There doesn’t seem to be any mention in the project plan of work to improve the metadata and indexing in response to research needs. Even just identifying digitised books in the current web interface can be a bit of a challenge, and digitised books and periodicals can be grouped into work records with other versions. We need to recognise that the needs of discovery sometimes compromise specific research uses.

I’m trying to be constructive in my responses here, but at this point I just have to scream – WHAT ABOUT THE TROVE NEWSPAPER HARVESTER? A tool has existed for ten years that lets users create a dataset containing metadata and full text from a search in Trove’s newspapers and gazettes. I’ve spent a lot of time over recent years adding features and making it easier to use. Now you can download not only full text, but also PDFs and images of articles. The latest web app version in the GLAM Workbench runs in the cloud. Just one click to start it up, then all you need to do is paste in your Trove API key and the url of your search. It can’t get much easier.

The GLAM Workbench also includes tools to create datasets and download OCRd text from Trove’s books and digitised journals. These are still in notebook form, so are not as easy to use, but I have created pre-harvested datasets of all books and periodicals with OCRd text, and stored them on CloudStor. What’s missing at the moment is something to harvest a collection of journal articles, but this would not be difficult. As an added bonus, the GLAM Workbench has tools to create full text datasets from the Australian Web Archive.

So what is this project really adding? And why is there no attempt to leverage existing tools and resources?

3. Visualise datasets

Again, there’s a fair bit of hand waving in the plan, but it seems that users will be able to select a stored dataset and then choose a form of visualisation. The plan says that:

An initial pilot would allow users to create line graphs that plot the frequency of a search term over time and maps that display results based on state-level geolocation.

Up to three additional visualisations would be created later based on research feedback. It’s not clear which researchers will be consulted and when their feedback will be sought.

The value of these sorts of visualisations is obviously dependent on the quality and consistency of the metadata. There’s nothing built into this plan that would, for example, allow a researcher to clean or normalise any of the saved data. You have to take what you’re given. The newspaper metadata is generally consistent, but books and periodicals less so.

It’s also important to clarify what’s meant by ‘the frequency of a search term over time’. Does this mean the number of records matching a search term, or the number of times that the search term actually appears in the full text of all matched records? If the latter, then this would be a major enrichment of the available data. Though if this data was available it should be pushed through the API and/or made available as a downloadable dataset for integration with other platforms (perhaps along the lines of the Hathi Trust’s Extracted Features Dataset). I suspect, however, that what is actually meant is the number of matching search results.

Again, the value of any geospatial visualisation depends on what is actually being visualised! The state facet in newspapers indicates place of publication, it’s not clear what the place facet in other categories represents. For this sort of visualisation to be useful in a research context, there would need to be some explanation of how these values were created, and any gaps or uncertainties.

Time for another scream of frustration — WHAT ABOUT QUERYPIC? Another long-standing tool which has already been cited a number of times in research literature. QueryPic visualises searches in Trove’s newspapers and gazettes over time. You can adjust time scales and intervals, and download the results as images, a CSV file, and an HTML page. The project plan makes a point of claiming that its tools would not require any coding, but neither does QueryPic. Just plug in an API key and a search URL. I even made some videos about it! The GLAM Workbench also includes a number of examples of how you can visualise places of publication of newspaper articles.

But it’s not just the GLAM Workbench. The Linguistics Data Commons of Australia, another activity to be funded as part of the HASS Research Data Commons, will include tools for text analysis and visualisation. The Time Layered Cultural Map is developing tools for geospatial visualisation of Australian collections. Surely the focus should be on connecting and reusing what’s available. Again I’m wondering what this project is really adding.

Portals and platforms

The original language describing the funded activity is interesting — it is intended to ‘focus on the delivery of researcher portals accessible through Trove’.

Portals (plural) accessible through (not in) Trove.

The NLA could meet a fair proportion of its stated objectives right now, simply by including links to QueryPic and the Trove Newspaper and Gazette Harvester. Done! There’s a million dollars saved.

More seriously, there’s no reason why the outcome of this activity should be a new interface attached to Trove and managed by the NLA. Indeed, such an approach works against integration, reuse, and data sharing. I believe the basic assumptions of the draft plan are seriously flawed. We need to separate out the strands of what’s meant by a ‘platform for advanced research’, and think more creatively and collaboratively about how we could achieve something useful, flexible, and sustainable.

Where’s the API?

I think the primary role of the NLA in the development of this research platform should be as the data provider. There are numerous ways in which Trove’s data might be improved and enriched in support of new research uses. These improvements could then be pushed through the API to integrate with a range of tools and resources. Which raises the question — where is the API in this plan?

The only mention of the API comes as an option for a user with ‘high technical expertise’ to extend the analysis provided by the built-in visualisations. This is all backwards. The API is the key pipeline for data-sharing and integration and should be at the heart of this plan.

This program offers an opportunity to make some much-needed improvements to the API. Here’s a few possibilities:

  • Bring the web interface and API back into sync so that researchers can easily transfer queries between the two (Trove’s interface update introduced new categories, while the API still groups resources by the original zones).
  • Provide public API access to additional data about digitised items. For example, you can get lists of newspaper titles and issues from the API, but there’s no comparable method to get titles and issues for digitised periodicals. The data’s there – it’s used to generate lists of issues in the browse interface – but it’s not in the API. There’s also other resource metadata, such as parent/child relationships, which are embedded in web pages but not exposed in the API.
  • Standardise the delivery of OCRd text for different resource types.
  • Finally add the People & Organisations data to the main RESTful API.
  • Fix the limitations of the web archives CDX API (documented here).
  • Add a search API for the web archives.
  • And what about a Write API? Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources.

I think the HASS RDC would benefit greatly by thinking much more about the role of the Trove API in establishing reusable data flows, and connecting up components.


Anyone who’s been to one of my GLAM Workbench talks will know that I talk a lot about ‘pathways’. My concern is not just to provide useful tools and examples, but to try and connect them in ways that encourage researchers to develop their skills and confidence. So a researcher with limited digital skills can spin up QueryPic and start making visualisations without any specialised knowledge. But if they want to explore the data and assumptions behind QueryPic, they can view a notebook that walks them through the process of getting data from facets and assembling a time series. If they find something interesting in QueryPic, they can go to the Newspaper Harvester and assemble a dataset that helps them zoom into a particular period. There are places to go.

Similarly, users can start making use of the GLAM Workbench in the cloud using Binder – one click and it’s running. But as their research develops they might find Binder a bit limiting, so there are options to spin up the GLAM Workbench using Reclaim Cloud or Docker. As a researcher’s skills, needs, and questions change, so does their use of the GLAM Workbench. At least that’s the plan – I’m very aware that there’s much, much more to do to build and document these pathways.

The developments described in the draft plan are focused on providing simple tools for non-technical users. That’s fair enough, but you have to give those users somewhere to go, some path beyond, or else it just becomes another dead end. Users can download their data or visualisation, but then what?

Of course you don’t point a non-coder to API documentation and say ‘there you go’. But coders can use the API to build and share a range of tools that introduce people to the possibilities of data, and scaffold their learning. Why should there be just one interface? It’s not too difficult to imagine a range of introductory visualisation tools aimed at different humanities disciplines. Instead of focusing inward on a single Trove Viz Lite tool, why not look outwards at ways of embedding Trove data within a range of research training contexts?


A number of the HASS RDC Evaluation Criteria focus on issues of integration, collaboration, and reuse of existing resources. For example:

  • Project plans should display robust proposal planning including the maximisation of the use or re-use of existing research infrastructure, platforms, tools, services, data storage and compute.
  • Project plans should display integrated infrastructure layers with other HASS RDC activities, in particular by linking together elements such as data storage, tools, authentication, licensing, networks, cloud and high-performance computing, and access to data resources for reuse.
  • Project plans must be robust and contribute to the HASS RDC as a coherent whole that capitalises on existing data collections, adheres to the F.A.I.R. principles, develops collaborative tools, utilises shared underlying infrastructure and has appropriate governance planning.

There’s little evidence of this sort of thinking in the draft project plan. I’ve mentioned a few obvious opportunities for integration above, but there are many more. Overall, I think the proposed ‘platform for advanced research’ needs to be designed as a series of interconnected components, and not be seen as the product of a single institution.

We could imagine, for example, a system where the NLA focused on the delivery of research-ready data via the Trove API. A layer of data filtering, cleaning, and packaging tools could be built on top of the API to help users assemble actionable datasets. The packaging processes could use standards such as RO-Crate to prepare datasets for ingest into data repositories. Existing storage services, such as CloudStor, could be used for saving and sharing working datasets. Another layer of visualisation and analysis tools could either process these datasets, or integrate directly with the API. These tools could be spread across different projects including LDaCA, TLCMap, and the GLAM Workbench — using standards such as Jupyter to encourage sharing and reuse of individual components, and running on a variety of cloud-hosted platforms. Instead of just adding another component to Trove, we’d be building a collaborative network of tool builders and data wranglers — developing capacities across the research sector, and spreading the burden of maintenance.


The draft project plan includes some pretty worrying comments about long-term support for the new platform. Work Package 5 notes:

The developed product will require support post release which can be guaranteed for a period not exceeding the contracted period for this project


ARDC will be responsible for providing ongoing financial support for this phase. It has not been included in the proposal.

So once the project is over, the NLA will not support the new platform unless the ARDC provides ongoing funding. What researcher would want to ‘publish’ their data on a platform that could disappear at any time? We all know that sustainability is hard, but you would think that the NLA could at least offer to work collaboratively with the research sector to develop a plan for sustainability, instead of just asking for more money. Why would anyone invest so much for so little?

Leadership and community

The development of collaborations and communities also figure prominently in the HASS RDC Evaluation Criteria. For example:

  • Project plans should clearly demonstrate that they enable collaboration and build communities across geographically dispersed research groups through facilitated sharing of high-quality data, particularly for computational analysis; the development of new platforms for collaboration and sharing; and, the encouragement of innovative methodologies through the use of analytic tools.
  • Project plans must include a demonstrated commitment to ongoing community development to ensure the sustainability of the development is vital. The deliverables will act as ongoing national research infrastructure. They must be broadly usable by more than just the project partners and serve as input to a wide range of research.
  • Project plans, and project leads in particular, should demonstrate the research leadership that will foster and encourage the uptake and use of the HASS RDC.

Once again the draft project plan falls short. There are no project partners listed. Instead the plan refers broadly to all of Trove’s content partners, none of whom have direct involvement in this project. Indeed, as noted above, data aggregated from project parters is excluded from the new platform.

There are no new governance arrangements proposed for this project. Instead the plan refers to the Trove Strategic Advisory Committee which includes representatives from partner organisations. But there are no researcher representatives on this committee.

The only consultation with the research sector undertaken in the ‘Consultation Phase’ of the project is that undertaken by the ARDC itself. Does that mean this current process whereby the ARDC is soliciting feedback on the project plans? Whoa, meta…

The plan notes that during the testing phase described in Work Package 3, ‘HASS community members would gain access to a beta version of the product for comment’. However, later it is stated that access would be provided to ’a subset of researchers’, and that only system bugs and ‘high priority improvements’ would be acted upon.

Generally speaking, it seems that the NLA is seeking as little consultation as possible. It’s not exploring options for collaboration. It’s not engaging with the research community about these developments. That doesn’t seem like an effective way to build communities. Nor does it demonstrate leadership.

Summing up

This project plan can’t be accepted in its current form. We’ve had failures and disappointments in the development of HASS research infrastructure in the past. The HASS RDC program gives us a chance to start afresh, and the focus on integration, data-sharing, and reuse give hope that we can build something that will continue to grow and develop, and not wither through lack of engagement and support. So should the NLA be getting $2 million to add a new component to Trove that is not integrated with other HASS RDC projects, and substantially duplicates tools available elsewhere? No, I don’t think so. They need to go back to the drawing board, undertake some real consultation, and build collaborations, not products.

Some research projects that have used QueryPic

A Twitter thread about some of the research uses of QueryPic…

Government publications in Trove

Over the last few weeks I’ve been updating my harvests of OCRd text from digitised books and periodicals in Trove. As part of the harvesting process, I’ve created lists of both that are available in digital form – this includes digitised works, as well as those that are born-digital (such as PDFs or epubs). I’ve published the full lists of books and periodicals as searchable databases to make them easy to explore.

One thing that you might notice is that works with the format ‘Government publication’ pop up in both lists – sometimes it’s not clear whether something is a ‘book’ or ‘periodical’. To make it easier to find these items, no matter what their format, I’ve combined data from my two harvests and created a searchable dataset of government publications. It includes links to download OCRd text from CloudStor if available.

All three databases make use of Datasette, which I’ve also used for the GLAM Name Index Search. One of the cool things about Datasette is that it provides it’s own API, so if you find some interesting in any of these databases, you can easily download the machine-readable data for further analysis. #dhhacks

GLAM Workbench – a platform for digital HASS research

We’re in the midst of planning for the HASS Research Data Commons, which will deliver some much-needed investment in digital research infrastructure for the humanities and social sciences. Amongst the funded programs are tools for text analysis as part of the Linguistics Data Commons, and a platform for more advanced research using Trove. I’m hoping that this will be an opportunity to take stock of existing tools and resources, and build flexible pathways for researchers that enable them to collect, move, analyse, preserve, and share data across different platforms and services.

To this end, I thought it might be useful to try and summarise what the GLAM Workbench offers, particularly for Trove researchers. The GLAM Workbench doesn’t really have an institutional home, and is mostly unfunded – it’s my passion project. That means that it’s easy to overlook, particularly when the big grants are being doled out. But I think it has a lot to offer and I’m looking forward to exploring ways it can connect with these new initiatives.

Getting and moving data

There’s lots of fabulous data in Trove and other GLAM collections. In fact, there’s so much data that it can be difficult for researchers to find and collect what’s relevant to their interests. There are many tools in the GLAM Workbench to help researchers assemble their own datasets. For example:

  • Get newspaper articles in bulk with the Trove Newspaper and Gazette Harvester – This has been around in some form for more than ten years (it pre-dates the Trove API!). Give it the url of a search in Trove’s newspapers and gazettes and the harvester will save all the metadata in a CSV file, and optionally download the complete articles as OCRd text, images, or PDFs. The amount of data you harvest is really only limited by your patience and disk space. I’ve harvested more than a million articles in the past. The GLAM Workbench includes a web app version of the harvester that runs live in the cloud – just paste in your Trove API key and the search url, and click the button.
  • Get Trove newspaper pages as images – If you need a nice, high-resolution version of a newspaper page you can use this web app. If you want to harvest every front page (or some other particular page) here’s an example that gets all the covers of the Australian Women’s Weekly. A pre-harvested collection of the AWW covers is included as a bonus extra.
  • Get Trove newspaper articles as images – The Trove web interface makes it difficult to download complete images of articles, but this tool will do the job. There’s a handy web app to grab individual images, but the code from this tool is reused in other places such as the Trove Newspaper Harvester and the Omeka uploader, and could be built-in to your own research workflows.
  • Upload Trove newspaper articles to Omeka – Whether you’re creating on online exhibition or building a research database, Omeka can be very useful. This notebook connects Trove’s newspapers to Omeka for easy upload. Your selected articles can come from a search query, a Trove list, a Zotero library, or just a list of article ids. Metadata records are created in Omeka for each article and newspaper, and an image of each article is attached.
  • Get OCRd text from digitised periodicals in Trove – They’re often overshadowed by the newspapers, but there’s now lots of digitised journals, magazines, and parliamentary papers in Trove. You can get article-level data from the API, but not issue data. This notebook enables researchers to get metadata and OCRd text from every available issue of a periodical. To make researchers’ lives even easier, I regularly harvest all the available OCRd text from digitised periodicals in Trove. The latest harvest downloaded 51,928 issues from 1,163 periodicals – that’s about 10gb of text. You can browse the list of periodicals with OCRd text, or search this database. All the OCRd text is stored in a public repository on CloudStor.
  • Get page images from digitised periodicals in Trove – There’s more than text in digitised periodicals, and you might want to download images of pages for visual analysis. This notebook shows you how to get cover images, but could be easily modified to get another page, or a PDF. I used a modified version of this to create a collection of 3,471 full page editorial cartoons from The Bulletin, 1886 to 1952 – all available to download from CloudStor.
  • Get OCRd text from digitised books in Trove – Yep, there’s digitised books as well as newspapers and periodicals. You can download OCRd text from an individual book using the Trove web interface, but how do you make a collection of books without all that pointing and clicking? This notebook downloads all the available OCRd text from digitised books in Trove. The latest harvest includes text from 26,762 works. You can explore the results using this database.
  • Harvest parliamentary press releases from Trove – Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and fulltext from a search of the parliamentary press releases. For example, here’s a collection of politicians talking about ‘refugees’, and another relating to COVID-19.
  • Harvest details of Radio National programs from Trove – Trove creates records for programs broadcast on ABC Radio National, for the major current affairs programs these records at at segment level. Even though they don’t provide full transcripts, this data provide a rich, fine-grained record of Australia’s recent political, social, and economic history. This notebook shows you how to download the Radio National data. If you just want to dive straight in, there’s also a pre-harvested collection containing more than 400,000 records, with separate downloads for some of the main programs.
  • Find all the versions of an archived web page in Trove – Many of the tools in the Web Archives section of the GLAM Workbench will work with the Australian Web Archive, which is part of Trove. This notebook shows you how to get data about the number of times a web page has been archived over time.
  • Harvesting collections of text from archived web pages in Trove – If you want to explore how the content of a web page changes over time, you can use this notebook to capture the text content of every archived version of a web page.
  • Convert a Trove list into a CSV file – While Trove provides a data download option for lists, it leaves out a lot of useful data. This notebook downloads full details of newspaper articles and other works in a list and saves them as CSV files. Like the Trove Newspaper Harvester, it lets you download OCRd text and images from newspaper articles.
  • Collecting information about Trove user activity – It’s not just the content of Trove that provides interesting research data, it’s also the way people engage with it. Using the Trove API it’s possible to harvest details of all user created lists and tags. And yes, there’s pre-harvested collections of lists and tags for the impatient.

While I’m focusing here on Trove, there’s also tools to create datasets from the National Archives of Australia, Digital NZ and Papers Past, the National Museum of Australia and more. And there’s a big list of readily downloadable datasets from Australian GLAM organisations.

Visualisation and analysis

Many of the notebooks listed above include examples that demonstrate ways of exploring and analysing your harvested data. There are also a number of companion notebooks that examine some possibilities in more detail, for example:

But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:

  • QueryPic – Another tool that’s been around in different forms for a decade, QueryPic visualises searches in Trove’s newspapers. The latest web app couldn’t be simpler, just paste in your API key and a search url and create charts showing the number of matching articles over time. You can combine queries, change time scales, and download the data and visualisations.
  • Visualise Trove newspaper searches over time – This is like a deconstructed version of QueryPic that walks you through the process of using Trove’s facets to assemble a dataset of results over time. It provide a lot of detail on the sorts of data available, and the questions we can ask of it.
  • Visualise the total number of newspaper articles in Trove by year and state – This notebook uses a modified version of the code above to analyse the construction and context of Trove’s newspaper corpus itself. What are you actually searching? Meet the WWI effect and the copyright cliff of death! This is a great place to start if you want to get people thinking critically about digital resources are constructed.
  • Analyse rates of OCR correction – Some more meta-analysis of the Trove corpus itself, this time focusing on patterns of OCR correction by Trove users.
  • Identifying non-English language newspapers in Trove – There are a growing number of non-English language newspapers digitised in Trove. However, if you’re only searching using English keywords, you might never know that they’re there. This notebook analyses a sample of articles from every newspaper in Trove to identify non-English content.
  • Beyond the copyright cliff of death – Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. This notebook helps you find out how many, and which newspapers they were published in.
  • Map Trove newspaper results by state – This notebook uses the Trove state facet to create a choropleth map that visualises the number of search results per state.
  • Map Trove newspaper results by place of publication – This notebook uses the Trove title facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.
  • Compare two versions of an archived web page – This notebook demonstrates a number of different ways of comparing versions of archived web pages. Just choose a repository, enter a url, and select two dates to see comparisons based on: page metadata, basic statistics such as file size and number of words, numbers of internal and external links, cosine similarity of text, line by line differences in text or code, and screenshots.
  • Display changes in the text of an archived web page over time – This web app gathers all the available versions of a web page and then visualises changes in its content between versions – what’s been added, removed, and changed?
  • Use screenshots to visualise change in a page over time– Create a series of full page screenshots of a web page over time, then assemble them into a time series.

There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.

Documentation and examples

All the Trove notebooks in the GLAM Workbench help document the possibilities and limits of the Trove API. The examples above can be modified and reworked to suit different research interests. Some notebooks also explore particular aspects of the API, for example:

  • Trove API Introduction – Some very basic examples of making requests and understanding results.
  • Today’s news yesterday – Uses the date index and the firstpageseq parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.
  • The use of standard licences and rights statements in Trove image records – Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of a list of standard licences and rights statements. We can also use this index to build a picture of which rights statements are currently being used, and by who.
  • Random items from Trove – Changes to the Trove API meant that techniques you could previously use to select resources at random no longer work. This section documents some alternative ways of retrieving random-ish works and newspaper articles from Trove.

And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.


In developing the GLAM Workbench I’m very aware that people will arrive with different levels of digital skill, confidence, and experience. That’s why I’ve been putting a lot of thought and effort into ways of providing a range of entry points.

Someone who might not identify as a ‘digital’ researcher can, with a single click, start up QueryPic and start exploring changes over time in Trove’s newspapers. This is possible because the GLAM Workbench is configured to make use of Binder, a service that spins up customised computing environments as needed.

Another researcher might start running the Trove Newspaper Harvester using Binder, but find that they want to run bigger and longer harvests. In that case, the GLAM Workbench offers a one-click installation of the Trove Newspaper Harvester on Reclaim Cloud. Unlike Binder, Reclaim Cloud environments are persistent, so you can run the harvester for as long as you want without the worry of interruptions.

Yet another researcher might want to understand how the Trove API works and the sorts of data that it makes available. By exploring the various notebooks they’ll find useful snippets of code they can try out in their own projects.

The GLAM Workbench connects outwards to make use of a range of other services – the notebooks run in Binder, Reclaim Cloud, and Docker; the code is all openly licensed and publicly available through GitHub and Zenodo; data is hosted on GitHub, CloudStor, and Zenodo; datasets can be explored using Datasette running on Glitch or Google CloudRun. I’m hoping that the new investments in HASS research infrastructure will embed a similar philosophy, connecting up existing services rather than starting from scratch.

The future

This is just an outline on what the GLAM Workbench currently offers researchers wanting to make use of the data available from Trove. It’s all there now, publicly accessible, openly licensed, and ready to use – take it, use it, change it, share it. But there’s much more I’d like to do, both in regard to Trove and to encourage use of GLAM data more generally. I’m also interested in your ideas for new tools, examples, or data sources – what would help your research? You can add a suggestion in GitHub, or post a comment in the GLAM Workbench channel of OzGLAM Help.

See the Getting Started section of the GLAM Workbench for more hints and examples. And keep an eye on the news feed for the latest additions and updates.

A Family History Month experiment – search millions of name records from GLAM organisations

There’s a lot of rich historical data contained within the indexes that Australian GLAM organisations provide to help people navigate their records. These indexes, often created by volunteers, allow access by key fields such as name, date or location. They aid discovery, but also allow new forms of analysis and visualisation. Kate Bagnall and I wrote about some of the possibilities, and the difficulties, in this recently published article.

Many of these indexes can be downloaded from government data portals. The GLAM Workbench demonstrates how these can be harvested, and provides a list of available datasets to browse. But what’s inside them? The GLAM CSV Explorer visualises the contents of the indexes to give you a sneak peek and encourage you to dig deeper.

There’s even more indexes available from the NSW State Archives. Most of these aren’t accessible thought the NSW government data portal yet, but I managed to scrape them from the website a couple of years ago and made them available as CSVs for easy download.

It’s Family History Month at the moment, and the other night I thought of an interesting little experiment using the indexes. I’ve been playing round with Datasette lately. It’s a fabulous tool for exploring tabular data, like CSVs. I also noticed that Datasette’s creator Simon Willison had added a search-all plugin that enabled you to run a full text search across multiple databases and tables. Hmmm, I wondered, would it be possible to use Datasette to provide a way of searching for names across all those GLAM indexes?

After a few nights work, I found the answer was yes.

Try out my new aggregated search interface here!.

(The cloud service it uses runs on demand, so if it has gone to sleep, it might take a little while to wake up again – just be patient for a few seconds.)

Currently, the GLAM Name Search interface lets you search for names across 195 indexes from eight GLAM organisations. All together, there’s a total of more than 9.2 million rows of data to explore!

It’s simple to use – just enter a name in the search box and Datasette will search each index in turn, displaying the first five matching results. You can click through to view all results from a specific index. Not surprisingly, the aggregated name search only searches columns containing names. However, once you click through to an individual table, you can apply additional filters or facets.

To create the aggregated search interface I worked through the list of CSVs I’d harvested from government data portals to identify those that contained names of people, and discard those that contained administrative, rather than historical data. I also made a note of the columns that contained the names so I could index their contents once they’d been added to the database. Usually these were fields such as Surname or Given names, but sometimes names were in the record title or notes.

Datasette uses SQLite databases to store its data. I decided to create one database for each GLAM organisation. I wrote some code to work through my list of datasets, saving them into an SQLite database, indexing the name columns, and writing information about the dataset to a metadata.json file. This file is used by Datasette to display information such as the title, source, licence, and last modified date of each of the indexes.

Once that was done, I could fire up Datasette and feed it all the SQLite databases. Amazingly it all worked – searching across all the indexes was remarkably quick! To make it publicly available I used the Datasette publish to push everything to Google CloudRun (about 1.4 gb of data). The first time I used CloudRun it took some time to get the authentication and other settings working properly. This time was much smoother. Before long it was live!

Once I knew it all worked, I decided to add in another 59 indexes from the NSW State Archives. I also plugged in a few extra indexes from the Public Record Office of Victoria. These datasets are stored as ZIP files in the Victorian government data portal, so it took a little bit of extra manual processing to get everything sorted. But finally I had all 195 indexes loaded.

What now? That depends on whether people find this experiment useful. I have a few ideas for improvements. But if people do use it, then the costs will go up. I’m going to have to monitor this over the next couple of months to see if I can afford to keep it going. If you want to help with the running costs, you might like to sign up as a GitHub sponsor.

And please let me know if you think it’s worth developing! #dhhacks

Explore Trove’s digitised books

The Trove books section of the GLAM Workbench has been updated! There’s freshly-harvested data, as well as updated Python packages, integration with Reclaim Cloud, and automated Docker builds.

Included is a notebook to harvest details of all books available from Trove in digital form. This includes both digitised books, that have been scanned and OCRd, as well as born digital publications, such as PDFs and epubs. The definition of ‘books’ is pretty loose – I’ve harvested details of anything that has been assigned the format ‘Book’ in Trove, but this includes ephemera, such as posters, pamphlets, and advertising.

In the latest harvest, I ended up with details of 42,174 ‘books’. This includes some duplicates, because multiple metadata entries can point to the same digital object. I thought it was best to preserve the duplicates, rather than discard the metadata.

Once I’d harvested the details of the books, I tried to see if there was any OCRd text available for download. If there was, I saved it to a public folder on CloudStor. In total, I was able to download 26,762 files of OCRd text.

Screenshot of database showing details of digital book

The easiest way to explore the books is using this searchable database. It’s created using Datasette and is running on Glitch. Full text search is available on the ‘title’ and ‘contributors’ fields, and you can filter on things like date, copyright status, number of pages, and whether OCRd text is available for download. If there is OCRd text, a direct link to the file on CloudStor is included. You can use the database to filter the titles, creating your own dataset that you can download in CSV or JSON format.

If you just want the full list of books as a CSV file, you can download it here. And if you want all the OCRd text, you can go straight to the public folder on CloudStor – there’s about 3.6gb of text files to explore! #dhhacks

A miscellany of ephemera, oddities, & estrays

I’m just in the midst of updating my harvest of OCRd text from Trove’s digitised books (more about that soon!). But amongst the items catalogued as ‘books’ are a wide assortment of ephemera, posters, advertisements, and other oddities. There’s no consistent way of identifying these items through the search interface, but because I’ve found the number of pages in each ‘book’ as part of the harvesting process, I can limit results to items with just a single digitised page – there’s more than 1,500! To make it easy to explore this collection of odds and ends, I’ve downloaded all the single page images and compiled them into one big PDF with links back to their entries in Trove. Enjoy your browsing!

This is another example of the ways in which we can extend and enrich existing collection interfaces using simple technologies like PDFs and CSVs. We can create slices across existing categories to expose interesting features, and provide new entry points for researchers. Some other examples in the GLAM Workbench are the collection of editorial cartoons from The Bulletin, the list of Trove newspapers with non-English content, the harvest of ABC Radio National programs, and the recent collection of politicians talking about COVID. Let me know if you have any ideas for additional slices! #dhhacks

Everyday heritage and the GLAM Workbench

Some good news on the funding front with the success of the Everyday Heritage project in the latest round of ARC Linkage grants. The project aims to look beyond the formal discourses of ‘national’ heritage to develop a more diverse range of heritage narratives. Working at the intersection of place, digital collections, and material culture, team members will develop a series of ‘heritage biographies’, that document everyday experience, and provide new models for the heritage sector.

Screen capture of project details in ARC grants database

Digital methods will play a major role in the project. I’ll be leading the ‘Heritage Hacks’ work package that will support the creation of the heritage biographies and develop a range of new tools and tutorials for use in heritage management contexts. All the tools, methods, and data generated through the project will be documented using Jupyter notebooks and published through the GLAM Workbench. Watch this space!

The project is led by Tracy Ireland (University of Canberra), with Jane Lydon (UWA), Kate Bagnall (UTAS), and me as chief investigators. Our industry partner is GML Heritage.