I’m hoping that the GLAM Workbench will encourage GLAM organisations and GLAM data nerds (like me) to create their own Jupyter notebooks. If they do, they can put a link to them in the list of GLAM Jupyter resources. But what if they want to add the notebooks to the GLAM Workbench itself?
To make this easier, I’ve been working on a template repository for the GLAM Workbench. It generates a new skeleton repository with all the files you need to develop and manage your own section of the GLAM Workbench.
Want a bit of added GLAM with your digital research skills? You’re in luck, as I’ll be speaking at not one, but three ResBaz events in November. If you haven’t heard of it before, ResBaz (Research Bazaar) is ‘a worldwide festival promoting the digital literacy at the centre of modern research’.
On Wednesday, 24 November I’ll be giving a key story presentation (like a keynote, but with more story!) entitled Exploring GLAM data for ResBaz Queensland.
The latest help video for the GLAM Workbench walks through the web app version of the Trove Newspaper & Gazette Harvester. Just paste in your search url and Trove API key and you can harvest thousands of digitised newspaper articles in minutes!
An inquiry on Twitter prompted me to put together a notebook that you can use to download all available issues of a newspaper as PDFs. It was really just a matter of copying code from other tools and making a few modifications. The first step harvests a list of available issues for a particular newspaper from Trove. You can then download the PDFs of those issues, supplying an optional date range.
The GLAM Workbench isn’t dependent on one big piece of technological infrastructure. It’s basically a collection of Jupyter notebooks, and those notebooks can be used within a variety of different environments. This helps make the GLAM Workbench more sustainable – new components can be swapped in and out as required. It also makes it possible to create different pathways for users, depending on their digital skills, institutional support, and research needs.
A new version of the GLAM Name Index Search is available. An additional 49 indexes have been added, bringing the total to 246. You can now search for names in more than 10.2 million records from 9 organisations.
The new indexes come from Queensland State Archives and the State Library of WA. QSA announced on Friday that they’d added two new indexes to their site. When I went to harvest them, I realised there was another 25 indexes that I hadn’t previously picked up.
When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? How do you get a list of dates when newspapers were published? This notebook in the GLAM Workbench shows how you can get information about issues from the Trove API.
Using the notebook, I’ve created a couple of datasets ready for download and use.
Way back in 2013, I went to the eResearch Australasia conference as the manager of Trove to talk about new research possibilities using the Trove API. Eight years years later I was back, still spruiking the possibilities of Trove data. This time, however, I was discussing Trove in the broader context of GLAM data – all the exciting possibilities that have emerged as galleries, libraries, archives and museums make more of their collections available in machine-readable form.
There’s no reliable way of downloading an image of a Trove newspaper article from the web interface. The image download option produces an HTML page with embedded images, and the article is often sliced into pieces to fit the page.
This Python package includes tools to download articles as complete JPEG images. If an article is printed across multiple newspaper pages, multiple images will be downloaded – one for each page.
Two more datasets have been added to the GLAM Name Index Search! From the History Trust of South Australia and Collab, I’ve added:
Passengers in History – that’s 371,894 records of people arriving in South Australia from 1836 to 1961 Women’s Suffrage Petition 1894 (South Australia) – another 10,638 names In total there’s 9.67 million name records to search across 197 datasets provided by 9 GLAM organisations!
Recently I created a list of publications that made use of QueryPic, my tool to visualise searches in Trove’s digitised newspapers. Here’s another example of the GLAM Workbench and QueryPic in action, in Professor Julian Meyrick’s recent keynote lecture, ‘Looking Forward to the 1950s: A Hauntological Method for Investigating Australian Theatre History’.
Late last year the Federal Government announced it was making an $8.9 million investment in HASS and Indigenous research infrastructure. This program is being managed by the ARDC and will lead to the development of a HASS Research Data Commons. According to the ARDC, a research data commons:
brings together people, skills, data, and related resources such as storage, compute, software, and models to enable researchers to conduct world class data-intensive research
A Twitter thread about some of the research uses of QueryPic…
QueryPic, my tool for visualising searches in @TroveAustralia’s digitised newspapers, has been around in different forms for more than 10 years. The latest version is part of the #GLAMWorkbench: https://t.co/qnY5tVDwgY #researchinfrastructure pic.twitter.com/QyHWJwGV3u
— Tim Sherratt (@wragge) August 29, 2021
I thought I’d highlight some of the research publications that have made use of QueryPic over the years, so, in no particular order.
Over the last few weeks I’ve been updating my harvests of OCRd text from digitised books and periodicals in Trove. As part of the harvesting process, I’ve created lists of both that are available in digital form – this includes digitised works, as well as those that are born-digital (such as PDFs or epubs). I’ve published the full lists of books and periodicals as searchable databases to make them easy to explore.
We’re in the midst of planning for the HASS Research Data Commons, which will deliver some much-needed investment in digital research infrastructure for the humanities and social sciences. Amongst the funded programs are tools for text analysis as part of the Linguistics Data Commons, and a platform for more advanced research using Trove. I’m hoping that this will be an opportunity to take stock of existing tools and resources, and build flexible pathways for researchers that enable them to collect, move, analyse, preserve, and share data across different platforms and services.
There’s a lot of rich historical data contained within the indexes that Australian GLAM organisations provide to help people navigate their records. These indexes, often created by volunteers, allow access by key fields such as name, date or location. They aid discovery, but also allow new forms of analysis and visualisation. Kate Bagnall and I wrote about some of the possibilities, and the difficulties, in this recently published article.
Many of these indexes can be downloaded from government data portals.
The Trove books section of the GLAM Workbench has been updated! There’s freshly-harvested data, as well as updated Python packages, integration with Reclaim Cloud, and automated Docker builds.
Included is a notebook to harvest details of all books available from Trove in digital form. This includes both digitised books, that have been scanned and OCRd, as well as born digital publications, such as PDFs and epubs. The definition of ‘books’ is pretty loose – I’ve harvested details of anything that has been assigned the format ‘Book’ in Trove, but this includes ephemera, such as posters, pamphlets, and advertising.
I’m just in the midst of updating my harvest of OCRd text from Trove’s digitised books (more about that soon!). But amongst the items catalogued as ‘books’ are a wide assortment of ephemera, posters, advertisements, and other oddities. There’s no consistent way of identifying these items through the search interface, but because I’ve found the number of pages in each ‘book’ as part of the harvesting process, I can limit results to items with just a single digitised page – there’s more than 1,500!
Some good news on the funding front with the success of the Everyday Heritage project in the latest round of ARC Linkage grants. The project aims to look beyond the formal discourses of ‘national’ heritage to develop a more diverse range of heritage narratives. Working at the intersection of place, digital collections, and material culture, team members will develop a series of ‘heritage biographies’, that document everyday experience, and provide new models for the heritage sector.
So far this year I’ve given eight workshops or presentations relating to the GLAM Workbench, with probably a few more yet to come. Here’s the latest:
Introducing the GLAM Workbench, presentation for the Griffith University Centre for Social and Cultural Research, Digital Humanities Seminar Series, 6 August 2021 Exploring the GLAM Workbench (slides), presentation for the UTS Digital Histories Seminar Series, 8 July 2021 The GLAM Workbench: A Labs approach?
For a few years now I’ve been harvesting downloadable text from digitised periodicals in Trove and making it easily available for exploration and research. I’ve just completed the latest harvest – here’s the summary:
1,163 digitised periodicals had text available for download Text was downloaded from 51,928 individual issues Adding up to a total of around 12gb of text If you want to dive straight in, here’s a list of all the harvested periodicals, with links to download a summary of available issues, as well as all the harvested text (there’s one file per issue).
The Trove Journals section of the GLAM Workbench includes a notebook that helps you download press releases, speeches, and interview transcripts by Australian federal politicians. These documents are compiled and published by the Parliamentary Library, and the details are regularly harvested into Trove.
Using this notebook, I’ve created a collection of documents that include the words ‘COVID’ or ‘Coronavirus’. It includes all the metadata from Trove, as well as the full text of each document downloaded from the Parliamentary Library.
I’ve always been interested in the way people add value to resources in Trove. OCR correction tends to get all the attention, but Trove users have also been busy organising resources using tags, lists, and comments. I used to refer to tagging quite often in presentations, pointing to the different ways they were used. For example, ‘TBD’ is a workflow marker, used by text correctors to label articles that are ‘To Be Done’.
I’ve spent a lot of time this year working on ways of improving the GLAM Workbench’s documentation and its integration with other services. Last year I created OzGLAM Help to provide a space where users of GLAM collections could ask questions and share discoveries – including a dedicated GLAM Workbench channel. Earlier this year, I tweaked my Micro.blog powered updates to include a dedicated GLAM Workbench news feed. Now I’ve brought the two together!
I’ve started creating short videos to introduce or explain various components of the GLAM Workbench. The first video shows how you can visualise searches in Trove’s digitised newspapers using the latest version of QueryPic. It’s a useful introduction to the way access to collection data enables us to ask different types of questions of historical sources.
As with all GLAM Workbench resources, the video is openly-licensed – so feel free to stop it into your own course materials or workshops.
To help you make use of the GLAM Workbench, I’ve set up an ‘office hours’ time slot every Friday when people can book in for 30 minute chats via Zoom. Want to talk about how you might use the GLAM Workbench in your latest research project? Are you having trouble getting started with GLAM data? Or perhaps you have some ideas for future notebooks you’d like to share? Just click on the ‘Book a chat’ link in the GLAM Workbench, or head straight to the scheduling page to set up a time!
QueryPic is a tool to visualise searches in Trove’s digitised newspapers. I created the first version way back in 2011, and since then it’s taken a number of different forms. The latest version introduces some new features:
Automatic query creation – construct your search in the Trove web interface, then just copy and paste the url into QueryPic. This means you can take advantage of Trove’s advanced search and facets to build complex queries.
I recently took part in a panel at the IIPC Web Archiving Conference discussing ‘Research use of web archives: a Labs approach’. My fellow panellists described some amazing stuff going on in European cultural heritage organisations to support researchers who want to make use of web archives. My ‘lab’ doesn’t have a physical presence, or an institutional home, but it does provide a starting point for researchers, and with the latest Reclaim Cloud and Docker integrations, everyone can have their own web archives lab!
When the 1-click installer for Reclaim Cloud works its magic and turns GLAM Workbench repositories into your own, personal digital labs, it creates a new work directory mounted inside of your main Jupyter directory. This new directory is independent of the Docker image used to run Jupyter, so it’s a handy place to copy things if you ever want to update the Docker image. However, I just realised that there was a permissions problem with the work directory which meant you couldn’t write files to it from within Jupyter.
Here’s a new little Python package that you might find useful. It simply takes a search url from Trove’s Newspapers & Gazettes category and converts it into a set of parameters that you can use to request data from the Trove API. While some parameters are used both in the web interface and the API, there are a lot of variations – this package means you don’t have to keep track of all the differences!