glamworkbench on Tim Sherratt

How to download all the images from a digitised collection in Trove (& learn some cool Trove tricks)

Wed, 24 Apr 2024 23:35:50 +1000

Digitised resources in Trove are sometimes grouped into collections – an album of photographs, a set of posters, a bundle of letters. I’ve just added a notebook to the GLAM Workbench that downloads all the images in a collection at the highest available resolution.

A sample of the 3,048 posters download from nla.obj-2590804313

Why is it necessary?

Trove’s digitised collection viewer includes a download option. But in most cases that seems to be limited to downloading 20 images at a time. Part of the reason for that is probably because the images are zipped up into a single file, which could get very large if 100, or 200 images were added.

Another limitation of the built-in download option is that the images are often fairly low resolution copies (many have a maximum width of 1000px). The quality of the images limits how you can use them.

For many research purposes, you’ll want a complete collection at the highest resolution possible. Trove makes that difficult. The new notebook makes it easy.

How does it work?

The Trove API is not much help. The only collections it knows about are articles in newspapers, and issues in periodicals. However, when you browse through a collection using the digitised collection viewer, a little internal API is called that delivers an HTML list of the next 20 items. By stepping through the collection page by page, you can eventually harvest details of all the items in a collection.

Once you have the nla.obj identifiers for each image, you can download a high-resolution version simply by adding /image to the url. Downloaded this way, the images generally seem to have a longest dimension equal to 5000px – a considerable improvement!

As well as downloading all the images, the notebook also generates an RO-Crate metadata file that describes the context of your harvest – when it was run, what collection you downloaded, what notebook you used, as well as the details of each image. This’ll help you in the future when you’ve forgotten where all the images came from!

Where can I learn all these Trove tricks?

This new notebook came about because I was documenting the method for harvesting collections in the Trove Data Guide. I realised that I needed to adapt my original code to work with complex collections that included multiple layers of nested sub-collections (like manuscript finding aids). Having done that, I thought it would be useful to create a working example in the GLAM Workbench.

The Trove Data Guide is being developed as part of the ARDC’s Community Lab. In it I’m trying to document as many of these problems and workarounds as I can to open Trove data to new research uses. Here, for example, is the page that discusses how to get a list of collection items, and here are some suggestions for downloading high-resolution images. One of my favourite discoveries from recent weeks was the internal API that delivers OCR layout information about book and periodical pages – find out how to save illustrations from books, visualise the layout of a periodical page, and even create some #redactionart poetry.

The content of the Trove Data Guide is changing and developing all the time. It’s all happening in the open, so feel free to explore the bleeding-edge development version or the latest published version. If there’s something you’d like to see, please post it on the GitHub ideas board.

Where is the new notebook?

The new notebook is part of the Trove images section in the GLAM Workbench. Because I was adding a new notebook, I also took the chance to update the whole repository. Changes include:

now using Trove API v3
now using Python 3.10
updated Python packages
now includes an ro-crate-metadata.json containing machine-readable metadata describing this repository

I also updated the datasets that provide information about the application of licences and rights statements to images by Trove contributors and moved them to their own GitHub repository.

What do you want to do with Trove data?

Thu, 18 Apr 2024 16:22:39 +1000

In my work on the Trove Data Guide I’ve started sketching out a series of research pathways. These are intended as ways of connecting Trove data to tools and questions – providing examples of the steps involved in gathering, preparing, and using data to explore particular research topics.

I’ve currently defined six pathways, roughly based on different types of data that you can get from Trove:

‘Creating collections’ is a bit different I suppose, as it’s meant to relate to the work of assembling research collections from data in Trove – for example, creating a collection of annotated newspaper articles in Omeka.

I have some ideas, of course, about the types of tutorials and examples to include in each pathway, but I’m wondering what you would like to see. What would you like to be able to do with Trove data?

You might get some inspiration by browsing through what’s already in the Trove Data Guide and the GLAM Workbench, or perhaps you have a research question that’s foundered because you couldn’t get the data you needed out of Trove. If you have any ideas please share them via the TDG’s ideas board. This is a chance to get some of your gnarly Trove data problems solved!

Note that the TDG links in this post go to the development version, which changes frequently. There is also a published version that doesn’t include the latest content.

Update! Saving Trove newspaper articles and pages as images

Thu, 18 Apr 2024 12:32:50 +1000

You probably know that when you select the Download as Image option for a digitised newspaper article in Trove what you get back is not actually an image – it’s an HTML document, in which the original image has been sliced up to try and fit on an A4 page when printed. So this article:

Ends up looking like this!!

So what do you do when you just want an image of an article as it appeared in the newspaper? Some years ago I figured out a workaround that involves scraping the OCR positional data that’s embedded in Trove’s newspaper viewer and cropping the article from a high-resolution image of the page. The method is documented in the GLAM Workbench and the Trove Data Guide, and I’ve packaged up the code in trove-newspapers-images so you can embed it in your own projects.

I also created a web app (using Jupyter and Voilá) to make it as simple as possible for people to download images from articles. Unlike most of the other notebooks in the GLAM Workbench which are spun up on demand, this web app was hosted on a constantly-running server. This made it faster to start and use, but it was relatively expensive, wasteful, and difficult to maintain. So I decided to make a change!

The new version of the Save Trove newspaper article as image web app is actually embedded with the GLAM Workbench. Behind the scenes, the page calls a AWS Lambda function which uses trove-newspapers-images to generate the image. So far it seems to be working pretty well. Try it now!

Even better, I’ve made some changes to image generation code to give users the option of masking the articles. The original version crops a rectangle from the page using the article coordinates. If an article extends over multiple columns with different lengths, the image will include content from neighbouring articles. It’s not a big problem, but it always annoyed me. Recently I realised that the solution was quite simple – instead of cropping one big box from the page, you can crop each individual OCR ‘zone’ and paste them into a new empty image with the same dimensions as the original. Once you’ve pasted all the zones, you crop the new page image using the article coordinates. Here’s an example of an article without masking:

And the same article with masking:

This enhancement has been pushed to the trove-newspapers-images package, and is available through the web app by simply checking the ‘mask image’ option.

Another frustrating feature of the Trove web interface is that there’s no way of saving a newspaper page as an image, only as a PDF. In this case the workaround is pretty simple, you just have to know the url pattern used to download page images. This is documented in the Trove Data Guide. Once again, I’ve been providing a web app to make this easy for users, and once again I’ve just updated it so that it’s embedded with the GLAM Workbench itself. Try it now!

Getting to know NED – born-digital periodicals in Trove

Wed, 10 Apr 2024 23:05:01 +1000

I spend a lot of my time trying to highlight the wealth of resources available through Trove – whether that’s 25,000 digitised Parliamentary Papers, 6,000 oral histories you can listen to online, or 3,471 full-page editorial cartoons from The Bulletin. Most recently I’ve been working on digitised periodicals, developing a new section for the Trove Data Guide. But as I was harvesting data about the 900 periodicals and 37,000 issues that had so far been digitised, I wondered about periodicals that were born digital – in particular, those that had been submitted to the National Library by publishers and authors through the National eDeposit Scheme (NED). It turns out, there’s a lot more than I realised.

I’ve added a new notebook to the Trove Periodicals section of the GLAM Workbench that harvests data about NED periodicals, and created a new dataset with lists of titles and issues. You can also explore the harvested data using Datasette-Lite. But here’s a quick overview.

There are at least 7,973 born-digital periodicals contributed through NED, comprising a total of 156,151 issues!

One of the 428 issues of the Palm Island Voice

What are they? Here’s the twenty titles with the most issues.

title_id	title	issues
nla.obj-1916881555	Western Australian government gazette.	1869
nla.obj-2940864261	The Australian Jewish News.	1067
nla.obj-2692666983	APSjobs-vacancies daily … daily gazette.	1043
nla.obj-2945379691	Tweed link	825
nla.obj-2541626239	Weekly notice	798
nla.obj-2940863963	The Australian Jewish News.	726
nla.obj-1252109725	Queensland Health services bulletin	700
nla.obj-1247944368	Hyden Karlgarin Householder News.	642
nla.obj-1775015332	E-record : your news from across the Archdioce…	640
nla.obj-638303044	Class ruling	580
nla.obj-2536144595	Plantagenet news.	574
nla.obj-1252305285	Clermont rag : Community newspaper.	514
nla.obj-2815835489	The Apollo Bay news.	513
nla.obj-1908935587	Assessment reports and exam papers	512
nla.obj-3125539859	The Peninsula community access news.	506
nla.obj-2859788676	Council news : weekly information from us to you	469
nla.obj-1252119874	Rot-Ayr-Ian [electronic resource] : the offici…	467
nla.obj-2994765231	Townsville Orchid Society Inc. bulletin.	442
nla.obj-1252246096	Palm Island Voice.	428
nla.obj-3267060622	News & views from George Cochrane.	399

These are born-digital, so they’re not images and OCRd text like the digitised periodicals and newspapers. Most of them are PDFs as we can see from the metadata.

format	number of issues
application/pdf	154,976
not specified	1,075
application/epub+zip	100

Not all NED periodicals can be viewed online. Publishers submitting periodicals through NED can place restrictions on access, specifying that that the publications can only be viewed on-site in a library. The three access categories are:

Unrestricted – you can view online and download
View Only – you can view online but not download
Onsite Only – you can only view when onsite at the designated libraries

Fortunately, the vast majority are Unrestricted.

Access status	Number of issues
Unrestricted	138,557
View Only	12,937
Onsite Only	4,657

One of the most important things about the digitised newspaper corpus is its diversity – it’s not just the metropolitan dailies, but many local, community, political, and religious newspapers as well. While local newspapers might be dying out in their traditional form, electronic publications are popping up. Look at the titles in the list above – the Apollo Bay News, the Palm Island Voice – while current historians mine the digitised newspapers for fragments of everyday life, future historians will be grateful for what’s being captured and preserved by NED.

But wait there’s more! Since 1996, the Australian Web Archive (previously Pandora) has been capturing online periodicals. My next task is to harvest some details of these as well.

More tools and data for working with Trove's digitised periodicals

Tue, 26 Mar 2024 14:35:02 +1000

The Trove Periodicals section of the GLAM Workbench has been updated! Some changes were necessary to make use of version 3 of the Trove API, but I’ve also taken the chance to reorganise things a bit – starting with the name. This section used to be called ‘Trove journals’, reflecting the naming of Trove’s ‘Journals’ zone. But zones have gone, and periodicals are now spread across multiple categories, so I thought a name change was necessary to better reflect the type of content being examined.

What periodicals have been digitised?

It’s surprising difficult to find out what periodicals have actually been digitised in Trove. There’s no straightforward list of titles as there is in the newspapers category. Over the years I’ve created a variety of lists and tools to try and overcome this. I’m now trying to consolidate these efforts into a single dataset which you can explore using Datasette-Lite. I’ve made a few improvements to this in recent weeks, in particular, title records now include a link to download all the OCRd text from periodical.

New notebooks

The notebook pages in the GLAM Workbench now include previews of the notebook’s content. There are a number of new notebooks:

Get details of periodicals from the /magazine/titles API endpoint – shows how you can get a list of titles from version 3 of the Trove API and explores some of the problems with the data
Enrich the list of periodicals from the Trove API – shows how to work around some of the problems with the titles data, adds some extra metadata, and generates the database described above
Harvest illustrations from periodicals – extract illustrations for periodical pages, issues, articles and searches using a OCR layout data

If you’d like an example of the sorts of illustrations you can extract from the digitised periodicals, here’s a collection of photos found by searching for periodical articles with cat or kitten in their titles.

Updated and reorganised datasets

I’ve moved all the datasets out of the main GitHub repository into their own separate repositories. Some large collections that were previously stored on the sadly-deceased Cloudstor service are now sitting in an Amazon s3 bucket. These include:

Details of digitised periodicals from the /magazine/titles API endpoint – these are the datasets created by harvesting and enriching titles and issues data from the Trove API
CSV formatted list of journals available from Trove in digital form – this is an update of an older dataset of titles created by searching for digitised works with the format Periodical
Editorial cartoons from The Bulletin, 1886 to 1952 – the cartoons haven’t been updated, but I’ve created a new metadata file and fixed up some problems with page numbering
OCRd text from Trove digitised journals – I’ve reharvested all of the OCRd text and made it available as individual zip files for each title, and one big zip file with everything!

As previously noted, I’ve also made the Bulletin cartoons available through Datasette-Lite for easy exploration.

A new way to explore editorial cartoons from The Bulletin

Tue, 19 Mar 2024 14:46:22 +1000

About five years ago I created a collection of full-page editorial cartoons from The Bulletin, harvested from Trove. Through a process that might be politely described as ‘iterative’, I fiddled with an assortment of queries and methods until I had at least one cartoon from every issue published between 4 September 1886 and 17 September 1952 – 3,471 cartoons in total. The details of the collection and how I created it are available in the Trove periodicals section of the GLAM Workbench.

Last night, as I was tidying up a new release of the Trove periodicals repository, I had a thought – why not put all of the details of the cartoons in a little database and make it available using Datasette-Lite for easy exploration? So I did.

Try it now!

One of the coolest new features is that I’ve harvested the OCRd text from each page containing a cartoon and created a full-text index. This means you can find cartoons by searching for words in their captions! Other features include embedded thumbnail images and links to download high-resolution versions of each page image.

In creating the database, I realised there were a few problems with the original metadata (dodgy page numbers), so I’ve fixed that up as well. I’ve also moved the mega zip download of every image (over 60gb) from the unfortunately deceased CloudStor service to AWS.

New GLAM Workbench section for working with government publications in Trove

Tue, 27 Feb 2024 13:34:18 +1000

The GLAM Workbench has a brand new section aimed at helping you find and use government publications in Trove. Most of the GLAM Workbench’s existing sections focus on a particular resource format, or are related to one of Trove’s top-level categories. This didn’t quite work for government publications, as things like Parliamentary Papers are spread across multiple categories, and can encompass a variety of formats. So I thought a new section was the best way of bringing it all together.

At the moment the Trove Government section includes two notebooks and three pre-harvested datasets.

It took a bit longer than I was originally expecting because I also made some changes in the way I store and display information about GLAM Workbench resources. You might notice, for example, that each of the datasets lives in its own separate GitHub repository, rather than being rolled together with the notebooks into one big repository. This makes it easier to manage and share information about individual datasets, and also trims down the size of the Docker images built from the code repository.

Each of these data and code repositories have their own machine-readable metadata following the RO-Crate standard. This continues work I’ve been doing with the ARDC Community Data Lab to describe GLAM Workbench resources and outputs using RO-Crate. Having this metadata in a standard format creates new possibilities for integration and automation. I’m now using the RO-Crate files to produce different, public-facing views of the resources they describe. The README files in each repository and all the GLAM Workbench pages in the Trove Government section are automatically generated from the RO-Crate data. In the latter case, I’ve extended my MkDocs setup using macros to pull in the RO-Crate JSON files and make the data available to the page templates. Connecting all the bits up took a lot of time, but I’m pretty happy with the result and will eventually extend this approach to the rest of the GLAM Workbench.

I also fiddled a bit with the way Jupyter notebooks are presented in the GLAM Workbench. The Trove Government pages include a notebook preview – basically an HTML rendering of the notebook in an iframe. This means you can browse the content of the notebook without having to do anything extra, or go anywhere else. In other sections you can view notebook content by following links to GitHub or NBViewer, but the embedded previews seem cleaner and more useful.

I’ve also changed the way options to run the notebook are presented. In the Trove Government section, these options are displayed as tabs beneath the preview – allowing you to choose, for example, between ARDC Binder and the public MyBinder service. In other sections I have a big blue button to launch the notebook using a specific service, with other options listed below. This new approach means I don’t have to prioritise one particular service – it’s left to the user to choose. It’s also expandable. In the future, I’m hoping to make some of the GLAM Workbench’s notebooks available using JupyterLite. As I do this, I can just add the JupyterLite option under another tab.

As with some other sections of the GLAM Workbench, the dataset pages are integrated with Datasette-Lite. If there’s a CSV file in the dataset, you’ll see a button to explore it using Datasette. For example, this link leads to a searchable database with details of 24,997 digitised Parliamentary Papers. That same dataset has also been used in the Trove Data Guide, to visualise Trove’s holdings of Parliamentary Papers. Yay for integration and reuse!

Digital history stream at AHA annual conference in July

Fri, 16 Feb 2024 10:44:32 +1000

This year the annual conference of the Australian Historical Association will include a digital history stream, sponsored by the Australian Research Data Commons (ARDC), and convened by me!

The call for papers is available here or through the Conference website. The list of possible topics is deliberately broad and inclusive – if you’re using digital tools or methods in the organisation, analysis, and visualisation of historical data we’d love to hear from you. Proposals are due on ~~23 February~~ 4 March and can be submitted through the Conference website.

We’re particularly keen for HDR and ECR scholars to be involved. To help meet registration and travel costs, the ARDC is funding up to four $1000 bursaries. More details are available here. Bursary applications close on 31 March.

There’s also likely to be a digital history workshop, as well as updates on the work of the HASS & Indigenous Research Data Commons and ARDC Community Data Lab, including the Trove Data Guide.

Time is short! Get your proposals in now! Contact me at tim@timsherratt.au if you have any questions.

Some recent presentations on the GLAM Workbench and Trove Data Guide

Tue, 13 Feb 2024 08:11:17 +1000

Last week I attended the ARDC Workshop on Repositories & Workspaces where I gave a quick intro to the GLAM Workbench and the Community Data Lab.

Then it was off to the ARDC HASS&I Research Data Commons Summer School where I explored some of the mysteries of Trove in a walk-through of the Trove Data Guide.

Exploring Trove’s digitised periodicals

Tue, 30 Jan 2024 21:53:14 +1000

While Trove’s digitised newspapers get all the attention, there are many other digitised periodicals to explore. But it’s not easy to find them from the Trove web interface – unlike the newspapers, there’s no list of digitised titles. So to help researchers find and use Trove’s digitised periodicals, I’ve created a searchable database using Datasette-Lite. Try it out!

Search for the titles of digitised periodicals.

View the details of an individual title (note the link to available issues at the bottom.

Browse a list of available issues.

The database currently contains details of 923 different titles, and over 37,000 individual issues. You can search for titles by keyword, then click through to view a full list of issues from a periodical. As well as basic descriptive metadata and links back to Trove, there’s a couple of other handy inclusions:

Titles include a ‘Search for articles in Trove’ link that opens up the Trove interface and pre-populates the search box with the title’s identifier. By adding some keywords you can search for articles within the publication.
Issues include a text_download_url link that downloads all the OCRd text from the issue.

Regular viewers might be thinking – wasn’t there already something like this? Yes indeed, for several years I’ve been maintaining the Trove Titles app, which provides a similar list. I’ve also provided harvests of OCRd text. So why the new database? First of all, I’ve harvested the data in a different way – making use of the new /magazine/titles API endpoint. This had several problems (see below), but I’m hoping that in the long term it will make updates easier.

Second, I’m exploring ways to make these sorts of resources available in a more sustainable way. The current Trove Titles app runs on the Heroku platform and there are costs associated with the app and the databases it uses. It just seems a bit silly for a relatively small amount of data. Datasette-Lite takes a very different approach – there’s no constantly running server, just a static site pointing at a dataset. All the magic happens within your browser!

I’ve written previously about how I’ve been customising Datasette-Lite for use within the GLAM Workbench, but I had to handle the periodicals data a bit differently. Because there’s a foreign key relationship between the titles and the issues (each issue is linked back to a title), I loaded the harvested data into a SQLite database (using sqlite-utils), defined the foreign key, and built a fulltext index on the periodical titles. Then I just saved the whole SQLite database to a GitHub repository and pointed Datasette-Lite at it.

I had to modify the GLAM Workbench template a bit to insert links back to the title when you view an individual issue. This happens automatically when you view a list of results, but not when you view an individual item. First I used the install parameter to tell Datasette-Lite to install the datasette-template-sql plugin. This plugin lets you run SQL queries within a template. Then I could run a query to see if there was a foreign key associated with the current item:

{% set fk = sql("SELECT * FROM pragma_foreign_key_list(?)", [table]) %}

If there is a foreign key, I run another query to get the title of the linked title:

{% if fk %}
    {% set flinks = sql("select title, " + fk.0.to + " from " + fk.0.table + " where id = ?", [display_rows.0[fk.0.from]]) %}
    {% set ftitle = flinks.0.title %}
{% endif %}

Then when rendering the column containing the linked value I can insert the title and a link:

{% if fk and cell.column == fk.0.from %}
	<a href="/{{database}}/{{fk.0.table}}/{{cell.value}}">{{ftitle}}</a> <em>{{cell.value}}</em>
{% else %}
	{{ row.display(cell.column) }}
{% endif %}

It seems to work ok, and doesn’t cause problems on databases where there’s no foreign keys.

I’m also using the datasette-json-html plugin to render the thumbnails. I’m also using the metadata parameter to point Datasette-Lite at a custom metadata file – this was primarily to define a custom sort order for the tables.

The data

I’ll write up more about the data and the harvesting process in coming weeks. There’ll also be a new section in the Trove Data Guide and some updated notebooks in the journals section of the GLAM Workbench. But a few notes about the /magazine/titles API endpoint:

there are a few hundred duplicate records – I’ve removed these from the dataset
the API doesn’t provide full information about issues, in particular undated issues are not returned – I’ve tried to fill these gaps
the data includes a thousand or more Parliamentary Papers – I’ve harvested these separately and thought it was best to exclude them from this dataset
some titles are actually nested collections, so the ‘issues’ are actually another level of title, while alternatively some titles are actually issues – I’ve tried to sort as much of this out as I can, but it gets confusing!

So I’m not confident that I’ve got everything, but I think it’s a useful start. I’ve reported the API problems to Trove but haven’t heard anything back yet.

The Trove Newspaper Data Dashboard now has an archive!

Mon, 15 Jan 2024 09:51:58 +1000

Since July 2022 I’ve been generating weekly snapshots of the contents of the Trove newspaper corpus. Every Sunday a new version of the Trove Newspaper Data Dashboard is created, highlighting what’s changed over the previous week, and visualising trends since April 2022 (when I first started regular data harvests).

All of the past versions of the dashboard are preserved in GitHub, but there wasn’t an easy way to browse them, until now. If you want to find out what changed in any week since July 2022, you can now visit the Trove Data Dashboard Archive and select a date from the list!

I created the archive by pulling all the versions from GitHub and saving them as individual files. I’ve also added some code to the weekly process that should automatically archive the past week from now on – we’ll see if it works next Sunday…

Customising Datasette-Lite to explore datasets in the GLAM Workbench

Fri, 12 Jan 2024 15:09:41 +1000

As well as tools and code, the GLAM Workbench includes a number of pre-harvested datasets for researchers to play with. But just including a link to a CSV file in GitHub or Zenodo isn’t very useful – it doesn’t help researchers understand what’s in the dataset, and why it might be useful. That’s why I’ve also started including links that open the CSV files in Datasette-Lite, enabling the contents to be searched, filtered, and faceted. Just look for the Explore in Datasette buttons!

Datasette is an excellent tool for sharing and exploring data. I’ve used it in a number of projects such as the GLAM Name Index Search and the Tasmanian Post Office Directories. Datasette-Lite is a version of Datasette that runs completely in the user’s web browser – no need for separate servers! All you do is point a Datasette-Lite Github repository at a publicly available CSV file, and it builds a searchable database in your browser. So instead of having to configure and maintain a series of servers running Datasette, I just have one static GitHub repository that only springs into action when needed.

For example, click this link to explore metadata describing oral history collections in Trove using Datasette-Lite.

I’ve made a few changes to the standard Datasette-Lite application for use with the GLAM Workbench. These are all included in the GLAM Workbench’s Datasette-Lite fork and described below.

Custom theme

I’d already created a custom Datasette theme for use in other projects. The question was how do I get it to work with Datasette-Lite? Just putting a templates folder in the repository wasn’t enough, as the virtual environment created within the browser doesn’t have direct access to all the files. I eventually figured out that I could zip up the templates folder, fetch the zip file using Javascript, and then unzip the folder into the browser’s virtual environment. This is the code in webworker.js that does all that:

let templateResponse = await fetch("templates.zip");
let templateBinary = await templateResponse.arrayBuffer();
pyodide.unpackArchive(templateBinary, "zip");

Then it’s just a matter of changing the Datasette initialisation to point to the templates directory:

ds = Datasette(
    names, 
    settings={
    	"num_sql_threads": 0, 
    	"truncate_cells_html": 100 # truncate cells
    }, 
    metadata=metadata, 
    template_dir="templates", # point to custom templates directory
    plugins_dir="plugins", # point to custom plugins directory
    memory=${settings.memory ? 'True' : 'False'}
)

As you can see, I’ve also added the "truncate_cells_html": 100 setting to truncate the contents of cells in the table view.

Custom plugins

Sometimes fields can contain multiple urls. While Datasette will make single urls clickable, multiple urls are just left as plain text. The datasette-multiline-links plugin fixes this for urls separated by line breaks, but I generally separate multiple values in CSV fields using the | character. It wasn’t hard to modify the plugin, but again it wasn’t clear how to make the modified plugin work with Datasette-Lite. You can use the install parameter to load plugins, but the plugins have to either be published in PyPI or available in GitHub as a Python wheel. That all seemed like overkill for my tiny plugin modification, but then I realised that I could use the same method as I was using for the custom template – zip, fetch, unzip, then point Datasette to the new plugins directory.

It also took me a while to figure out how to get the plugin to work nicely with the truncate_cells_html setting. Unless a cell-formatting plugin returns None, other cell format operations, such as truncation, aren’t applied. So I had to make sure that the plugin returned None if there were no urls in a cell.

Custom metadata

You can use the metadata parameter in Datasette-Lite to point to a metadata file in either JSON or YAML. I’ve added a custom metadata.json file to the GLAM Workbench repository, and adjusted the webworker.js code to load it by default.

Full text indexing

One really cool things about Datasette is the ability to run full text searches across specified columns. If Datasette detects a full text index, it automatically adds a keyword search box.

There wasn’t a way of adding full text indexes to CSV datasets in Datasette-Lite, so I added a new fts url parameter and used the value in webworker.js to modify the database using SQLite-utils.

fts_cols = ${JSON.stringify(settings.ftsCols || "")} 
try:
    db[bit].enable_fts(fts_cols.split(",")) # add full text indexes to columns
except sqlite3.OperationalError:
    print("Column not found")
    pass

For example, adding fts=title to a Datasette-Lite url will automatically add a full text index to the title column. You can also index multiple columns – just separate the column names with commas.

This url opens a CSV dataset with oral history metadata harvested from Trove and indexes the title and contributor columns: https://glam-workbench.net/datasette-lite/?csv=https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv&fts=title,contributor

Datasette converts your CSV file to a SQLite database, and SQLite supports a number of advanced search options. These options aren’t enabled by default in Datasette – you need to set searchmode to raw in the table metadata. To enable advanced searches, I’ve added a line in webworker.js to modify the default metadata:

metadata["databases"]["data"]["tables"][bit] = {"searchmode": "raw"}

Drop unwanted columns

Not all the columns in pre-harvested datasets are useful or interesting. To remove selected columns from Datasette-Lite, I added a drop url parameter. Once again, you can submit multiple values separated by commas.

In webworker.js the drop values are used with the SQLite-utils transform() function to remove the columns from the database.

drop_cols = ${JSON.stringify(settings.dropCols || "")}
db[bit].transform(drop=set(drop_cols.split(",")))

This url opens a CSV dataset with oral history metadata harvested from Trove and drops the publisher and work_type columns: https://glam-workbench.net/datasette-lite/?csv=https://github.com/GLAM-Workbench/trove-oral-histories-data/blob/main/trove-oral-histories.csv&drop=publisher,work_type

What’s going on?

Thu, 04 Jan 2024 17:21:41 +1000

The hardest part of developing tools and resources like the GLAM Workbench is getting information about them to the people who might benefit. The collapse of Twitter has only added to the difficulty, as has the reluctance of GLAM organisations to share new resources with their users. I’d rather spend my time making new tools, but what’s the point if no-one knows they exist?

Anyway, I thought I’d do a bit of a communications refresh for the new year. If you’re interested in GLAM Workbench and Trove Data Guide updates you can:

Keep an eye on the GLAM Workbench channel of my microblog (or add the feed to your RSS reader)
Follow the GLAM Workbench Facebook page for cross-posted updates from the RSS feed
Follow the GLAM Workbench LinkedIn page for cross-posted updates from the RSS feed

I’m also working on an email newsletter thing that’ll compile the updates at regular intervals.

For more social socials, as well as questions, requests and problems, you can always find me on Mastodon: @wragge@hcommons.social. My email address is not too hard to find, but, honestly, your chances of getting a reply are slim.

If you’ve got a bug report, or a suggestion for a new notebook or data source, feel free to create an issue on GitHub.

And, of course, everything I do is openly-licensed, so you are very welcome to modify and share! See the GLAM Workbench for other ways to get involved.

Exploring oral histories in Trove

Thu, 04 Jan 2024 11:20:30 +1000

The National Library of Australia holds over 55,000 hours of oral history and folklore recordings dating back to the 1950s. This collection is being made available online, and many recordings can now be listened to using Trove’s audio player.

However, the oral history collection is not easy to find in Trove. You need to go the ‘Music, Audio, & Video’ category and check the ‘Sound/Interview, lecture, talk’ format facet. To limit results to oral histories that have been digitised, you can add “nla.obj” to your query and set the ‘Access’ facet to ‘Online’. But what’s actually in the oral history collection and what can you do with it?

To help researchers explore and analyse the NLA’s oral history collection, I’ve added some notebooks to the Music, sound, and oral histories section of the GLAM Workbench:

Harvest oral histories metadata – harvests metadata describing the NLA’s oral history collection from Trove and saves the results as a CSV file
Save a list of oral history collections and projects – extracts a list of series from metadata describing oral histories held by the NLA and described in Trove
Download summaries and transcripts from oral histories – download all the available transcripts and summaries from digitised oral histories available in Trove

There’s also a couple of associated datasets:

The Trove Data Guide uses these datasets to create an overview of the collection. For example, here’s how the oral histories are distributed over time.

And here’s the top ten subjects of digitised oral histories.

subject	count
Painters – Australia – Interviews	193
Politicians – Australia	192
Prime ministers – Australia – Quotations	188
Older people – New South Wales – Biography	187
Menzies, Robert, Sir, 1894-1978. Speeches	185
Federal politicians	184
Politicians – Australia – Quotations	183
Australia – Politics and government – 1945-1965	172
Politicians – Australia – Interviews	171
Academics	126

The Trove Data Guide also includes information on the types of data from the oral histories and how you can access it.

Mapping MARC Geographic Area codes to Wikidata

Wed, 03 Jan 2024 17:03:48 +1000

Trove uses codes from the MARC Geographic Areas list to identify locations in metadata records. I couldn’t find any mappings of these codes to other sources of geospatial information, so I fired up OpenRefine and reconciled the geographic area names against Wikidata. Once I’d linked as many as possible, I copied additional information from Wikidata, such as ISO country codes, GeoNames identifiers, and geographic coordinates.

I’ve saved the resulting dataset in two formats – as a flattened CSV file (handy for loading as a dataframe), and as a JSON file that uses the geographic area codes as keys (handy for looking up values). You can download the datasets from this GitHub repository.

I’ve also written the codes back into the Wikidata records, so you can now find them with a SPARQL query like this.

The columns in the CSV file are:

code – MARC geographic areas code (without any trailing dashes)
place – name of geographic area from the MARC list
wikidata_label – name of geographic area from Wikidata
wikidata_id – Wikidata identifier
coordinates – pair of decimal coordinates in the form latitude,longitude (multiple values are pipe | separated)
iso_country_code – ISO two letter country code (multiple values are pipe | separated)
iso_numeric_country_code – ISO numeric country code (multiple values are pipe | separated)
geonames_id – GeoNames identifier (multiple values are pipe | separated)

Note that some fields can contain multiple values. For example the area Mediterranean Region is linked to 22 countries, so there will be multiple values in the ISO code fields.

For an example of this dataset in use, see Which countries do the oral histories relate to? in the Trove Data Guide.

National Archives of Australia in 2023 – digitisation of files

Wed, 03 Jan 2024 10:18:21 +1000

In 2023 the National Archives of Australia digitised 416,602 files (down from 575,597 in 2022). This chart shows the number of files digitised per day in 2023.

These files were drawn from 1,423 different series, but the vast bulk (81%) were from 4 series of World War Two service records. (This media release includes some details about the funding of the WW2 digitisation.)

Here’s the top twenty series by number of items digitised in 2023.

series	series_title	total
B883	Second Australian Imperial Force Personnel Dossiers, 1939-1947	201,511
A9301	RAAF Personnel files of Non-Commissioned Officers (NCOs) and other ranks, 1921-1948	111,673
A9300	RAAF Officers Personnel files, 1921-1948	14,125
B884	Citizen Military Forces Personnel Dossiers, 1939-1947	11,265
A14435	Stanley Fowler photographs showing the Australian fishing industry and coastline, numerical series with ‘LA’prefix	10,512
D3481	Photographs (black and white, colour) of buildings, installations, sites, etc	8,295
A1	Correspondence files, annual single number series [Main correspondence files series of the agency]	7,571
K1145	Certificates of Exemption from Dictation Test, annual certificate number order	4,169
A13150	Specifications, examiners reports and correspondence relating to the Registration of Victorian Patents - Second system	3,941
J853	Architectural plans, annual single number series with alpha (denoting Papua New Guinea and discipline) prefix and/or alpha/numeric (denoting size and amendment) suffix	3,322
A2571	Name Index Cards, Migrants Registration [Bonegilla]	2,204
D1423	Original plans (negatives), single number series with alpha prefix denoting discipline	2,102
AP67/1	Personal documents of British migrants (including ex-service) in receipt of free and assisted passages	2,058
E1652	Northern Territory Pastoral Applications (Pastoral Claims)	1,822
D5440	Photographs of post office buildings, personnel and equipment, single number series (with variations)	1,488
C609	Payment cards for employees' entitlements claims, alphabetical series	1,317
B3104	Photographs, Trans-Australian Railway, single number series	1,222
MP1117/2	Microfilm reels of RAAF Engineering Drawings	1,168
A2572	Name Index Cards, Migrants Registration [Bonegilla]	1,130
B6295	Photographs and negatives of Commonwealth building sites and Works departmental activities, single number series	1,113

For more data, see the naa-recently-digitised GitHub repository which runs a process every Sunday to save details of files digitised in the previous week.

Trove newspapers in 2023

Tue, 02 Jan 2024 16:56:58 +1000

I’ve been capturing weekly snapshots of the Trove newspaper corpus for the last couple of years. You can see the latest results in the Trove Newspaper Data Dashboard. Using this data I’ve compiled a quick summary of changes over the last year.

7,518,764 digitised newspaper articles were added to Trove in 2023. The total number of articles increased from 236,530,127 to 244,048,891. The chart below shows how the number of articles varied across the year. You’ll notice that the rate of digitisation increased about the same time the government announced new funding for Trove. Were more articles digitised because of the funding, or were articles in the digitisation pipeline held back until the funding was announced? Or both?

Most of the new articles were published in either Victoria or NSW – both these states had an increase of more than 3 million articles each! There were smaller increases for WA and SA. This chart shows the distribution of articles by state.

Fifty-seven new newspaper titles were added to Trove in 2023:

Trove Data Guide update – accessing data from newspapers and gazettes

Fri, 15 Sep 2023 14:53:20 +1000

I’m continuing to slog away at the Trove Data Guide (part of the ARDC’s HASS Community Data Lab) – dumping everything I know about Trove into a format that I hope will be useful for researchers.

I’ve just finished a first pass through the section on accessing data from newspapers and gazettes, and it’s online if you want to have a look. There’s still lots of things to add, update, and reorganise, but getting the basic content of the section defined is a bit of a milestone, so I’ll allow myself a little moment of celebration. Yay!

Of course it took longer than I expected, but that’s largely due to the fact that I was sketching out related sections as I went along. You’ll see lots of pages in the navigation that only contain a list of dot points, but they’ll get filled out over the next couple of months.

The ‘Accessing data’ section is going to be the most code heavy, as it’s focused on using the API to develop reusable methods for harvesting machine-readable data. Other sections, such as ‘Understanding search’ and ‘Collections and contexts’ will be more discursive, aimed at helping all Trove users better understand what Trove is and how it works.

Comments and suggestions are welcome! You can add issues on GitHub, or use Hypothes.is to annotate the text.

Some important updates for the Trove Newspaper & Gazette Harvester

Thu, 31 Aug 2023 17:00:54 +1000

Version 3 of the Trove API is out, and version 2 is scheduled to be decommissioned in early 2023 – that means I have a lot of code to update! First cab of the rank is the Trove Newspaper & Gazette Harvester with version 0.7.1 now available.

The Harvester is a Python package that can be used as either a library or a command-line tool. It’s been around in some form for more than 10 years. The latest updates include:

support for version 3 of the Trove API
automatic creation of a metadata file describing each harvest according to the RO-Crate format
automatic creation of a harvester config file, capturing the query parameters sent to Trove as well as the Harvester options
the ability to initiate a harvest from an existing config file
more memory-friendly generation of CSV result files (no loading everything into Pandas)

The RO-Crate integration was part of my work for the ARDC’s HASS Community Data Lab. The Harvester was already generating a simple metadata file that captured some of the harvest parameters, but now it documents the context of the harvest in much more detail, and saves it in a standard, Linked Open Data based, format.

Every harvest now creates an ro-crate-metadata.json file. This file includes details of the datasets created by the Harvester, such as the results.csv file that includes article metadata, and the text directory that contains the OCRd text. It also captures contextual information about the Harvester itself. The Harvester and the datasets are linked through a CreateAction that describes the harvesting process. The harvester_config.json file that saves the query parameters and Harvester options is also linked as an input to this process. In this way, all the components of the harvest are described and linked.

Here’s an example RO-Crate file.

Trove is changing all the time. By capturing information such as the query, the harvester version, the date, and the number of results, the RO-Crate file will help researchers document, manage, and share their research. And now that you can start a new harvest with an existing config file, it’s easy for researchers to re-run a harvest to see what changes over time.

As well as updating the Python package, I’ve also updated the Trove Newspaper & Gazette Harvester section of the GLAM Workbench. Here you’ll find examples of the Harvester in action, as well as some ways of exploring the harvested data. If you’d like to take the Harvester for a spin, the easiest way to start is with web app version – no software to install, no code to navigate! If you’re an Australian university researcher you can spin it up on the new ARDC Binder service in seconds.

Run GLAM Workbench notebooks on the ARDC’s new Binder service

Thu, 31 Aug 2023 12:52:11 +1000

There are a number of different ways to run the Jupyter notebooks in the GLAM Workbench depending on your needs and technical skills. But the easiest and quickest has always been the public, international Binder service, based in Europe. One click in the GLAM Workbench and Binder prepares a customised computing environment and loads up the Jupyter notebooks ready for you to explore. Unfortunately, the public Binder service has been having some capacity issues in the last few months, and sometimes repositories fail to run. The good news is that Australian university researchers now have another option with the launch of the Australian Research Data Commons Binder service!

The big difference between the ARDC’s Binder service and the international version is that you need to log in using your university credentials. While that’s an extra hassle, the service itself should be faster and more reliable for Australian researchers. For this reason, I’ve started making ARDC Binder links the default in a number of GLAM Workbench sections. Of course, not all GLAM Workbench users are attached to Australian universities, so the international Binder links remain – it’s just a matter of emphasis.

For example, near the top of many GLAM Workbench pages you’ll see Explore live on Binder buttons that launch the current repository. I’ve now added an Explore live on ARDC Binder option.

Most notebooks in the GLAM Workbench now have their own documentation page with a big blue button to launch the notebook on Binder. I’ve started changing these buttons to use the ARDC Binder service but, as you can see in the screenshot below, there’s also a link to run the notebook on the original Binder service, with no authentication required.

I’ve added some information on using the ARDC Binder service to the GLAM Workbench help pages.

I’ll be continuing to explore new options for running GLAM Workbench notebooks (I’m particularly interested in the possibilities of Jupyter Lite). Also the ARDC’s HASS Community Data Lab project is currently investigating ways of adding more authentication options to the Binder service to open it up to researchers outside of universities.

Trove Query Parser updated!

Sat, 26 Aug 2023 17:21:23 +1000

I’ve just updated the Trove Query Parser to work with version 3 of the Trove API. You just give it the url of a search in Trove’s newspapers, and it translates the search into a set of parameters that the API will understand. So this:

parse_query("https://trove.nla.gov.au/search/category/newspapers?keyword=wragge&l-artType=newspapers&l-state=Queensland&l-category=Article&l-illustrationType=Cartoon", 3)

Produces this:

{'q': 'wragge', 'l-artType': 'newspapers', 'l-state': ['Queensland'], 'l-category': ['Article'], 'l-illustrated': 'true', 'l-illustrationType': ['Cartoon'], 'category': 'newspaper'}

You can then feed the parameters to the Trove API with your API key and you’ll get data back. Easy! It’s simple but handy – I use the Query Parser in other tools like the Trove Newspaper Harvester and QueryPic.

This version adds a second parameter to the parse_query() function so you can specify the version of the Trove API you’re using. The default value is 2 for backwards compatibility. See the documentation for more information.

Family history resources in the GLAM Workbench

Fri, 18 Aug 2023 14:06:21 +1000

It’s Family History Month, so I thought a brief post was in order describing some of the family history related resources in the GLAM Workbench.

GLAM Name Index Search

This is the biggie (in more ways than one). I’ve brought 263 datasets from 10 Australian GLAM organisations together into a single search interface. All these datasets index collections by people’s names, so with one search you can find information about individuals across a broad range of records, locations, and periods. There’s more than 10 million rows of data to explore!

Try it now!

NSW Post Office and Sydney Telephone Directories

Many volumes of the NSW Post Office and Sydney Telephone Directories have been digitised and made available through Trove. However, they’re not easy to search. I’ve taken the text from these volumes and indexed it by line to make it easier to find people and places. There’s two databases to explore:

New South Wales Post Office Directories (54 volumes from 1886 to 1950)
Sydney Telephone Directories (44 volumes from 1926 to 1954)

Tasmanian Post Office Directories

The Tasmanian Post Office Directories have been digitised by Libraries Tasmania, but each volume is available as a separate PDF, making it difficult to search across the collection. I’ve downloaded all the PDFs, extracted the text and images, and indexed the content by line. Now you can search all 48 volumes, from 1890 to 1948, at once!

Try it now!

Trove Places

If your family history research is focused on a particular region, it can be useful to know what newspapers from that region are digitised in Trove. Trove Places is a map interface to Trove’s digitised newspapers, just click on a place to find newspapers published or distributed nearby.

Try it now!

Save a Trove newspaper article as an image

You might have noticed that Trove’s download option for digitised newspaper articles can slice articles up in unfortunate ways. This simple web app saves the complete article as a single image (or multiple images if the article is split across different pages). Simple, but very useful.

Try it now!

Trove Newspapers Data Dashboard

Just about every week more digitised newspaper articles are added to Trove. The search you tried a couple of months ago might now produce different results. How can you keep track of these changes? The Trove Data Dashboard uses weekly snapshots of the digitised newspapers to visualise changes over time. It’s updated every Sunday!

Try it now!

And more!

The GLAM Workbench provides a wide range of tools and examples to help you work with data from libraries, archives, and museums.

The resources above cost money to keep online. If you find them useful, you might like to sponsor me on GitHub or buy me a coffee.

Exploring the front pages of newspapers (10 years on)

Tue, 08 Aug 2023 16:28:18 +1000

Way back in 2012, I used the brand new Trove API to download the details of 4 million articles published on the front pages of newspapers. I did it for two reasons: first, I wanted to see how the content of front pages changed over time; and second, I wanted to show that large-scale data wrangling was entirely possible using nothing more than a laptop and a home broadband connection. I described my adventures in this blog post, but if you look at it now you’ll see lots of sad, empty boxes where live charts used to be. This is because I shared my results though a custom web application which, 10 years later, seems like a really, really dumb idea. Needless to say the app fell foul of web hosting changes and is no more. Nowadays I use GitHub and Jupyter notebooks to share my data noodling, so I thought it was time to revisit the topic of newspaper front pages, and create something a bit more robust.

If you just want the data and code, head over to the GitHub repository. I’ve shared the notebooks used to harvest, convert, and explore the data, as well as two parquet formatted datasets.

As you might expect, there are a lot more articles now, as new articles and newspapers are added to Trove every week (see the Trove Newspaper Data Dashboard for details). Instead of 4 million articles, I ended up downloading details of more than 19 million articles! I used my trusty Trove Newspaper Harvester as a library so I could more easily manage the way the data was saved. After several days I had a 14.6gb newline-delimited JSON file. I pared the data down to only the necessary fields, and used DuckDB to create two parquet files – one with the article data, and another that added up the number of words on each page in the different article categories.

One nifty thing about the Trove API is that it tells you the number of words in each newspaper article. Articles are also assigned to a series of categories, such as ‘Article’ (your standard news-type piece) and ‘Advertising’. By adding up the number of words in each category I could explore how the front page’s mix of articles and advertising changed over time. Here, for example, is what happened to the front page of the Sydney Morning Herald.

The first chart shows the average number of words per page in the ‘Article’ and ‘Advertising’ categories across the full span of the SMH’s publication run digitised in Trove (advertising is blue, and articles orange). The second focuses in on the point where the number of words in articles overtakes the number of words in advertisements. You can see that for most of the publication run, the front page was dominated by advertising. But when change came it was quite abrupt. Sydney-siders awoke on 15 April 1944 to a new look newspaper. Here’s what the front page looked like at the beginning and end of the period represented by the second chart.

Different newspapers have different histories. I’ve created a notebook that generates more of these visualisations for a range of newspapers, and tries to provide a bit more context around the changes taking place.

As I continue my work on the Trove Data Guide for the HASS Community Data Lab, I’m discovering more and more inconsistencies, oddities, and undocumented possibilities. As I was finishing up the work on front pages, I realised there was another way of exploring the same topic – by looking at the space articles take up. This data isn’t in the API, but it can be scraped from the web site. Hmmm, interesting…

Trove API Console updates

Tue, 18 Jul 2023 11:27:11 +1000

The Trove API Console provides examples of the Trove API in action that you can run, edit, and share. It’s been online for 9 years now, and I’ve just updated it to use version 3 of the Trove API by default. I’ve also added a new ‘Share’ button that makes it easier to share and embed examples.

If you click on the ‘Share’ button, a box will pop up.

If you add a comment, this will appear above the example query when users follow the shared link. You can use this to provide them with some context or a description.

There are two buttons providing different share options:

Copy share url – copies the full url to the shared example
Copy Markdown button – copies a Markdown snippet that embeds a button like this linked to your example. Just paste it into your Markdown-formatted documentation!

Other improvements include:

The parameters used in any request are now displayed in a table for easy reference.
The Console includes a link to an API Status page. This page runs all the example queries in the Console and checks the results to make sure the API is working as expected. It’s updated every 6 hours.

Version 3 of the Trove API includes a standard Swagger UI that you can use to build queries, and provides limited anonymous access (without the need for an API key). But if, like me, you learn best by looking at examples, then you should find the API Console handy. In particular, the API Console makes it easy to share live examples which can be very useful in training, troubleshooting, and writing documentation. The Trove Data Guide, which I’m working on at the moment, includes lots of ‘Try it!’ buttons.

Updated harvest of NSW State Archives indexes – more than 2 million rows of data!

Mon, 08 May 2023 12:25:09 +1000

The NSW State Archives (now part of Museums of History NSW) publishes a series of useful indexes to its collections. The indexes include basic data transcribed from the records, such as names, dates, and places, providing fine-grained access to the collections. But when they’re explored as data, the indexes also suggest new ways of analysing, visualising, and linking sets of records. (For some of the possibilities and challenges of using this sort of data see Missing Links: Data Stories from the Archive of British Settler Colonial Citizenship).

In 2016, I started harvesting the index data from the NSW State Archives website to make each dataset available as an easily downloadable CSV file. In 2019, changes to the website made it impossible to access the complete indexes, so I was unable to update the CSV files. Fortunately the website has changed again, and I’ve been able to reharvest all the indexes, capturing any changes since 2019. There are currently 75 indexes containing 2,481,881 rows of data.

Title	Number of rows	Download data	View at State Archives
Aboriginal People in the Register of Aboriginal Reserves 1875-1904	78	CSV file	Browse index
Assisted Immigrants Index 1839-1896	200,000	CSV file	Browse index
Australian Railway Supply Detachment 1914	65	CSV file	Browse index
Bankruptcy index 1888-1929	30,000	CSV file	Browse index
Bench of Magistrates Index 1788-1820	4,442	CSV file	Browse index
Botanic Gardens and government domains employees	916	CSV file	Browse index
Bubonic plague index 1900-1908	567	CSV file	Browse index
Census - 1841	9,355	CSV file	Browse index
Chemists, druggists and pharmacists index 1876-1920	2,967	CSV file	Browse index
Child care and protection index 1817-1942	21,292	CSV file	Browse index
Colonial (Government) Architect index 1837-1970	2,373	CSV file	Browse index
Colonial Secretary Letters Received, 1826-1896	205,863	CSV file	Browse index
Colonial Secretary’s Papers 1788-1825	144,572	CSV file	Browse index
Colonial Secretary’s letters relating to land 1826-1856	20,000	CSV file	Browse index
Colonial Secretary’s main series of letters received	7,638	CSV file	Browse index
Convict assignments index 1821-1825	6,156	CSV file	Browse index
Convict exiles index 1849-1850	3,004	CSV file	Browse index
Convict indents (digitised) index 1788-1801	20,000	CSV file	Browse index
Convicts applications to marry 1825-1851	14,327	CSV file	Browse index
Convicts index 1791-1873	150,000	CSV file	Browse index
Coroners' inquests index 1796-1824	808	CSV file	Browse index
Court of Civil Jurisdiction index 1799-1814	2,876	CSV file	Browse index
Court of Claims (Land) index 1833-1922	2,966	CSV file	Browse index
Crew and passengers 1828-1841	2,560	CSV file	Browse index
Criminal court records index 1788-1833	5,028	CSV file	Browse index
Criminal depositions (Deposition Books) index 1849-1949	117,508	CSV file	Browse index
Criminal indictments index 1863-1919	20,000	CSV file	Browse index
Deceased estates index 1880-1958	577,891	CSV file	Browse index
Depasturing licenses index 1837-1851	7,449	CSV file	Browse index
Dependent children registers 1883-1923	28,910	CSV file	Browse index
Devonshire Street Cemetery reinterment index	9,559	CSV file	Browse index
Divorce records index 1873-1923	21,239	CSV file	Browse index
Fire Commissioners Personnel	3,767	CSV file	Browse index
Gaol inmates & prisoners photos index 1870-1930	52,055	CSV file	Browse index
Gold (auriferous) lease registers 1874-1953	60,000	CSV file	Browse index
Indigenous colonial court cases 1788-1838	65	CSV file	Browse index
Infirm & destitute (Government) asylums index 1880-1896	20,000	CSV file	Browse index
Inquest index 1942-1963	45,547	CSV file	Browse index
Insolvency index 1842-1887	23,108	CSV file	Browse index
Intestate estates index 1821-1913	30,000	CSV file	Browse index
Land grants and leases (registers) 1792-1865	5,627	CSV file	Browse index
Letters re migration to NSW 1838-1857	22,771	CSV file	Browse index
Maintenance registers - Metropolitan Children’s Court 1915-1917	1,372	CSV file	Browse index
Miscellaneous immigrants index 1828-1843	8,821	CSV file	Browse index
NSW Government employees granted military leave	20,000	CSV file	Browse index
NSW King’s / Queen’s Counsel appointment correspondence	2,083	CSV file	Browse index
Naturalization index 1834-1903	9,860	CSV file	Browse index
Nominal Roll of the First Railway Section (AIF)	416	CSV file	Browse index
Norfolk Island special bundles index 1794-1813	216	CSV file	Browse index
Nurses index 1926-1954	46,499	CSV file	Browse index
Police service registers 1852-1913	20,000	CSV file	Browse index
Port Macquarie Small Debts Register, 1845-1887	2,036	CSV file	Browse index
Probate records - supplementary index 1790-1875	1,626	CSV file	Browse index
Public Works Salary Registers	523	CSV file	Browse index
Publicans' licenses index 1830-1861	20,000	CSV file	Browse index
Quarter sessions cases 1824-1837	6,232	CSV file	Browse index
Railway employment records 1856-1917	763	CSV file	Browse index
Railways and Tramways Roll of Honour	1,214	CSV file	Browse index
Register of Firms index 1903-1922	50,000	CSV file	Browse index
School teachers' rolls 1869-1908	20,000	CSV file	Browse index
Schools and related records 1876-1979	30,181	CSV file	Browse index
Soldier (Closer) Settlement - Returned Soldiers Transfer files 1907-1951	9,656	CSV file	Browse index
Soldier (Closer) Settlement transfer registers 1919-1925	4,957	CSV file	Browse index
Soldier (Closer) settlement promotion files index 1913-1958	4,354	CSV file	Browse index
Soldier Settlement loan files index 1906-1960	7,642	CSV file	Browse index
Soldier Settlement miscellaneous files index 1916	1,050	CSV file	Browse index
Soldier Settlement purchases index 1905-1937	9,776	CSV file	Browse index
Squatters and graziers index 1837-1849	9,003	CSV file	Browse index
Surveyor General’s crown plans 1792-1886	5,455	CSV file	Browse index
Surveyors' field books 1794-1860	813	CSV file	Browse index
Surveyors’ letters 1822-1855	157	CSV file	Browse index
Tramway employees 1879-1911	10,606	CSV file	Browse index
Unassisted immigrants index 1842-1855	140,000	CSV file	Browse index
Unemployed in Sydney 1866	3,222	CSV file	Browse index
Vessels arrived in Sydney 1837-1925	129,999	CSV file	Browse index

The list below is also available as a CSV file. The complete repository of CSV files is available on GitHub, and the methods used to harvest the data are documented in the GLAM Workbench.

But what’s in all these wonderfully-rich datasets? You can find out using the Index Explorer in the GLAM Workbench. Just select an index from the dropdown list, and the Index Explorer will analyse every column, summarising the contents and building visualisations to help you understand the range of values.

Data from the NSW State Archives Indexes is also included in the GLAM Name Index Search, which helps you find people in 263 different indexes from 10 Australian GLAM organisations.

A big milestone, Trove contributor data, and the coming of API v3 – recent GLAM Workbench updates

Fri, 24 Mar 2023 11:40:58 +1000

There have been quite a few GLAM Workbench updates over the last month, here’s some notes. (See February’s update for more recent changes…)

General developments

After many months of work, all thirteen Trove repositories within the GLAM Workbench have been updated to include standard configurations, integrations, and basic tests. This will make ongoing development and maintenance much easier. Docker images of every repository are now built automatically whenever the code changes. These images can be used across multiple computing environments, including cloud services such as Binder, Nectar, and Reclaim Cloud, as well on a local computer. This means users have more options for running the notebooks within a consistent, pre-configured, and tested environment.
With all the Trove repositories now Docker-ised, I worked with the Nectar Cloud team to update the GLAM Workbench app. Now when you install the app, you can select from any of the Trove repositories (excluding API intro and Random items), as well as the NAA RecordSearch, Web Archives, Digital NZ and Te Papa repositories.
Datasette Lite integration! Datasette Lite turns CSV and JSON files into fully searchable databases running within your browser (no server required). I spent some time creating a customised version of Datasette Lite for the GLAM Workbench. Now all I can just point the urls for CSV datasets to my Datasette Lite repository to open them up for quick exploration – like this! I’ve started adding Explore in Datasette buttons to dataset pages in the GLAM Workbench. Some examples are mentioned below.

New sections

Trove contributors

Trove aggregates metadata from thousands of organisations and projects around Australia. Data about contributors is available from the /contributor endpoint of the Trove API. This section includes examples of harvesting and exploring this data.

Notebooks: One notebook converts the nested data available from the /contributor endpoint into a single flat list of contributors. Another uses this list to to find out the number of records contributed by each organisation, aggregated by zone and format.
Datasets: Three datasets, generated by the code in the notebooks above, are being updated weekly – a list of organisations contributing metadata to Trove, a count of records by contributor and zone, and a count of records by contributor, zone, and format. You can explore them using Datasette Lite.

Trove API v3

This is a temporary section of the the GLAM Workbench created to bring together information and examples relating to version 3 beta of the Trove API. It will probably disappear once the new version is officially released and I reorganise the Trove sections of the GLAM Workbench accordingly. It currently includes:

A link to the new v3 beta section of the Trove API Console
A summary of breaking changes in the new version – these are the things you’ll need to change in your code before v2 of the API is switched off in early 2024.

Over the coming months I’ll be adding updated notebooks and examples from other Trove sections.

Updated sections

Trove newspaper & gazette harvester

New notebook! When I overhauled the trove-newspaper-harvester Python package last year, I made it possible to use the harvester as a library as well as a command line tool. This means you can integrate the harvester into your own tools or workflows. This new notebook, Harvesting articles that mention “Anzac Day” on Anzac Day, gives an example of a complex search query where it might be easier to use the harvester as a library – in this case, finding newspaper articles with particular keywords, published on a particular day, over a span of years!

Trove unpublished works (diaries, letters, and archives)

Repository upgraded: Python packages updated, configuration files standardised, basic tests added, automated Docker builds configured, integrations with Zenodo and Reclaim Cloud added.
New notebooks! I’ve added a series of notebooks related to finding, using, & analysing the NLA’s digitised manuscript finding aids (which are somewhat submerged beneath other content). One notebook, finds the finding aids, another one gathers some summary information about all 2,337 finding aids, and a third helps you reconstruct the hierarchy from the HTML of a single finding aid & saves the content as JSON.
New datasets! I used the new finding aids notebooks to generate a couple of datasets. One is just a list of Trove urls pointing to finding aids. The other dataset provides some summary information about each finding aid – this includes the number of items described, digitised, and searchable. You can explore the summary data using Datasette Lite.

Trove music & sound

Repository upgraded: Python packages updated, configuration files standardised, basic tests added, automated Docker builds configured, integrations with Zenodo and Reclaim Cloud added.
Updated harvest of ABC Radio National program metadata: there’s now 421,277 records from about 163 programs, though there doesn’t seem to have been any additions since early 2022.

Random items from Trove

This section documents some ways of retrieving random-ish works and newspaper articles from Trove.

Repository upgraded: Python packages updated, configuration files standardised, basic tests added, automated Docker builds configured, integrations with Zenodo and Reclaim Cloud added.

I’ve also created a couple of new ‘git scrapers’ to capture information about Trove. These are just bits of code that run on a schedule using GitHub actions and save their results into a GitHub repository. Along with the Trove Newspaper Data Dashboard and other historical datasets, these help researchers understand how the contents of Trove changes over time.

trove-zone-totals: saves data about the contents of Trove’s zones (this will need to be changed to categories with the release of v3 of the Trove API)
trove-contributor-totals: saves details of organisations and projects that contribute metadata to Trove

Maps, people, lists & more – recent updates to Trove resources in the GLAM Workbench

Fri, 17 Feb 2023 15:53:07 +1000

Once again I’ve gotten a bit behind in noting GLAM Workbench updates, so here’s a quick catch up on some Trove-related changes from the last couple of months.

Trove API introduction

The section that introduces the Trove API (or APIs!) hasn’t had much love over recent years. I’m hoping to add some more content in the coming months, but for now I just did a bit of maintenance – updating Python packages and config files, including tests, and setting up automated builds of Docker containers. The documentation pages have also had a bit of a refresh. More soon!

Trove lists and tags

For the Everyday Heritage digital workshop last November, I added a new notebook to convert a Trove list into a CollectionBuilder exhibition. The notebook harvests metadata and images from the items in a Trove list, then packages everything up in a form that can be uploaded to a CollectionBuilder-GH repository. Just upload the files to create your own instant exhibition! For example, this exhibition was generated from this Trove list.

More recently, I updated and reorganised the documentation pages. In particular, I updated the links to the pre-harvested datasets which are now all saved in Zenodo:

These datasets have also been added to the Trove historical data collection in Zenodo.

Trove maps

The Trove Maps section also needed a refresh and rebuild. The existing notebook harvested metadata about digitised maps to build an overview of what was available. I updated the harvesting code to capture a bit more information, including spatial coordinates where available. These coordinates aren’t available through the API, and aren’t visible on a map’s web page, but they are embedded within the HTML (this is a little trick I’ve used with other digitised materials to get additional metadata). You can download the updated dataset.

I also added a new notebook that attempts to parse the spatial data strings and convert the coordinates into decimal values that we can display on a map. Some of the coordinates were bounding boxes, while others were just points. If there was a bounding box, I calculated the centre point and saved that as well. I ended up with decimal coordinates for 26,591 digitised maps. You can download the parsed and converted coordinates as a dataset.

From this data I was able generate some interesting visualisations. I used Folium and the FastMarkerCluster plugin to map all of the centre points. I’ve had trouble before displaying lots of markers in Jupyter, but FastMarkerCluster handled it easily. I also saved the Folium map as a HTML page to make it easy for people to explore. Just zoom around and click on markers to display the map’s title and a link to Trove.

Where there are bounding boxes, you can overlay the map images themselves on a modern map. Of course, this isn’t as accurate as georectifying the map, particularly if the map doesn’t fill the whole image, but it’s still pretty fun. There’s a demonstration in the notebook that selects a random map and overlays the image on a modern basemap using IPyleaflet. It includes a widget to adjust the opacity of the map image (something I didn’t seem to be able to include using Folium?).

Trove People & Organisations

I’ve finally added a section for the Trove People & Organisations zone! This has been in the works for a while, but thanks to the Australian Cultural Data Engine I was able to devote some time to it. Trove’s People and Organisations zone aggregates information about individuals and organisations, bringing multiple sources together under a single identifier. Data is available from a series of APIs, which are not well-documented. The notebooks show you how to harvest all the available people and organisations data as EAC-CPF encoded XML files. Once you have the data, you can extract some summary information about sources and occupations, and use this to explore the way that records from different sources have been merged into unique identities. For example you can create a network graph of relationships between sources.

Or use an UpSet chart to show the most common groupings of sources.

There’s also a couple of notebooks with some handy examples of code to convert the XML results from the SRU API to JSON, and to find extra identity links through VIAF. I’ve also shared a pre-harvested version of the complete dataset and the extracted summaries.

Recent presentations – Library of Congress Data Jam, Everyday Heritage, Wikidata, and GLAM Workbench!

Sat, 10 Dec 2022 13:00:33 +1000

October and November brought a flurry of presentations from which I’m still recovering. Here’s a few details and links.

Library of Congress Data Jam

In October, the Computing Cultural Heritage in the Cloud project at the Library of Congress organised a Data Jam. I was invited to spend a couple of weeks playing around with one of their datasets and to report on the results. I ended up trying to find references to countries in a collection of 90,000 OCRd books. Of course I struck a few problems along the way, and didn’t get nearly as much done as I’d hoped, but that was really part of the point – to find the problems and explore the possibilities. You can watch a video of my presentation, or the whole Data Jam.

You can also:

Wikimedia Australia Community Meeting

Thanks to a grant from Wikimedia Australia, I’ve been able to spend some time this year working to align information about government agencies in Wikidata with details from the National Archives of Australia’s RecordSearch database. You can read about the project in this blog post. In November, I gave a report on the project at Wikimedia Australia’s monthly Community Meeting. You can watch the video on YouTube. You can also explore the new Wikidata section of the GLAM Workbench.

Everyday Heritage workshop and symposium

November also brought the first of what will be a series of annual symposia relating to the ARC-funded Everyday Heritage project. On 9 November, Kate Bagnall and I ran a Connecting People and Place workshop at the University of Canberra. We walked participants through some ways of finding and using digital collections such as Trove, and worked with them to create projects based on their own research using StoryMapJS and CollectionBuilder. You can view the slides of my Trove tips & tricks presentation. In preparation for the workshop I also created a new notebook in the GLAM Workbench that converts a Trove List into a Collection Builder exhibition.

The following day the Everyday Heritage Symposium, Connecting digital archives people & place, was held at the National Film and Sound Archive in Canberra. You can view the slides from my presentation Access to the everyday through digital collections. You can also watch a video of the full symposium.

Building DH

On 15 November I took part in a panel discussion on Designing user-friendly Platforms and Toolkits for Digital Humanities as part of the Building DH online conference. You can watch the video of the session on YouTube.

CAPOS 2022

On 16 November I gave the feature presentation at Reviewing, Revising, and Refining Open Social Scholarship: Australasia, and event organised by the Canadian-Australian Partnership for Open Scholarship. My talk, ‘DIY Infrastructure – Building the GLAM Workbench’, described work I’ve been doing over the past year or so to make the GLAM Workbench more sustainable by automating and standardising basic tasks, and integrating it with a range of existing services and tools. You can view the video of my talk, or browse the slides.

Worlds of Wikimedia 2022

Finally, on 17 November, I gave a keynote presentation at the Worlds of Wikimedia Conference. My talk, ‘Portals, platforms, and participation: building online collaboration around GLAM collections’, revisited my Portals to platforms paper from 2014, looking at ways that people can engage and create in the space around GLAM collections. You can view my slides online.

Do you want your Trove newspaper articles in bulk? Meet the new Trove Newspaper Harvester Python package!

Thu, 22 Sep 2022 13:53:49 +1000

The Trove Newspaper Harvester has been around in different forms for more than a decade. It helps you download all the articles in a Trove newspaper search, opening up new possibilities for large-scale analysis. You can use it as a command-line tool by installing a Python package, or through the Trove Newspaper Harvester section of the GLAM Workbench.

I’ve just overhauled development of the Python package. The new trove-newspaper-harvester replaces the old troveharvester repository. The command-line interface remains the same (with a few new options), so it’s really a drop in replacement. Read the full documentation of the new package for more details.

Here’s a summary of the changes:

the package can now be used as a library (that you incorporate into your own code) as well as a standalone command-line tool – this means you can embed the harvester in your own tools or workflows
both the library and the CLI now let you set the names of the directories in which your harvests will be saved – this makes it easier to organise your harvests into groups and give them meaningful names
the harvesting process now saves results into a newline-delimited JSON file (one JSON object per line) – the library has a save_csv() option to convert this to a CSV file, while the CLI automatically converts the results to CSV to maintain compatibility with previous versions
behind the scenes, the package is now developed and maintained using nbdev – this means the code and documentation are all generated from a set of Jupyter notebooks
the Jupyter notebooks include a variety of automatic tests which should make maintenance and development much easier in the future

I’ve also updated the Trove Newspaper Harvester section of the GLAM Workbench to use the new package. The new core library will make it easier to develop more complex harvesting examples – for example, searching for articles from a specific day across a range of years. If you find any problems, or want to suggest improvements, please raise an issue.

From 48 PDFs to one searchable database – opening up the Tasmanian Post Office Directories with the GLAM Workbench

Thu, 15 Sep 2022 22:15:21 +1000

A few weeks ago I created a new search interface to the NSW Post Office Directories from 1886 to 1950. Since then, I’ve used the same process on the Sydney Telephone Directories from 1926 to 1954. Both of these publications had been digitised by the State Library of NSW and made available through Trove. To build the new interfaces I downloaded the text from Trove, indexed it by line, and linked it back to the online page images.

But there are similar directories from other states that are not available through Trove. The Tasmanian Post Office Directory, for example, has been digitised between 1890 and 1948 and made available as 48 individual PDF files from Libraries Tasmania. While it’s great that they’ve been digitised, it’s not really possible to search them without downloading all the PDFs.

As part of the Everyday Heritage project, Kate Bagnall and I are working on mapping Tasmania’s Chinese history – finding new ways of connecting people and places. The Tasmanian Post Office Directories will be a useful source for us, so I thought I’d try converting them into a database as I had with the NSW directories. But how?

There were several stages involved:

Downloading the 48 PDF files
Extracting the text and images from the PDFs
Making the separate images available online so they could be integrated with the search interface
Loading all the text and image links into a SQLite database for online delivery using Datasette

And here’s the result!

Search for people and places in Tasmania from 1890 to 1948!

The complete process is documented in a series of notebooks, shared through the brand new Libraries Tasmania section of the GLAM Workbench. As with the NSW directories, the processing pipeline I developed could be reused with similar publications in PDF form. Any suggestions?

Some technical details

There were some interesting challenges in connecting up all the pieces. Extracting the text and images from the PDFs was remarkably easy using PyMuPDF, but the quality of the text wasn’t great. In particular, I had trouble with columns – values from neighbouring columns would be munged together, upsetting the order of the text. I tried working with the positional information provided by PyMuPDF to improve column detection, but every improvement seemed to raise another issue. I was also worried that too much processing might result in some text being lost completely.

I tried a few experiments re-OCRing the images with Textract ( a paid service from Amazon) and Tesseract. The basic Textract product provides good OCR, but again I needed to work with the positional information to try and reassemble the columns. On the other hand, Tesseract’s automatic layout detection seemed to work pretty well with just the default settings. It wasn’t perfect, but good enough to support search and navigation. So I decided to re-OCR all the images using Tesseract. I’m pretty happy with the result.

The search interfaces for the NSW directories display page images loaded directly from Trove into an OpenSeadragon viewer. The Tasmanian directories have no online images to integrate in this way, so I had to set up some hosting for the images I extracted from the PDFs. I could have just loaded them from an Amazon s3 bucket, but I wanted to use IIIF to deliver the images. Fortunately there’s a great project that uses Amazon’s Lambda service to provide a Serverless IIIF Image API. To prepare the images for IIIF, you convert them to pyramidal TIFFs (a format that contains an image at a number of different resolutions) using VIPS. Then you upload the TIFFs to an s3 bucket and point the Serverless IIIF app at the bucket. There’s more details in this notebook. It’s very easy and seems to deliver images amazingly quickly.

The rest of the processing followed the process I used with the NSW directories – using SQLite-utils and Datasette to package the data and deliver it online via Google Cloudrun.

Postscript: Time and money

I thought I should add a little note about costs (time and money) in case anyone was interested in using this workflow on other publications. I started working on this on Sunday afternoon and had a full working version up about 24 hours later – that includes a fair bit of work that I didn’t end up using, but doesn’t include the time I spent re-OCRing the text a day or so later. This was possible because I was reusing bits of code from other projects, and taking advantage of some awesome open-source software. Now that the processing pipeline is pretty well-defined and documented it should be even faster.

The search interface uses cloud services from Amazon and Google. It’s a bit tricky to calculate the precise costs of these, but here’s a rough estimate.

I uploaded 63.9gb of images to Amazon s3. These should cost about US$1.47 per month to store.

The Serverless IIIF API uses Amazon’s Lambda service. At the moment my usage is within the free tier, so $0 so far.

The Datasette instance uses Google Cloudrun. Costs for this service are based on a combination of usage, storage space, and the configuration of the environment. The size of the database for the Tasmanian directories is about 600mb, so I can get away with 2gb of application memory. (The NSW Post Office directory currently uses 8gb.) These services scale to zero – so basically they shut down if they’re not being used. This saves a lot of money, but means there can be a pause if they need to start up again. I’m running the Tasmanian and NSW directories, as well as the GLAM Name Index search, within the same Google Cloud account, and I’m not quite sure how to itemise the costs. But overall, it’s costing me about US$4.00 a month to run them all. Of course if usage increases, so will the costs!

So I suppose the point is that these sorts of approaches can be quite a practical and cost-effective way of improving access to digitised resources, and don’t need huge investments in time or infrastructure.

If you want to contribute to the running costs of the NSW and Tasmanian directories you can sponsor me on GitHub.

Fresh harvest of OCRd text from Trove's digitised periodicals – 9gb of text to explore and analyse!

Mon, 05 Sep 2022 16:13:16 +1000

I’ve updated the GLAM Workbench’s harvest of OCRd text from Trove’s digitised periodicals. This is a completely fresh harvest, so should include any corrections made in recent months. It includes:

1,430 periodicals
OCRd text from 41,645 issues
About 9gb of text

The easiest way to explore the harvest is probably this human-readable list. The list of periodicals with OCRd text is also available as a CSV. You can find more details in the Trove journals section of the GLAM Workbench, and download the complete corpus from CloudStor.

Finding which periodical issues in Trove have OCRd text you can download is not as easy as it should be. The fullTextInd index doesn’t seem to distinguish between digitised works (with OCR) and born-digital publications (like PDFs) without downloadable text. You can use has:correctabletext to find articles with OCR, but you can’t get a full list of the periodicals the articles come from using the title facet. As this notebook explains, you can search for nla.obj, but this returns both digitised works and publications supplied through edeposit. In previous harvests of OCRd text I processed all of the titles returned by the nla.obj search, finding out whether there was any OCRd text by just requesting it and seeing what came back. But the number of non-digitised works on the list of periodicals in digital form has skyrocketed through the edeposit scheme and this approach is no longer practical. It just means you waste a lot of time asking for things that don’t exist.

For the latest harvest I took a different approach. I only processed periodicals in digital form that weren’t identified as coming through edeposit. These are the publications with a fulltext_url_type value of either ‘digitised’ or ‘other’ in my dataset of digital periodicals. Is it possible that there’s some downloadable text in edeposit works that’s now missing from the harvest? Yep, but I think this is a much more sensible, straightforward, and reproduceable approach.

That’s not the only problem. As I noted when creating the list of periodicals in digital form, there are duplicates in the list, so they have to be removed. You then have to find information about the issues available for each title. This is not provided by the Trove API, but there is an internal API used in the web interface that you can access – see this notebook for details. I also noticed that sometimes where there’s a single issue of a title, it’s presented as if each page is an issue. I think I’ve found a work around for that as well.

All these doubts, inconsistencies and workarounds mean that I’m fairly certain I don’t have everything. But I do think I have most of the OCRd text available from digitised periodicals, and I do have a methodology, documented in this notebook, that at least provides a starting point for further investigation. As I noted in my wishlist for a Trove Researcher Platform, it would be great if more metadata for digitised works, other than newspapers, was made available through the API.

Explore Trove's digitised newspapers by place

Mon, 05 Sep 2022 15:34:05 +1000

I’ve updated my map displaying places where Trove digitised newspapers were published or distributed. You can view all the places on single map – zoom in for more markers, and click on a marker for title details and a link back to Trove.

If you want to find newspapers from a particular area, just click on a location using this map to view the 10 closest titles.

You can view or download the dataset used to construct the map. Place names were extracted from the newspaper titles using the Geoscience Gazetteer.

Making NSW Postal Directories (and other digitised directories) easier to search with the GLAM Workbench and Datasette

Thu, 01 Sep 2022 16:22:09 +1000

As part of my work on the Everyday Heritage project I’m looking at how we can make better use of digitised collections to explore the everyday experiences woven around places such as Parramatta Road in Sydney. For example, the NSW Postal Directories from 1886 to 1908 and 1909 to 1950 have been digitised by the State Library of NSW and made available through Trove. The directories list residences and businesses by name and street location. Using them we can explore changes in the use of Parramatta Road across 60 years of history. But there’s a problem. While you can browse the directories page by page, searching is clunky. Trove’s main search indexes the contents of the directories by ‘article’. Each ‘article’ can be many pages long, so it’s difficult to focus in on the matching text. Clicking through from the search results to the digitised volume lands you in another set of search results, showing all the matches in the volume. However, the internal search index works differently to the main Trove index. In particular it doesn’t seem to understand phrase or boolean searches. If you start off searching for “parramatta road” , Trove tells you there’s 50 matching articles, but if you click through to a volume you’re told there’s no results. If you remove the quotes you get every match for ‘parramatta’ or ‘road’. It’s all pretty confusing.

The idea of ‘articles’ is really not very useful for publications like the Post Office Directories where information is mostly organised by column, row or line. You want to be able to search for a name, and go directly to the line in the directory where that name is mentioned. And now you can! Using Datasette, I’ve created an interface that searches by line across all 54 volumes of the NSW Post Office Directory from 1886 to 1950 (that’s over 30 million lines of text).

Try it now!

Basic features

The full text search supports phrase, boolean, and wildcard searches. Just enter your query in the main search box to get results from all 54 volumes in a flash.
Each search result is a single line of text. Click on the link to view this line in context – it’ll show you 5 lines above and below your match, as well as a zoomable image of the digitised page from Trove.
For more context, you can click on View full page to see all the lines of text extracted from that page. You can then use the Next and Previous buttons to browse page by page.
To view the full digitised volume, just click on the View page in Trove button.

How it works

There were a few stages involved in creating this resource, but mostly I was able to reuse bits of code from the GLAM Workbench’s Trove journals and books sections, and other related projects such as the GLAM Name Index Search. Here’s a summary of the processing steps:

I started with the two top-level entries for the NSW Postal Directories, harvesting details of the 54 volumes under them.
For each of these 54 volumes, I downloaded the OCRd text page by page. Downloading the text by page, rather than volume, was very slow, but I thought it was important to be able to link each line of text back to its original page.
To create links back to pages, I also needed the individual identifiers for each page. A list of page identifiers is embedded as a JSON string within each volume’s HTML, so I extracted this data and matched the page ids to the text.
Using sqlite-utils, I created a SQLite database with a separate table for every volume. Then I processed the text by volume, page, and line – adding each line line of text and its page details as a individual record in the database.
I then ran full text indexing across each line to make it easily searchable.
Using Datasette and its search-all plugin, I loaded up the database and BINGO! More than 30 million lines of text across 54 digitised volumes were instantly searchable.
To make it all public, I used Datasette’s publish function to push the database to Google’s Cloudrun service.

All the code is available in the journals section of the GLAM Workbench.

Future developments

One of the most exciting things to me is that this processing pipeline can be used with any digitised publication on Trove where it would be easier to search by line rather than article. Any suggestions?

Mon, 29 Aug 2022 15:55:30 +1000

Interested in Victorian shipwrecks? Kim Doyle and Mitchell Harrop have added a new notebook to the Heritage Council of Victoria section of the GLAM Workbench exploring shipwrecks in the Victorian Heritage Database: glam-workbench.net/heritage-…

Mon, 29 Aug 2022 13:53:26 +1000

Updates!

troveharvester Python package updated to v0.5.1: github.com/wragge/tr…
Trove Newspaper Harvester section of #GLAMWorkbench updated to v1.1.1 to use latest troveharvester: glam-workbench.net/trove-har…

Thu, 25 Aug 2022 13:29:39 +1000

Minor update to RecordSearch Data Scraper – now captures ‘institution title’ for agencies if it is present. pypi.org/project/r…

Many thanks to the British Library – sponsors of the GLAM Workbench’s web archives section!

Tue, 16 Aug 2022 10:21:59 +1000

You might have noticed some changes to the web archives section of the GLAM Workbench.

I’m very excited to announce that the British Library is now sponsoring the web archives section! Many thanks to the British Library and the UK Web Archive for their support – it really makes a difference.

The web archives section was developed in 2020 with the support of the International Internet Preservation Consortium’s Discretionary Funding Programme, in collaboration with the British Library, the National Library of Australia, and the National Library of New Zealand. It’s intended to help historians, and other researchers, understand what sort of data is available through web archives, how to get it, and what you can do with it. It provides a series of tools and examples that document existing APIs, and explore questions such as how web pages change over time. The notebooks focus on four particular web archives: the UK Web Archive, the Australian Web Archive (National Library of Australia ), the New Zealand Web Archive (National Library of New Zealand), and the Internet Archive. However, the tools and approaches could be easily extended to other web archives (and soon will be!). I introduced the web archives section of the GLAM Workbench in this seminar for the IIPC in August 2020:

According to the Binder launch stats, the web archives section is the most heavily used part of the GLAM Workbench. In December 2020, it won the British Library Labs Research Award. Last year I updated the repository, automating the build of Docker images, and adding integrations with Zenodo, Reclaim Cloud, and Australia’s Nectar research cloud. I’m also thinking about some new notebooks – watch this space!

The GLAM Workbench receives no direct funding from government or research agencies, and so the support of sponsors like the British Library and all my other GitHub sponsors is really important. Thank you! If you think this work is valuable, have a look at the Get involved! page to see how you can contribute. And if your organisation would like to sponsor a section of the GLAM Workbench, let me know!

New GLAM data to search, visualise and explore using the GLAM Workbench!

Mon, 15 Aug 2022 16:43:44 +1000

There’s lots of GLAM data out there if you know where to look! For the past few years I’ve been harvesting a list of datasets published by Australian galleries, libraries, archives, and museums through open government data portals. I’ve just updated the harvest and there’s now 463 datasets containing 1,192 files. There’s a human-readable version of the list that you can browse. If you just want the data you can download it as a CSV. Or if you’d like to search the list there’s a database version hosted on Glitch. The harvesting and processing code is available in this notebook.

The GLAM data from government portals section of the GLAM Workbench provides more information and a summary of results. For example, here’s a list of the number of data files by GLAM institution.

Most of the datasets are in CSV format, and most have a CC-BY licence.

What’s inside?

Obviously it’s great that GLAM organisations are sharing lots of open data, but what’s actually inside all of those CSV files? To help you find out, I created the GLAM CSV Explorer. Click on the blue button to run it in Binder, then just select a dataset from the dropdown list. The CSV Explorer will download the file, and examine the contents of every field to try and determine the type of data it holds – such as text, dates, or numbers. It then summarises the results and builds a series of visualisations to give you an overview of the dataset.

Search for names

Many of the datasets are name indexes to collections of records – GLAM staff or volunteers have transcribed the names of people mentioned in records as an aid to users. For Family History Month last year I aggregated all of the name indexes and made them searchable through a single interface using Datasette. The GLAM Name Index Search has been updated as well – it searches across 10.3 million records in 253 indexes from 10 GLAM organisations. And it’s free!

And a bit of maintenance…

As well as updating the data, I also updated the code repository adding the features that I’m rolling out across the whole of the GLAM Workbench. This includes automated Docker builds saved to Quay.io, integrations with Reclaim Cloud and Zenodo, and some basic quality controls through testing and code format checks.

Zotero now saves links to digitised items in Trove from the NLA catalogue!

Tue, 09 Aug 2022 09:52:53 +1000

I’ve made a small change to the Zotero translator for the National Library of Australia’s catalogue. Now, if there’s a link to a digitised version of the work in Trove, that link will be saved in Zotero’s url field. This makes it quicker and easier to view digitised items – just click on the ‘URL’ label in Zotero to open the link.

It’s also handy if you’re viewing a digitised work in Trove and want to capture the metadata about it. Just click on the ‘View catalogue’ link in the details tab of a Trove item, then use Zotero to save the details from the catalogue.

View embedded JSON metadata for Trove's digitised books and journals

Mon, 01 Aug 2022 18:22:47 +1000

The metadata for digitised books and journals in Trove can seem a bit sparse, but there’s quite a lot of useful metadata embedded within Trove’s web pages that isn’t displayed to users or made available through the Trove API. This notebook in the GLAM Workbench shows you how you can access it. To make it even easier, I’ve added a new endpoint to my Trove Proxy that returns the metadata in JSON format.

Just pass the url of a digitised book or journal as a parameter named url to https://trove-proxy.herokuapp.com/metadata/. For example:

https://trove-proxy.herokuapp.com/metadata/?url=https://nla.gov.au/nla.obj-2906940941

I’ve created a simple bookmarklet to make it simpler to open the proxy. To use it just:

Drag this link to your bookmarks toolbar: Get Trove work metadata
View a digitised book or journal in Trove.
Click on the bookmarklet to view the metadata in JSON format.

To view the JSON data in your browser you might need to install an extension like JSONView.

Where did all those NSW articles go? Trove Newspapers Data Dashboard update!

Fri, 29 Jul 2022 13:20:34 +1000

I was looking at my Trove Newspapers Data Dashboard again last night trying to figure out why the number of newspaper articles from NSW seemed to have dropped by more than 700,000 since my harvesting began. It took me a while to figure out, but it seems that the search index was rebuilt on 31 May, and that caused some major shifts in the distribution of articles by state, as reported by the main result API. So the indexing of the articles changed, not the actual number of articles. Interestingly, the number of articles by state reported by the newspaper API doesn’t show the same fluctuations.

This adds another layer of complexity to understanding how Trove changes over time. To try and document such things, I’ve added a ‘Significant events’ section to the Dashboard. I’ve also included a new ‘Total articles by publication state’ section that compares results from the result and newspaper APIs. This should make it easier to identify such issues in the future.

Stay alert people – remember, search interfaces lie!

Catching up – some recent GLAM Workbench updates!

Thu, 28 Jul 2022 14:59:13 +1000

There’s been lots of small updates to the GLAM Workbench over the last couple of months and I’ve fallen behind in sharing details. So here’s an omnibus list of everything I can remember…

Data

Weekly harvests of basic Trove newspaper data continue, there’s now about three months worth. You can view a summary of the harvested data through the brand new Trove Newspaper Data Dashboard. The Dashboard is generated from a Jupyter notebook and is updated whenever there’s a new data harvest.
There’s also weekly harvests of files digitised by the NAA, now 16 months worth of data.
Updated harvest of Trove public tags (Zenodo) – includes 2,201,090 unique public tags added to 9,370,614 resources in Trove between August 2008 and July 2022.
I’ve started moving other pre-harvested datasets out of the GLAM Workbench code repositories, into their own data repositories. This means better versioning and citability. The first example is the list of Trove newspapers with articles post the 1955 copyright cliff of death – here’s the GH repo, and the Zenodo record.
To bring together datasets that provide historical data about Trove itself, I’ve created a Trove historical data community on Zenodo. Anyone’s welcome to contribute. There’s much more to come.

Tag cloud generated from the latest harvest of Trove Tags

Code

Big thanks to Mitchell Harrop who contributed a new Heritage Council of Victoria section to the GLAM Workbench providing examples using the Victorian Heritage Database API.
The troveharvester Python package has been updated. Mainly to remove annoying Pandas warnings and to make use of the trove-query-parser package.
As a result of the above, the Trove Newspaper & Gazette Harvester section of the GLAM Workbench has been updated. No major changes to notebooks, but I’ve implemented basic testing and linting to improve code quality.
The Trove newspapers section of the GW has been updated. There were a few bug fixes and minor improvements. In particular there was a problem downloading data and HTML files from QueryPic, and some date queries in QueryPic were returning no results.
The tool to download complete, high-res newspaper page images has been updated so that you now no longer need to supply an API key. Also fixed a problem with displaying the images in Voila.
The recordsearch_data_scraper Python package has been updated. This fixes a bug where agency and series searches with only one result weren’t being captured properly.
The RecordSearch section of the GW has been updated. This is incorporates the above update, but I took the opportunity to update all packages, and implement basic testing and linting. The Harvest items from a search in RecordSearch notebook has been simplified and reorganised. There are two new notebooks: Exploring harvested series data, 2022 – generates some basic statistics from the harvest of series data in 2022 and compares the results to the previous year; Summary of records digitised in the previous week – run this notebook to analyse the most recent dataset of recently digitised files, summarising the results by series.
A new Zotero translator for Libraries Tasmania has been developed

Wed, 13 Jul 2022 23:04:24 +1000

Updated dataset! Harvests of Trove list metadata from 2018, 2020, and 2022 are now available on Zenodo: doi.org/10.5281/z… Another addition to the growing collection of historical Trove data. #GLAMWorkbench

Sat, 09 Jul 2022 17:34:26 +1000

Coz I love making work for myself, I’ve started pulling datasets out of #GLAMWorkbench code repos & creating new data repos for them. This way they’ll have their own version histories in Zenodo. Here’s the first: github.com/GLAM-Work…

Tue, 28 Jun 2022 08:48:32 +1000

Ahead of my session at #OzHA2022 tomorrow, I’ve updated the NAA section of the #GLAMWorkbench. Come along to find out how to harvest file details, digitsed images, and PDFs, from a search in RecordSearch! github.com/GLAM-Work…

Sun, 26 Jun 2022 11:48:52 +1000

Noticed that QueryPic was having a problem with some date queries. Should be fixed in the latest release of the Trove Newspapers section of the #GLAMWorkbench: glam-workbench.net/trove-new… #maintenance #researchinfrastructure

Fri, 24 Jun 2022 16:38:04 +1000

The Trove Newspapers section of the #GLAMWorkbench has been updated! Voilá was causing a problem in QueryPic, stopping results from being downloaded. A package update did the trick! Everything now updated & tested. glam-workbench.net/trove-new…

Fri, 24 Jun 2022 14:40:33 +1000

Some more #GLAMWorkbench maintenance – this app to download a high-res page images from Trove newspapers now doesn’t require an API key if you have a url, & some display problems have been fixed. trove-newspaper-apps.herokuapp.com/voila/ren…

Thu, 23 Jun 2022 16:11:35 +1000

The Trove Newspaper and Gazette Harvester section of the #GLAMWorkbench has been updated! No major changes to notebooks, just lots of background maintenance stuff such as updating packages, testing, linting notebooks etc. glam-workbench.net/trove-har…

Wed, 01 Jun 2022 16:43:27 +1000

Ordering some #GLAMWorkbench stickers…

Using Datasette on Nectar

Thu, 26 May 2022 15:24:49 +1000

If you have a dataset that you want to share as a searchable online database then check out Datasette – it’s a fabulous tool that provides an ever-growing range of options for exploring and publishing data. I particularly like how easy Datasette makes it to publish datasets on cloud services like Google’s Cloudrun and Heroku. A couple of weekends ago I migrated the TungWah Newspaper Index to Datasette. It’s now running on Heroku, and I can push updates to it in seconds.

I’m also using Datasette as the platform for sharing data from the Sydney Stock Exchange Project that I’m working on with the ANU Archives. There’s a lot of data – more than 20 million rows – but getting it running on Google Cloudrun was pretty straightforward with Datasette’s publish command. The problem was, however, that Datasette is configured to run on most cloud services in ‘immutable’ mode and we want authenticated users to be able to improve the data. So I needed to explore alternatives.

I’ve been working with Nectar over the past year to develop a GLAM Workbench application that helps researchers do things like harvesting newspaper articles from a Trove search. So I thought I’d have a go at setting up Datasette in a Nectar instance, and it works! Here’s a few notes on what I did…

First of course you need get yourself a resource allocation on Nectar. I’ve also got a persistent volume storage allocation that I’m using for the data.
From the Nectar Dashboard I made sure that I had an SSH keypair configured, and created a security group to allow access via SSH, HTTP and HTTPS. I also set up a new storage volume.
I then created a new Virtual Machine using the Ubuntu 22.04 image, and attaching the keypair, security group, and volume storage. For the stock exchange data I’m currently used the ‘m3.medium’ flavour of virtual machine which provides 8gb of RAM and 4 VCPUs. This might be overkill, but I went with the bigger machine because of the size of the SQLite database (around 2gb). This is similar to what I used on Cloudstor after I ran into problems with the memory limit. I think most projects would run perfectly well using one of the ‘small’ flavours. In any case, it’s easy to resize if you run into problems.
Once the new machine was running I grabbed the IP address. Because I have DNS configured on my Nectar project, I also created a ‘datasette’ subdomain from the DNS dashboard by pointing an ‘A’ (alias) record to the IP address.
Using the IP address I logged into the new machine via SSH.
With all the Nectar config done, it was time to set up Datasette. I mainly just followed the excellent instructions in the Datasette documention for deploying Datasette using systemd. This involved installing datasette via pip, creating a folder for the Datasette data and configuration files, creating a datasette.service file for systemd.
I also used the datasette install command to add a couple of Datasette plugins. One of these is the datasette-github-auth plugin, which needs a couple of secret tokens set. I added these as environment variables in the datasette.service file.
The systemd setup uses Datasette’s configuration directory mode. This means you can put your database, metadata definitions, custom templates and CSS, and any other settings all together in a single directory and Datasette will find and use them. I’d previously passed runtime settings via the command line, so I had to create a settings.json for these.
Then I just uploaded all my Datasette database and configuration files to the folder I created on the virtual machine using rsync and started the Datasette service. It worked!
The next step was to use the persistent volume storage for my Datasette files. The persistent storage exists independently of the virtual machine, so you don’t need to worry about losing data if there’s a change to the instance. I mounted the storage volume as /pvol in the virtual machine as the Nectar documentation describes.
I created a datasette-root folder under pvol, copied the Datasette files to it, and changed the datasette.service file to point to it. This didn’t seem to work and I’m not sure why. So instead I created a symbolic link between /home/ubuntu/datasette-root and /pvol/datasette-root and and set the path in the service file back to /home/ubuntu/datasette-root. This worked! So now the database and configuration files are sitting in the persistent storage volume.
To make the new Datasette instance visible to the outside world, I installed nginx, and configured it as a Datasette proxy using the example in the Datasette documentation.
Finally I configured HTTPS using certbot.

Although the steps above might seem complicated, it was mainly just a matter of copying and pasting commands from the existing documentation. The new Datasette instance is running here, but this is just for testing and will disappear soon. If you’d like to know more about the Stock Exchange Project, check out the ANU Archives section of the GLAM Workbench.

Convert your Trove newspaper searches to an API query with just one click!

Fri, 20 May 2022 15:43:00 +1000

I’m thinking about the Trove Researcher Platform discussions & ways of integrating Trove with other apps and platforms (like the GLAM Workbench).

As a simple demo I modifed my Trove Proxy app to convert a newspaper search url from the Trove web interface into an API query (using the trove-query-parser package). The proxy app then redirects you to the Trove API Console so you can see the results of the API query without needing a key.

To make it easy to use, I created a bookmarklet that encodes your current url and feeds it to the proxy. To use it just:

Drag this link to your bookmarks toolbar: Open Trove API Console.
Run a search in Trove’s newspapers.
Click on the bookmarklet.

This little hack provides a bit of ‘glue’ to help researchers think their about search results as data, and explore other possibilities for download and analysis. #dhhacks

My Trove researcher platform wishlist

Wed, 11 May 2022 13:26:00 +1000

The ARDC is collecting user requirements for the Trove researcher platform for advanced research. This is a chance to start from scratch, and think about the types of data, tools, or interface enhancements that would support innovative research in the humanities and social sciences. The ARDC will be holding two public roundtables, on 13 and 20 May, to gather ideas. I created a list of possible API improvements in my response to last year’s draft plan, and thought it might be useful to expand that a bit, and add in a few other annoyances, possibilities, and long-held dreams.

My focus is again on the data; this is for two reasons. First because public access to consistent, good quality data makes all other things possible. But, of course, it’s never just a matter of OPENING ALL THE DATA. There will be questions about priorities, about formats, about delivery, about normalisation and enrichment. Many of these questions will arise as people try to make use of the data. There needs to be an ongoing conversation between data providers, research tool makers, and research tool users. This is the second reason I think the data is critically important – our focus should be on developing communities and skills, not products. A series of one-off tools for researchers might be useful, but the benefits will wane. Building tools through networks of collaboration and information sharing based on good quality data offers much more. Researchers should be participants in these processes, not consumers.

Anyway, here’s my current wishlist…

APIs and data

Bring the web interface and main public API back into sync, so that researchers can easily transfer queries between the two. The Trove interface update in 2020 reorganised resources into ‘categories’, replacing the original ‘zones’. The API, however, is still organised by zone and knows nothing about these new categories. Why does this matter? The web interface allows researchers to explore the collection and develop research questions. Some of these questions might be answered by downloading data from the API for analysis or visualisation. But, except for the newspapers, there is currently no one-to-one correspondence between searches in the web interface and searches using the API. There’s no way of transferring your questions – you need to start again.
Expand the metadata available for digitised resources other than newspapers. In recent years, the NLA has digitised huge quantities of books, journals, images, manuscripts, and maps. The digitisation process has generated new metadata describing these resources, but most of this is not available through the public API. We can get an idea of what’s missing by comparing the digitised journals to the newspapers. The API includes a newspaper endpoint that provides data on all the newspapers in Trove. You can use it to get a list of available issues for any newspaper. There is no comparable way of retrieving a list of digitised journals, or the issues that have been digitised. The data’s somewhere – there’s an internal API that’s used to generate lists of issues in the browse interface and I’ve scraped this to harvest issue details. But this information should should be in the public API. Manuscripts are described using finding aids, themselves generated from EAD formatted XML files, but none of this important structured data is available from the API, or for download. There’s also other resource metadata, such as parent/child relationships between different levels in the object hierarchy (eg publication > pages). These are embedded in web pages but not exposed in the API. The main point is that when it comes to data-driven research, digitised books, journals, manuscripts, images, and maps are second-class citizens, trailing far behind the newspapers in research possibilities. There needs to be a thorough stocktake of available metadata, and a plan to make this available in machine actionable form.
Standardise the delivery of text, images, and PDFs and provide download links through the API. As noted above, digitised resources are treated differently depending on where they sit in Trove. There are no standard mechanisms for downloading the products of digitisation, such as OCRd text and images. OCRd text is available directly though the API for newspaper and journal articles, but to download text from a book or journal issue you need to hack the download mechanisms from the web interface. Links to these should be included in the API. Similarly, machine access to images requires various hacks and workarounds. There should be a consistent approach that allows researchers to compile image datasets from digitised resources using the API. Ideally IIIF standard APIs should be used for the delivery of images and maps. This would enable the use of the growing ecosystem of IIIF compliant tools for integration, analysis, and annotation.
Provide an option to exclude search results in tags and comments. The Trove advanced search used to give you the option of excluding search results which only matched tags or comments, rather than the content of the resource. Back when I was working at Trove, the IT folks said this feature would be added to a future version of the API, but instead it disappeared from the web interface with the 2020 update! Why is this important? If you’re trying to analyse the occurance of search terms within a collection, such as Trove’s digitised newspapers, you want to be sure that the result reflects the actual content, and not a recent annotation by Trove users.
Finally add the People & Organisations data to the main API. Trove’s People & Organisations section was ahead of the game in providing machine-readable access, but the original API is out-of-date and uses a completely different query language. Some work was done on adding it to the main RESTful API, but it was never finished. With a bit of long-overdue attention, the People & Organisations data could power new ways of using and linking biographical resources.
Improve web archives CDX API. Although the NLA does little to inform researchers of the possibilities, the web archives software it uses (Pywb) includes some baked in options for retrieving machine-readable data. This includes support for the Memento protocol, and the provision of a CDX API that delivers basic metadata about individual web page captures. The current CDX API has some limitations ( documented here ). In particular, there’s no pagination or results, and no support for domain-level queries. Addressing these limitations would make the existing CDX API much more useful.
Provide new data sources for web archives analysis. There needs to be an constructive, ongoing discussion about the types of data that could be extracted and shared from the Australian web archive. For example, a search API, or downloadable datasets of word frequencies. The scale is a challenge, but some pilot studies could help us all understand both the limits and the possibilities.
Provide a Write API for annotations. Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources. Indeed, this would create exciting possibilities for embedding Trove resources within systems of scholarly analysis, allowing insights gained through research to be automatically fed back into Trove to enhance discovery and understanding.
Provide historical statistics on Trove resources. It’s important for researchers to understand how Trove itself changes over time. There used to be a page that provided regularly-updated statistics on the number of resources and user annotations, but this was removed by the interface upgrade in 2020. I’ve started harvesting some basic stats relating to Trove newspapers, but access to general statistics should be reinstated.
Reassess key authentication and account limits. DigitalNZ recently changed their policy around API authentication, allowing public access without a key. Authentication requirements hinder exploration and limit opportunities for using the API in teaching and workshops. Similarly, I don’t think the account usage limits have been changed since the API was released, even though the capacity of the systems has increased. It seems like time that both of these were reassessed.

Ok, I’ll admit, that’s a pretty long list, and not everything can be done immediately! I think this would be a good opportunity for the NLA to develop and share an API and Data Roadmap, that is regularly updated, and invites comments and suggestions. This would help researchers plan for future projects, and build a case for further government investment.

Integration

Unbreak Zotero integration. The 2020 interface upgrade broke the existing Zotero integration and there’s no straightforward way of fixing it without changes at the Trove end. Zotero used to be able to capture search results, metadata and images from most of the zones in Trove. Now it can only capture individual newspaper articles. This greatly limits the ability of researchers to assemble and manage their own research collections. More generally, a program to examine and support Zotero integration across the GLAM sector would be a useful way of spending some research infrastructure dollars.
Provide useful page metadata. Zotero is just one example of a tool that can extract structured metadata from web pages. Such metadata supports reuse and integration, without the need for separate API requests. Only Trove’s newspaper articles currently provide embedded metadata. Libraries used to lead the way is promoting the use of standardised, structured, embedded page metadata (Dublin Core anyone?), but now?
Explore annotation frameworks. I’ve mentioned the possibility of a Write API for annotations above, but there are other possibilities for supporting web scale annotations, such as Hypothesis. Again, the current Trove interface makes the use of Hypothesis difficult, and again this sort of integration would be usefully assessed across the whole GLAM sector.

Tools & interfaces

Obviously any discussion of new tools or interfaces needs to start by looking at what’s already available. This is difficult when the NLA won’t even link to existing resources such as the GLAM Workbench. Sharing information about existing tools needs to be the starting point from which to plan investment in the Trove Researcher Platform. From there we can identify gaps and develop processes and collaborations to meet specific research needs. Here’s a list of some Trove-related tools and resources currently available through the GLAM Workbench.

Update (18 May): some extra bonus bugs

I forgot to add these annoying bugs:

The newspaper endpoint returns both newspaper and gazette titles, even though there’s a separate gazette endpoint. This forces you to do silly workarounds like this in the GLAM Workbench.
The list zone has recurring problems. At the moment it’s impossible to harvest a complete set of Trove lists.

Tue, 10 May 2022 21:26:35 +1000

Spending the evening updating the NAA section of the #GLAMWorkbench. Here’s a fresh harvest of the agency functions currently being used in RecordSearch… gist.github.com/wragge/d1…

Working with Trove data – a collection of tools and resources

Mon, 02 May 2022 14:23:00 +1000

The ARDC is organising a couple of public forums to help gather researcher requirements for the Trove component of the HASS RDC. One of the roundtables will look at ‘Existing tools that utilise Trove data and APIs’. Last year I wrote a summary of what the GLAM Workbench can contribute to the development of humanities research infrastructure, particularly in regard to Trove. I thought it might be useful to update that list to include recent additions to the GLAM Workbench, as well as a range of other datasets, software, tools, and interfaces that exist outside of the GLAM Workbench.

Since last year’s post I’ve also been working hard to integrate the GLAM Workbench with other eResearch services such as Nectar and CloudStor, and to document and support the ways that individuals and institutions can contribute code and documentation.

Getting and moving data

There’s lots of fabulous data in Trove and other GLAM collections. In fact, there’s so much data that it can be difficult for researchers to find and collect what’s relevant to their interests. There are many tools in the GLAM Workbench to help researchers assemble their own datasets. For example:

Get newspaper articles in bulk with the Trove Newspaper and Gazette Harvester – This has been around in some form for more than ten years (it pre-dates the Trove API!). Give it the url of a search in Trove’s newspapers and gazettes and the harvester will save all the metadata in a CSV file, and optionally download the complete articles as OCRd text, images, or PDFs. The amount of data you harvest is really only limited by your patience and disk space. I’ve harvested more than a million articles in the past. The GLAM Workbench includes a web app version of the harvester that runs live in the cloud – just paste in your Trove API key and the search url, and click the button.
Harvest information about newspaper issues – When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? On what dates were newspapers published? This notebook shows how you can get information about issues from the Trove API.
Get Trove newspaper pages as images – If you need a nice, high-resolution version of a newspaper page you can use this web app. If you want to harvest every front page (or some other particular page) here’s an example that gets all the covers of the Australian Women’s Weekly. A pre-harvested collection of the AWW covers is included as a bonus extra.
Get Trove newspaper articles as images – The Trove web interface makes it difficult to download complete images of articles, but this tool will do the job. There’s a handy web app to grab individual images, but the code from this tool is reused in other places such as the Trove Newspaper Harvester and the Omeka uploader, and could be built-in to your own research workflows.
Harvest the issues of a newspaper as PDFs – This notebook harvests whole issues of newspapers as PDFs – one PDF per issue.
Upload Trove newspaper articles to Omeka – Whether you’re creating on online exhibition or building a research database, Omeka can be very useful. This notebook connects Trove’s newspapers to Omeka for easy upload. Your selected articles can come from a search query, a Trove list, a Zotero library, or just a list of article ids. Metadata records are created in Omeka for each article and newspaper, and an image of each article is attached. My Omeka S Tools software package also includes an example using Trove newspapers.
Get OCRd text from digitised periodicals in Trove – They’re often overshadowed by the newspapers, but there’s now lots of digitised journals, magazines, and parliamentary papers in Trove. You can get article-level data from the API, but not issue data. This notebook enables researchers to get metadata and OCRd text from every available issue of a periodical. To make researchers’ lives even easier, I regularly harvest all the available OCRd text from digitised periodicals in Trove. The latest harvest downloaded 51,928 issues from 1,163 periodicals – that’s about 10gb of text. You can browse the list of periodicals with OCRd text, or search this database. All the OCRd text is stored in a public repository on CloudStor.
Get page images from digitised periodicals in Trove – There’s more than text in digitised periodicals, and you might want to download images of pages for visual analysis. This notebook shows you how to get cover images, but could be easily modified to get another page, or a PDF. I used a modified version of this to create a collection of 3,471 full page editorial cartoons from The Bulletin, 1886 to 1952 – all available to download from CloudStor.
Get OCRd text from digitised books in Trove – Yep, there’s digitised books as well as newspapers and periodicals. You can download OCRd text from an individual book using the Trove web interface, but how do you make a collection of books without all that pointing and clicking? This notebook downloads all the available OCRd text from digitised books in Trove. The latest harvest includes text from 26,762 works. You can explore the results using this database.
Harvest parliamentary press releases from Trove – Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and fulltext from a search of the parliamentary press releases. For example, here’s a collection of politicians talking about ‘refugees’, and another relating to COVID-19.
Harvest details of Radio National programs from Trove – Trove creates records for programs broadcast on ABC Radio National, for the major current affairs programs these records at at segment level. Even though they don’t provide full transcripts, this data provide a rich, fine-grained record of Australia’s recent political, social, and economic history. This notebook shows you how to download the Radio National data. If you just want to dive straight in, there’s also a pre-harvested collection containing more than 400,000 records, with separate downloads for some of the main programs.
Find all the versions of an archived web page in Trove – Many of the tools in the Web Archives section of the GLAM Workbench will work with the Australian Web Archive, which is part of Trove. This notebook shows you how to get data about the number of times a web page has been archived over time.
Harvesting collections of text from archived web pages in Trove – If you want to explore how the content of a web page changes over time, you can use this notebook to capture the text content of every archived version of a web page.
Convert a Trove list into a CSV file – While Trove provides a data download option for lists, it leaves out a lot of useful data. This notebook downloads full details of newspaper articles and other works in a list and saves them as CSV files. Like the Trove Newspaper Harvester, it lets you download OCRd text and images from newspaper articles.
Collecting information about Trove user activity – It’s not just the content of Trove that provides interesting research data, it’s also the way people engage with it. Using the Trove API it’s possible to harvest details of all user created lists and tags. And yes, there’s pre-harvested collections of lists and tags for the impatient.

While I’m focusing here on Trove, there’s also tools to create datasets from the National Archives of Australia, Digital NZ and Papers Past, the National Museum of Australia and more. And there’s a big list of readily downloadable datasets from Australian GLAM organisations.

Visualisation and analysis

Many of the notebooks listed above include examples that demonstrate ways of exploring and analysing your harvested data. There are also a number of companion notebooks that examine some possibilities in more detail, for example:

But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:

QueryPic – Another tool that’s been around in different forms for a decade, QueryPic visualises searches in Trove’s newspapers. The latest web app couldn’t be simpler, just paste in your API key and a search url and create charts showing the number of matching articles over time. You can combine queries, change time scales, and download the data and visualisations. Interested to see how other researchers have used it? Here’s a Twitter thread with links to some publications.
Visualise Trove newspaper searches over time – This is like a deconstructed version of QueryPic that walks you through the process of using Trove’s facets to assemble a dataset of results over time. It provide a lot of detail on the sorts of data available, and the questions we can ask of it.
Visualise the total number of newspaper articles in Trove by year and state – This notebook uses a modified version of the code above to analyse the construction and context of Trove’s newspaper corpus itself. What are you actually searching? Meet the WWI effect and the copyright cliff of death! This is a great place to start if you want to get people thinking critically about digital resources are constructed.
Trove newspapers – number of issues per day, 1803–2020 – visualisation of the number of newspaper issues published every day in Trove.
Analyse rates of OCR correction – Some more meta-analysis of the Trove corpus itself, this time focusing on patterns of OCR correction by Trove users.
Identifying non-English language newspapers in Trove – There are a growing number of non-English language newspapers digitised in Trove. However, if you’re only searching using English keywords, you might never know that they’re there. This notebook analyses a sample of articles from every newspaper in Trove to identify non-English content.
Beyond the copyright cliff of death – Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. This notebook helps you find out how many, and which newspapers they were published in.
Map Trove newspaper results by state – This notebook uses the Trove state facet to create a choropleth map that visualises the number of search results per state.
Map Trove newspaper results by place of publication – This notebook uses the Trove title facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.
Compare two versions of an archived web page – This notebook demonstrates a number of different ways of comparing versions of archived web pages. Just choose a repository, enter a url, and select two dates to see comparisons based on: page metadata, basic statistics such as file size and number of words, numbers of internal and external links, cosine similarity of text, line by line differences in text or code, and screenshots.
Display changes in the text of an archived web page over time – This web app gathers all the available versions of a web page and then visualises changes in its content between versions – what’s been added, removed, and changed?
Use screenshots to visualise change in a page over time– Create a series of full page screenshots of a web page over time, then assemble them into a time series.

There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.

Documentation and examples

All the Trove notebooks in the GLAM Workbench help document the possibilities and limits of the Trove API. The examples above can be modified and reworked to suit different research interests. Some notebooks also explore particular aspects of the API, for example:

Trove API Introduction – Some very basic examples of making requests and understanding results.
Today’s news yesterday – Uses the date index and the firstpageseq parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.
The use of standard licences and rights statements in Trove image records – Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of a list of standard licences and rights statements. We can also use this index to build a picture of which rights statements are currently being used, and by who.
Random items from Trove – Changes to the Trove API meant that techniques you could previously use to select resources at random no longer work. This section documents some alternative ways of retrieving random-ish works and newspaper articles from Trove.

And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.

Videos

I’ve started making videos to help you get started with the GLAM Workbench.

Datasets

A number of pre-harvested datasets are noted above in the ‘Getting and moving data’ section. Here’s a fairly complete list of ready-to-download datasets harvested from Trove.

Newspapers

Books and periodicals

Other

See also Sources of Australian GLAM data in the GLAM Workbench.

Software

The GLAM Workbench makes use of a number of software packages that I’ve created in Python to work with Trove data. These are openly-licensed and available for installation from PyPi.

Trove Harvester – harvest newspaper and gazette articles
Trove Query Parser – convert search parameters from the Trove web interface into a form the API understands
Trove Newspaper Images – tools for downloading images from Trove’s digitised newspapers and gazettes

Other tools and interfaces

Over the years I’ve developed many tools and interfaces using Trove data. Some have been replaced by the GLAM Workbench, but others keep chugging along, for example:

Explore Trove’s Digitised Journals – a tool to help you browse, search, and explore Trove’s digitised journals
Headline Roulette – a simple, customisable game using Trove newspapers
In a word… – currents in Australian affairs, 2003–2013, using data from ABC Radio National
Trove API Console – use this to learn how to construct API queries
Trove Traces – archived version of a 2014 experiment to see who was citing Trove newspapers
@TroveNewsBot – unlike most GLAM Twitter bots, @TroveNewsBot doesn’t just tweet random stuff, you can use it to search Trove from Twitter, see the full operating instructions.
Trove Places – click on a map to find Trove newspapers by place
Trove Zotero Translator – lets you capture metadata, OCRd text, and a PDF from an article in Trove newspapers, installed as part of Zotero.
Keyboard shortcuts for Trove newspapers – this userscript activates your arrow keys to help you navigate newspapers by page and issue.

See also More GLAM tools and interfaces in the GLAM Workbench. #dhhacks

Sat, 30 Apr 2022 15:39:07 +1000

And so it starts… #GLAMWorkbench

Thu, 28 Apr 2022 22:49:31 +1000

Ok, I’ve created a new #GLAMWorkbench meta issue to try and bring together all the things I’m trying to do to improve & automate the code & documentation. This should help me keep track of things… github.com/GLAM-Work… #DayofDH2022

Thu, 28 Apr 2022 21:33:41 +1000

A couple of hours of #DayofDH2022 left – feeling a bit uninspired, so I’m going to do some pruning & reorganising of the #GLAMWorkbench issues list: github.com/GLAM-Work…

Tracking Trove changes over time

Wed, 20 Apr 2022 14:49:46 +1000

I’ve been doing a bit of cleaning up, trying to make some old datasets more easily available. In particular I’ve been pulling together harvests of the number of newspaper articles in Trove by year and state. My first harvests date all the way back to 2011, before there was even a Trove API. Unfortunately, I didn’t run the harvests as often as I should’ve and there are some big gaps. Nonetheless, if you’re interested in how Trove’s newspaper corpus has grown and changed over time, you might find them useful. They’re available in this repository and also in Zenodo.

This chart shows how the number of newspaper articles per year in Trove has changed from 2011 to 2022. Note the rapid growth between 2011 and 2015.

To try and make sure that there’s a more consistent record from now on, I’ve also created a new Git Scraper – a GitHub repository that automatically harvests and saves data at weekly intervals. As well as the number of articles by year and state, it also harvests the number of articles by newspaper and category. As mentioned, these four datasets are updated weekly. If you want to get all the changes over time, you can retrieve earlier versions from the repository’s commit history.

All the datasets are CC-0 licensed and validated with Frictionless.

There’s also a notebook in the GLAM Workbench that explores this sort of data.

The GLAM Workbench wants you!

Wed, 02 Mar 2022 14:52:20 +1000

Over the past few months I’ve been doing a lot of behind-the-scenes work on the GLAM Workbench – automating, standardising, and documenting processes for developing and managing repositories. These sort of things ease the maintenance burden on me and help make the GLAM Workbench sustainable, even as it continues to grow. But these changes are also aimed at making it easier for you to contribute to the GLAM Workbench!

Perhaps you’re part of a GLAM organisation that wants to help researchers explore its collection data – why not create your own section of the GLAM Workbench? It would be a great opportunity for staff to develop their own digital skills and learn about the possibilities of Jupyter notebooks. I’ve developed a repository template and some detailed documentation to get you started. The repository template includes everything you need to create and test notebooks, as well as built-in integration with Binder, Docker, Reclaim Cloud, and Zenodo. And, of course, I’ll be around to help you through the process.

Or perhaps you’re a researcher who wants to share some code you’ve developed that extends or improves an existing GLAM Workbench repository. Yes please! Or maybe you’re a GLAM Workbench user who has something to add to one of the lists of resources; or you’ve noticed a problem with some of the documentation that you’d like to fix. All contributions welcome!

The Get involved! page includes links to all this information, as well as some other possibilities such as becoming a sponsor, or sharing news. And to recognise those who make a contribution to the code or documentation there’s also a brand new contributors page.

I’m looking forward to exploring how we can build the GLAM Workbench together. #dhhacks

Omeka S Tools – new Python package

Thu, 17 Feb 2022 09:31:00 +1000

Over the last couple of years I've been fiddling with bits of Python code to work with the Omeka S REST API. The Omeka S API is powerful, but the documentation is patchy, and doing basic things like uploading images can seem quite confusing. My code was an attempt to simplify common tasks, like creating new items.

In case it's of use to others, I've now shared my code as a Python package. So you can just `pip install omeka-s-tools` to get started. The code helps you:

download lists of resources
search and filter lists of items
create new items
create new items based on a resource template
update and delete resources
add media to items
add map markers to items (assuming the Mapping module is installed)
upload templates exported from one Omeka instance to a new instance

There's quite detailed documentation available, including an example of adding a newspaper article from Trove to Omeka. If you want to see the code in action, there's also a notebook in the Trove newspapers section of the GLAM Workbench that uploads newspaper articles (including images and OCRd text) to Omeka from a variety of sources, including Trove searches, Trove lists, and Zotero libraries.

If you find any problems, or would like additional features, feel free to create an issue in the GitHub repository. #dhhacks

Testing, testing...

Fri, 28 Jan 2022 14:01:15 +1000

I regularly update the Python packages used in the different sections of the GLAM Workbench; though probably not as often as I should. Part of the problem is that once I've updated the packages, I have to run all the notebooks to make sure I haven't inadvertently broken something -- and this takes time. And in those cases where the notebooks need an API key to run, I have to copy and paste the key in at the appropriate spots, then remember to delete them afterwords. They're little niggles, but they add up, particularly as the GLAM Workbench itself expands.

I've been looking around at Jupyter notebook automated testing options for a while. There's nbmake, testbook, and nbval, as well as custom solutions involving things like papermill and nbconvert. After much wavering, I finally decided to give `nbval` a go. The thing that I like about `nbval` is that I can start simple, then increase the complexity of my testing as required. The `--nbval-lax` option just checks to make sure that all the cells in a notebook run without generating exceptions. You can also tag individual cells that you want to exclude from testing. This gives me a testing baseline -- this notebook runs without errors -- it might not do exactly what I think it's doing, but at least it's not exploding in flames. Working from this baseline, I can start tagging individual cells where I want the output of the cell to be checked. This will let me test whether a cell is doing what it's supposed to.

This approach means that I can start testing without making major changes to existing notebooks. The main thing I had to think about is how to handle API keys or other variables which are manually set by users. I decided the easiest approach was to store them in a `.env` file and use dotenv to load them within the notebook. This also makes it easy for users to save their own credentials and use them across multiple notebooks -- no more cutting and pasting of keys! Some notebooks are designed to run as web apps using Voila, so they expect human interaction. In these cases, I added extra cells that only run in the testing environment -- they populate the necessary fields and simulate button clicks to start.

While I was in a QA frame of mind, I also started playing with nbqa -- a framework for all sorts of code formatting, linting, and checking tools. I decided I'd try to standardise the formatting of my notebook code by running isort, black, and flake8. As well ask making the code cleaner and more readable, they pick up things like unused imports or variables. To further automate this process, I configured the `nbqa` checks to run when I try to commit any notebook code changes using `git`. This was made easy by the pre-commit package.

This is all set up and running in the Trove newspapers repository -- you can see the changes here. Now if I update the Python packages or make any other changes to the repository, I can just run `pytest --nbval-lax` to test every notebook at once. And if I make changes to an individual notebook, `nbqa` will automatically give the changes a code quality check before I save them to the repository. I'm planning to roll these changes out across the whole of the GLAM Workbench in coming months.

Developments like these are not very exciting for users, but they're important for the management and sustainability of the GLAM Workbench, and help create a solid foundation for future development and collaboration. Last year I created a GLAM Workbench repository template to help people or organisations thinking about contributing new sections. I can now add these testing and QA steps into the template to further share and standardise the work of developing the GLAM Workbench.

Some big pictures of newspapers in Trove and DigitalNZ

Thu, 09 Dec 2021 09:44:24 +1000

One of the things I really like about Jupyter is the fact that I can share notebooks in a variety of different formats. Tools like QueryPic can run as simple web apps using Voila, static versions of notebooks can be viewed using NBViewer, and live versions can be spun up as required on Binder. It’s also possible to export notebooks at PDFs, slideshows, or just plain-old HTML pages. Just recently I realised I could export notebooks to HTML using the same template I use for Voila. This gives me another way of sharing – static web pages delivered via the main GLAM Workbench site.

Here’s a couple of examples:

Papers Past newspapers in DigitalNZ – showing which Papers Past newspapers are available through DigitalNZ
Trove newspapers – number of issues per day, 1803–2020 – the number of newspaper issues published every day in Trove

Both are HTML pages that embed visualisations created using Altair. The visualisations are rendered using javascript, and even though the notebook isn’t running in a live computing environment, there’s some basic interactivity built-in – for example, you can hover for more details, and click on the DigitalNZ chart to search for articles from a newspaper. More to come! #dhhacks

Exploring GLAM data at ResBaz

Thu, 09 Dec 2021 09:13:48 +1000

The video of my key story presentation at ResBaz Queensland (simulcast via ResBaz Sydney) is now available on Vimeo. In it, I explore some of the possibilities of GLAM data by retracing my own journey through WWI service records, The Real Face of White Australia, #redactionart, and Trove – ending up at the GLAM Workbench, which brings together a lot of my tools and resources in a form that anyone can use. The slides are also available, and there’s an archived version of everything in Zenodo.

This and many other presentations about the GLAM Workbench are listed here. It seems I’ve given at least 11 talks and workshops this year! #dhhacks

GLAM Workbench Nectar Cloud Application updated!

Wed, 01 Dec 2021 10:20:53 +1000

The newly-updated DigitalNZ and Te Papa sections of the GLAM Workbench have been added to the list of available repositories in the Nectar Research Cloud’s GLAM Workbench Application. This means you can create your very own version of these repositories running in the Nectar Cloud, simply by choosing them from the app’s dropdown list. See the Using Nectar help page for more information.

I’ve also taken the opportunity to make use of the new container registry service developed by the ARDC as part of the ARCOS project. The app now pulls the GLAM Workbench Docker images from Quay.io via the container registry’s cache. This means that copies of the images are cached locally, speeding things up and saving on data transfers. Yay for integration!

Thanks again to Andy and the Nectar Cloud staff for their help! #dhhacks

DigitalNZ & Te Papa sections of the GLAMWorkbench updated!

Wed, 01 Dec 2021 09:41:00 +1000

In preparation for my talk at ResBaz Aotearoa, I updated the DigitalNZ and Te Papa sections of the GLAM Workbench. Most of the changes are related to management, maintenance, and integration of the repositories. Things like:

Setting up GitHub actions to automatically generate Docker images when the repositories change, and to upload the images to the Quay.io container registry
Automatic generation of an index.ipynb file based on README.md to act as a front page within Jupyter Lab
Addition of a reclaim-manifest.jps file to allow for one-click installation of the repository in Reclaim Cloud
Additional documentation in README.md with instructions on how to run the repository via Binder, Reclaim Cloud, Nectar Research Cloud, and Docker Desktop.
Addition of a .zenodo.json metadata file so that new releases are preserved in Zenodo
Switch to using pip-tools for generating requirements.txt files, and the include unpinned requirements in requirements.in
Update of all Python packages

From the user’s point of view, the main benefit of these changes is the ability to run the repositories in a variety of different environments depending on your needs and skills. The Docker images, generated using repo2Docker, are used by Binder, Reclaim Cloud, Nectar, and Docker Desktop. Same image, multiple environments! See ‘Run these notebooks’ in the DigitalNZ and Te Papa sections of the GLAM Workbench for more information.

Of course, I’ve also re-run all of the notebooks to make sure everything works and to update any statistics, visualisations, and datasets. As a bonus, there’s a couple of new notebooks in the DigitalNZ repository:

QueryPic DigitalNZ – a web app to visualise searches in Papers Past over time (I’ll post some further info about this shortly)
Papers Past newspapers in DigitalNZ – displays details of the Papers Past newspapers available through DigitalNZ (you can view the results as an HTML page)

#dhhacks

A template for GLAM Workbench development

Thu, 11 Nov 2021 16:03:43 +1000

I’m hoping that the GLAM Workbench will encourage GLAM organisations and GLAM data nerds (like me) to create their own Jupyter notebooks. If they do, they can put a link to them in the list of GLAM Jupyter resources. But what if they want to add the notebooks to the GLAM Workbench itself?

To make this easier, I’ve been working on a template repository for the GLAM Workbench. It generates a new skeleton repository with all the files you need to develop and manage your own section of the GLAM Workbench. It uses GitHub’s built in templating feature, together with Cookiecutter , and this GitHub Action by Stefan Buck. Stefan has also written a very helpful blog post.

The new repository is configured to do various things automatically, such as generate and save Docker images, and integrate with Reclaim Cloud and Zenodo. Lurking inside the dev folder of each new repository, you’ll find some basic details on how to set up and manage your development environment.

This is just the first step. There’s more documentation to come, but you’re very welcome to try it out. And, of course, if you are interested in contributing to the development of the GLAM Workbench, let me know and I’ll help get you set up!

Coming up! GLAM Workbench at ResBaz(s)

Thu, 04 Nov 2021 14:28:30 +1000

Want a bit of added GLAM with your digital research skills? You’re in luck, as I’ll be speaking at not one, but three ResBaz events in November. If you haven’t heard of it before, ResBaz (Research Bazaar) is ‘a worldwide festival promoting the digital literacy at the centre of modern research’.

On Wednesday, 24 November I’ll be giving a key story presentation (like a keynote, but with more story!) entitled Exploring GLAM data for ResBaz Queensland. This presentation will also be simulcast through ResBaz Sydney.
On Thursday, 25 November I’ll be giving a presentation on the GLAM Workbench for ResBaz Aotearoa – focusing in particular on NZ GLAM data.

The programs of all three ResBaz events are chock full of excellent opportunities to develop your digital skills, learn new research methods, and explore digital tools. If you’re an HDR student you should check out what’s on offer.

New video – using the Trove Newspaper & Gazette Harvester

Mon, 01 Nov 2021 09:32:00 +1000

The latest help video for the GLAM Workbench walks through the web app version of the Trove Newspaper & Gazette Harvester. Just paste in your search url and Trove API key and you can harvest thousands of digitised newspaper articles in minutes!

Harvest newspaper issues as PDFs

Mon, 01 Nov 2021 08:53:00 +1000

An inquiry on Twitter prompted me to put together a notebook that you can use to download all available issues of a newspaper as PDFs. It was really just a matter of copying code from other tools and making a few modifications. The first step harvests a list of available issues for a particular newspaper from Trove. You can then download the PDFs of those issues, supplying an optional date range. Beware – this could consume a lot of disk space!

The PDF file names have the following structure:

[newspaper identifier]-[issue date as YYYYMMDD]-[issue identifier].pdf

For example:

903-19320528-1791051.pdf

903 – the Glen Innes Examiner
19320528 – 28 May 1932
1791051 – to view in Trove just add this to http://nla.gov.au/nla.news-issue, eg http://nla.gov.au/nla.news-issue1791051

I also took the opportunity to create a new Harvesting data heading in the Trove newspapers section of the GLAM Workbench. #dhhacks

GLAM Workbench now in the Nectar Research Cloud!

Thu, 21 Oct 2021 09:18:11 +1000

The GLAM Workbench isn’t dependent on one big piece of technological infrastructure. It’s basically a collection of Jupyter notebooks, and those notebooks can be used within a variety of different environments. This helps make the GLAM Workbench more sustainable – new components can be swapped in and out as required. It also makes it possible to create different pathways for users, depending on their digital skills, institutional support, and research needs. For example, links to Binder make it easy for users to explore the possibilities of the GLAM Workbench and accomplish quick tasks. But Binder has limits. Where do you go when your research project scales up?

Earlier this year I added one-click installation of GLAM Workbench repositories in Reclaim Cloud. Today I’m very pleased to announce that selected GLAM Workbench repositories can be installed as applications within the Nectar Research Cloud. Using nationally-funded digital infrastructure, researchers in Australian universities can now create their own workbenches in minutes. So whether you’re harvesting truckloads of data from Trove or analysing web archives at scale, you can move beyond Binder and set up an environment dedicated to your research project. Cool huh?

Currently four repositories can be installed on Nectar in this way:

Trove newspapers
Trove newspaper and gazette harvester
Recordsearch (National Archives of Australia)
Web archives

But more will be added in the future as I update repositories to generate the necessary Docker images. Nectar installation information is now included in each of these four repositories, and I’ve added a Using the Nectar Cloud section to the help documentation that includes a detailed walkthrough of the installation process. If you strike any problems either raise an issue on GitHub, or ask a question at OzGLAM Chat.

Huge thanks to Andy, Jacob, and Jo at the Australian Research Data Commons (ARDC) who responded enthusiastically to my tweeted query, and packaged the repositories up into an easy-to-install, reusable application. After all the work I’ve put into the GLAM Workbench, it’s really exciting to see it embedded within Australia’s digital research infrastructure. #dhhacks

More GLAM Name Index updates from Queensland State Archives and SLWA

Mon, 18 Oct 2021 09:40:00 +1000

A new version of the GLAM Name Index Search is available. An additional 49 indexes have been added, bringing the total to 246. You can now search for names in more than 10.2 million records from 9 organisations.

The new indexes come from Queensland State Archives and the State Library of WA. QSA announced on Friday that they’d added two new indexes to their site. When I went to harvest them, I realised there was another 25 indexes that I hadn’t previously picked up. It seems that some QSA datasets are tagged as ‘Queensland State Archives’ in the data.qld.gov.au portal, but others are tagged as ‘queensland state archives’ – and the tag search is case sensitive! I now search for both the upper and lower case tags.

There’s also a number of additions from the State Library of WA. These datasets were already in my harvest, but because of some oddities in their formatting, I hadn’t included them in the Index Search. Looking at them again, I realised they were right to go, so I’ve added them in.

Here’s the list of additions:

Queensland State Archives

Australian South Sea Islanders 1867 to 1908 - A-K
Australian South Sea Islanders 1867 to 1908 L-Z
Beaudesert Shire Burials - Logan Village 1878-2000 - Beaudesert Shire and Logan Village Burials 1878-2000
Immigrants, Bowen Immigration Depot 1885-1892
Brisbane Gaol Hospital Admission registers 1889-1911 - Index to Brisbane Gaol Hospital Admission Registers 1889-1911
Index to Correspondence of Queensland Colonial Secretary 1859-1861 - Index to Colonial Secretary s Correspondence Bundles 1859 - 1861.csv
Dunwich Benevolent Asylum records - Index to Dunwich Benevolent Asylum 1859-1948
Dunwich Benevolent Asylum records - Index to Dunwich Benevolent Asylum 1885-1907
Immigrants and Crew 1860-1865 (COL/A) - Index to Immigrants and Crew 1860 - 1964
Index to Immigration 1909-1932
Outdoor Relief 1900-1904 - Index to Outdoor Relief 1892-1920
Pensions 1908-1919 - Index to Pensions 1908-1909
Cases & treatment Moreton Bay Hospital 1830-1862 - Index to Register of Cases and treatment at Moreton Bay Hospital 1830-1862
Index to Registers of Agricultural Lessees 1885-1908
Index to Registers of Immigrants, Rockhampton 1882-1915
Pneumonic influenza patients, Wallangarra Quarantine Compound - Index to Wallangarra Flu Camp 1918-1919
Land selections 1885-1981
Lazaret patient registers - Lazaret Patient Registers
Leases, Selections and Pastoral Runs and other related records 1850-2014
Perpetual Lease Selections of soldier settlements 1917 - 1929 - Perpetual Lease Selections of soldier settlements 1917-1929
Photographic records of prisoners 1875-1913 - Photographic Records of Prisoners 1875-1913
Redeemed land orders 1860-1906 - Redeemed land orders 1860-1907
Register of the Engagement of Immigrants at the Immigration Depot - Bowen 1873-1912
Registers of Applications by Selectors 1868-1885
Registers of Immigrants Promissory Notes (Maryborough)
Education Office Gazette Scholarships 1900 - 1940 - Scholarships in the Education Office Gazette 1900 - 1940
Teachers in the Education Office Gazettes 1899-1925

State Library of Western Australia

WABI Subset: Eastern Goldfields - Eastern Goldfields
Western Australian Biographical Index (WABI) - Index entries beginning with A
Western Australian Biographical Index (WABI) - Index entries beginning with B
Western Australian Biographical Index (WABI) - Index entries beginning with C
Western Australian Biographical Index (WABI) - Index entries beginning with D and E
Western Australian Biographical Index (WABI) - Index entries beginning with F
Western Australian Biographical Index (WABI) - Index entries beginning with G
Western Australian Biographical Index (WABI) - Index entries beginning with H
Western Australian Biographical Index (WABI) - Index entries beginning with I and J
Western Australian Biographical Index (WABI) - Index entries beginning with K
Western Australian Biographical Index (WABI) - Index entries beginning with L
Western Australian Biographical Index (WABI) - Index entries beginning with M
Western Australian Biographical Index (WABI) - Index entries beginning with N
Western Australian Biographical Index (WABI) - Index entries beginning with O
Western Australian Biographical Index (WABI) - Index entries beginning with P and Q
Western Australian Biographical Index (WABI) - Index entries beginning with R
Western Australian Biographical Index (WABI) - Index entries beginning with S
Western Australian Biographical Index (WABI) - Index entries beginning with T
Western Australian Biographical Index (WABI) - Index entries beginning with U-Z
Digital Photographic Collection - Pictorial collection_csv
WABI subset: Police - WABI police subset
WABI subset: York - York and districts subset

Bonus update

After a bit more work last night I added in a dataset from the State Library of Victoria:

Melbourne and metropolitan hotels, pubs and publicans

That’s an extra 21,000 records, and takes the total number of indexes to 247 from 10 different GLAM organisations!

#dhhacks

Getting data about newspaper issues in Trove

Fri, 15 Oct 2021 09:44:00 +1000

When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? How do you get a list of dates when newspapers were published? This notebook in the GLAM Workbench shows how you can get information about issues from the Trove API.

Using the notebook, I’ve created a couple of datasets ready for download and use.

Total number of issues per year for every newspaper in Trove

Harvested 10 October 2021

CSV formatted dataset containing the number of newspaper issues available on Trove, totalled by title and year – comprises 27,604 rows with the fields:

title – newspaper title
title_id – newspaper id
state – place of publication
year – year published
issues – number of issues

Download from Cloudstor: newspaper_issues_totals_by_year_20211010.csv (2.1mb)

Complete list of issues for every newspaper in Trove

Harvested 10 October 2021

CSV formatted dataset containing a complete list of newspaper issues available on Trove – comprises 2,654,020 rows with the fields:

title – newspaper title
title_id – newspaper id
state – place of publication
issue_id – issue identifier
issue_date – date of publication (YYYY-MM-DD)

To keep the file size down, I haven’t included an issue_url in this dataset, but these are easily generated from the issue_id. Just add the issue_id to the end of http://nla.gov.au/nla.news-issue. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.

Download from Cloudstor: newspaper_issues_20211010.csv (222mb)

For more information see the Trove newspapers section of the GLAM Workbench.

GLAM Workbench at eResearch Australasia 2021

Fri, 15 Oct 2021 08:50:58 +1000

Way back in 2013, I went to the eResearch Australasia conference as the manager of Trove to talk about new research possibilities using the Trove API. Eight years years later I was back, still spruiking the possibilities of Trove data. This time, however, I was discussing Trove in the broader context of GLAM data – all the exciting possibilities that have emerged as galleries, libraries, archives and museums make more of their collections available in machine-readable form. The big question is, of course, how do researchers, particularly those in the humanities, make use of that data? The GLAM Workbench is my attempt to address that question – to provide humanities researchers with both the tools and information they need, and an understanding of the possibilities that might emerge if they invest a bit of time in working with GLAM data. My eResearch Australasia 2021 presentation provides a quick introduction to the GLAM Workbench, here’s the video, and the slides.

A GLAM Workbench for humanities researchers from Tim Sherratt on Vimeo.

The presentation was pre-recorded, but I managed to sneak in an update via chat for those who attended the session. More news on this next week… 🥳

New Python package to download Trove newspaper images

Tue, 05 Oct 2021 11:03:00 +1000

There’s no reliable way of downloading an image of a Trove newspaper article from the web interface. The image download option produces an HTML page with embedded images, and the article is often sliced into pieces to fit the page.

This Python package includes tools to download articles as complete JPEG images. If an article is printed across multiple newspaper pages, multiple images will be downloaded – one for each page. It’s intended for integration into other tools and processing workflows, or for people who like working on the command line.

You can use it as a library:

from trove_newspaper_images.articles import download_images

images = download_images('107024751')

Or from the command line:

trove_newspaper_images.download 107024751 --output_dir images

If you just want to quickly download an article as an image without installing anything, you can use this web app in the GLAM Workbench. To download images of all articles returned by a search in Trove, you can also use the Trove Newspaper and Gazette Harvester.

See the documentation for more information. #dhhacks

More records for the GLAM Name Index Search

Wed, 29 Sep 2021 11:17:35 +1000

Two more datasets have been added to the GLAM Name Index Search! From the History Trust of South Australia and Collab, I’ve added:

Passengers in History – that’s 371,894 records of people arriving in South Australia from 1836 to 1961
Women’s Suffrage Petition 1894 (South Australia) – another 10,638 names

In total there’s 9.67 million name records to search across 197 datasets provided by 9 GLAM organisations!

More QueryPic in action

Wed, 29 Sep 2021 10:30:26 +1000

Recently I created a list of publications that made use of QueryPic, my tool to visualise searches in Trove’s digitised newspapers. Here’s another example of the GLAM Workbench and QueryPic in action, in Professor Julian Meyrick’s recent keynote lecture, ‘Looking Forward to the 1950s: A Hauntological Method for Investigating Australian Theatre History’.

Some thoughts on the ‘Trove Researcher Platform for Advanced Research’ draft plan

Fri, 10 Sep 2021 10:12:00 +1000

Late last year the Federal Government announced it was making an $8.9 million investment in HASS and Indigenous research infrastructure. This program is being managed by the ARDC and will lead to the development of a HASS Research Data Commons. According to the ARDC, a research data commons:

brings together people, skills, data, and related resources such as storage, compute, software, and models to enable researchers to conduct world class data-intensive research

Sounds awesome!

Based on scoping studies commissioned by the Department of Education, Skills, and Employment (which have not yet been made public), four activities were selected for initial funding under this program. Draft project plans for these four activities have now been released for public comment.

One of these activities aims to develop a ’Trove researcher platform for advanced research’:

Augmenting existing National Library of Australia resources, this platform will enable a focus on the delivery of researcher portals accessible through Trove, Australia’s unique public heritage site. The platform will create tools for visualisation, entity recognition, transcription and geocoding across Trove content and other corpora.

You can download the draft project plan for the Trove platform. Funding for this activity will be capped at $2,301,185 across 2021-23. In this post I’ll try to pull together some of my own thoughts on this plan.

I suppose I’d better start with a disclaimer – I’m not a neutral observer in this. I started scraping data from Trove newspapers way back in 2010, building the first versions of tools like QueryPic and the Trove Newspaper Harvester. While I was manager of Trove, from 2013 to 2016, I argued for recognition of Trove as a key part of Australia’s humanities research infrastructure, and highlighted possible research uses of Trove data available through the API. Since then I’ve worked to bring a range of digital tools, examples, tutorials, and hacks together for researchers in the GLAM Workbench – a large number of these work with data from Trove.

I strongly believe that Trove should receive ongoing funding through NCRIS as a piece of national research infrastructure. Unfortunately though, the draft project plan does not make a strong case for investment – it’s vague, unimaginative, and makes little attempt to integrate with existing tools and services. I think it scores poorly against the ARDC’s evaluation criteria, and doesn’t seem to offer good value for money. As someone who has championed the use of Trove data for research across the last decade, I’m very disappointed.

What’s planned?

So what is being proposed? There seems to be three main components:

Authenticated ‘project’ spaces for researchers where datasets relating to a particular research topic can be stored
The ability to create custom datasets from a search in Trove
Tools to visualise stored datasets.

There’s no doubt that these are all useful functions for researchers, but many problems arise when we look at how they’re going to be implemented.

1. Authenticated project spaces

The draft plan indicates that authentication of users through the Australian Access Federation is preferred. Why? Trove already has a system for the creation of user accounts. Using AAF would limit use of the new platform to those attached to universities or research agencies. I don’t understand what the use of AAF adds to the project, except perhaps to provide an example of integration with existing infrastructure services.

The plan notes that project spaces could be ‘public’ or ‘private’. Presumably a ‘public’ space would give access to stored datasets, but what sort of access controls would be available in relation to individual datasets? It’s also noted (Deliverable 7) that researchers would have ‘an option to “publish” their research findings for public consumption‘. Does this mean datasets and visualisations would be assigned a DOI (or other persistent identifier) and preserved indefinitely? How might these spaces integrate with existing data repositories?

2. Create custom datasets

The lack of detail in the plan makes it difficult to assess what’s being proposed here. But it seems that users would be able to construct a search using the Trove web interface (or a new search interface?) and save the results as a dataset.

What data would be searched? It’s not clear, but in reference to the visualisations it’s stated that data would come from ’Trove’s existing full text collections (newspapers and gazettes, magazines and newsletters, books)’. So no web archives, and no metadata from any of Trove’s aggregated collections (even without full text, collection metadata can create interesting research possibilities, see for example the Radio National records in the GLAM Workbench).

What will be included in each dataset? There’s few details, but at a minimum you’d expect something like a CSV containing the metadata of all the matching records, and files containing the full text content of the items. These could potentially be very large. There’s no indication about how storage and processing demands would be managed, but presumably there would be some per user, or per project, limits.

Deliverable 8, ‘Data and visual download’, states that:

All query results must be available as downloadable files, this would include CSV, JSON and XML for the query results list.

But there’s no mention of the full text content at all. Will it be included in downloadable datasets?

As well as the record metadata and full text, you’d want there to be some metadata captured about the dataset itself – the search query used, when it was captured, the number of records, etc. To support integration and reuse, it would be good to align this with something like RO Crate.

How will searches be constructed? It’s not clear if this will be integrated with the existing search interface, or be something completely separate; however, the plan does note that ‘limitations are put onto the dataset like keyword search terms and filters corresponding to the filters currently available in the interface’. So it seems that the new platform will be using the existing search indexes. It’s obviously important for the relationship between existing search functions and the new dataset creation tool to be explicit and transparent so that researchers understand what they’re getting.

It’s also worth noting that changes to the search interface last year removed some useful options from the advanced search form. In particular, you can no longer exclude matches in tags or comments. If you’re a researcher looking for the occurrence of a particular word, you generally don’t want to include records where that word only appears in a user added tag (I have a story about ‘Word War I’ that illustrates this!).

This raises a broader issue. There doesn’t seem to be any mention in the project plan of work to improve the metadata and indexing in response to research needs. Even just identifying digitised books in the current web interface can be a bit of a challenge, and digitised books and periodicals can be grouped into work records with other versions. We need to recognise that the needs of discovery sometimes compromise specific research uses.

I’m trying to be constructive in my responses here, but at this point I just have to scream – WHAT ABOUT THE TROVE NEWSPAPER HARVESTER? A tool has existed for ten years that lets users create a dataset containing metadata and full text from a search in Trove’s newspapers and gazettes. I’ve spent a lot of time over recent years adding features and making it easier to use. Now you can download not only full text, but also PDFs and images of articles. The latest web app version in the GLAM Workbench runs in the cloud. Just one click to start it up, then all you need to do is paste in your Trove API key and the url of your search. It can’t get much easier.

The GLAM Workbench also includes tools to create datasets and download OCRd text from Trove’s books and digitised journals. These are still in notebook form, so are not as easy to use, but I have created pre-harvested datasets of all books and periodicals with OCRd text, and stored them on CloudStor. What’s missing at the moment is something to harvest a collection of journal articles, but this would not be difficult. As an added bonus, the GLAM Workbench has tools to create full text datasets from the Australian Web Archive.

So what is this project really adding? And why is there no attempt to leverage existing tools and resources?

3. Visualise datasets

Again, there’s a fair bit of hand waving in the plan, but it seems that users will be able to select a stored dataset and then choose a form of visualisation. The plan says that:

An initial pilot would allow users to create line graphs that plot the frequency of a search term over time and maps that display results based on state-level geolocation.

Up to three additional visualisations would be created later based on research feedback. It’s not clear which researchers will be consulted and when their feedback will be sought.

The value of these sorts of visualisations is obviously dependent on the quality and consistency of the metadata. There’s nothing built into this plan that would, for example, allow a researcher to clean or normalise any of the saved data. You have to take what you’re given. The newspaper metadata is generally consistent, but books and periodicals less so.

It’s also important to clarify what’s meant by ‘the frequency of a search term over time’. Does this mean the number of records matching a search term, or the number of times that the search term actually appears in the full text of all matched records? If the latter, then this would be a major enrichment of the available data. Though if this data was available it should be pushed through the API and/or made available as a downloadable dataset for integration with other platforms (perhaps along the lines of the Hathi Trust’s Extracted Features Dataset). I suspect, however, that what is actually meant is the number of matching search results.

Again, the value of any geospatial visualisation depends on what is actually being visualised! The state facet in newspapers indicates place of publication, it’s not clear what the place facet in other categories represents. For this sort of visualisation to be useful in a research context, there would need to be some explanation of how these values were created, and any gaps or uncertainties.

Time for another scream of frustration — WHAT ABOUT QUERYPIC? Another long-standing tool which has already been cited a number of times in research literature. QueryPic visualises searches in Trove’s newspapers and gazettes over time. You can adjust time scales and intervals, and download the results as images, a CSV file, and an HTML page. The project plan makes a point of claiming that its tools would not require any coding, but neither does QueryPic. Just plug in an API key and a search URL. I even made some videos about it! The GLAM Workbench also includes a number of examples of how you can visualise places of publication of newspaper articles.

But it’s not just the GLAM Workbench. The Linguistics Data Commons of Australia, another activity to be funded as part of the HASS Research Data Commons, will include tools for text analysis and visualisation. The Time Layered Cultural Map is developing tools for geospatial visualisation of Australian collections. Surely the focus should be on connecting and reusing what’s available. Again I’m wondering what this project is really adding.

Portals and platforms

The original language describing the funded activity is interesting — it is intended to ‘focus on the delivery of researcher portals accessible through Trove’.

Portals (plural) accessible through (not in) Trove.

The NLA could meet a fair proportion of its stated objectives right now, simply by including links to QueryPic and the Trove Newspaper and Gazette Harvester. Done! There’s a million dollars saved.

More seriously, there’s no reason why the outcome of this activity should be a new interface attached to Trove and managed by the NLA. Indeed, such an approach works against integration, reuse, and data sharing. I believe the basic assumptions of the draft plan are seriously flawed. We need to separate out the strands of what’s meant by a ‘platform for advanced research’, and think more creatively and collaboratively about how we could achieve something useful, flexible, and sustainable.

Where’s the API?

I think the primary role of the NLA in the development of this research platform should be as the data provider. There are numerous ways in which Trove’s data might be improved and enriched in support of new research uses. These improvements could then be pushed through the API to integrate with a range of tools and resources. Which raises the question — where is the API in this plan?

The only mention of the API comes as an option for a user with ‘high technical expertise’ to extend the analysis provided by the built-in visualisations. This is all backwards. The API is the key pipeline for data-sharing and integration and should be at the heart of this plan.

This program offers an opportunity to make some much-needed improvements to the API. Here’s a few possibilities:

Bring the web interface and API back into sync so that researchers can easily transfer queries between the two (Trove’s interface update introduced new categories, while the API still groups resources by the original zones).
Provide public API access to additional data about digitised items. For example, you can get lists of newspaper titles and issues from the API, but there’s no comparable method to get titles and issues for digitised periodicals. The data’s there – it’s used to generate lists of issues in the browse interface – but it’s not in the API. There’s also other resource metadata, such as parent/child relationships, which are embedded in web pages but not exposed in the API.
Standardise the delivery of OCRd text for different resource types.
Finally add the People & Organisations data to the main RESTful API.
Fix the limitations of the web archives CDX API (documented here).
Add a search API for the web archives.
And what about a Write API? Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources.

I think the HASS RDC would benefit greatly by thinking much more about the role of the Trove API in establishing reusable data flows, and connecting up components.

Pathways

Anyone who’s been to one of my GLAM Workbench talks will know that I talk a lot about ‘pathways’. My concern is not just to provide useful tools and examples, but to try and connect them in ways that encourage researchers to develop their skills and confidence. So a researcher with limited digital skills can spin up QueryPic and start making visualisations without any specialised knowledge. But if they want to explore the data and assumptions behind QueryPic, they can view a notebook that walks them through the process of getting data from facets and assembling a time series. If they find something interesting in QueryPic, they can go to the Newspaper Harvester and assemble a dataset that helps them zoom into a particular period. There are places to go.

Similarly, users can start making use of the GLAM Workbench in the cloud using Binder – one click and it’s running. But as their research develops they might find Binder a bit limiting, so there are options to spin up the GLAM Workbench using Reclaim Cloud or Docker. As a researcher’s skills, needs, and questions change, so does their use of the GLAM Workbench. At least that’s the plan – I’m very aware that there’s much, much more to do to build and document these pathways.

The developments described in the draft plan are focused on providing simple tools for non-technical users. That’s fair enough, but you have to give those users somewhere to go, some path beyond, or else it just becomes another dead end. Users can download their data or visualisation, but then what?

Of course you don’t point a non-coder to API documentation and say ‘there you go’. But coders can use the API to build and share a range of tools that introduce people to the possibilities of data, and scaffold their learning. Why should there be just one interface? It’s not too difficult to imagine a range of introductory visualisation tools aimed at different humanities disciplines. Instead of focusing inward on a single Trove Viz Lite tool, why not look outwards at ways of embedding Trove data within a range of research training contexts?

Integration

A number of the HASS RDC Evaluation Criteria focus on issues of integration, collaboration, and reuse of existing resources. For example:

Project plans should display robust proposal planning including the maximisation of the use or re-use of existing research infrastructure, platforms, tools, services, data storage and compute.

Project plans should display integrated infrastructure layers with other HASS RDC activities, in particular by linking together elements such as data storage, tools, authentication, licensing, networks, cloud and high-performance computing, and access to data resources for reuse.
Project plans must be robust and contribute to the HASS RDC as a coherent whole that capitalises on existing data collections, adheres to the F.A.I.R. principles, develops collaborative tools, utilises shared underlying infrastructure and has appropriate governance planning.

There’s little evidence of this sort of thinking in the draft project plan. I’ve mentioned a few obvious opportunities for integration above, but there are many more. Overall, I think the proposed ‘platform for advanced research’ needs to be designed as a series of interconnected components, and not be seen as the product of a single institution.

We could imagine, for example, a system where the NLA focused on the delivery of research-ready data via the Trove API. A layer of data filtering, cleaning, and packaging tools could be built on top of the API to help users assemble actionable datasets. The packaging processes could use standards such as RO-Crate to prepare datasets for ingest into data repositories. Existing storage services, such as CloudStor, could be used for saving and sharing working datasets. Another layer of visualisation and analysis tools could either process these datasets, or integrate directly with the API. These tools could be spread across different projects including LDaCA, TLCMap, and the GLAM Workbench — using standards such as Jupyter to encourage sharing and reuse of individual components, and running on a variety of cloud-hosted platforms. Instead of just adding another component to Trove, we’d be building a collaborative network of tool builders and data wranglers — developing capacities across the research sector, and spreading the burden of maintenance.

Sustainability

The draft project plan includes some pretty worrying comments about long-term support for the new platform. Work Package 5 notes:

The developed product will require support post release which can be guaranteed for a period not exceeding the contracted period for this project

And:

ARDC will be responsible for providing ongoing financial support for this phase. It has not been included in the proposal.

So once the project is over, the NLA will not support the new platform unless the ARDC provides ongoing funding. What researcher would want to ‘publish’ their data on a platform that could disappear at any time? We all know that sustainability is hard, but you would think that the NLA could at least offer to work collaboratively with the research sector to develop a plan for sustainability, instead of just asking for more money. Why would anyone invest so much for so little?

Leadership and community

The development of collaborations and communities also figure prominently in the HASS RDC Evaluation Criteria. For example:

Project plans should clearly demonstrate that they enable collaboration and build communities across geographically dispersed research groups through facilitated sharing of high-quality data, particularly for computational analysis; the development of new platforms for collaboration and sharing; and, the encouragement of innovative methodologies through the use of analytic tools.

Project plans must include a demonstrated commitment to ongoing community development to ensure the sustainability of the development is vital. The deliverables will act as ongoing national research infrastructure. They must be broadly usable by more than just the project partners and serve as input to a wide range of research.
Project plans, and project leads in particular, should demonstrate the research leadership that will foster and encourage the uptake and use of the HASS RDC.

Once again the draft project plan falls short. There are no project partners listed. Instead the plan refers broadly to all of Trove’s content partners, none of whom have direct involvement in this project. Indeed, as noted above, data aggregated from project parters is excluded from the new platform.

There are no new governance arrangements proposed for this project. Instead the plan refers to the Trove Strategic Advisory Committee which includes representatives from partner organisations. But there are no researcher representatives on this committee.

The only consultation with the research sector undertaken in the ‘Consultation Phase’ of the project is that undertaken by the ARDC itself. Does that mean this current process whereby the ARDC is soliciting feedback on the project plans? Whoa, meta…

The plan notes that during the testing phase described in Work Package 3, ‘HASS community members would gain access to a beta version of the product for comment’. However, later it is stated that access would be provided to ’a subset of researchers’, and that only system bugs and ‘high priority improvements’ would be acted upon.

Generally speaking, it seems that the NLA is seeking as little consultation as possible. It’s not exploring options for collaboration. It’s not engaging with the research community about these developments. That doesn’t seem like an effective way to build communities. Nor does it demonstrate leadership.

Summing up

This project plan can’t be accepted in its current form. We’ve had failures and disappointments in the development of HASS research infrastructure in the past. The HASS RDC program gives us a chance to start afresh, and the focus on integration, data-sharing, and reuse give hope that we can build something that will continue to grow and develop, and not wither through lack of engagement and support. So should the NLA be getting $2 million to add a new component to Trove that is not integrated with other HASS RDC projects, and substantially duplicates tools available elsewhere? No, I don’t think so. They need to go back to the drawing board, undertake some real consultation, and build collaborations, not products.

Some research projects that have used QueryPic

Mon, 30 Aug 2021 11:48:11 +1000

A Twitter thread about some of the research uses of QueryPic…

QueryPic, my tool for visualising searches in @TroveAustralia’s digitised newspapers, has been around in different forms for more than 10 years. The latest version is part of the #GLAMWorkbench: https://t.co/qnY5tVDwgY #researchinfrastructure pic.twitter.com/QyHWJwGV3u
— Tim Sherratt (@wragge) August 29, 2021

I thought I’d highlight some of the research publications that have made use of QueryPic over the years, so, in no particular order...
— Tim Sherratt (@wragge) August 29, 2021

There’s @Airminded’s article in @HistAustJournal – Brett Holman (2013) 'Dreaming War: Airmindedness and the Australian Mystery Aeroplane Scare of 1918', History Australia, 10:2, 180-201, DOI: https://t.co/2wgiLueHGL
— Tim Sherratt (@wragge) August 29, 2021

A book! Simon Sleight, Young People and the Shaping of Public Space in Melbourne, 1870–1914, Ashgate Publishing, Ltd., 2013. https://t.co/CPgGMrYYYq pic.twitter.com/XryAF0hJ5K
— Tim Sherratt (@wragge) August 29, 2021

Yorick Smaal (2013) Keeping it in the family: prosecuting incest in colonial Queensland, Journal of Australian Studies, 37:3, 316-332, DOI: https://t.co/n5tQlER9Vo pic.twitter.com/tKzpAosu1i
— Tim Sherratt (@wragge) August 29, 2021

In @AHSjournal there’s – Murray G. Phillips & Gary Osmond (2015) Australia's Women Surfers: History, Methodology and the Digital Humanities, Australian Historical Studies, 46:2, 285-303, DOI: https://t.co/Gxs1Ru6Ojt
— Tim Sherratt (@wragge) August 29, 2021

Gary Osmond (2015) ‘Pink Tea and Sissy Boys’: Digitized Fragments of Male Homosexuality, Non-Heteronormativity and Homophobia in the Australian Sporting Press, 1845–1954, The International Journal of the History of Sport, 32:13, 1578-1592, DOI: https://t.co/C6FndD7C4E
— Tim Sherratt (@wragge) August 29, 2021

Murray G. Phillips, Gary Osmond & Stephen Townsend (2015) A Bird’s-Eye View of the Past: Digital History, Distant Reading and Sport History, The International Journal of the History of Sport, 32:15, 1725-1740, DOI: https://t.co/4rB2hkmmDM
— Tim Sherratt (@wragge) August 29, 2021

Sarah Ailwood and Maree Sainsbury, ‘Copyright Law, Readers and Authors in Colonial Australia’, Journal of the Association for the Study of Australian Literature, vol. 14, no. 3, 2014. https://t.co/XWqx8XJGLQ
— Tim Sherratt (@wragge) August 29, 2021

Sarah Ailwood and Maree Sainsbury, ‘The Imperial Effect: Literary Copyright Law in Colonial Australia’, Law, Culture and the Humanities, vol. 12, no. 3, 1 October 2016, pp. 716–740. https://t.co/s6HrBZmQ6N
— Tim Sherratt (@wragge) August 29, 2021

A book chapter by @JVLamond – Lamond, J, 2016, 'Zones of Connection: Common Reading in a Regional Australian Library', in Print Culture Histories Beyond the Metropolis, University of Toronto Press, Toronto, pp. 355-374. https://t.co/o3oAmreYne
— Tim Sherratt (@wragge) August 29, 2021

Not just history – Scifleet, P., Henninger, M. & Albright, K.H. (2013). When social media are your source. Information Research, 18(3) paper C41. https://t.co/qOYbZ3TMTf pic.twitter.com/GDP2TmeUzp
— Tim Sherratt (@wragge) August 29, 2021

There’s also a number of references to QueryPic as a tool in the DH & library literature, that I won’t list.

There’s probably more – citation of tools like QueryPic can be a bit hit and miss.
— Tim Sherratt (@wragge) August 29, 2021

The latest version of QueryPic is designed to be both easy-to-use and flexible – click a link to start it up, paste in your @TroveAustralia API key, and a search url from Trove… and bingo!

For a quick intro, see this video: https://t.co/Hh1oDIOh9a
— Tim Sherratt (@wragge) August 29, 2021

But even though it’s easy to get started, QueryPic can do interesting things like compare queries. You can also adjust facets, date ranges, and time scales.

This video shows you how to create more complex queries: https://t.co/0CoJhO7vaJ
— Tim Sherratt (@wragge) August 29, 2021

As I often say, not all #researchinfrastructure has to be big. A simple tool like this can help researchers see their topics in new ways.

And from this starting point, there’s all sorts of pathways to follow in the #GLAMWorkbench https://t.co/AC2tipN8eY
— Tim Sherratt (@wragge) August 29, 2021

Government publications in Trove

Mon, 30 Aug 2021 11:21:46 +1000

Over the last few weeks I’ve been updating my harvests of OCRd text from digitised books and periodicals in Trove. As part of the harvesting process, I’ve created lists of both that are available in digital form – this includes digitised works, as well as those that are born-digital (such as PDFs or epubs). I’ve published the full lists of books and periodicals as searchable databases to make them easy to explore.

One thing that you might notice is that works with the format ‘Government publication’ pop up in both lists – sometimes it’s not clear whether something is a ‘book’ or ‘periodical’. To make it easier to find these items, no matter what their format, I’ve combined data from my two harvests and created a searchable dataset of government publications. It includes links to download OCRd text from CloudStor if available.

All three databases make use of Datasette, which I’ve also used for the GLAM Name Index Search. One of the cool things about Datasette is that it provides it’s own API, so if you find some interesting in any of these databases, you can easily download the machine-readable data for further analysis. #dhhacks

GLAM Workbench – a platform for digital HASS research

Thu, 26 Aug 2021 16:31:00 +1000

We’re in the midst of planning for the HASS Research Data Commons, which will deliver some much-needed investment in digital research infrastructure for the humanities and social sciences. Amongst the funded programs are tools for text analysis as part of the Linguistics Data Commons, and a platform for more advanced research using Trove. I’m hoping that this will be an opportunity to take stock of existing tools and resources, and build flexible pathways for researchers that enable them to collect, move, analyse, preserve, and share data across different platforms and services.

To this end, I thought it might be useful to try and summarise what the GLAM Workbench offers, particularly for Trove researchers. The GLAM Workbench doesn’t really have an institutional home, and is mostly unfunded – it’s my passion project. That means that it’s easy to overlook, particularly when the big grants are being doled out. But I think it has a lot to offer and I’m looking forward to exploring ways it can connect with these new initiatives.

Getting and moving data

Get newspaper articles in bulk with the Trove Newspaper and Gazette Harvester – This has been around in some form for more than ten years (it pre-dates the Trove API!). Give it the url of a search in Trove’s newspapers and gazettes and the harvester will save all the metadata in a CSV file, and optionally download the complete articles as OCRd text, images, or PDFs. The amount of data you harvest is really only limited by your patience and disk space. I’ve harvested more than a million articles in the past. The GLAM Workbench includes a web app version of the harvester that runs live in the cloud – just paste in your Trove API key and the search url, and click the button.
Get Trove newspaper pages as images – If you need a nice, high-resolution version of a newspaper page you can use this web app. If you want to harvest every front page (or some other particular page) here’s an example that gets all the covers of the Australian Women’s Weekly. A pre-harvested collection of the AWW covers is included as a bonus extra.
Get Trove newspaper articles as images – The Trove web interface makes it difficult to download complete images of articles, but this tool will do the job. There’s a handy web app to grab individual images, but the code from this tool is reused in other places such as the Trove Newspaper Harvester and the Omeka uploader, and could be built-in to your own research workflows.
Upload Trove newspaper articles to Omeka – Whether you’re creating on online exhibition or building a research database, Omeka can be very useful. This notebook connects Trove’s newspapers to Omeka for easy upload. Your selected articles can come from a search query, a Trove list, a Zotero library, or just a list of article ids. Metadata records are created in Omeka for each article and newspaper, and an image of each article is attached.
Get OCRd text from digitised periodicals in Trove – They’re often overshadowed by the newspapers, but there’s now lots of digitised journals, magazines, and parliamentary papers in Trove. You can get article-level data from the API, but not issue data. This notebook enables researchers to get metadata and OCRd text from every available issue of a periodical. To make researchers’ lives even easier, I regularly harvest all the available OCRd text from digitised periodicals in Trove. The latest harvest downloaded 51,928 issues from 1,163 periodicals – that’s about 10gb of text. You can browse the list of periodicals with OCRd text, or search this database. All the OCRd text is stored in a public repository on CloudStor.
Get page images from digitised periodicals in Trove – There’s more than text in digitised periodicals, and you might want to download images of pages for visual analysis. This notebook shows you how to get cover images, but could be easily modified to get another page, or a PDF. I used a modified version of this to create a collection of 3,471 full page editorial cartoons from The Bulletin, 1886 to 1952 – all available to download from CloudStor.
Get OCRd text from digitised books in Trove – Yep, there’s digitised books as well as newspapers and periodicals. You can download OCRd text from an individual book using the Trove web interface, but how do you make a collection of books without all that pointing and clicking? This notebook downloads all the available OCRd text from digitised books in Trove. The latest harvest includes text from 26,762 works. You can explore the results using this database.
Harvest parliamentary press releases from Trove – Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and fulltext from a search of the parliamentary press releases. For example, here’s a collection of politicians talking about ‘refugees’, and another relating to COVID-19.
Harvest details of Radio National programs from Trove – Trove creates records for programs broadcast on ABC Radio National, for the major current affairs programs these records at at segment level. Even though they don’t provide full transcripts, this data provide a rich, fine-grained record of Australia’s recent political, social, and economic history. This notebook shows you how to download the Radio National data. If you just want to dive straight in, there’s also a pre-harvested collection containing more than 400,000 records, with separate downloads for some of the main programs.
Find all the versions of an archived web page in Trove – Many of the tools in the Web Archives section of the GLAM Workbench will work with the Australian Web Archive, which is part of Trove. This notebook shows you how to get data about the number of times a web page has been archived over time.
Harvesting collections of text from archived web pages in Trove – If you want to explore how the content of a web page changes over time, you can use this notebook to capture the text content of every archived version of a web page.
Convert a Trove list into a CSV file – While Trove provides a data download option for lists, it leaves out a lot of useful data. This notebook downloads full details of newspaper articles and other works in a list and saves them as CSV files. Like the Trove Newspaper Harvester, it lets you download OCRd text and images from newspaper articles.
Collecting information about Trove user activity – It’s not just the content of Trove that provides interesting research data, it’s also the way people engage with it. Using the Trove API it’s possible to harvest details of all user created lists and tags. And yes, there’s pre-harvested collections of lists and tags for the impatient.

Visualisation and analysis

But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:

QueryPic – Another tool that’s been around in different forms for a decade, QueryPic visualises searches in Trove’s newspapers. The latest web app couldn’t be simpler, just paste in your API key and a search url and create charts showing the number of matching articles over time. You can combine queries, change time scales, and download the data and visualisations.
Visualise Trove newspaper searches over time – This is like a deconstructed version of QueryPic that walks you through the process of using Trove’s facets to assemble a dataset of results over time. It provide a lot of detail on the sorts of data available, and the questions we can ask of it.
Visualise the total number of newspaper articles in Trove by year and state – This notebook uses a modified version of the code above to analyse the construction and context of Trove’s newspaper corpus itself. What are you actually searching? Meet the WWI effect and the copyright cliff of death! This is a great place to start if you want to get people thinking critically about digital resources are constructed.
Analyse rates of OCR correction – Some more meta-analysis of the Trove corpus itself, this time focusing on patterns of OCR correction by Trove users.
Identifying non-English language newspapers in Trove – There are a growing number of non-English language newspapers digitised in Trove. However, if you’re only searching using English keywords, you might never know that they’re there. This notebook analyses a sample of articles from every newspaper in Trove to identify non-English content.
Beyond the copyright cliff of death – Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. This notebook helps you find out how many, and which newspapers they were published in.
Map Trove newspaper results by state – This notebook uses the Trove state facet to create a choropleth map that visualises the number of search results per state.
Map Trove newspaper results by place of publication – This notebook uses the Trove title facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.
Compare two versions of an archived web page – This notebook demonstrates a number of different ways of comparing versions of archived web pages. Just choose a repository, enter a url, and select two dates to see comparisons based on: page metadata, basic statistics such as file size and number of words, numbers of internal and external links, cosine similarity of text, line by line differences in text or code, and screenshots.
Display changes in the text of an archived web page over time – This web app gathers all the available versions of a web page and then visualises changes in its content between versions – what’s been added, removed, and changed?
Use screenshots to visualise change in a page over time– Create a series of full page screenshots of a web page over time, then assemble them into a time series.

There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.

Documentation and examples

Trove API Introduction – Some very basic examples of making requests and understanding results.
Today’s news yesterday – Uses the date index and the firstpageseq parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.
The use of standard licences and rights statements in Trove image records – Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of a list of standard licences and rights statements. We can also use this index to build a picture of which rights statements are currently being used, and by who.
Random items from Trove – Changes to the Trove API meant that techniques you could previously use to select resources at random no longer work. This section documents some alternative ways of retrieving random-ish works and newspaper articles from Trove.

And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.

Pathways

In developing the GLAM Workbench I’m very aware that people will arrive with different levels of digital skill, confidence, and experience. That’s why I’ve been putting a lot of thought and effort into ways of providing a range of entry points.

Someone who might not identify as a ‘digital’ researcher can, with a single click, start up QueryPic and start exploring changes over time in Trove’s newspapers. This is possible because the GLAM Workbench is configured to make use of Binder, a service that spins up customised computing environments as needed.

Another researcher might start running the Trove Newspaper Harvester using Binder, but find that they want to run bigger and longer harvests. In that case, the GLAM Workbench offers a one-click installation of the Trove Newspaper Harvester on Reclaim Cloud. Unlike Binder, Reclaim Cloud environments are persistent, so you can run the harvester for as long as you want without the worry of interruptions.

Yet another researcher might want to understand how the Trove API works and the sorts of data that it makes available. By exploring the various notebooks they’ll find useful snippets of code they can try out in their own projects.

The GLAM Workbench connects outwards to make use of a range of other services – the notebooks run in Binder, Reclaim Cloud, and Docker; the code is all openly licensed and publicly available through GitHub and Zenodo; data is hosted on GitHub, CloudStor, and Zenodo; datasets can be explored using Datasette running on Glitch or Google CloudRun. I’m hoping that the new investments in HASS research infrastructure will embed a similar philosophy, connecting up existing services rather than starting from scratch.

The future

This is just an outline on what the GLAM Workbench currently offers researchers wanting to make use of the data available from Trove. It’s all there now, publicly accessible, openly licensed, and ready to use – take it, use it, change it, share it. But there’s much more I’d like to do, both in regard to Trove and to encourage use of GLAM data more generally. I’m also interested in your ideas for new tools, examples, or data sources – what would help your research? You can add a suggestion in GitHub, or post a comment in the GLAM Workbench channel of OzGLAM Help.

See the Getting Started section of the GLAM Workbench for more hints and examples. And keep an eye on the news feed for the latest additions and updates.

A Family History Month experiment – search millions of name records from GLAM organisations

Mon, 23 Aug 2021 10:05:00 +1000

There’s a lot of rich historical data contained within the indexes that Australian GLAM organisations provide to help people navigate their records. These indexes, often created by volunteers, allow access by key fields such as name, date or location. They aid discovery, but also allow new forms of analysis and visualisation. Kate Bagnall and I wrote about some of the possibilities, and the difficulties, in this recently published article.

Many of these indexes can be downloaded from government data portals. The GLAM Workbench demonstrates how these can be harvested, and provides a list of available datasets to browse. But what’s inside them? The GLAM CSV Explorer visualises the contents of the indexes to give you a sneak peek and encourage you to dig deeper.

There’s even more indexes available from the NSW State Archives. Most of these aren’t accessible thought the NSW government data portal yet, but I managed to scrape them from the website a couple of years ago and made them available as CSVs for easy download.

It’s Family History Month at the moment, and the other night I thought of an interesting little experiment using the indexes. I’ve been playing round with Datasette lately. It’s a fabulous tool for exploring tabular data, like CSVs. I also noticed that Datasette’s creator Simon Willison had added a search-all plugin that enabled you to run a full text search across multiple databases and tables. Hmmm, I wondered, would it be possible to use Datasette to provide a way of searching for names across all those GLAM indexes?

After a few nights work, I found the answer was yes.

Try out my new aggregated search interface here!.

(The cloud service it uses runs on demand, so if it has gone to sleep, it might take a little while to wake up again – just be patient for a few seconds.)

Currently, the GLAM Name Search interface lets you search for names across 195 indexes from eight GLAM organisations. All together, there’s a total of more than 9.2 million rows of data to explore!

It’s simple to use – just enter a name in the search box and Datasette will search each index in turn, displaying the first five matching results. You can click through to view all results from a specific index. Not surprisingly, the aggregated name search only searches columns containing names. However, once you click through to an individual table, you can apply additional filters or facets.

To create the aggregated search interface I worked through the list of CSVs I’d harvested from government data portals to identify those that contained names of people, and discard those that contained administrative, rather than historical data. I also made a note of the columns that contained the names so I could index their contents once they’d been added to the database. Usually these were fields such as Surname or Given names, but sometimes names were in the record title or notes.

Datasette uses SQLite databases to store its data. I decided to create one database for each GLAM organisation. I wrote some code to work through my list of datasets, saving them into an SQLite database, indexing the name columns, and writing information about the dataset to a metadata.json file. This file is used by Datasette to display information such as the title, source, licence, and last modified date of each of the indexes.

Once that was done, I could fire up Datasette and feed it all the SQLite databases. Amazingly it all worked – searching across all the indexes was remarkably quick! To make it publicly available I used the Datasette publish to push everything to Google CloudRun (about 1.4 gb of data). The first time I used CloudRun it took some time to get the authentication and other settings working properly. This time was much smoother. Before long it was live!

Once I knew it all worked, I decided to add in another 59 indexes from the NSW State Archives. I also plugged in a few extra indexes from the Public Record Office of Victoria. These datasets are stored as ZIP files in the Victorian government data portal, so it took a little bit of extra manual processing to get everything sorted. But finally I had all 195 indexes loaded.

What now? That depends on whether people find this experiment useful. I have a few ideas for improvements. But if people do use it, then the costs will go up. I’m going to have to monitor this over the next couple of months to see if I can afford to keep it going. If you want to help with the running costs, you might like to sign up as a GitHub sponsor.

And please let me know if you think it’s worth developing! #dhhacks

Explore Trove’s digitised books

Mon, 16 Aug 2021 15:40:15 +1000

The Trove books section of the GLAM Workbench has been updated! There’s freshly-harvested data, as well as updated Python packages, integration with Reclaim Cloud, and automated Docker builds.

Included is a notebook to harvest details of all books available from Trove in digital form. This includes both digitised books, that have been scanned and OCRd, as well as born digital publications, such as PDFs and epubs. The definition of ‘books’ is pretty loose – I’ve harvested details of anything that has been assigned the format ‘Book’ in Trove, but this includes ephemera, such as posters, pamphlets, and advertising.

In the latest harvest, I ended up with details of 42,174 ‘books’. This includes some duplicates, because multiple metadata entries can point to the same digital object. I thought it was best to preserve the duplicates, rather than discard the metadata.

Once I’d harvested the details of the books, I tried to see if there was any OCRd text available for download. If there was, I saved it to a public folder on CloudStor. In total, I was able to download 26,762 files of OCRd text.

The easiest way to explore the books is using this searchable database. It’s created using Datasette and is running on Glitch. Full text search is available on the ‘title’ and ‘contributors’ fields, and you can filter on things like date, copyright status, number of pages, and whether OCRd text is available for download. If there is OCRd text, a direct link to the file on CloudStor is included. You can use the database to filter the titles, creating your own dataset that you can download in CSV or JSON format.

If you just want the full list of books as a CSV file, you can download it here. And if you want all the OCRd text, you can go straight to the public folder on CloudStor – there’s about 3.6gb of text files to explore! #dhhacks

A miscellany of ephemera, oddities, & estrays

Fri, 13 Aug 2021 11:02:00 +1000

I’m just in the midst of updating my harvest of OCRd text from Trove’s digitised books (more about that soon!). But amongst the items catalogued as ‘books’ are a wide assortment of ephemera, posters, advertisements, and other oddities. There’s no consistent way of identifying these items through the search interface, but because I’ve found the number of pages in each ‘book’ as part of the harvesting process, I can limit results to items with just a single digitised page – there’s more than 1,500! To make it easy to explore this collection of odds and ends, I’ve downloaded all the single page images and compiled them into one big PDF with links back to their entries in Trove. Enjoy your browsing!

This is another example of the ways in which we can extend and enrich existing collection interfaces using simple technologies like PDFs and CSVs. We can create slices across existing categories to expose interesting features, and provide new entry points for researchers. Some other examples in the GLAM Workbench are the collection of editorial cartoons from The Bulletin, the list of Trove newspapers with non-English content, the harvest of ABC Radio National programs, and the recent collection of politicians talking about COVID. Let me know if you have any ideas for additional slices! #dhhacks

Everyday heritage and the GLAM Workbench

Mon, 09 Aug 2021 11:26:38 +1000

Some good news on the funding front with the success of the Everyday Heritage project in the latest round of ARC Linkage grants. The project aims to look beyond the formal discourses of ‘national’ heritage to develop a more diverse range of heritage narratives. Working at the intersection of place, digital collections, and material culture, team members will develop a series of ‘heritage biographies’, that document everyday experience, and provide new models for the heritage sector.

Digital methods will play a major role in the project. I’ll be leading the ‘Heritage Hacks’ work package that will support the creation of the heritage biographies and develop a range of new tools and tutorials for use in heritage management contexts. All the tools, methods, and data generated through the project will be documented using Jupyter notebooks and published through the GLAM Workbench. Watch this space!

The project is led by Tracy Ireland (University of Canberra), with Jane Lydon (UWA), Kate Bagnall (UTAS), and me as chief investigators. Our industry partner is GML Heritage.

Recent GLAM Workbench presentations

Fri, 06 Aug 2021 17:42:17 +1000

So far this year I’ve given eight workshops or presentations relating to the GLAM Workbench, with probably a few more yet to come. Here’s the latest:

Introducing the GLAM Workbench, presentation for the Griffith University Centre for Social and Cultural Research, Digital Humanities Seminar Series, 6 August 2021
Exploring the GLAM Workbench (slides), presentation for the UTS Digital Histories Seminar Series, 8 July 2021
The GLAM Workbench: A Labs approach?, presentation for the panel ‘Research use of web archives: A labs approach’, at the IIPC Web Archiving Conference, 15 June 2021
Hands-on introduction to the GLAM Workbench, workshop for the Professional Historians Association of Victoria and Tasmania, 27 May 2021
Exploring collections through the GLAM Workbench, keynote presentation for the XVIII Congrés d’Arxvística i Gestió de Documents de Catalunya, 11 May 2021
Quick hacks and DIY data: Innovations for the discerning historian, presentation for AHA ECR digital skills seminar, 5 March 2021

You can view all the GLAM Workbench presentations here.

Updated! Lots and lots of text freshly harvested from Trove periodicals

Fri, 06 Aug 2021 09:49:54 +1000

For a few years now I’ve been harvesting downloadable text from digitised periodicals in Trove and making it easily available for exploration and research. I’ve just completed the latest harvest – here’s the summary:

1,163 digitised periodicals had text available for download
Text was downloaded from 51,928 individual issues
Adding up to a total of around 12gb of text

If you want to dive straight in, here’s a list of all the harvested periodicals, with links to download a summary of available issues, as well as all the harvested text (there’s one file per issue). You’ll notice that the list includes a large number of parliamentary papers and government reports as well as published journals.

All of the harvested text is available from a public folder on CloudStor.

The harvesting process involves a few different steps:

First I generate a list of periodicals available in digital form from Trove. This includes digitised titles, as well as born-digital titles submitted through e-Legal Deposit. This produced a CSV file containing the details of 7,270 titles. See this notebook for details.
Then I work through this list of titles to find out how many issues of each title are available through Trove. This information isn’t accessible through the API, so I have to do some screen scraping.
Next I work through the list of issues and try to download the text contents. Most of the born-digital titles don’t have downloadable text.
Once I’ve downloaded all the text I can from a title, I create a CSV file for it that lists the available issues and notes whether text is available for each. This file is stored with the text on CloudStor.
Once I’ve checked all the titles, I generate another CSV file that lists the details of all the periodicals that have downloadable text.
The code to harvest and document the downloaded text is available in this notebook. #dhhacks

New dataset – Politicians talking about COVID

Mon, 02 Aug 2021 10:23:00 +1000

The Trove Journals section of the GLAM Workbench includes a notebook that helps you download press releases, speeches, and interview transcripts by Australian federal politicians. These documents are compiled and published by the Parliamentary Library, and the details are regularly harvested into Trove.

Using this notebook, I’ve created a collection of documents that include the words ‘COVID’ or ‘Coronavirus’. It includes all the metadata from Trove, as well as the full text of each document downloaded from the Parliamentary Library. There’s 3,995 documents in total, covering the period up until early April 2021. You can download them all as a zip file (12 mb).

While I was compiling this dataset, I also made a few improvements to the notebook. You can now filter the results to weed out false positives, and identify duplicates. #dhhacks

8 million Trove tags to explore!

Wed, 14 Jul 2021 17:06:00 +1000

I’ve always been interested in the way people add value to resources in Trove. OCR correction tends to get all the attention, but Trove users have also been busy organising resources using tags, lists, and comments. I used to refer to tagging quite often in presentations, pointing to the different ways they were used. For example, ‘TBD’ is a workflow marker, used by text correctors to label articles that are ‘To Be Done’. My favourite was ‘LRRSA’, one of the most heavily-used tags across the whole of Trove. What does it mean? It stands for the Light Rail Research Society of Australia, and the tag is used by members to mark items of shared interest. It’s a great example of how something as simple as plain text tags can be used to support collaboration and build communities.

Until its update last year, Trove used to provide some basic stats about user activity. There was also a tag cloud that let you explore the most commonly-used tags. It’s now much harder to access this sort of information. However, you can extract some basic information about tags from the Trove API. First of all, you can filter a search using ‘has:tags’ to limit the results to items that have tags attached to them. Then to find out what the tags actually are, you can add the include=tags parameter. This embeds the tags within the item record, so you can work through a set of results, extracting all the tags as you go. To save you the trouble, I’ve done this for the whole of Trove, and ended up with a dataset containing more than 8 million tags!

The dataset is saved as a 500mb CSV file, and contains the following fields:

tag – lower-cased version of the tag
date – date the tag was added
zone – the API zone that contains the tagged resource
resource_id – the identifier of he tagged resource

There’s a few things to note about the data:

Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.
A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the ‘book’, ‘picture’, and ‘map’ zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. Depending on your interests, you might want to remove these duplicates.
While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your interests, you might want to exclude these by limiting the date range or zones.
User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

You can download the complete dataset from Zenodo, or from CloudStor. For more information on how I harvested the data, and some of its limits and complexities, see the notebooks in the new ‘Tags’ section in the GLAM Workbench. There’s also some examples of analysing and visualising the tags. As an extra bonus, there’s a more compact 50mb CSV dataset which lists each unique tag and the number of times it has been used.

Of course, it’s worth remembering that this sort of dataset is out of date before the harvest is even finished. More tags are being added all the time! But hopefully this data will help us better understand the way people work to organise and enrich complex resources like Trove. #dhhacks

Integrating GLAM Workbench news and discussion

Thu, 01 Jul 2021 15:18:41 +1000

I’ve spent a lot of time this year working on ways of improving the GLAM Workbench’s documentation and its integration with other services. Last year I created OzGLAM Help to provide a space where users of GLAM collections could ask questions and share discoveries – including a dedicated GLAM Workbench channel. Earlier this year, I tweaked my Micro.blog powered updates to include a dedicated GLAM Workbench news feed. Now I’ve brought the two together! What does this mean?

Any GLAM Workbench news that I post to my updates feed is now automatically added to OzGLAM Help
Links are automatically added to items in the news feed that let you add comments or questions in OzGLAM Help

So now there’s two-way communication between the services providing more ways for people to discover and discuss how the GLAM Workbench can help them.

GLAM Workbench now on YouTube!

Thu, 01 Jul 2021 14:51:00 +1000

I’ve started creating short videos to introduce or explain various components of the GLAM Workbench. The first video shows how you can visualise searches in Trove’s digitised newspapers using the latest version of QueryPic. It’s a useful introduction to the way access to collection data enables us to ask different types of questions of historical sources.

As with all GLAM Workbench resources, the video is openly-licensed – so feel free to stop it into your own course materials or workshops. It could, for example, provide an interesting little digital methods task in an Australian history unit.

I’ll be creating a second QueryPic video shortly, demonstrating how you can work with complex queries and differing timescales. Let me know if you find it useful, or if you have any ideas for future topics. #dhhacks

GLAM Workbench office hours

Mon, 28 Jun 2021 14:04:56 +1000

To help you make use of the GLAM Workbench, I’ve set up an ‘office hours’ time slot every Friday when people can book in for 30 minute chats via Zoom. Want to talk about how you might use the GLAM Workbench in your latest research project? Are you having trouble getting started with GLAM data? Or perhaps you have some ideas for future notebooks you’d like to share? Just click on the ‘Book a chat’ link in the GLAM Workbench, or head straight to the scheduling page to set up a time!

This is yet another experiment to see how I can support the use of GLAM data and the development of digital skills with the GLAM Workbench. Let me know if you think it’s worthwhile. #dhhacks

QueryPic: The Next Generation

Mon, 21 Jun 2021 10:50:19 +1000

QueryPic is a tool to visualise searches in Trove’s digitised newspapers. I created the first version way back in 2011, and since then it’s taken a number of different forms. The latest version introduces some new features:

Automatic query creation – construct your search in the Trove web interface, then just copy and paste the url into QueryPic. This means you can take advantage of Trove’s advanced search and facets to build complex queries.
Multiple time scales – previous versions only aggregated search results by year, but now you can also aggregate by month, or by day. QueryPic will automatically choose a time unit based on the date range of your query, but if you’re not happy with the result you can change it!
Links back to Trove – click on any of the points on the chart to search Trove within that time period. This enables you to zoom in and out of your results, from the high-level visualisation, to individual articles.

This version of QueryPic is built within a Jupyter notebook, and designed to run using Voila (which hides all the code and makes the notebook look like a web app). See the Trove Newspapers section of the GLAM Workbench for more information. If you’d like to give it a try, just click the button below to run it live using Binder.

Hope you find it useful! #dhhacks

Everyone gets a Lab!

Mon, 21 Jun 2021 10:00:30 +1000

I recently took part in a panel at the IIPC Web Archiving Conference discussing ‘Research use of web archives: a Labs approach’. My fellow panellists described some amazing stuff going on in European cultural heritage organisations to support researchers who want to make use of web archives. My ‘lab’ doesn’t have a physical presence, or an institutional home, but it does provide a starting point for researchers, and with the latest Reclaim Cloud and Docker integrations, everyone can have their own web archives lab! Here’s my 8 minute video. The slides are available here.

Minor change to Reclaim Cloud config

Mon, 14 Jun 2021 14:44:00 +1000

When the 1-click installer for Reclaim Cloud works its magic and turns GLAM Workbench repositories into your own, personal digital labs, it creates a new work directory mounted inside of your main Jupyter directory. This new directory is independent of the Docker image used to run Jupyter, so it’s a handy place to copy things if you ever want to update the Docker image. However, I just realised that there was a permissions problem with the work directory which meant you couldn’t write files to it from within Jupyter.

To fix the problem, I’ve added an extra line to the reclaim-manifest.jps config file to make the Jupyter user the owner of the work directory:

	- cmd[cp]: chown -R jovyan:jovyan /home/jovyan/work

This takes care of any new installations. If you have an existing installation, you can either just create a completely new environment using the updated config, or you can manually change the permissions:

Hover over the name of your environment in the control panel to display the option buttons.
Click on the Settings button. A new box will open at the bottom of the control panel with all the settings options.
Click on ‘SSH Access’ in the left hand menu of the settings box.
Click on the ‘SSH Connection’ tab.
Under ‘Web SSH’ click on the Connect button and select the default node.
A terminal session will open. At the command line enter the following:

	chown -R jovyan:jovyan /home/jovyan/work

Done! See the Using Reclaim Cloud section of the GLAM Workbench for more information.

Trove Query Parser

Mon, 14 Jun 2021 12:46:01 +1000

Here’s a new little Python package that you might find useful. It simply takes a search url from Trove’s Newspapers & Gazettes category and converts it into a set of parameters that you can use to request data from the Trove API. While some parameters are used both in the web interface and the API, there are a lot of variations – this package means you don’t have to keep track of all the differences!

It’s very simple to use.

The code for the parser has been basically lifted from the Trove Newspaper Harvester. I wanted to separate it out so that I could use it at various spots in the GLAM Workbench and in other projects.

This package, the documentation, and the tests were all created using nbdev, which is really quite a fun way to develop Python packages. #dhhacks

Some GLAM Workbench stats

Sun, 13 Jun 2021 18:01:23 +1000

I deliberately don’t keep any stats about GLAM Workbench visits, because I think they’re pretty meaningless. On the other hand, I’m always interested to see how often GLAM Workbench repositories are launched on Binder. Rather than just random clicks, these numbers represent the number of times users started new computing sessions using the GLAM Workbench. I just compiled these stats for the past year, and I was very pleased to see that the Web Archives section has been launched over 1,000 times in the past twelve months! The Trove Newspapers and Trove Newspaper Harvester repositories are also well used – on average these are both being launched more than once a day.

The GLAM Workbench is never going to attract massive numbers of users – it’s all about being there when a researcher needs help to use GLAM collections. One or two launches per day means one or two researchers from somewhere around the world are able to explore new datasets, or ask new questions. I think that’s pretty important.

More Reclaim Cloud integrations!

Sun, 13 Jun 2021 17:28:12 +1000

Five of the GLAM Workbench repositories now have automatically built Docker images and 1-click integration with Reclaim Cloud – ANU Archives, Trove Newspapers, Trove Newspaper Harvester, NAA RecordSearch, & Web Archives.

This means you can launch your very own version of these GLAM Workbench repositories in the cloud, where all your downloads and experiments will be saved! Find out more on the Using Reclaim Cloud page.

Get your GLAM datasets here!

Sun, 13 Jun 2021 17:12:37 +1000

I’ve updated my harvest of Australian GLAM datasets from state/national government open data portals. There’s now 387 datasets, containing 1049 files (including 684 CSVs). There’s a list if you want to browse, and a CSV file if you want to download all the metadata. For more more information see the data portals section of the GLAM Workbench.

If you’re interested in finding out what’s inside all those 684 CVS files, take the GLAM CSV Explorer for a spin! It’s also been given a refresh, with new data and a new interface. #dhhacks

NAA RecordSearch section of the GLAM Workbench updated!

Mon, 24 May 2021 10:50:00 +1000

If you work with the collections of the National Archives of Australia, you might find the RecordSearch section of the GLAM Workbench helpful. I’ve just updated the repository to add new options for running the notebooks, including 1-click installation on Reclaim Cloud. There’s also a few new notebooks.

New notebooks and datasets

Harvest details of all series in RecordSearch – get details of all series registered in RecordSearch, also generates a summary dataset with the total number of items digitised, described and in each access category
Exploring harvested series data – generates some basic statistics from the harvest of series data
Summary data about all series in RecordSearch (15mb CSV) – contains basic descriptive information about all the series currently registered on RecordSearch (May 2021) as well as the total number of items described, digitised, and in each access category

Updated

I’ve started (but not completed) updating all the notebooks in this repository to use my new RecordSearch Data Scraper. The new scraper is simpler and more efficient, and enables me to get rid of a lot of boilerplate code. Updated notebooks include:

Harvest items from a search in RecordSearch – save the results of an item search in RecordSearch as a downloadable dataset, you can also save images and PDFs from digitised files (PDF saving is new!)
Harvest files with the access status of ‘closed’ – find out what we’re not allowed to see by harvesting details of ‘closed’ files

Other updates include:

Python packages updated
Integration with Reclaim Cloud allowing 1-click installation of the whole repository and environment
Automatic creation of Docker images when the repository is updated
Updated README and repository index with list of all notebooks
Notebooks intended to run as apps now use Voila rather than Appmode for better integration with Jupyter Lab
requirements-unpinned.txt added to repository for people who want to develop the notebooks in their own clean environment

Hope you find these changes useful! #dhhacks

Web archives section of GLAM Workbench updated!

Mon, 17 May 2021 12:22:00 +1000

My program of rolling out new features and integrations across the GLAM Workbench continues. The latest section to be updated is the Web Archives section!

There are no new notebooks with this update, but some important changes under the hood. If you haven’t used it before, the Web Archives section contains 16 notebooks providing documentation, tools, apps, and examples to help you make use of web archives in your research. The notebooks are grouped by the following topics: Types of data, Harvesting data and creating datasets, and Exploring change over time.

I’ve updated all the Python packages used in this repository and changed the app-ified notebooks to run using Voila (which is better integrated with Jupyter Lab than Appmode). But most importantly, you can now install the repository into your own persistent environment using Reclaim Cloud or Docker.

As Christie Moffatt noted recently harvesting data from web archives can take a long time, and you might hit the limits of the free Binder service. These new integrations mean you don’t have to worry about your notebooks timing out. Just click on the Launch on Reclaim Cloud button and you can have your own fully-provisioned, persistent environment up and running in minutes!

This is possible because every change to the Web Archives repository now triggers the build of a new Docker image with all the software that you need pre-installed. You can also run this Docker image on your own computer, or using another cloud service.

The Web Archives section now includes documentation on running the notebooks using Binder, Reclaim, Cloud or Docker. #dhhacks

Using web archives to find out when newspapers were added to Trove

Wed, 12 May 2021 11:36:00 +1000

There’s no doubt that Trove’s digitised newspapers have had a significant impact on the practice of history in Australia. But analysing that impact is difficult when Trove itself is always changing – more newspapers and articles are being added all the time.

In an attempt to chart the development of Trove, I’ve created a dataset that shows (approximately) when particular newspaper titles were first added. This gives a rough snapshot of what Trove contained at any point in the last 12 years.

I say approximately because the only public source of this information are web archives like the Internet Archive’s Wayback Machine and Trove itself. By downloading captures of Trove’s browse page, I was able to extract a list of newspaper titles available when that capture was made. Depending on the frequency of captures, the titles may have been first made available some time earlier.

The method I used to create the dataset is documented in the Trove Newspapers section of the GLAM Workbench. I used the Internet Archive as my source rather than Trove just because there were more captures available. Most of the code I could conveniently copy from the Web Archives section of the GLAM Workbench, in particular the Find all the archived versions of a particular web page notebook.

The result was actually two datasets:

trove_newspaper_titles_2009_2021.csv – complete dataset of captures and titles
trove_newspaper_titles_first_appearance_2009_2021.csv – filtered dataset, showing only the first appearance of each title / place / date range combination

There’s also an alphabetical list of newspaper titles for easy browsing. The list shows the date of the capture in which the title was first recorded, as well as any changes to its date range. #dhhacks

GLAM Jupyter Resources

Wed, 12 May 2021 10:58:29 +1000

To make it easier for people to suggest additions, I’ve created a GitHub repository for my list of GLAM Jupyter examples and resources. Contributions are welcome!

This list is automatically pulled into the GLAM Workbench’s help documentation. #dhhacks

Running notebooks – a sign of things to come in the GLAM Workbench

Wed, 12 May 2021 10:51:11 +1000

I recently made some changes in the GLAM Workbench’s Help documentation, adding a new Running notebooks section. This section provides detailed information of running and managing GLAM Workbench repositories using Reclaim Cloud and Docker.

I’m still rolling out this functionality across all the repositories, but it’s going to take a while. When I’m finished you’ll be able to create your own persistent environment on Reclaim Cloud from any repository with just the click of a button. See the Trove Newspapers section to try this out now! #dhhacks

Sponsor my work on GitHub!

Wed, 12 May 2021 10:25:00 +1000

As I foreshadowed some weeks ago, I’ve shut down my Patreon page. Thanks to everyone who has supported me there over the last few years!

I’ve now shifted across to GitHub Sponsors, which is focused on supporting open source projects. This seems like a much better fit for the things that I do, which are all free and open by default.

So if you think things like the GLAM Workbench, Historic Hansard, OzGLAM Help, and The Real Face of White Australia are worth supporting, you can sign up using my GitHub Sponsors page. Sponsorship tiers start at just $1 a month. Financially, your contributions help pay some of my cloud hosting bills and keep everything online. But just as important is the encouragement and motivation I get from knowing that there are people out there who think this work is important and useful.

To recognise my GitHub sponsors, I’ve also created a new Supporters page in the GLAM Workbench.

Thanks!

Updates to the Trove Newspapers section of GLAM Workbench

Wed, 12 May 2021 09:52:00 +1000

I’ve updated, refreshed, and reorganised the Trove newspapers section of the GLAM Workbench. There’s currently 22 Jupyter notebooks organised under the following headings:

Trove newspapers in context – Notebooks in this section look at the Trove newspaper corpus as a whole, to try and understand what’s there, and what’s not.
Visualising searches – Notebooks in this section demonstrate some ways of visualising searches in Trove newspapers – seeing everything rather than just a list of search results.
Useful tools – Notebooks in this section provide useful tools that extend or enhance the Trove web interface and API.
Tips and tricks – Notebooks in this section provide some useful hints to use with the Trove API.
Get creative – Notebooks in this section look at ways you can use data from Trove newspapers in creative ways.

There’s also a number of pre-harvested datasets.

Recently refreshed analyses, visualisations, and datasets include:

Number of Trove newspaper articles by year and state (notebook)
Analysing OCR correction in Trove’s newspapers (notebook)
List of Trove newspapers in languages other than English (markdown formatted list)
Newspapers with content from beyond the 1954 copyright ‘cliff of death’ (CSV file)

As part of the update, notebooks that are intended to run as apps (with all the code hidden) have been updated to use Voila. But perhaps the thing I’m most excited about are the new options for running the notebooks. As well as being able to launch the notebooks on Binder, you can now create your very own, persistent environment on Reclaim Cloud with just a click of a button.

There’s also an automatically-built Docker image of this repository, containing everything you need to run the notebooks on your own computer. Check out the new Run these notebooks section for details. I’m gradually rolling this out across all the repositories in the GLAM Workbench. #dhhacks

Introducing the new, improved RecordSearch Data Scraper!

Tue, 27 Apr 2021 09:55:00 +1000

It was way back in 2009 that I created my first scraper for getting machine-readable data out of the National Archives of Australia’s online database, RecordSearch. Since then I’ve used versions of this scraper in a number of different projects such as The Real Face of White Australia, Closed Access, and Redacted (including the recent update). The scraper is also embedded in many of the notebooks that I’ve created for the RecordSearch section of the GLAM Workbench.

However, the scraper was showing its age. The main problem was that one of its dependencies, Robobrowser, is no longer maintained. This made it difficult to update. I’d put off a major rewrite, thinking that RecordSearch itself might be getting a much-needed overhaul, but I could wait no longer. Introducing the brand new RecordSearch Data Scraper.

Just like the old version, the new scraper delivers machine-readable data relating to Items, Series and Agencies – both from individual records, and search results. It also adds a little extra to the basic metadata, for example, if an Item is digitised, the data includes the number of pages in the file. Series records can include the number of digitised files, and the breakdown of files by access category.

The new scraper adds in some additional search parameters for Series and Agencies. It also makes use of a simple caching system to improve speed and efficiency. RecordSearch makes use of an odd assortment of sessions, redirects, and hidden forms, which make scraping a challenge. Hopefully I’ve nailed down the idiosyncrasies, but I expect to catching bugs for a while.

I created the new scraper in Jupyter using NBDev. NBDev helps you to keep your code, examples, tests, and documentation all together in Jupyter notebooks. When you’re ready, it converts the code from the notebooks into distributable Python libraries, runs all your tests, and builds a documentation site. It’s very cool.

Having updated the scraper, I now need to update the notebooks in the GLAM Workbench – more on that soon. The maintenance never ends! #dhhacks

Recently digitised files in the National Archives of Australia

Mon, 29 Mar 2021 09:00:00 +1000

I’m interested in understanding what gets digitised and when by our cultural institutions, but accessible data is scarce. The National Archives of Australia lists ‘newly scanned' records in RecordSearch, so I thought I’d see if I could convert that list into a machine-readable form for analysis. I’ve had a lot of experience trying to get data out of RecordSearch, but even so it took me a while to figure out how the ‘newly scanned’ page worked. Eventually I was able to extract all the file metadata from the list and save it to a CSV file. The details are in this notebook in the GLAM Workbench.

I used the code to create a dataset of all the files digitised in the past month. The ‘newly scanned' list only displays a month’s worth of additions, so that’s as much as I could get in one hit. In the past month, 24,039 files were digitised. 22,500 of these (about 93%) come from just four series of military records. This is no surprise, as the NAA is currently undertaking a major project to digitise WW2 service records. What is perhaps more interesting is the long tail of series from which a small number of files were digitised. 357 of the 375 series represented in the dataset (about 95%) appear 20 or fewer times. 210 series have had only one file digitised in the last month. I’m assuming that this diversity represents research interests, refracted through the digitisation on demand service. But this really needs more data, and more analysis.

As I mentioned, only one month’s data is available from RecordSearch at any time. To try and capture a longer record of the digitisation process, I’ve set up an automated ‘git scraper’ that runs every Sunday and captures metadata of all the files digitised in the preceding week. The weekly datasets are saved as CSV files in a public GitHub repository. Over time, this should become a useful dataset for exploring long-term patterns in digitisation. #dhhacks

Moving on from Patreon...

Fri, 26 Mar 2021 12:37:00 +1000

Over the last few years, I’ve been very grateful for the support of my Patreon subscribers. Financially, their contributions have helped me cover a substantial proportion of the cloud hosting costs associated with projects like Historic Hansard and The Real Face of White Australia. But, more importantly, just knowing that they thought my work was of value has helped keep me going, and inspired me to develop a range of new resources.

However, while I’ve been grateful for the platform provided by Patreon, I’ve increasingly felt that it’s not a good fit for the sort of work I do. Patreon is geared towards providing special content to supporters, but, as you know, all my work is open. And that’s really important to me.

Recently GitHub opened up its own sponsorship program for the development of open source software. This program seems to align more closely with what I do. I already share and manage my code through GitHub, so integrating sponsorship seems to make a lot of sense. It’s worth noting too, that, unlike Patreon, GitHub charges no fees and takes no cut of your contributions. As a result I’ve decided to close my Patreon account by the end of April, and create a GitHub sponsors page.

What does this mean for you?

If you’re a Patreon subscriber and you’d like to keep supporting me, you should cancel your Patreon contribution, then head over to my brand new GitHub sponsors page and sign up! Thanks for your continued support!

If you’d prefer to let your contributions lapse, just do nothing. Your payments will stop when I close the account at the end of April. I understand that circumstances change – thank you so much for your support over the years, and I hope you will continue to make use of the things I create.

If you make use of any of my tools or resources and would like to support their continued development, please think about becoming a sponsor. For a sample of the sorts of things I’ve been working on lately, see my updates feed.

The future!

I’m very excited about the possibilities ahead. The GLAM Workbench has received a lot of attention around the world (including a Research Award from the British Library Labs), and I’m planning some major developments over coming months. And, of course, I won’t forget all my other resources – I spent a lot of time in 2020 migrating databases and platforms to keep everything chugging along.

On my GitHub sponsors page, I’ve set an initial target of 50 sponsors. That might be ambitious, but as I said above, it’s not just about money. Being able to point to a group of people who use and value this work will help me argue for new ways of enabling digital research in the humanities. So please help me spread the word – let’s make things together!

What can you do with the GLAM Workbench?

Thu, 25 Mar 2021 10:43:47 +1000

You might have noticed some changes to the GLAM Workbench home page recently. One of the difficulties has always been trying to explain what the GLAM Workbench actually is, so I thought it might be useful to put more examples up front. The home page now lists about 25 notebooks under the headings:

Hopefully they give a decent representation of the sorts of things you can do using the GLAM Workbench. I’ve also included a little rotating slideshow built using Slides.com.

Other recent additions include a new Grants and Awards page. #dhhacks

Reclaim Cloud integration coming soon to the GLAM Workbench

Thu, 25 Mar 2021 10:18:00 +1000

I’ve been doing a bit of work behind the scenes lately to prepare for a major update to the GLAM Workbench. My plan is to provide one click installation of any of the GLAM Workbench repositories on the Reclaim Cloud platform. This will provide a useful step up from Binder for any researcher who wants to do large-scale or sustained work using the GLAM Workbench. Reclaim Cloud is a paid service, but they do a great job supporting digital scholarship in the humanities, and it’s fairly easy to minimise your costs by shutting down environments when they’re not in use.

I’ve still got a lot of work to do to roll this out across the GLAM Workbench’s 40 repositories, but if you’d like a preview head to the Trove Newspaper and Gazette Harvester repository on GitHub. Get yourself a Reclaim Cloud account and click on the Launch on Reclaim Cloud button. It’s that easy!

There’s some technical notes in the Reclaim Hosting forum, and a post by Reclaim Hosting guru Jim Groom describing his own experience spinning up the GLAM Workbench.

Watch this space for more news! #dhhacks

Some recent GLAM Workbench presentations

Thu, 25 Mar 2021 09:40:00 +1000

I’ve given a couple of talks lately on the GLAM Workbench and some of my other work relating to the construction of online access to GLAM collections. Videos and slides are available for both:

From collections as data to collections as infrastructure: Building the GLAM Workbench, seminar for the Centre for Creative and Cultural Research, University of Canberra, 22 February 2021 – video (40 minutes) and slides

Building the GLAM Workbench (and various other projects such as The Real Face of White Australia, Closed Access, and redacted), guest lecture for the Cultural Data Sculpting course, EPFL, Switzerland, 18 March 2021 – video (1hr 40mins) and slides

I’ve also updated the presentations page in the GLAM Workbench. #dhhacks

Some GLAM Workbench datasets to explore for Open Data Day

Mon, 08 Mar 2021 13:54:00 +1000

It was Open Data Day on Saturday 6 March – here’s some of the ready-to-go datasets you can find in the GLAM Workbench – there’s something for historians, humanities researchers, teachers & more!

First here’s a list of Australian GLAM (that’s galleries, libraries, archives & museums) data sources. It includes APIs, portals, and downloadable datasets. Suggested additions welcome!
There’s also a list of Australian GLAM datasets that are available through government open data portals. There’s hundreds of them, but they’re not always easy to find. Convicts, immigration, hospitals, WWI – includes lots of useful biographical data.
If you’re not sure where to start with a list of 600 CSV files, have a look at the GLAM CSV Explorer! Select a file and this Jupyter-powered app will build a series of visualisations based on the contents of each column.
While they’re not yet in an open data portal, NSW State Archives has a rich collection of indexes transcribed by volunteers. I’ve scraped 64 indexes, with over 1.4 million rows of data and put them in a repository for easy download. There’s even a version of the CSV Explorer, just for the NSW State Archives indexes.
Here’s a CSV file containing details of every issue of the Australian Women’s Weekly in Trove.
A collection of front covers from the Australian Women’s Weekly from 1933 to 1982! That’s 2,566 images you can download from Cloudstor or browse in a series of convenient PDFs.

Here’s a list of non-English language newspapers in Trove.
And another list of newspapers in Trove with articles available from beyond the 1954 copyright cliff of death.
While we’re on newspapers, here’s a spreadsheet that identifies places of publication or circulation of Trove newspapers, and provides geocordinates
What about some text? Here’s 24,620 files of OCRd text from digitised books and ephemera in Trove. There’s also a CSV-formatted list with the basic details of each book.
More text! Here’s OCRd text from 26,234 issues of 397 digitised journals in Trove.
Something different – a collection of 12,619 press releases & speeches by Australian politicians that include any of the terms ‘immigrant’, ‘asylum seeker’, ‘boat people’, ‘illegal arrivals’, or ‘boat arrivals’. From the Parliamentary Library via Trove.
Some more images – a collection of 3,471 full-page editorial cartoons from The Bulletin, 1886 to 1952 (with a warning for racist content). Available both as individual images and compiled into PDFs.
From the ABC via Trove, there’s 400,000 records from Radio National programs broadcast since the late 1990s. That includes every segment broadcast on AM, PM, RN Breakfast etc.
This might be handy – from some work I’m doing with ANU Archives, here’s a CSV file containing details of holidays in NSW from 1901 to 1950.
The Department of Prime Minister and Cabinet provides XML versions of more than 20,000 speeches & interviews from recent PMs for download. I’ve saved them to a repository and compiled some indexes.
And finally – Commonwealth Hansard from the Parliamentary Library – lots of well-structured XML files! I’ve created a repo with one file for each sitting day from 1901 to 1980 & 1998 to 2005 (hopefully the gap will be filled soon). There’s also a CSV index to sitting days.

And if that’s not enough data, the GLAM Workbench provides tools to help you create your own datasets from Trove, the National Archives of Australia, the National Museum of Australia, Archives NZ, DigitalNZ, & more! #dhhacks

Thu, 11 Feb 2021 09:24:00 +1000

The NAA recently changed field labels in RecordSearch, so that ‘Barcode' is now ‘Item ID’. This required an update to my recordsearch_tools screen scraper. I also had to make a few changes in the RecordSearch section of the GLAM Workbench. #dhhacks

New! DigitalNZ API Query Builder added to GLAM Workbench

Wed, 03 Feb 2021 09:08:17 +1000

I’ve added an API Query Builder to the DigitalNZ section of the GLAM Workbench. You can use it to learn about the different parameters available from the search API, and experiment with different queries. Just get your API key from DigitalNZ, then try entering keywords and selecting options. Once you understand how the API works, you can start thinking about how you can make use of it in your own projects.

👉🏻 Try it out live on Binder!

Under the hood the API Query Builder is a Jupyter notebook (of course), but it uses ipyvuetify to create good-looking, responsive, form widgets. It’s intended to be run using Voilà, which turns notebooks into interactive apps and dashboards. You can now run any Jupyter notebook using Voilà on Binder, just by changing the url.

If this app seems useful (let me know!) I might put a version on Heroku so the start up time is reduced. I’m also thinking of using this sort of pattern to create apps for other APIs in the GLAM Workbench. #dhhacks

OpenGLAM fireworks! Finding open collections in DigitalNZ

Thu, 28 Jan 2021 10:29:00 +1000

Lately I’ve been updating and expanding the notebooks in the DigitalNZ section of the GLAM Workbench. In particular, I’ve been looking at the usage facet to understand how much of the aggregated content is ‘open’. What do I mean by ‘open’? The Open Knowledge Foundation definition states that ‘open data and content can be freely used, modified, and shared by anyone for any purpose’. Obviously things that are in the public domain, such as out-of-copyright resources, are open. But so are resources with an open licence such as CC-BY or CC-BY-SA. The Creative Commons ‘Non commercial' and ‘No derivatives’ licences are not open because they put limits on how you can use resources.

How does this definition map to DigitalNZ? The usage facet includes five values:

Share
Modify
Use commercially
Unknown

These values have been assigned by DigitalNZ based on the 35,000 different rights statements and 30 different copyright statements that are included in DigitalNZ metadata records. I find I have to turn the usage values inside out to really understand them. A resource that only allows you to ‘Share’, excludes the ‘Modify’ and ‘Use commercially’ permissions and so is sort of equivalent to a CC-BY-ND-NC licence. The only open value, according to the definition above, is ‘Use commercially’, which is like CC-BY. I’m assuming that ‘Use commercially’ has been assigned to resources that either out of copyright (or with no known copyright restrictions) or are openly licensed.

It’s also worth noting that the ‘usage’ values are not mutually-exclusive. A record with a ‘usage’ value of ‘Use commercially’, will also be assigned ‘Share’ and ‘Modify' values. This is because ‘Use commercially’ includes the ‘Share’ and ‘Modify’ permissions. This seems a bit counter-intuitive, but makes sense if you think about doing a search for everything you’re allowed to share.

A rough calculation based on the usage facet indicates that 71.76% of the resources aggregated by DigitalNZ are open. That seems pretty good, though a lot of those are probably out-of-copyright newspaper articles from Papers Past. For a more fine-grained analysis, I decided to look at the ‘usage’ data for each combination of ‘content_partner’ and ‘primary_collection’. How open is each individual collection in DigitalNZ?

For added excitement, and to stretch my knowledge of what Altair can do, I decided to visualise the results as display of colourful fireworks. The higher the explosion, the more open the collection! I’m pretty pleased with the result.

I’ve saved a HTML version of the chart so you can mouseover the explosions for more details. All the code is included in this notebook, along with a CSV file containing all the harvested facet data. #dhhacks

New dataset and notebooks – twenty years of ABC Radio National

Mon, 18 Jan 2021 09:27:00 +1000

There’s a new GLAM Workbench section for working with data from Trove’s Music & Sound zone!

Inside you’ll find out how to harvest all the metadata from ABC Radio National program records – that’s 400,000+ records, from 160 Radio National programs, over more than 20 years.

It’s metadata only, so not full transcripts or audio, though there are links back to the ABC site where you might find transcripts. Most records should at least have a title, a date, the name of the program it was broadcast on, a list of contributors, and perhaps a brief abstract/summary. It’s also worth noting that many of these records, particularly those from the main current affairs programs, represent individual stories or segments – so they provide a detailed record of the major news stories for the last couple of decades!

The harvesting notebook shows you how to get the data from the Trove API. There are a number of duplicate records, and some inconsistencies in the way the data is formatted, so the harvesting code tries to clean things up a bit. You can of course adjust this to meet your own needs.

If you don’t want to do the harvesting yourself, there’s pre-harvested datasets that you can download immediately from Cloudstor and start exploring. The complete harvest of all 400,000+ records is available both in JSONL (newline separated JSON) and CSV formats. There’s also a series of separate datasets for the most frequently occurring programs: RN Breakfast, RN Drive, AM, PM, The World Today, Late Night Live, Life Matters, and the Science Show.

There’s also a notebook that demonstrates a few possible ways you might start to play with the data – looking at the range of programs, the distribution of records over time, the people involved in each story, and words in the titles of each segment.

This is a very rich source of data for examining Australia’s political and social history over the last twenty years. Dive in and see what you can find! #dhhacks

GLAM Workbench wins British Library Labs Research Award!

Wed, 16 Dec 2020 10:46:00 +1000

Asking questions with web archives – introductory notebooks for historians has won the British Library Labs Research Award for 2020. The awards recognise ‘exceptional projects that have used the Library’s digital collections and data’.

This project gave me a chance to work with web archives collections and staff from the British Library, the National Library of Australia, and the National Library of New Zealand, and was supported by the International Internet Preservation Consortium’s Discretionary Funding Program.

We developed a range of tools, examples, and documentation to help researchers use and explore the vast historical resources available through web archives. A new web archives section was added to the GLAM Workbench, and 16 Jupyter notebooks, combining text, images, and live code, were created.

Here’s a 30 second summary of the project!

The judges noted:

“The panel were impressed with the level of documentation and thought that went into how to work computationally through Jupyter notebooks with web archives which are challenging to work with because of their size. These tools were some of the first of their kind.

“The Labs Advisory Board wanted to acknowledge and reward the incredible work of Tim Sherratt in particular. Tim you have been a pioneer as a one-person lab over many years and these 16 notebooks are a fine addition to your already extensive suite in your GLAM Workbench. Your work has inspired so many in GLAM, the humanities community, and BL Labs to develop their own notebooks. To our audience, we strongly recommend that you look at the GLAM Workbench if you’re interested in doing computational experiments with many institutions’ data sources.

Thanks to Andy, Olga, Alex, and Ben for your advice and support. And thanks to the British Library Labs for the award! #dhhacks

The GLAM Workbench as research infrastructure (some basic stats)

Tue, 15 Dec 2020 09:43:00 +1000

Repositories in the GLAM Workbench have been launched on Binder 3,529 times since the start of this year (according to data from the Binder Events log). That’s repository launches, not notebooks. Having launched a repository, users might use multiple notebooks. And of course these stats don’t include people using the notebooks in contexts other than Binder – on their own machines, servers, or services like AARNet’s SWAN. Or just viewing the notebooks in GitHub and copying code into their own projects.

I’m suspicious of web stats, but the Binder data indicates that people have actually done more than ‘visit’ – they’ve spun up a Binder session ready to do some exploration.

Every Jupyter notebook in the GLAM Workbench has a link that opens the notebook in Binder. If you click on the link, Binder reads configuration details from the repository and loads a customised computing environment. All in your browser! That means you can start using the GLAM Workbench without installing any software. Just click on the Binder link and start exploring!

There are about 40 different repositories in the GLAM Workbench, helping you work with data from Trove, DigitalNZ, NAA, SLNSW, NSW Archives, NMA, ArchivesNZ, ANU Archives & more! The image below shows them ranked by number of Binder launches this year.

The web archives section was added this year in collaboration with the IIPC, the UK Web Archive, the Australian Web Archive, and the NZ Web Archive. Its annual number of launches is inflated a bit by the development process. But there’s been 426 launches since it went public in June.

I’m really pleased to see the Trove newspaper harvester up near the top. At least once a day (on average) someone’s been firing up the repository to grab Trove newspaper articles in bulk.

Overall, that’s about 11 GLAM Workbench repository launches a day on Binder. It might not seem like much, but that’s 11 research opportunities that didn’t exist before, 11 GLAM collections opened to exploration, 11 researchers building their digital skills…

As humanities researchers continue to learn of the possibilities of GLAM data and develop their digital skills the numbers will grow. It’s a start. And a reminder that not all research infrastructure needs to be built in Go8 unis, by large teams, with $millions. We can all contribute by sharing our tools and methods. #dhhacks

Fri, 27 Nov 2020 15:02:00 +1000

Earlier this year I gave a seminar for the International Internet Preservation Consortium (IIPC) introducing the web archives section of the GLAM Workbench. The seminar is now available online: youtu.be/rVidh_wex…

Here are the slides if you want to follow along. #dhhacks

Harvest text from the Australian Women's Weekly!

Wed, 25 Nov 2020 14:52:00 +1000

The Trove Newspaper & Gazette Harvester has been updated to version 0.4.0. The major change is that if the OCRd text for an article isn’t available through the API, it will be automatically downloaded via the web interface. What does this mean in practice? Well previously you couldn’t harvest OCRd text from the Australian Women’s Weekly because it’s not included in API results, but now you can!

You don’t need to do anything differently. If there are AWW articles in your search, and you ask for all the OCRd text using the --text option, the AWW text files will automagically appear in your harvest.

Under the hood, I’ve started using html2text to remove tags from the OCRd text. I think this should produce more consistent results. As previously, line breaks are removed by default from the OCRd text files. However, I’ve now added a --include_linebreaks option if you’d like to keep them. This generally produces text that is more human-readable, but note that the line breaks produced by OCR aren’t always accurate.

Head to the GLAM Workbench to try it out, or download the code from PyPi. #dhhacks

Beyond the copyright cliff of death

Fri, 13 Nov 2020 08:56:00 +1000

If you’ve done any searching in Trove’s digitised newspapers, you’ve probably noticed that there aren’t many results after 1954. This is basically because of copyright restrictions (though given the complexities of Australia’s copyright system, you can’t be sure that everything published before 1955 is out of copyright). We can visualise the impact of this by looking at the number of newspaper articles in Trove by year.

You can see why I started referring to it as the copyright cliff of death.

But you can also see a little trickle of articles continuing post-1954. The number of newspapers from beyond the copyright cliff of death continues to increase as agreements are made with publishers to put them online. I just checked and there’s now 83 newspapers that have at least some post-1954 articles available. Here’s the top 10 (by number of articles).

If you’d like to browse the full list of post-1954 newspapers, here’s the data as a CSV (spreadsheet) file.

If you’d like to see how I generated this list, have a look at this notebook in the Trove Newspapers section of the GLAM Workbench.

If you’d like to know how I created the chart above, have a look at Visualise the total number of newspaper articles in Trove by year and state. #dhhacks

Questions? Ask away at OzGLAM Help. #dhhacks

Mon, 26 Oct 2020 16:15:00 +1000

I’ve added a new section to the GLAM Workbench for the ANU Archives. The first set of notebooks relates to the Sydney Stock exchange stock and share lists. As the content note describes:

These are large format bound volumes of the official lists that were posted up for the public to see - 3 times a day - forenoon, noon and afternoon - at the close of the trading session in the call room at the Sydney Stock Exchange. The closing prices of stocks and shares were entered in by hand on pre-printed sheets.

The volumes have been digitised, resulting in a collection of 70,000+ high resolution images. You can browse the details of each volume using this notebook.

I’ve been exploring ways of getting useful, machine-readable data out of the images. There’s more information about the processes involved in this repository. I’ve also been working on improving the metadata and have managed to assign a date and session (Morning, Noon, or Afternoon) to each page. We these, we can start to explore the content!

One of the notebooks creates a calendar-like view of the whole collection, showing the number of pages surviving from each trading day. This makes it easy to find the gaps and changes in process. #dhhacks

Mon, 26 Oct 2020 15:52:00 +1000

I’ve added more years to my repository of Commonwealth Hansard! The repository now includes XML-formatted text files for both houses from 1901 to 1980, and 1998 to 2005. I’ve done some more checking and confirmed that the XML files for 1981 to 1997 aren’t currently available through ParlInfo, however, the Parliamentary Library are looking into it. I’ve also created a CSV-formatted list of sitting days from 1901 to 2005 (based on ParlInfo search results). Details of the harvesting process are available in the GLAM Workbench. #dhhacks

Fri, 14 Aug 2020 17:29:50 +1000

Another #GLAMWorkbench update! Snip words out of @TroveAustralia newspaper pages and create big composite images. OCR art! glam-workbench.github.io/trove-new… #dhhacks

Thu, 30 Jul 2020 11:37:16 +1000

Ok, so do you want to make your own ‘scissors & paste’ messages using words from @TroveAustralia newspaper articles? Go to the notebook in #GLAMWorkbench & click on ‘Run live on Binder in Appmode’. #dhhacks

Wed, 29 Jul 2020 13:17:56 +1000

Another #GLAMWorkbench update! The Trove Harvester will now download both newspaper and gazette articles in bulk. You can optionally include full text, and save copies of the articles as images and PDFs. #dhhacks glam-workbench.github.io/trove-har…

Tue, 28 Jul 2020 09:27:51 +1000

Interested in using web archives in your research? Join us on 5/6 August for a free @netpreserve webinar introducing the tools and examples available in the new #webarchives section of the #GLAMWorkbench. There are two timeslots to cover multiple timezones: www.eventbrite.com/e/iipc-rs… and www.eventbrite.com/e/iipc-rs…

Mon, 27 Jul 2020 17:46:40 +1000

Introducing a brand new section of the #GLAMWorkbench, exploring the @MuseumsVictoria collection API. Harvest species records, display random images, and download ALL THE ANTECHINUSES! glam-workbench.github.io/museumsvi… #dhhacks

Mon, 27 Jul 2020 15:32:34 +1000

New additions to the @TroveAustralia books section of the #GLAMWorkbench – word frequency examples with OCRd text from digitised books, and a random recipe generator powered by a 19th C cook book! glam-workbench.github.io/trove-boo… #dhhacks

Mon, 27 Jul 2020 10:52:18 +1000

With the recent changes to @TroveAustralia, the Australian Women’s Weekly cover browser was retired. As a low-tech alternative, I’ve harvested all the cover images from the Women’s Weekly and saved them into PDFs for easy browsing, one for each decade. There are 2,566 images from 1933 to 1982.

Just click on the link below each image to explore the complete issue on Trove. You can also download the full collection of images from Cloudstor. There’s a CSV file containing all the issue metadata.

The notebook used to harvest the images is in the Trove newspapers section of the GLAM Workbench. You could easily adapt the notebook to harvest the front pages of any newspaper. #dhhacks

Fri, 17 Jul 2020 22:11:35 +1000

The Trove books section of the #GLAMWorkbench has been updated. There’s a fresh harvest of OCRd text & the notebooks have been changed to work with the new @TroveAustralia interface. Download & explore 24,620 files (3gb) of OCRd text! #dhhacks

Fri, 17 Jul 2020 16:07:44 +1000

Revisiting my Historic Hansard XML repository & realising how easy it is to load files as needed via the GitHub API & explore with Pandas & Jupyter. This #GLAMWorkbench notebook helps you explore a particular year/house. #dhhacks

Tue, 14 Jul 2020 13:31:47 +1000

The Trove Journals section of the #GLAMWorkbench has been updated to work with the new @TroveAustralia interface! I’ve also re-harvested ALL the OCRd text from digitised journals — 6gb of text from 397 journals now downloadable in bulk from CloudStor. #dhhacks

Sun, 12 Jul 2020 13:18:00 +1000

New in #GLAMWorkbench! After you’ve used the @TroveAustralia Newspaper Harvester to download lots & lots of articles, try exploring the results in Datasette. This notebook sets everything up, you can even add full text search & images! #dhhacks

Mon, 29 Jun 2020 09:48:53 +1000

Download newspaper articles in bulk! The Trove Newspaper Harvester has been updated to work with the new @TroveAustralia interface. I’ve also added the ability to save articles as .jpg images! The easiest way to get started is via the #GLAMWorkbench. #dhhacks

New GLAM Workbench section on web archives!

Wed, 27 May 2020 13:11:25 +1000

We tend to think of a web archive as a site we go to when links are broken – a useful fallback, rather than a source of new research data. But web archives don’t just store old web pages, they capture multiple versions of web resources over time. Using web archives we can observe change – we can ask historical questions. But web archives store huge amounts of data, and access is often limited for legal reasons. Just knowing what data is available and how to get to it can be difficult. Where do you start?

The GLAM Workbench’s new web archives section can help! Here you’ll find a collection of Jupyter notebooks that document web archive data sources and standards, and walk through methods of harvesting, analysing, and visualising that data. It’s a mix of examples, explorations, apps and tools. The notebooks use existing APIs to get data in manageable chunks, but many of the examples demonstrated can also be scaled up to build substantial datasets for research – you just have to be patient!

Have you ever wanted to find when a particular fragment of text first appeared in a web page? Or compare full-page screenshots of archived sites? Perhaps you want to explore how the text content of a page has changed over time, or create a side-by-side comparison of web archive captures. There are notebooks to help you with all of these.

To dig deeper you might want to assemble a dataset of text extracted from archived web pages, construct your own database of archived Powerpoint files, or explore patterns within a whole domain. The notebooks provide a range of approaches that can be extended or modified according to your research questions.

The development of these notebooks was supported by the International Internet Preservation Consortium’s Discretionary Funding Programme 2019-2020, with the participation of the British Library, the National Library of Australia, and the National Library of New Zealand. #dhhacks

Fri, 08 May 2020 11:18:15 +1000

Thanks to @NetPreserve, I’ve been spending time lately working on a set of web archive exploration notebooks for the #GLAMWorkbench. Here’s an example to create/compare screenshots of captures. #dhhacks

Thu, 02 Apr 2020 23:19:24 +1000

The GLAM CSV Explorer has had a few updates — you can now filter by organisation, and upload your own CSV files! #GLAMWorkbench Try it live on Binder.

Buildings might be closed, but the data is open – explore hundreds of datasets from Australian GLAM organisations!

Tue, 31 Mar 2020 12:48:07 +1000

For a couple of years I’ve been harvesting datasets created or published by Australian GLAM organisations through government data portals. I’ve just completed the latest harvest, and there’s now 369 datasets, containing 983 files, from 23 GLAM organisations. 628 of these files are in CSV (spreadsheet) format.

There’s a number of ways that you can explore the harvested data. You can browse a big list of datasets, or download a CSV containing all the harvested data or just those formatted as CSVs.

With this harvest I’ve added a new way of searching and filtering the harvested data using Datasette running on Glitch. This interface lets you narrow your queries by field or facet, and search text fields for keywords.

But what’s actually in all those CSV files? If you’d like to start exploring the content of the datasets, then give my GLAM CSV Explorer a go! The CSV Explorer looks at each column in the dataset and tries to identify the type of data inside. It then attempts to tell you something useful about it.

For all the details, links, and harvesting code, see the #GLAMWorkbench. #dhhacks

Wed, 11 Mar 2020 19:05:04 +1000

My harvest of OCRd text from @TroveAustralia digitised books, ephemera, and parliamentary papers has been updated! There’s now 19,795 text files (about 3gb) to explore! Harvesting details and links to browse/download files from Cloudstor are in the #GLAMWorkbench. #dhhacks

Tue, 03 Mar 2020 10:09:17 +1000

I’ve added some more documentation to the Trove Newspaper Harvester page in the #GLAMWorkbench. Get your @TroveAustralia newspaper articles in bulk! #dhhacks #collectionsasdata

Thu, 27 Feb 2020 11:47:18 +1000

New section added to the #GLAMWorkbench with examples from @Library_Vic! #slvdata #dhhacks #collectionsasdata

Thu, 27 Feb 2020 11:18:57 +1000

More fun with @iiif_io and images from @library_vic – resize, rotate, crop and more! Try it out with this new notebook in the #GLAMWorkbench. #slvdata #dhhacks

Wed, 26 Feb 2020 21:39:10 +1000

New #GLAMWorkbench notebook! Download images from @Library_Vic using IIIF and Handle… #dhhacks

Fri, 21 Feb 2020 09:10:00 +1000

Want to save @TroveAustralia newspaper articles as images (that aren’t sliced up in annoying ways)? There’s an app for that in the #GLAMWorkbench. #dhhacks

Mon, 17 Feb 2020 20:08:39 +1000

New ‘Trove images' section added to the #GLAMWorkbench! Here you’ll find my latest Jupyter notebook harvesting data about the use of standard licences & rights statements in Trove’s picture zone. #dhhacks

Fri, 14 Feb 2020 08:09:06 +1000

Voting in the 2019 @dhawards is now open! Go and check out all the cool #DigitalHumanities projects from around the world. And while you’re there, you might like to vote for my #GLAMWorkbench in the ‘Tools’ category!

Wed, 20 Nov 2019 21:24:46 +1000

New #GLAMWorkbench section with examples of how to get random-ish works and newspaper articles from @TroveAustralia. #dhhacks

Wed, 04 Sep 2019 23:03:04 +1000

The @naagovau RecordSearch section of the #GLAMWorkbench has been updated with more notebooks to help you get Australian archives data in a usable form. glam-workbench.github.io/recordsea… Useful for #twitterstorians, #ozhist, & #govhack!

Sun, 25 Aug 2019 13:33:18 +1000

I’ve updated my harvest of OCRd text from digitised journals in @TroveAustralia. The complete dataset now includes 33,035 issues from 720 titles – about 8gb of text to explore. Details in the #GLAMWorkbench: glam-workbench.github.io/trove-jou… #dhhacks

Fri, 09 Aug 2019 22:55:23 +1000

There’s a new section of the GLAM Workbench devoted to the National Museum of Australia collection API! Harvest @nma data, then explore it by time and place. #dhhacks

Wed, 24 Jul 2019 17:09:56 +1000

Updates to the Trove newspapers section of GLAM Workbench – adding links to app-ified versions of some notebooks, & direct links to @mybinderteam for everything. If you work with @TroveAustralia newspapers you might find it useful.

Download & explore 1,499,259 rows of open data from NSW State Archives Online Indexes

Wed, 24 Jul 2019 13:34:00 +1000

NSW State Archives publishes a number of detailed indexes containing data manually extracted from their records. These provide additional entry points to the records, such as a person’s name, or a place. But they also provide useful data for analysis. However, to explore the index data we need to get it out of the web interface and into a form that can be easily downloaded and manipulated.

I’ve created a series of Jupyter notebooks to harvest the all the indexes and save the data in a series of CSV-formatted files. I’ve also updated my repository containing all the harvested CSV files. It’s available from the new NSW State Archives section of my GLAM Workbench. There are currently 64 different index datasets, containing 1,499,259 rows of data.

And to help you get a sense of what’s actually in all those CSV files, I’ve created an interactive Index Explorer. Just select an index from the list, and the Index Explorer will generate a series of tables and visualisations that provide an overview of the data. Try running it live on Binder.

Thanks to the State Archives staff and volunteers for preparing all this most excellent data. #dhhacks

Thu, 11 Jul 2019 21:00:00 +1000

New in GLAM Workbench! Notebooks to harvest, index, analyse, and aggregate transcripts of speeches & interviews by Australian prime ministers. Plus links to harvested data and aggregated files. #dhhacks

Thu, 11 Jul 2019 11:26:29 +1000

Reorganising things a little at GLAM Workbench. @statelibrarynsw gets its own section. Hansard and @datagovau GLAM datasets now under ‘Australian government’. Making some space for further additions…

Mon, 17 Jun 2019 17:17:43 +1000

Kicked off a new GLAM Workbench repository dedicated to @SLSA with a quick notebook hack to get higher res versions of digitised photos. #dhhacks

Fri, 07 Jun 2019 13:57:28 +1000

Recent additions to the Trove Newspapers section of the GLAM Workbench: getting images from @TroveAustralia newspaper articles, and uploading article to @Omeka-S: glam-workbench.github.io/trove-new…

Sun, 26 May 2019 16:39:22 +1000

More GLAM Workbench updates! More full text of Australian books! I’ve added the notebook & data from my harvest of @TroveAustralia books in the @InternetArchive. There’s metadata and text of 1,153 books to explore. #dhhacks

Sun, 19 May 2019 22:22:47 +1000

Some overdue updates to the GLAM Workbench. First here’s details, data, and code from a harvest of GLAM datasets on @datagovau. Includes details of more than 400 CSV datasets. #dhhacks

Thu, 09 May 2019 19:14:00 +1000

Over the last week I’ve been downloading editorial cartoons published in The Bulletin from @TroveAustralia. There’s 3,471 cartoons – at least one from every issue published between 4 Sep 1886 and 17 Sep 1952. And you can browse them all…

To make it easier to explore the images, I’ve compiled them into a series of PDFs – one PDF for each decade. The PDFs include lower resolution versions of the images together with their publication details and a link to Trove. They’re all available from DropBox:

1886 to 1889 (45mb PDF)
1890 to 1899 (139mb PDF)
1900 to 1909 (147mb PDF)
1910 to 1919 (153mb PDF)
1920 to 1929 (159mb PDF)
1930 to 1939 (151mb PDF)
1940 to 1949 (146mb PDF)
1950 to 1952 (42mb PDF)

The complete collection of high resolution images (about 60gb in total) can be downloaded from CloudStor. The names of each image file provide useful contextual metadata. For example, the file name 19330412-2774-nla.obj-606969767-7.jpg tells you:

19330412 – the cartoon was published on 12 April 1933
2774 – it was published in issue number 2774
nla.obj-606969767 – the Trove identifier for the issue, can be used to make a url eg [nla.gov.au/nla.obj-6...](https://nla.gov.au/nla.obj-606969767)
7 – on page 7

There’s some details of the method that I used to find the cartoons in this notebook. I’ve also documented everything in the Trove Journals section of my GLAM Workbench.

Be warned – the language, images, and ideas presented in The Bulletin were often racist, anti-Semitic, and sexist. You won’t have to look far within this collection to find something offensive. This was, after all, the journal whose slogan for many years was ‘Australia for the white man’. This is our history… #dhhacks

Sat, 27 Apr 2019 11:44:19 +1000

And now my GLAM Workbench has a ‘Trove Maps’ section to document examples and explorations using data from @TroveAustralia’s ‘map’ zone: glam-workbench.github.io/trove-map… Includes a list of 20,158 maps with high-res downloads. #dhhacks

Tue, 23 Apr 2019 13:49:00 +1000

I’ve been busy lately harvesting LOTS of full text data from @TroveAustralia’s digitised journals – so many opportunities for research! You should be able to get to all the code & data from the new Trove journals section of my GLAM Workbench. #dhhacks

Mon, 22 Apr 2019 22:07:01 +1000

I’ve added a section for the @TroveAustralia ‘book’ zone to the GLAM Workbench.

Mon, 22 Apr 2019 11:11:06 +1000

All 9,738 OCRd text files harvested from books, pamphlets and leaflets in @TroveAustralia’s ‘book' zone have been uploaded to @aarnet’s CloudStor for easy browsing/download. There’s also a 400mb zip file if you want the whole lot.

The harvesting method and code is available in this notebook. All this and more will be documented soon in my GLAM Workbench. #dhhacks

Sun, 31 Mar 2019 20:23:43 +1000

Train from Canberra to Melbourne booked for #VALATechCamp. I’ll be hanging around both days, so let me know if you’d like to chat about the GLAM Workbench, Jupyter, Trove data, or any of the other things I fiddle with…

Sun, 24 Feb 2019 19:15:56 +1000

I’ve updated the notebook for harvesting records from @archivesnz’s Archway database in my GLAM Workbench. I just used it to harvest more than 8,000 records from series 8333 relating to naturalisation. #dhhacks

Thu, 21 Feb 2019 10:46:17 +1000

New section added to my GLAM Workbench for the Queensland State Archives (@qsarchives). Includes a notebook to add series information into their Naturalisations 1851-1904 index. #dhhacks

Sun, 17 Feb 2019 13:33:44 +1000

Suggestions of new topics and collections for my GLAM workbench are welcome!

Sun, 17 Feb 2019 11:40:14 +1000

I’ve added a section for Library and Archives Canada to my GLAM workbench. The first notebook extracts records of people from a specific country from their naturalisations database and saves the results as a CSV file. #dhhacks

Fri, 15 Feb 2019 22:59:24 +1000

Current status — extracting data from Library and Archives Canada’s 1915-1946 naturalisation database. Coming soon to my GLAM Workbench…

Fri, 01 Feb 2019 21:26:26 +1000

I’ve added a ‘save chart’ option to the QueryPic app in my GLAM Workbench. Visualise your searches in @TroveAustralia newspapers, then save the results as HTML for easy download. #dhhacks

Wed, 23 Jan 2019 22:17:20 +1000

One more and I’m done for the night… New GLAM Workbench page for the ‘Trove API introduction’ notebooks.

Wed, 23 Jan 2019 21:22:10 +1000

I’ve finished putting details of all the current GLAM Workbench repositories into the new documentation site. Still a few notebooks to migrate from the original workbench, but getting there! There’s about 50 Jupyter notebooks so far. #dhhacks

Wed, 23 Jan 2019 16:07:30 +1000

Added a ‘data’ section to the GLAM Workbench docs, with info on harvests from government data portals, as well as series from @naagovau relating to ASIO and the White Australia Policy.

Wed, 23 Jan 2019 10:23:08 +1000

And now a GLAM Workbench page for @Te_Papa…

Wed, 23 Jan 2019 09:49:01 +1000

Added a page for @ArchivesNZ’s Archway to the GLAM Workbench docs…

Tue, 22 Jan 2019 16:30:49 +1000

So here’s some fun things to do with @TroveAustralia newspapers… (via GLAM Workbench)

Tue, 22 Jan 2019 14:06:50 +1000

Ok, more documentation for you — page for the @DigitalNZ API in GLAM Workbench updated!

Tue, 22 Jan 2019 12:11:33 +1000

Slowly working my way through the documentation for my GLAM Workbench. Still lots to do, but I think the page for @naagovau’s RecordSearch is now up-to-date.

Tue, 22 Jan 2019 11:38:15 +1000

If there are APIs or other data sources you’d like me to add to my GLAM Workbench, feel free to create an issue. You could also describe what sorts of tools or examples using that data source would be useful.

Wed, 16 Jan 2019 16:41:45 +1000

New notebook added to the #GLAMWorkbench RecordSearch repository — get the basic details of agencies associated with all government functions used in @naagovau’s RecordSearch and save to a single JSON data file. View code and data. #dhhacks

glamworkbench on Tim Sherratt

How to download all the images from a digitised collection in Trove (& learn some cool Trove tricks)

Why is it necessary?

How does it work?

Where can I learn all these Trove tricks?

Where is the new notebook?

What do you want to do with Trove data?

Update! Saving Trove newspaper articles and pages as images

Getting to know NED – born-digital periodicals in Trove

More tools and data for working with Trove's digitised periodicals

What periodicals have been digitised?

New notebooks

Updated and reorganised datasets

A new way to explore editorial cartoons from *The Bulletin*

New GLAM Workbench section for working with government publications in Trove

Digital history stream at AHA annual conference in July

Some recent presentations on the GLAM Workbench and Trove Data Guide

Exploring Trove’s digitised periodicals

The data

The Trove Newspaper Data Dashboard now has an archive!

Customising Datasette-Lite to explore datasets in the GLAM Workbench

Custom theme

Custom plugins

Custom metadata

Full text indexing

Drop unwanted columns

What’s going on?

Exploring oral histories in Trove

Mapping MARC Geographic Area codes to Wikidata

National Archives of Australia in 2023 – digitisation of files

Trove newspapers in 2023

Trove Data Guide update – accessing data from newspapers and gazettes

Some important updates for the Trove Newspaper & Gazette Harvester

Run GLAM Workbench notebooks on the ARDC’s new Binder service

Trove Query Parser updated!

Family history resources in the GLAM Workbench

GLAM Name Index Search

NSW Post Office and Sydney Telephone Directories

Tasmanian Post Office Directories

Trove Places

Save a Trove newspaper article as an image

Trove Newspapers Data Dashboard

And more!

Exploring the front pages of newspapers (10 years on)

Trove API Console updates

Updated harvest of NSW State Archives indexes – more than 2 million rows of data!

A big milestone, Trove contributor data, and the coming of API v3 – recent GLAM Workbench updates

General developments

New sections

Trove contributors

Trove API v3

Updated sections

Trove newspaper & gazette harvester

Trove unpublished works (diaries, letters, and archives)

Trove music & sound

Random items from Trove

Related developments

Maps, people, lists & more – recent updates to Trove resources in the GLAM Workbench

Trove API introduction

Trove lists and tags

Trove maps

Trove People & Organisations

Recent presentations – Library of Congress Data Jam, Everyday Heritage, Wikidata, and GLAM Workbench!

Library of Congress Data Jam

Wikimedia Australia Community Meeting

Everyday Heritage workshop and symposium

Building DH

CAPOS 2022

Worlds of Wikimedia 2022

Do you want your Trove newspaper articles in bulk? Meet the new Trove Newspaper Harvester Python package!

From 48 PDFs to one searchable database – opening up the Tasmanian Post Office Directories with the GLAM Workbench

Some technical details

Postscript: Time and money

Fresh harvest of OCRd text from Trove's digitised periodicals – 9gb of text to explore and analyse!

Explore Trove's digitised newspapers by place

Making NSW Postal Directories (and other digitised directories) easier to search with the GLAM Workbench and Datasette

Basic features

How it works

Future developments

Many thanks to the British Library – sponsors of the GLAM Workbench’s web archives section!

A new way to explore editorial cartoons from The Bulletin