29 Aug 2024

The future (and past) of Historic Hansard

Don’t panic! Historic Hansard is not closing down – on the contrary, I’m planning a major update in the next few months. But as I look to the future, I thought it was a good time to pull together a few threads documenting my adventures with Commonwealth Hansard.

The past

Commonwealth Hansard is made available online through ParlInfo (there’s an alternative search interface here). The Parliamentary Library has invested a lot of time and effort in converting the printed volumes into nicely-structured XML files which break up the sitting day into debates and speeches, and identify individual speakers. For the most part, there’s one XML file for each sitting day in each house. However, there’s currently a gap between 1981 and 1997 where no XML files are available.

I started pulling data out of ParlInfo around 2011, but in 2016 I decided it would be more efficient to harvest all of the XML files into a dedicated repository. I started with the House of Representatives debates to 1965, and gradually expanded the coverage. The repository now contains all the Hansard XML files for both houses from 1901 to 1980, and 1998 to 2005. I stopped in 2005 because Open Australia provides access to the Hansard XML files from 2006 onwards.

The harvesting process revealed some interesting anomalies. In particular, I discovered that Parlinfo was missing Hansard from 94 sitting days. Most of the gaps were in the Senate between 1910 and 1920.

Chart showing 'missing' Hansard in the Senate 1901 to 1925

Fortunately the Parliamentary Library was quick to investigate the problem and fill the gaps. Over the years since, they’ve continued to improve the quality and accuracy of the XML files. Earlier this year I noticed that lots of new versions of the XML files were appearing and so I reharvested them all. It looks like the accuracy of the OCRd text has been improved and some structural issues fixed. This is one reason why a new version of Historic Hansard is needed!

If you’ve ever used ParlInfo you’ll know that while you can find things in Hansard, browsing and reading are difficult. There’s no easy way of just perusing the proceedings of a single day. A few months after I created the XML repository I decided to use the files to build Historic Hansard – ‘Commonwealth of Australia parliamentary debates presented in an easy-to-read format for historians and other lovers of political speech’. Historic Hansard is mostly just a static site, with one web page for each sitting day. Unlike ParlInfo, the focus is on reading, and you can view each speech within the context of the complete day’s proceedings.

Screenshot of the home page of Historic Hansard

You can browse Historic Hansard by house, parliament, year, and day. There are also indexes to the bills presented in the House of Representatives and the Senate, and pages for every person in the House of Representatives and the Senate with a complete list of their speeches. Because the focus was on browsing, Historic Hansard didn’t originally include a full text search function, but I eventually succumbed to user demand and added one in 2017. You can search for either debates or speeches, and download your results as a CSV file.

Screenshot of Historic Hansard's search interface

I also integrated Historic Hansard with Hypothes.is and Voyant Tools. Using Hypothes.is you can add notes and annotations to the speeches. You can even create deep links to fragments of text within a speech. I’ve often suggested that you could structure a whole undergraduate history unit around the annotation of a year of Hansard – identifying people and events, and finding and linking to related information. The Voyant Tools integration allowed you to analyse and visualise the language of a complete year of Hansard. Unfortunately I broke this at some point, so it’s on my list of things to fix in the new version!

In 2017, I did a bit of work with David Lowe at Deakin University to analyse the language of Hansard. Most of it was focused on the 1970s, but I did create word clouds using the top 200 TF-IDF weighted words for each decade and each parliament. In 2019, I compared the usage of the term ‘aliens’ in Hansard, newspapers, and The Bulletin.

Comparison of words associated with 'aliens' in The Bulletin and Hansard

One interesting feature of the Hansard XML is that it identifies interjections as well as formal speeches. In 2017, I extracted almost a million interjections from Hansard into a separate dataset. One of my favourite experiments used the interjections to reimagine political communication in the pre-internet era by transforming interjections into (fake) tweets. Of course I also started feeding the interjections through a text-to-speech processor, creating RoboHansard.

Screenshot of Real Words :: Imagined Tweets showing Hansard interjections reimagine as tweets

During the Real Face of White Australia transcribe-a-thon at Old Parliament House in 2017, I had the chance to take some of the interjections back to the place they were uttered. I set up speakers around the House of Representatives chamber and then started them hurling interjections about the White Australia Policy at each other. The drama (and spookiness) was heightened when a power failure turned off all the lights!

I summarised my adventures with Historic Hansard for a conference in South Africa in 2020.

I also wrote a piece on ‘The language of Parliament’ for the Museum of Australian Democracy which, unfortunately, seems to have disappeared during their latest site update. You can find it on Zenodo or in the Australian Web Archive.

In more recent times, I’ve integrated the XML harvesting code into the GLAM Workbench and added some Jupyter notebooks that demonstrate how you can access and analyse files from the repository.

The future

A number of historians have told me how much Historic Hansard has helped their research. A quick search in Google Scholar for “historichansard.net” returns 78 publications, many of which seem to include multiple links to specific sitting days. It seems I accidentally created a key piece of digital research infrastructure for Australian historians.

Given that, I’m not intending to change very much. My current plans include:

update all the content to use the latest versions of the XML
extend the date range to include files from 1998 to 2005
get the Voyant integration working again
make some back-end changes to the search application
fix various outdated links
I’d also like to see if I can clean up the Bills indexes a bit by merging together some of the titles, but that’ll require a bit of experimentation

Are there any improvements you’d like to see? If you have any suggestions, feel free to add an issue to the GitHub repository.

Historic Hansard is one of my passion projects. I haven’t received any funding to create or maintain it. Currently I estimate it costs me about AU$400 a year to run. It’s not much in the world of research infrastructure but, from a personal perspective, it all adds up. I’m committed to keeping Historic Hansard going, and it’s already outlasted some similar, well-funded projects, but if you’d like to help with the costs you can sponsor me on GitHub, or just buy me a coffee.

There seems to be a lot of interest in Hansard amongst digital humanities researchers at the moment, and there’s a couple of new projects starting up. It’ll be interesting to see where they go, but whatever happens, Historic Hansard will continue to serve lovers of political speech.

Tim Sherratt

The future (and past) of Historic Hansard

The past

The future