New dataset and notebooks – twenty years of ABC Radio National

There’s a new GLAM Workbench section for working with data from Trove’s Music & Sound zone!

Inside you’ll find out how to harvest all the metadata from ABC Radio National program records – that’s 400,000+ records, from 160 Radio National programs, over more than 20 years.

It’s metadata only, so not full transcripts or audio, though there are links back to the ABC site where you might find transcripts. Most records should at least have a title, a date, the name of the program it was broadcast on, a list of contributors, and perhaps a brief abstract/summary. It’s also worth noting that many of these records, particularly those from the main current affairs programs, represent individual stories or segments – so they provide a detailed record of the major news stories for the last couple of decades!

The harvesting notebook shows you how to get the data from the Trove API. There are a number of duplicate records, and some inconsistencies in the way the data is formatted, so the harvesting code tries to clean things up a bit. You can of course adjust this to meet your own needs.

If you don’t want to do the harvesting yourself, there’s pre-harvested datasets that you can download immediately from Cloudstor and start exploring. The complete harvest of all 400,000+ records is available both in JSONL (newline separated JSON) and CSV formats. There’s also a series of separate datasets for the most frequently occurring programs: RN Breakfast, RN Drive, AM, PM, The World Today, Late Night Live, Life Matters, and the Science Show.

There’s also a notebook that demonstrates a few possible ways you might start to play with the data – looking at the range of programs, the distribution of records over time, the people involved in each story, and words in the titles of each segment.

This is a very rich source of data for examining Australia’s political and social history over the last twenty years. Dive in and see what you can find! #dhhacks

Finding non-English newspapers in Trove

There are a growing number of non-English newspapers in Trove, but how do you know what’s there? After trying a few different approaches, I generated a list of 48 newspapers with non-English content. The full details are in this notebook).

As the notebook describes, I found the language metadata for newspapers was incomplete, so I used some language detection code on a sample of articles from every newspaper to try and find those with non-English content. But this had its own problems – such as the fact that the language detection code thought that bad OCR looked like Maltese…

Anyway, if you’re just searching using English keywords, you might not even be aware that these titles exist. It’s important to explore ways of making diversity visible within large digitised collections. #dhhacks

Last year I did some analysis of the availability of open access versions of research articles published between 2008 and 2018 in Australian Historical Studies. I’ve now broadened this out to cover all individual articles (with a DOI) across a number of journals. It’s pretty grim. Despite Green OA policies that allow researchers to share versions of their articles through institutional repositories, Australian history journals still seem to be about 94% closed.

Full details are in this notebook.

But this can be fixed! If you’re in a university, talk to your librarians about depositing a Green OA version of your article in an institutional repository. If not, you can use the Share your paper service to upload a Green OA version to Zenodo. Your research will be easier to find, easier to access, easier to cite, and available to everyone – not just those with the luxury of an institutional subscription. #dhhacks

A long thread exploring files in the National Archives of Australia with the access status of ‘closed’. This is the 6th consecutive year I’ve harvested ‘closed’ files on or about 1 January.

More updates from The Real Face of White Australia – running facial detection code over NAA: SP42/1.

GLAM Workbench wins British Library Labs Research Award!

Asking questions with web archives – introductory notebooks for historians has won the British Library Labs Research Award for 2020. The awards recognise ‘exceptional projects that have used the Library’s digital collections and data’.

This project gave me a chance to work with web archives collections and staff from the British Library, the National Library of Australia, and the National Library of New Zealand, and was supported by the International Internet Preservation Consortium’s Discretionary Funding Program.

We developed a range of tools, examples, and documentation to help researchers use and explore the vast historical resources available through web archives. A new web archives section was added to the GLAM Workbench, and 16 Jupyter notebooks, combining text, images, and live code, were created.

Here’s a 30 second summary of the project!

The judges noted:

“The panel were impressed with the level of documentation and thought that went into how to work computationally through Jupyter notebooks with web archives which are challenging to work with because of their size. These tools were some of the first of their kind.

“The Labs Advisory Board wanted to acknowledge and reward the incredible work of Tim Sherratt in particular. Tim you have been a pioneer as a one-person lab over many years and these 16 notebooks are a fine addition to your already extensive suite in your GLAM Workbench. Your work has inspired so many in GLAM, the humanities community, and BL Labs to develop their own notebooks. To our audience, we strongly recommend that you look at the GLAM Workbench if you’re interested in doing computational experiments with many institutions’ data sources.

Thanks to Andy, Olga, Alex, and Ben for your advice and support. And thanks to the British Library Labs for the award! #dhhacks

Want to relive the early days of digital humanities in Australia? I’ve archived the websites created for THATCamp Canberra in 2010, 2011, and 2014. They’re now static sites so search and commenting won’t work, but all the content should be there! #dhhacks

The Invisible Australians website has been given a much needed overhaul, and we’ve brought all our related projects together under the title The real face of White Australia. This includes an updated version of the wall of faces. #dhhacks

The GLAM Workbench as research infrastructure (some basic stats)

Repositories in the GLAM Workbench have been launched on Binder 3,529 times since the start of this year (according to data from the Binder Events log). That’s repository launches, not notebooks. Having launched a repository, users might use multiple notebooks. And of course these stats don’t include people using the notebooks in contexts other than Binder – on their own machines, servers, or services like AARNet’s SWAN. Or just viewing the notebooks in GitHub and copying code into their own projects.

I’m suspicious of web stats, but the Binder data indicates that people have actually done more than ‘visit’ – they’ve spun up a Binder session ready to do some exploration.

Every Jupyter notebook in the GLAM Workbench has a link that opens the notebook in Binder. If you click on the link, Binder reads configuration details from the repository and loads a customised computing environment. All in your browser! That means you can start using the GLAM Workbench without installing any software. Just click on the Binder link and start exploring!

There are about 40 different repositories in the GLAM Workbench, helping you work with data from Trove, DigitalNZ, NAA, SLNSW, NSW Archives, NMA, ArchivesNZ, ANU Archives & more! The image below shows them ranked by number of Binder launches this year.

The web archives section was added this year in collaboration with the IIPC, the UK Web Archive, the Australian Web Archive, and the NZ Web Archive. Its annual number of launches is inflated a bit by the development process. But there’s been 426 launches since it went public in June.

I’m really pleased to see the Trove newspaper harvester up near the top. At least once a day (on average) someone’s been firing up the repository to grab Trove newspaper articles in bulk.

Overall, that’s about 11 GLAM Workbench repository launches a day on Binder. It might not seem like much, but that’s 11 research opportunities that didn’t exist before, 11 GLAM collections opened to exploration, 11 researchers building their digital skills…

As humanities researchers continue to learn of the possibilities of GLAM data and develop their digital skills the numbers will grow. It’s a start. And a reminder that not all research infrastructure needs to be built in Go8 unis, by large teams, with $millions. We can all contribute by sharing our tools and methods. #dhhacks

Earlier this year I gave a seminar for the International Internet Preservation Consortium (IIPC) introducing the web archives section of the GLAM Workbench. The seminar is now available online: youtu.be/rVidh_wex…

Here are the slides if you want to follow along. #dhhacks

Harvest text from the Australian Women's Weekly!

The Trove Newspaper & Gazette Harvester has been updated to version 0.4.0. The major change is that if the OCRd text for an article isn’t available through the API, it will be automatically downloaded via the web interface. What does this mean in practice? Well previously you couldn’t harvest OCRd text from the Australian Women’s Weekly because it’s not included in API results, but now you can!

You don’t need to do anything differently. If there are AWW articles in your search, and you ask for all the OCRd text using the --text option, the AWW text files will automagically appear in your harvest.

Under the hood, I’ve started using html2text to remove tags from the OCRd text. I think this should produce more consistent results. As previously, line breaks are removed by default from the OCRd text files. However, I’ve now added a --include_linebreaks option if you’d like to keep them. This generally produces text that is more human-readable, but note that the line breaks produced by OCR aren’t always accurate.

Head to the GLAM Workbench to try it out, or download the code from PyPi. #dhhacks

Beyond the copyright cliff of death

If you’ve done any searching in Trove’s digitised newspapers, you’ve probably noticed that there aren’t many results after 1954. This is basically because of copyright restrictions (though given the complexities of Australia’s copyright system, you can’t be sure that everything published before 1955 is out of copyright). We can visualise the impact of this by looking at the number of newspaper articles in Trove by year.

You can see why I started referring to it as the copyright cliff of death.

But you can also see a little trickle of articles continuing post-1954. The number of newspapers from beyond the copyright cliff of death continues to increase as agreements are made with publishers to put them online. I just checked and there’s now 83 newspapers that have at least some post-1954 articles available. Here’s the top 10 (by number of articles).

If you’d like to browse the full list of post-1954 newspapers, here’s the data as a CSV (spreadsheet) file.

If you’d like to see how I generated this list, have a look at this notebook in the Trove Newspapers section of the GLAM Workbench.

If you’d like to know how I created the chart above, have a look at Visualise the total number of newspaper articles in Trove by year and state. #dhhacks

Questions? Ask away at OzGLAM Help. #dhhacks

Updated! Find Trove newspapers by place of publication by using this simple interface – just click on the map to find the 10 closest newspapers. Now including newspapers added to Trove since June.

You can also browse the locations of all newspapers across Australia.

The underlying data file is available as a spreadsheet. Feel free to add a comment if you notice any problems. I’m geolocating place names found in newspaper titles, so it’s not always exact.

Questions? Ask away at OzGLAM Help. #dhhacks

I’ve added a new section to the GLAM Workbench for the ANU Archives. The first set of notebooks relates to the Sydney Stock exchange stock and share lists. As the content note describes:

These are large format bound volumes of the official lists that were posted up for the public to see - 3 times a day - forenoon, noon and afternoon - at the close of the trading session in the call room at the Sydney Stock Exchange. The closing prices of stocks and shares were entered in by hand on pre-printed sheets.

The volumes have been digitised, resulting in a collection of 70,000+ high resolution images. You can browse the details of each volume using this notebook.

I’ve been exploring ways of getting useful, machine-readable data out of the images. There’s more information about the processes involved in this repository. I’ve also been working on improving the metadata and have managed to assign a date and session (Morning, Noon, or Afternoon) to each page. We these, we can start to explore the content!

One of the notebooks creates a calendar-like view of the whole collection, showing the number of pages surviving from each trading day. This makes it easy to find the gaps and changes in process. #dhhacks

Any regular user of RecordSearch, the National Archives of Australia’s online database, will understand its frustrations. But here’s a handy little hack to fix a couple of annoying problems and add some useful functionality!

The RecordSearch Show Pages userscript updates links to digitised files in search results and item details pages, inserting the number of pages in a file. This means that you can easily scan a list of search results to see where the big fat files are, without having to click through to each one individually.

But wait there’s more! The script also rewrites the link to the digitised file viewer so that it opens in the current tab, as you would expect, and not in an annoying pop up window!

And as an extra bonus if you install now, the script also inserts a link on the barcode of an item in the digitised file viewer that takes you back to the item details page. Links to the digitised file viewer are shareable (unlike most RecordSearch links), but they don’t give you a way to find more information about the item. That problem is also fixed by this handy little script.

For more information see OzGLAM Help. #dhhacks

I’ve added more years to my repository of Commonwealth Hansard! The repository now includes XML-formatted text files for both houses from 1901 to 1980, and 1998 to 2005. I’ve done some more checking and confirmed that the XML files for 1981 to 1997 aren’t currently available through ParlInfo, however, the Parliamentary Library are looking into it. I’ve also created a CSV-formatted list of sitting days from 1901 to 2005 (based on ParlInfo search results). Details of the harvesting process are available in the GLAM Workbench. #dhhacks

It was Open Access Week last week, so I tried a little experiment. How many research articles published in Australian Historical Studies between 2008 and 2018 are available via Open Access? Just 9.5% (23 out of 242). This is despite the fact that all articles published in 2018 or earlier are outside of the journal’s embargo period and Green OA versions could be shared through repositories.

Here’s all the code, it could be easily modified to work with other journals: nbviewer.jupyter.org/github/wr… #dhhacks

The Trove Newspaper and Gazette Harvester has been updated to include the snippet field in the harvested metadata. https://ozglam.chat/t/trove-newspaper-gazette-harvester-updated-to-version-0-3-3/56 #dhhacks

Calling users of Australian galleries, libraries, archives, & museums – OzGLAM Help is now live! Ask a question or simply share your latest discoveries. There’s handy tips, news about recent developments, & links to useful tools. Please use & share! #dhhacks

The Zotero translator for RecordSearch (the National Archives of Australia’s online database) has been updated. There’s many fixes and enhancements — see the full details. #dhhacks

If you try to share or bookmark the url of an item in RecordSearch (the National Archives of Australia’s online database), you’ll often get a ‘Session time out’ error when you access it. That’s because the urls only work within the current active RecordSearch session. So how can you create a shareable link that works across sessions? I’ve created a simple app that helps you create shareable links: recordsearch-links.glitch.me #dhhacks

The Zotero translator for Trove was failing on newspaper articles with tags. I’ve submitted a fix for approval: github.com/zotero/tr…

I’m not sure yet whether the capture of works and search results can be fixed following the Trove redesign. React is not very scraper friendly…

Another #GLAMWorkbench update! Snip words out of @TroveAustralia newspaper pages and create big composite images. OCR art! glam-workbench.github.io/trove-new… #dhhacks

Just in time for #GovHack, I’ve given the Trove API Console a major overhaul. It’s been updated for the latest API versions and has MANY MANY more examples. Explore all the data you can get from @TroveAustralia! troveconsole.herokuapp.com #dhhacks