Secrets and lives

Here’s the video of my presentation, ‘Secrets and lies’, for the (Re)create symposium at the University of Canberra, 21 April 2021. It’s mainly about finding and resting redactions in ASIO surveillance files held by the National Archives of Australia.

Secrets and lives from Tim Sherratt on Vimeo.

Here are links to the various sites and resources mentioned in the video:

I haven’t yet written up the details of training my latest redaction finder. When I do, I’ll post it here! #dhhacks

Recently digitised files in the National Archives of Australia

I’m interested in understanding what gets digitised and when by our cultural institutions, but accessible data is scarce. The National Archives of Australia lists ‘newly scanned’ records in RecordSearch, so I thought I’d see if I could convert that list into a machine-readable form for analysis. I’ve had a lot of experience trying to get data out of RecordSearch, but even so it took me a while to figure out how the ‘newly scanned’ page worked. Eventually I was able to extract all the file metadata from the list and save it to a CSV file. The details are in this notebook in the GLAM Workbench.

I used the code to create a dataset of all the files digitised in the past month. The ‘newly scanned’ list only displays a month’s worth of additions, so that’s as much as I could get in one hit. In the past month, 24,039 files were digitised. 22,500 of these (about 93%) come from just four series of military records. This is no surprise, as the NAA is currently undertaking a major project to digitise WW2 service records. What is perhaps more interesting is the long tail of series from which a small number of files were digitised. 357 of the 375 series represented in the dataset (about 95%) appear 20 or fewer times. 210 series have had only one file digitised in the last month. I’m assuming that this diversity represents research interests, refracted through the digitisation on demand service. But this really needs more data, and more analysis.

As I mentioned, only one month’s data is available from RecordSearch at any time. To try and capture a longer record of the digitisation process, I’ve set up an automated ‘git scraper’ that runs every Sunday and captures metadata of all the files digitised in the preceding week. The weekly datasets are saved as CSV files in a public GitHub repository. Over time, this should become a useful dataset for exploring long-term patterns in digitisation. #dhhacks

Moving on from Patreon...

Over the last few years, I’ve been very grateful for the support of my Patreon subscribers. Financially, their contributions have helped me cover a substantial proportion of the cloud hosting costs associated with projects like Historic Hansard and The Real Face of White Australia. But, more importantly, just knowing that they thought my work was of value has helped keep me going, and inspired me to develop a range of new resources.

However, while I’ve been grateful for the platform provided by Patreon, I’ve increasingly felt that it’s not a good fit for the sort of work I do. Patreon is geared towards providing special content to supporters, but, as you know, all my work is open. And that’s really important to me.

Recently GitHub opened up its own sponsorship program for the development of open source software. This program seems to align more closely with what I do. I already share and manage my code through GitHub, so integrating sponsorship seems to make a lot of sense. It’s worth noting too, that, unlike Patreon, GitHub charges no fees and takes no cut of your contributions. As a result I’ve decided to close my Patreon account by the end of April, and create a GitHub sponsors page.

What does this mean for you?

If you’re a Patreon subscriber and you’d like to keep supporting me, you should cancel your Patreon contribution, then head over to my brand new GitHub sponsors page and sign up! Thanks for your continued support!

If you’d prefer to let your contributions lapse, just do nothing. Your payments will stop when I close the account at the end of April. I understand that circumstances change – thank you so much for your support over the years, and I hope you will continue to make use of the things I create.

If you make use of any of my tools or resources and would like to support their continued development, please think about becoming a sponsor. For a sample of the sorts of things I’ve been working on lately, see my updates feed.

The future!

I’m very excited about the possibilities ahead. The GLAM Workbench has received a lot of attention around the world (including a Research Award from the British Library Labs), and I’m planning some major developments over coming months. And, of course, I won’t forget all my other resources – I spent a lot of time in 2020 migrating databases and platforms to keep everything chugging along.

On my GitHub sponsors page, I’ve set an initial target of 50 sponsors. That might be ambitious, but as I said above, it’s not just about money. Being able to point to a group of people who use and value this work will help me argue for new ways of enabling digital research in the humanities. So please help me spread the word – let’s make things together!

What can you do with the GLAM Workbench?

You might have noticed some changes to the GLAM Workbench home page recently. One of the difficulties has always been trying to explain what the GLAM Workbench actually is, so I thought it might be useful to put more examples up front. The home page now lists about 25 notebooks under the headings:

Hopefully they give a decent representation of the sorts of things you can do using the GLAM Workbench. I’ve also included a little rotating slideshow built using

Other recent additions include a new Grants and Awards page. #dhhacks

Reclaim Cloud integration coming soon to the GLAM Workbench

I’ve been doing a bit of work behind the scenes lately to prepare for a major update to the GLAM Workbench. My plan is to provide one click installation of any of the GLAM Workbench repositories on the Reclaim Cloud platform. This will provide a useful step up from Binder for any researcher who wants to do large-scale or sustained work using the GLAM Workbench. Reclaim Cloud is a paid service, but they do a great job supporting digital scholarship in the humanities, and it’s fairly easy to minimise your costs by shutting down environments when they’re not in use.

I’ve still got a lot of work to do to roll this out across the GLAM Workbench’s 40 repositories, but if you’d like a preview head to the Trove Newspaper and Gazette Harvester repository on GitHub. Get yourself a Reclaim Cloud account and click on the Launch on Reclaim Cloud button. It’s that easy!

There’s some technical notes in the Reclaim Hosting forum, and a post by Reclaim Hosting guru Jim Groom describing his own experience spinning up the GLAM Workbench.

Watch this space for more news! #dhhacks

Some recent GLAM Workbench presentations

I’ve given a couple of talks lately on the GLAM Workbench and some of my other work relating to the construction of online access to GLAM collections. Videos and slides are available for both:

  • From collections as data to collections as infrastructure: Building the GLAM Workbench, seminar for the Centre for Creative and Cultural Research, University of Canberra, 22 February 2021 – video (40 minutes) and slides

I’ve also updated the presentations page in the GLAM Workbench. #dhhacks

Some GLAM Workbench datasets to explore for Open Data Day

It was Open Data Day on Saturday 6 March – here’s some of the ready-to-go datasets you can find in the GLAM Workbench – there’s something for historians, humanities researchers, teachers & more!

And if that’s not enough data, the GLAM Workbench provides tools to help you create your own datasets from Trove, the National Archives of Australia, the National Museum of Australia, Archives NZ, DigitalNZ, & more! #dhhacks

Zotero translator for NAA RecordSearch updated

The recent change of labels from ‘Barcode’ to ‘ItemID’ in the National Archives of Australia’s RecordSearch database broke the Zotero translator. I’ve now updated the translator, and the new version has been merged into the Zotero translators repository. It should be updated when you restart Zotero, but if not you can go to Preferences > Advanced > Files and folders and click on the Reset translators button.

The translator lets you:

  • Capture metadata from single items, series, or photos
  • Capture multiple items, series, or photos from search results
  • Save photos from PhotoSearch
  • Save digitised files as PDFs
  • Save page images from the digitised file viewer

For more information or to ask questions, go to OzGLAM Help. #dhhacks

TroveNewsBot upgraded – now sharing articles published 'on this day'!

@TroveNewsBot has been sharing Trove newspaper articles on Twitter for over 7 years. With its latest upgrade the bot now has an ‘on this day’ function. Every day at AEST9.00am, TroveNewsBot will share an article published on that day in the past.

Even better, you can make your own ‘on this day’ queries by tweeting to @TroveNewsBot with the hashtag #onthisday. For example:

  • Tweeting ‘#onthisday #luckydip’ – will return a random article published on this day in the past.
  • Tweeting ‘cat #onthisday #luckydip’ – will return a random article about cats published on this day in the past.
  • Tweeting ‘1920 #year #onthisday’ – will return an article published on this day in 1920.

See @TroveNewsBot’s help page for a full list of parameters. #dhhacks

The NAA recently changed field labels in RecordSearch, so that ‘Barcode’ is now ‘Item ID’. This required an update to my recordsearch_tools screen scraper. I also had to make a few changes in the RecordSearch section of the GLAM Workbench. #dhhacks

Open access publishing for Australian historians

After some recent investigations of the availability of open access versions of articles published in paywalled Australian history journals, I’ve started a Google doc to capture useful links and information for Australian historians wanting to make their research open access. Comments and additions are welcome. #dhhacks

Who was linking to Trove newspapers in 2014?

In 2014 I pulled together a sample of web pages that included links back to digitised newspaper articles in Trove and created the ‘Trove Traces’ app. It was interesting, and sometimes disturbing, to see the diversity of sites that made use of Trove. Amongst the family and local history enthusiasts were climate change deniers and racists who found ‘evidence’ for their views in past newspapers. And of course, the sample only includes links in web pages, not social media sharing.

The app has since disappeared, but I’ve now published a static version. The search and sort functions won’t work, but you can browse through the sample of backlinks by web page or by linked Trove newspaper article. There are also lists of the most frequently occurring pages and articles. #dhhacks

New! DigitalNZ API Query Builder added to GLAM Workbench

I’ve added an API Query Builder to the DigitalNZ section of the GLAM Workbench. You can use it to learn about the different parameters available from the search API, and experiment with different queries. Just get your API key from DigitalNZ, then try entering keywords and selecting options. Once you understand how the API works, you can start thinking about how you can make use of it in your own projects.

👉🏻 Try it out live on Binder!

Under the hood the API Query Builder is a Jupyter notebook (of course), but it uses ipyvuetify to create good-looking, responsive, form widgets. It’s intended to be run using Voilà, which turns notebooks into interactive apps and dashboards. You can now run any Jupyter notebook using Voilà on Binder, just by changing the url.

If this app seems useful (let me know!) I might put a version on Heroku so the start up time is reduced. I’m also thinking of using this sort of pattern to create apps for other APIs in the GLAM Workbench. #dhhacks

OpenGLAM fireworks! Finding open collections in DigitalNZ

Lately I’ve been updating and expanding the notebooks in the DigitalNZ section of the GLAM Workbench. In particular, I’ve been looking at the usage facet to understand how much of the aggregated content is ‘open’. What do I mean by ‘open’? The Open Knowledge Foundation definition states that ‘open data and content can be freely used, modified, and shared by anyone for any purpose’. Obviously things that are in the public domain, such as out-of-copyright resources, are open. But so are resources with an open licence such as CC-BY or CC-BY-SA. The Creative Commons ‘Non commercial’ and ‘No derivatives’ licences are not open because they put limits on how you can use resources.

How does this definition map to DigitalNZ? The usage facet includes five values:

  • Share
  • Modify
  • Use commercially
  • All rights reserved
  • Unknown

These values have been assigned by DigitalNZ based on the 35,000 different rights statements and 30 different copyright statements that are included in DigitalNZ metadata records. I find I have to turn the usage values inside out to really understand them. A resource that only allows you to ‘Share’, excludes the ‘Modify’ and ‘Use commercially’ permissions and so is sort of equivalent to a CC-BY-ND-NC licence. The only open value, according to the definition above, is ‘Use commercially’, which is like CC-BY. I’m assuming that ‘Use commercially’ has been assigned to resources that either out of copyright (or with no known copyright restrictions) or are openly licensed.

It’s also worth noting that the ‘usage’ values are not mutually-exclusive. A record with a ‘usage’ value of ‘Use commercially’, will also be assigned ‘Share’ and ‘Modify’ values. This is because ‘Use commercially’ includes the ‘Share’ and ‘Modify’ permissions. This seems a bit counter-intuitive, but makes sense if you think about doing a search for everything you’re allowed to share.

A rough calculation based on the usage facet indicates that 71.76% of the resources aggregated by DigitalNZ are open. That seems pretty good, though a lot of those are probably out-of-copyright newspaper articles from Papers Past. For a more fine-grained analysis, I decided to look at the ‘usage’ data for each combination of ‘content_partner’ and ‘primary_collection’. How open is each individual collection in DigitalNZ?

For added excitement, and to stretch my knowledge of what Altair can do, I decided to visualise the results as display of colourful fireworks. The higher the explosion, the more open the collection! I’m pretty pleased with the result.

I’ve saved a HTML version of the chart so you can mouseover the explosions for more details. All the code is included in this notebook, along with a CSV file containing all the harvested facet data. #dhhacks

Easy browsing of Trove newspapers with these keyboard shortcuts!

If you like browsing Trove’s digitised newspapers page by page, you might have found that the current interface is a bit clunky. To move between pages you have to hover over the page number and click on ‘Next’ or ‘Previous’. Wouldn’t it be good if you could just use the arrow keys on your keyboard? Well now you can!

I’ve created a very simple script that allows you to use the arrows on your keyboard to move between pages in Trove’s digitised newspapers. Once it’s enabled you can use the following shortcuts:

  • ⬆️ Up arrow – go to the first page of the previous issue
  • ⬇️ Down arrow – go the the first page of the next issue
  • ⬅️ Left arrow – go to the previous page
  • ➡️ Right arrow – go to the next page

Go here for installation instructions.

This makes browsing through a newspaper much easier. Enjoy! #dhhacks

New dataset and notebooks – twenty years of ABC Radio National

There’s a new GLAM Workbench section for working with data from Trove’s Music & Sound zone!

Inside you’ll find out how to harvest all the metadata from ABC Radio National program records – that’s 400,000+ records, from 160 Radio National programs, over more than 20 years.

It’s metadata only, so not full transcripts or audio, though there are links back to the ABC site where you might find transcripts. Most records should at least have a title, a date, the name of the program it was broadcast on, a list of contributors, and perhaps a brief abstract/summary. It’s also worth noting that many of these records, particularly those from the main current affairs programs, represent individual stories or segments – so they provide a detailed record of the major news stories for the last couple of decades!

The harvesting notebook shows you how to get the data from the Trove API. There are a number of duplicate records, and some inconsistencies in the way the data is formatted, so the harvesting code tries to clean things up a bit. You can of course adjust this to meet your own needs.

If you don’t want to do the harvesting yourself, there’s pre-harvested datasets that you can download immediately from Cloudstor and start exploring. The complete harvest of all 400,000+ records is available both in JSONL (newline separated JSON) and CSV formats. There’s also a series of separate datasets for the most frequently occurring programs: RN Breakfast, RN Drive, AM, PM, The World Today, Late Night Live, Life Matters, and the Science Show.

There’s also a notebook that demonstrates a few possible ways you might start to play with the data – looking at the range of programs, the distribution of records over time, the people involved in each story, and words in the titles of each segment.

This is a very rich source of data for examining Australia’s political and social history over the last twenty years. Dive in and see what you can find! #dhhacks

Finding non-English newspapers in Trove

There are a growing number of non-English newspapers in Trove, but how do you know what’s there? After trying a few different approaches, I generated a list of 48 newspapers with non-English content. The full details are in this notebook).

As the notebook describes, I found the language metadata for newspapers was incomplete, so I used some language detection code on a sample of articles from every newspaper to try and find those with non-English content. But this had its own problems – such as the fact that the language detection code thought that bad OCR looked like Maltese…

Anyway, if you’re just searching using English keywords, you might not even be aware that these titles exist. It’s important to explore ways of making diversity visible within large digitised collections. #dhhacks

Open access versions of Australian history articles

Last year I did some analysis of the availability of open access versions of research articles published between 2008 and 2018 in Australian Historical Studies. I’ve now broadened this out to cover all individual articles (with a DOI) across a number of journals. It’s pretty grim. Despite Green OA policies that allow researchers to share versions of their articles through institutional repositories, Australian history journals still seem to be about 94% closed.

Full details are in this notebook.

But this can be fixed! If you’re in a university, talk to your librarians about depositing a Green OA version of your article in an institutional repository. If not, you can use the Share your paper service to upload a Green OA version to Zenodo. Your research will be easier to find, easier to access, easier to cite, and available to everyone – not just those with the luxury of an institutional subscription. #dhhacks

A long thread exploring files in the National Archives of Australia with the access status of ‘closed’. This is the 6th consecutive year I’ve harvested ‘closed’ files on or about 1 January.

More updates from The Real Face of White Australia – running facial detection code over NAA: SP42/1.

GLAM Workbench wins British Library Labs Research Award!

Asking questions with web archives – introductory notebooks for historians has won the British Library Labs Research Award for 2020. The awards recognise ‘exceptional projects that have used the Library’s digital collections and data’.

This project gave me a chance to work with web archives collections and staff from the British Library, the National Library of Australia, and the National Library of New Zealand, and was supported by the International Internet Preservation Consortium’s Discretionary Funding Program.

We developed a range of tools, examples, and documentation to help researchers use and explore the vast historical resources available through web archives. A new web archives section was added to the GLAM Workbench, and 16 Jupyter notebooks, combining text, images, and live code, were created.

Here’s a 30 second summary of the project!

The judges noted:

“The panel were impressed with the level of documentation and thought that went into how to work computationally through Jupyter notebooks with web archives which are challenging to work with because of their size. These tools were some of the first of their kind.

“The Labs Advisory Board wanted to acknowledge and reward the incredible work of Tim Sherratt in particular. Tim you have been a pioneer as a one-person lab over many years and these 16 notebooks are a fine addition to your already extensive suite in your GLAM Workbench. Your work has inspired so many in GLAM, the humanities community, and BL Labs to develop their own notebooks. To our audience, we strongly recommend that you look at the GLAM Workbench if you’re interested in doing computational experiments with many institutions’ data sources.

Thanks to Andy, Olga, Alex, and Ben for your advice and support. And thanks to the British Library Labs for the award! #dhhacks

Want to relive the early days of digital humanities in Australia? I’ve archived the websites created for THATCamp Canberra in 2010, 2011, and 2014. They’re now static sites so search and commenting won’t work, but all the content should be there! #dhhacks

The Invisible Australians website has been given a much needed overhaul, and we’ve brought all our related projects together under the title The real face of White Australia. This includes an updated version of the wall of faces. #dhhacks

The GLAM Workbench as research infrastructure (some basic stats)

Repositories in the GLAM Workbench have been launched on Binder 3,529 times since the start of this year (according to data from the Binder Events log). That’s repository launches, not notebooks. Having launched a repository, users might use multiple notebooks. And of course these stats don’t include people using the notebooks in contexts other than Binder – on their own machines, servers, or services like AARNet’s SWAN. Or just viewing the notebooks in GitHub and copying code into their own projects.

I’m suspicious of web stats, but the Binder data indicates that people have actually done more than ‘visit’ – they’ve spun up a Binder session ready to do some exploration.

Every Jupyter notebook in the GLAM Workbench has a link that opens the notebook in Binder. If you click on the link, Binder reads configuration details from the repository and loads a customised computing environment. All in your browser! That means you can start using the GLAM Workbench without installing any software. Just click on the Binder link and start exploring!

There are about 40 different repositories in the GLAM Workbench, helping you work with data from Trove, DigitalNZ, NAA, SLNSW, NSW Archives, NMA, ArchivesNZ, ANU Archives & more! The image below shows them ranked by number of Binder launches this year.

The web archives section was added this year in collaboration with the IIPC, the UK Web Archive, the Australian Web Archive, and the NZ Web Archive. Its annual number of launches is inflated a bit by the development process. But there’s been 426 launches since it went public in June.

I’m really pleased to see the Trove newspaper harvester up near the top. At least once a day (on average) someone’s been firing up the repository to grab Trove newspaper articles in bulk.

Overall, that’s about 11 GLAM Workbench repository launches a day on Binder. It might not seem like much, but that’s 11 research opportunities that didn’t exist before, 11 GLAM collections opened to exploration, 11 researchers building their digital skills…

As humanities researchers continue to learn of the possibilities of GLAM data and develop their digital skills the numbers will grow. It’s a start. And a reminder that not all research infrastructure needs to be built in Go8 unis, by large teams, with $millions. We can all contribute by sharing our tools and methods. #dhhacks