I deliberately don’t keep any stats about GLAM Workbench visits, because I think they’re pretty meaningless. On the other hand, I’m always interested to see how often GLAM Workbench repositories are launched on Binder. Rather than just random clicks, these numbers represent the number of times users started new computing sessions using the GLAM Workbench. I just compiled these stats for the past year, and I was very pleased to see that the Web Archives section has been launched over 1,000 times in the past twelve months!
Five of the GLAM Workbench repositories now have automatically built Docker images and 1-click integration with Reclaim Cloud – ANU Archives, Trove Newspapers, Trove Newspaper Harvester, NAA RecordSearch, & Web Archives.
This means you can launch your very own version of these GLAM Workbench repositories in the cloud, where all your downloads and experiments will be saved! Find out more on the Using Reclaim Cloud page.
I’ve updated my harvest of Australian GLAM datasets from state/national government open data portals. There’s now 387 datasets, containing 1049 files (including 684 CSVs). There’s a list if you want to browse, and a CSV file if you want to download all the metadata. For more more information see the data portals section of the GLAM Workbench.
If you’re interested in finding out what’s inside all those 684 CVS files, take the GLAM CSV Explorer for a spin!
If you work with the collections of the National Archives of Australia, you might find the RecordSearch section of the GLAM Workbench helpful. I’ve just updated the repository to add new options for running the notebooks, including 1-click installation on Reclaim Cloud. There’s also a few new notebooks.
New notebooks and datasets Harvest details of all series in RecordSearch – get details of all series registered in RecordSearch, also generates a summary dataset with the total number of items digitised, described and in each access category Exploring harvested series data – generates some basic statistics from the harvest of series data Summary data about all series in RecordSearch (15mb CSV) – contains basic descriptive information about all the series currently registered on RecordSearch (May 2021) as well as the total number of items described, digitised, and in each access category Updated I’ve started (but not completed) updating all the notebooks in this repository to use my new RecordSearch Data Scraper.
My program of rolling out new features and integrations across the GLAM Workbench continues. The latest section to be updated is the Web Archives section!
There are no new notebooks with this update, but some important changes under the hood. If you haven’t used it before, the Web Archives section contains 16 notebooks providing documentation, tools, apps, and examples to help you make use of web archives in your research. The notebooks are grouped by the following topics: Types of data, Harvesting data and creating datasets, and Exploring change over time.
There’s no doubt that Trove’s digitised newspapers have had a significant impact on the practice of history in Australia. But analysing that impact is difficult when Trove itself is always changing – more newspapers and articles are being added all the time.
In an attempt to chart the development of Trove, I’ve created a dataset that shows (approximately) when particular newspaper titles were first added. This gives a rough snapshot of what Trove contained at any point in the last 12 years.
To make it easier for people to suggest additions, I’ve created a GitHub repository for my list of GLAM Jupyter examples and resources. Contributions are welcome!
This list is automatically pulled into the GLAM Workbench’s help documentation. #dhhacks
I recently made some changes in the GLAM Workbench’s Help documentation, adding a new Running notebooks section. This section provides detailed information of running and managing GLAM Workbench repositories using Reclaim Cloud and Docker.
I’m still rolling out this functionality across all the repositories, but it’s going to take a while. When I’m finished you’ll be able to create your own persistent environment on Reclaim Cloud from any repository with just the click of a button.
As I foreshadowed some weeks ago, I’ve shut down my Patreon page. Thanks to everyone who has supported me there over the last few years!
I’ve now shifted across to GitHub Sponsors, which is focused on supporting open source projects. This seems like a much better fit for the things that I do, which are all free and open by default.
So if you think things like the GLAM Workbench, Historic Hansard, OzGLAM Help, and The Real Face of White Australia are worth supporting, you can sign up using my GitHub Sponsors page.
I’ve updated, refreshed, and reorganised the Trove newspapers section of the GLAM Workbench. There’s currently 22 Jupyter notebooks organised under the following headings:
Trove newspapers in context – Notebooks in this section look at the Trove newspaper corpus as a whole, to try and understand what’s there, and what’s not. Visualising searches – Notebooks in this section demonstrate some ways of visualising searches in Trove newspapers – seeing everything rather than just a list of search results.
It was way back in 2009 that I created my first scraper for getting machine-readable data out of the National Archives of Australia’s online database, RecordSearch. Since then I’ve used versions of this scraper in a number of different projects such as The Real Face of White Australia, Closed Access, and Redacted (including the recent update). The scraper is also embedded in many of the notebooks that I’ve created for the RecordSearch section of the GLAM Workbench.
I’m interested in understanding what gets digitised and when by our cultural institutions, but accessible data is scarce. The National Archives of Australia lists ‘newly scanned' records in RecordSearch, so I thought I’d see if I could convert that list into a machine-readable form for analysis. I’ve had a lot of experience trying to get data out of RecordSearch, but even so it took me a while to figure out how the ‘newly scanned’ page worked.
Over the last few years, I’ve been very grateful for the support of my Patreon subscribers. Financially, their contributions have helped me cover a substantial proportion of the cloud hosting costs associated with projects like Historic Hansard and The Real Face of White Australia. But, more importantly, just knowing that they thought my work was of value has helped keep me going, and inspired me to develop a range of new resources.
You might have noticed some changes to the GLAM Workbench home page recently. One of the difficulties has always been trying to explain what the GLAM Workbench actually is, so I thought it might be useful to put more examples up front. The home page now lists about 25 notebooks under the headings:
Finding GLAM data Asking different questions Hacking heritage Bringing documentation alive Hopefully they give a decent representation of the sorts of things you can do using the GLAM Workbench.
I’ve been doing a bit of work behind the scenes lately to prepare for a major update to the GLAM Workbench. My plan is to provide one click installation of any of the GLAM Workbench repositories on the Reclaim Cloud platform. This will provide a useful step up from Binder for any researcher who wants to do large-scale or sustained work using the GLAM Workbench. Reclaim Cloud is a paid service, but they do a great job supporting digital scholarship in the humanities, and it’s fairly easy to minimise your costs by shutting down environments when they’re not in use.
I’ve given a couple of talks lately on the GLAM Workbench and some of my other work relating to the construction of online access to GLAM collections. Videos and slides are available for both:
From collections as data to collections as infrastructure: Building the GLAM Workbench, seminar for the Centre for Creative and Cultural Research, University of Canberra, 22 February 2021 – video (40 minutes) and slides Building the GLAM Workbench (and various other projects such as The Real Face of White Australia, Closed Access, and redacted), guest lecture for the Cultural Data Sculpting course, EPFL, Switzerland, 18 March 2021 – video (1hr 40mins) and slides I’ve also updated the presentations page in the GLAM Workbench.
It was Open Data Day on Saturday 6 March – here’s some of the ready-to-go datasets you can find in the GLAM Workbench – there’s something for historians, humanities researchers, teachers & more!
First here’s a list of Australian GLAM (that’s galleries, libraries, archives & museums) data sources. It includes APIs, portals, and downloadable datasets. Suggested additions welcome!
There’s also a list of Australian GLAM datasets that are available through government open data portals.
The NAA recently changed field labels in RecordSearch, so that ‘Barcode' is now ‘Item ID’. This required an update to my recordsearch_tools screen scraper. I also had to make a few changes in the RecordSearch section of the GLAM Workbench. #dhhacks
I’ve added an API Query Builder to the DigitalNZ section of the GLAM Workbench. You can use it to learn about the different parameters available from the search API, and experiment with different queries. Just get your API key from DigitalNZ, then try entering keywords and selecting options. Once you understand how the API works, you can start thinking about how you can make use of it in your own projects.
Lately I’ve been updating and expanding the notebooks in the DigitalNZ section of the GLAM Workbench. In particular, I’ve been looking at the usage facet to understand how much of the aggregated content is ‘open’. What do I mean by ‘open’? The Open Knowledge Foundation definition states that ‘open data and content can be freely used, modified, and shared by anyone for any purpose’. Obviously things that are in the public domain, such as out-of-copyright resources, are open.
There’s a new GLAM Workbench section for working with data from Trove’s Music & Sound zone!
Inside you’ll find out how to harvest all the metadata from ABC Radio National program records – that’s 400,000+ records, from 160 Radio National programs, over more than 20 years.
It’s metadata only, so not full transcripts or audio, though there are links back to the ABC site where you might find transcripts. Most records should at least have a title, a date, the name of the program it was broadcast on, a list of contributors, and perhaps a brief abstract/summary.
Asking questions with web archives – introductory notebooks for historians has won the British Library Labs Research Award for 2020. The awards recognise ‘exceptional projects that have used the Library’s digital collections and data’.
This project gave me a chance to work with web archives collections and staff from the British Library, the National Library of Australia, and the National Library of New Zealand, and was supported by the International Internet Preservation Consortium’s Discretionary Funding Program.
Repositories in the GLAM Workbench have been launched on Binder 3,529 times since the start of this year (according to data from the Binder Events log). That’s repository launches, not notebooks. Having launched a repository, users might use multiple notebooks. And of course these stats don’t include people using the notebooks in contexts other than Binder – on their own machines, servers, or services like AARNet’s SWAN. Or just viewing the notebooks in GitHub and copying code into their own projects.
Earlier this year I gave a seminar for the International Internet Preservation Consortium (IIPC) introducing the web archives section of the GLAM Workbench. The seminar is now available online: youtu.be/rVidh_wex…
Here are the slides if you want to follow along. #dhhacks
The Trove Newspaper & Gazette Harvester has been updated to version 0.4.0. The major change is that if the OCRd text for an article isn’t available through the API, it will be automatically downloaded via the web interface. What does this mean in practice? Well previously you couldn’t harvest OCRd text from the Australian Women’s Weekly because it’s not included in API results, but now you can!
You don’t need to do anything differently.
If you’ve done any searching in Trove’s digitised newspapers, you’ve probably noticed that there aren’t many results after 1954. This is basically because of copyright restrictions (though given the complexities of Australia’s copyright system, you can’t be sure that everything published before 1955 is out of copyright). We can visualise the impact of this by looking at the number of newspaper articles in Trove by year.
You can see why I started referring to it as the copyright cliff of death.
These are large format bound volumes of the official lists that were posted up for the public to see - 3 times a day - forenoon, noon and afternoon - at the close of the trading session in the call room at the Sydney Stock Exchange. The closing prices of stocks and shares were entered in by hand on pre-printed sheets.
The volumes have been digitised, resulting in a collection of 70,000+ high resolution images. You can browse the details of each volume using this notebook.
I’ve been exploring ways of getting useful, machine-readable data out of the images. There’s more information about the processes involved in this repository. I’ve also been working on improving the metadata and have managed to assign a date and session (Morning, Noon, or Afternoon) to each page. We these, we can start to explore the content!
One of the notebooks creates a calendar-like view of the whole collection, showing the number of pages surviving from each trading day. This makes it easy to find the gaps and changes in process. #dhhacks
I’ve added more years to my repository of Commonwealth Hansard! The repository now includes XML-formatted text files for both houses from 1901 to 1980, and 1998 to 2005. I’ve done some more checking and confirmed that the XML files for 1981 to 1997 aren’t currently available through ParlInfo, however, the Parliamentary Library are looking into it. I’ve also created a CSV-formatted list of sitting days from 1901 to 2005 (based on ParlInfo search results). Details of the harvesting process are available in the GLAM Workbench. #dhhacks
Another #GLAMWorkbench update! Snip words out of @TroveAustralia newspaper pages and create big composite images. OCR art! glam-workbench.github.io/trove-new… #dhhacks
Ok, so do you want to make your own ‘scissors & paste’ messages using words from @TroveAustralia newspaper articles? Go to the notebook in #GLAMWorkbench & click on ‘Run live on Binder in Appmode’. #dhhacks