Using Pandora's collection of archived websites

Tuesday, May 7, 2024

There’s a brand new section of the GLAM Workbench to help you use data from Pandora’s collection of archived websites.

What’s Pandora?

Pandora is an initiative of the National Library of Australia which has been selecting web sites and online resources for preservation since 1996. It’s assembled a collection of more than 80,000 archived website titles, organised into subjects and collections. The archived websites are now part of the Australian Web Archive (AWA), which combines the selected titles with broader domain harvests, and is searchable through Trove.

Why is this needed?

The GLAM Workbench already has a Web Archives section that provides documentation, tools, and examples to help you work with data from a range of web archives, including the Australian Web Archive. But Pandora is unique to the NLA, and its curated collections offer a useful entry point for researchers trying to find web sites relating to particular topics or events.

Imagine you’re a researcher examining Australian election campaigns over the last decade. One thing you might want to do is analyse the language of campaign web sites. But how would you find them? You can search across the full text content of the Australian Web Archive in Trove, but you’d need some way of filtering the results to find sites of interest. Or you could just go the Elections category in Pandora.

Screenshot of Pandora's listing of collections related to election campaigns

Full-text search is great for some tasks, but carefully-curated collections with good-quality metadata can save us a lot of time and effort. Unfortunately, the Trove web interface prioritises search over Pandora’s collection metadata. If you head to Trove’s ‘Categories’ tab, you’ll find a link to Archived Webpage Collections. This collection hierarchy is basically the same as Pandora’s – combining Pandora’s subjects, subcategories, and collections into a single hierarchical structure. However, it only includes links to titles that are part of collections. This is important, as less than half of Pandora’s selected titles seem to be assigned to collections. Even stranger is the fact that I can’t find any link in Trove to the main Pandora site. This means that most researchers using the Australian Web Archive through Trove probably don’t even know that Pandora’s subject groupings exist!

For more on Pandora’s approach to describing collections see Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists.

What’s in the new section?

The new Trove web archive collections (Pandora) section of the GLAM Workbench includes three notebooks and three datasets.

The main aim is help researchers assemble datasets of archived website urls based on Pandora’s subject groupings. So if, as described above, you want a list of websites associated with election campaigns you can go the Create title datasets from collections and subjects notebook and generate a CSV file containing 9,304 urls. Easy! The notebook also generates an RO-Crate metadata file capturing the context in which your dataset was created.

To make all that possible, I’ve harvested Pandora’s complete subject hierarchy and list of titles. The code to do this is in these notebooks:

And pre-harvested subject, collections, and titles datasets are here:

To give an overview of Pandora’s subject organisation, I’ve also created a single-page view of the complete hierarchy.

What do you do with a dataset of Archived website urls?

Once you have your own dataset of archived urls you can make use of the tools in the Web Archives section to gather more data for analysis. For example:

I’m hoping to explore some more of these possibilities in the Trove Data Guide, part of the ARDC’s Community Data Lab.

glamworkbench