31 Jan 2025

A Community Data Lab (CDL) wishlist

The ARDC is holding an event on 18 February to begin shaping the next phase of the Community Data Lab. If you’re interested in the development of digital tools and resources to support HASS research, I’d suggest you go along.

I worked on the first phase of the Community Data Lab, developing the Trove Data Guide amongst other things. I’m very keen to see the CDL expand, working with researchers to create new possibilities for digital research, particularly using the rich collections of the GLAM sector (galleries, libraries, archives, and museums). As planning gets underway for the next phase of the CDL, I thought I’d pull together some rough ideas about what the CDL might be and might do. The ARDC needs co-investment in its projects, so new initiatives have to have some form of institutional support to become part of the CDL. Nonetheless, I think it’s useful to continue to think more broadly about HASS research infrastructure needs and possibilities – both long-term requirements and short-term interventions.

Extending co-design throughout the life of the CDL

Currently CDL co-design processes are focused on the initial design phase of a project and are structured around a plan developed from the institutional partners' interests and priorities. It can be difficult to relate specific research tasks and needs to these larger-scale initiatives. How can co-design processes be embedded in a project’s ongoing development?

One possibility is that CDL projects could have shortish development cycles with co-design processes before each cycle to refine the scope and priorities. This would increase opportunities for community participation and feedback, but the process would still be focused on the needs of the project rather than the needs of researchers.

I’d like to see an alternative (or additional) component that operates like a CDL ‘help desk’ or ‘triage’ service. For example:

a researcher outlines a specific problem or need
if useful tools/datasets/documentation already exist, CDL project staff point the researcher towards them
if there’s nothing to meet this specific need, some basic analysis is done to see where it might fit in the CDL universe and how much work might be required
for small-scale solutions within current capacities, simply create the new resource (eg. a piece of documentation, a single notebook, a slice of an existing dataset)
for larger-scale solutions, document the gap/need to feed into future CDL design processes
add all of the information – problem, requirements, solution – to a searchable knowledgebase
the knowledgebase should allow user comments/links and voting on unsolved problems to help identify future priorities

Some of the advantages to this approach would be:

direct engagement with CDL at any time when the researcher identifies a specific problem, not when allowed by the CDL’s co-design timetable and framework
recognises that many problems can be solved with existing tools and resources, both inside and outside of the CDL
encourages direct researcher engagement through the knowledgebase, bringing in broader community perspectives and suggestions, as well as information about related projects and tools
helps to put the ‘community’ in Community Data Lab

Connecting things up

There’s a lot of work to be done connecting up existing tools and data sources. This requires a combination of documentation and the development of small-scale tools or plugins to transform and move data as required. This sort of work fills in the gaps of existing research infrastructures helping to ensure that best use is made of existing investments, and encouraging re-use and collaboration instead of duplication or reinvention. It could be integrated with the ‘help desk’ function described above.

The pathways/tutorials in the Trove Data Guide are all examples of this, indeed, the development of additional Data Guides around particular collections or collection types could help fill a lot of these gaps.

Similarly, I think it’s important to put CDL projects within a broader context, so researchers are guided to solutions that best meet their needs.

For example, I’m a big fan of Omeka. I gave an Omeka workshop at THATCamp Melbourne way back in 2011, I created a Python client for interacting with the Omeka-S API, and I’ve used Omeka in a number of projects. But just as not every blog needs WordPress, not every online collection needs Omeka. There are alternative pathways that don’t require the full technology stack and might better suit the needs of individual researchers. It’s important to keep these alternatives in mind and provide documentation that will enable researchers to make decisions about what’s best for their project.

For example, if you want to share data (even relational databases) in a form that other users can search and explore, then Datasette might be a better option. Even within the Datasette ecosystem there are a variety of deployment possibilities. Datasette-Lite runs wholly within the browser, all you need is a GitHub repository (or similar) to point at. Most of the CSV files that I share through the GLAM Workbench have an option to explore using Datasette-Lite. All I do is create a url that points the CSV file at my existing Datasette-Lite repository. I’ve now created a simple tool to help others create these links. You can even add full text indexes and embed images. Because Datasette-Lite uses static technologies, the maintenance overheads are minimised.

If your dataset is too large or complex for Datasette-Lite, you can use a standard Datasette instance. It’s easy to scale. I have one Datasette instance that contains 11 million rows of data, in 279 datasets, from 10 different GLAM organisations running on Google Cloudrun. The Tasmanian Post Office Directories are also in Datasette and includes a IIIF viewer embedded in a customised theme.

Similarly, if a researcher’s main aim is to create a website that displays an annotated collection, then a static option, like CollectionBuilder, might be all they need.

Using static technologies was one of the patterns highlighted by the CDL architecture and principles documents, and I think it would be good to get into a ‘static first’ mindset, only implementing more complex solutions once the need had been established.

That said, I also think that investment in the development of additional, openly licensed plugins, themes, resource templates, and vocabularies for use with Omeka-S would also be very useful and welcome.

Zotero integration

An example of the small-scale interventions that can be made to mobilise existing data sources is the development of Zotero translators to extract metadata and images from GLAM collection databases. Using Zotero, researchers can collaborate in the creation and annotation of specialised datasets. Once they’ve saved a collection of resources to Zotero, they can use the public API to move the data into other tools for analysis or sharing. For example, at the moment I’m working on a collection of newspaper articles and extracts from Tasmanian Post Office Directories which have been saved and annotated in Zotero by members of the EveryDay Heritage team. Using the API I’m downloading the data, sorting the annotations, and populating an Omeka-S instance with details of sources, people, and places.

The problem is that support for Zotero by GLAM collection databases is patchy at best. I’ve created a spreadsheet to summarise the current situation. Few GLAM institutions embed anything beyond Facebook approved metadata, so translators (little bits of Javascript code) are often necessary to feed rich data to Zotero. There are currently 7 custom translators available for GLAM collections including the National Archives of Australia, PROV, and Trove.

Improving Zotero integration would open up new GLAM collections to digital research. Coordinated effort in the development of new translators, and best-practice documentation for supporting Zotero, would also help GLAM organisations understand what they can do to expand use of their collections.

The CDL could play a coordinating role in this, rather than do all the work. For example, it could share updates, compile documentation, and perhaps organise a GLAM hack style event to create/update as many translators as possible. It would be a low cost, but high impact initiative.

Collections as data

The Australian GLAM landscape is littered with decommissioned APIs, dead open data portals, datasets created for some hack event that never get updated, and disappeared labs. But at the same time, there are hundreds of open datasets shared by GLAM organisations through government data portals that are little acknowledged. In organisations with diminishing resources, shifting priorities, and internal jealousies, it’s difficult to maintain a persuasive argument around the importance of ‘collections as data’ for research. Perhaps the CDL can help.

Ultimately, it would be great for the CDL itself to incorporate some sort of national, collaborative GLAM Lab, but I think we should start by helping to develop the ammunition that institutions need to argue for their involvement in such an initiative. For example:

Documenting needs – What research might happen if particular datasets or tools were available? This is the sort of thing that could be captured through the ‘help desk’. These needs would be shared and discussed, and fed into co-design sessions with GLAM organisations.
Understanding use – What research is being done using GLAM collections in Australia and elsewhere? What types of data? What types of tools? Obviously there will be examples from CDL projects & elsewhere.
Developing skills – Providing opportunities for GLAM staff to develop their own digital research skills. Where CDL projects make use of GLAM collections, invite organisations to nominate staff for active participation in the development processes (like internships).
Creating examples – identify possible GLAM data related developments within CDL projects and pursue small-scale collaborations. For example in relation to government data.
Identifying opportunities – Which GLAM organisations are in a position to participate now? In particular, identify data assets (like PROV API or datasets in government portals) that are already available but not well documented or understood.
Sharing solutions – Many of the ‘connecting things up’ solutions are likely to involve moving or transforming GLAM data. Zotero integrations would benefit researchers and organisations.

Much of this would not be ‘extra’ work, but involve the coordination and packaging of existing CDL activities – a GLAM window onto CDL activities. Though I suppose there would need to be a coordinating role.

Obviously the GLAM Workbench would also have a role in this as a repository for tools and examples. It’d be great if the CDL could work with GLAM organisations to develop their own sections/repositories.

Tim Sherratt

A Community Data Lab (CDL) wishlist

Extending co-design throughout the life of the CDL

Connecting things up

Zotero integration

Collections as data