I’ve always been interested in the way people add value to resources in Trove. OCR correction tends to get all the attention, but Trove users have also been busy organising resources using tags, lists, and comments. I used to refer to tagging quite often in presentations, pointing to the different ways they were used. For example, ‘TBD’ is a workflow marker, used by text correctors to label articles that are ‘To Be Done’. My favourite was ‘LRRSA’, one of the most heavily-used tags across the whole of Trove. What does it mean? It stands for the Light Rail Research Society of Australia, and the tag is used by members to mark items of shared interest. It’s a great example of how something as simple as plain text tags can be used to support collaboration and build communities.
Until its update last year, Trove used to provide some basic stats about user activity. There was also a tag cloud that let you explore the most commonly-used tags. It’s now much harder to access this sort of information. However, you can extract some basic information about tags from the Trove API. First of all, you can filter a search using ‘has:tags’ to limit the results to items that have tags attached to them. Then to find out what the tags actually are, you can add the include=tags
parameter. This embeds the tags within the item record, so you can work through a set of results, extracting all the tags as you go. To save you the trouble, I’ve done this for the whole of Trove, and ended up with a dataset containing more than 8 million tags!
The dataset is saved as a 500mb CSV file, and contains the following fields:
tag
– lower-cased version of the tagdate
– date the tag was addedzone
– the API zone that contains the tagged resourceresource_id
– the identifier of he tagged resourceThere’s a few things to note about the data:
You can download the complete dataset from Zenodo, or from CloudStor. For more information on how I harvested the data, and some of its limits and complexities, see the notebooks in the new ‘Tags’ section in the GLAM Workbench. There’s also some examples of analysing and visualising the tags. As an extra bonus, there’s a more compact 50mb CSV dataset which lists each unique tag and the number of times it has been used.
Of course, it’s worth remembering that this sort of dataset is out of date before the harvest is even finished. More tags are being added all the time! But hopefully this data will help us better understand the way people work to organise and enrich complex resources like Trove. #dhhacks