More GLAM Name Index updates from Queensland State Archives and SLWA

A new version of the GLAM Name Index Search is available. An additional 49 indexes have been added, bringing the total to 246. You can now search for names in more than 10.2 million records from 9 organisations.

The new indexes come from Queensland State Archives and the State Library of WA. QSA announced on Friday that they’d added two new indexes to their site. When I went to harvest them, I realised there was another 25 indexes that I hadn’t previously picked up. It seems that some QSA datasets are tagged as ‘Queensland State Archives’ in the data.qld.gov.au portal, but others are tagged as ‘queensland state archives’ – and the tag search is case sensitive! I now search for both the upper and lower case tags.

There’s also a number of additions from the State Library of WA. These datasets were already in my harvest, but because of some oddities in their formatting, I hadn’t included them in the Index Search. Looking at them again, I realised they were right to go, so I’ve added them in.

Here’s the list of additions:

Queensland State Archives

  • Australian South Sea Islanders 1867 to 1908 - A-K
  • Australian South Sea Islanders 1867 to 1908 L-Z
  • Beaudesert Shire Burials - Logan Village 1878-2000 - Beaudesert Shire and Logan Village Burials 1878-2000
  • Immigrants, Bowen Immigration Depot 1885-1892
  • Brisbane Gaol Hospital Admission registers 1889-1911 - Index to Brisbane Gaol Hospital Admission Registers 1889-1911
  • Index to Correspondence of Queensland Colonial Secretary 1859-1861 - Index to Colonial Secretary s Correspondence Bundles 1859 - 1861.csv
  • Dunwich Benevolent Asylum records - Index to Dunwich Benevolent Asylum 1859-1948
  • Dunwich Benevolent Asylum records - Index to Dunwich Benevolent Asylum 1885-1907
  • Immigrants and Crew 1860-1865 (COL/A) - Index to Immigrants and Crew 1860 - 1964
  • Index to Immigration 1909-1932
  • Outdoor Relief 1900-1904 - Index to Outdoor Relief 1892-1920
  • Pensions 1908-1919 - Index to Pensions 1908-1909
  • Cases & treatment Moreton Bay Hospital 1830-1862 - Index to Register of Cases and treatment at Moreton Bay Hospital 1830-1862
  • Index to Registers of Agricultural Lessees 1885-1908
  • Index to Registers of Immigrants, Rockhampton 1882-1915
  • Pneumonic influenza patients, Wallangarra Quarantine Compound - Index to Wallangarra Flu Camp 1918-1919
  • Land selections 1885-1981
  • Lazaret patient registers - Lazaret Patient Registers
  • Leases, Selections and Pastoral Runs and other related records 1850-2014
  • Perpetual Lease Selections of soldier settlements 1917 - 1929 - Perpetual Lease Selections of soldier settlements 1917-1929
  • Photographic records of prisoners 1875-1913 - Photographic Records of Prisoners 1875-1913
  • Redeemed land orders 1860-1906 - Redeemed land orders 1860-1907
  • Register of the Engagement of Immigrants at the Immigration Depot - Bowen 1873-1912
  • Registers of Applications by Selectors 1868-1885
  • Registers of Immigrants Promissory Notes (Maryborough)
  • Education Office Gazette Scholarships 1900 - 1940 - Scholarships in the Education Office Gazette 1900 - 1940
  • Teachers in the Education Office Gazettes 1899-1925

State Library of Western Australia

  • WABI Subset: Eastern Goldfields - Eastern Goldfields
  • Western Australian Biographical Index (WABI) - Index entries beginning with A
  • Western Australian Biographical Index (WABI) - Index entries beginning with B
  • Western Australian Biographical Index (WABI) - Index entries beginning with C
  • Western Australian Biographical Index (WABI) - Index entries beginning with D and E
  • Western Australian Biographical Index (WABI) - Index entries beginning with F
  • Western Australian Biographical Index (WABI) - Index entries beginning with G
  • Western Australian Biographical Index (WABI) - Index entries beginning with H
  • Western Australian Biographical Index (WABI) - Index entries beginning with I and J
  • Western Australian Biographical Index (WABI) - Index entries beginning with K
  • Western Australian Biographical Index (WABI) - Index entries beginning with L
  • Western Australian Biographical Index (WABI) - Index entries beginning with M
  • Western Australian Biographical Index (WABI) - Index entries beginning with N
  • Western Australian Biographical Index (WABI) - Index entries beginning with O
  • Western Australian Biographical Index (WABI) - Index entries beginning with P and Q
  • Western Australian Biographical Index (WABI) - Index entries beginning with R
  • Western Australian Biographical Index (WABI) - Index entries beginning with S
  • Western Australian Biographical Index (WABI) - Index entries beginning with T
  • Western Australian Biographical Index (WABI) - Index entries beginning with U-Z
  • Digital Photographic Collection - Pictorial collection_csv
  • WABI subset: Police - WABI police subset
  • WABI subset: York - York and districts subset

Bonus update

After a bit more work last night I added in a dataset from the State Library of Victoria:

  • Melbourne and metropolitan hotels, pubs and publicans

That’s an extra 21,000 records, and takes the total number of indexes to 247 from 10 different GLAM organisations!

#dhhacks

Getting data about newspaper issues in Trove

When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? How do you get a list of dates when newspapers were published? This notebook in the GLAM Workbench shows how you can get information about issues from the Trove API.

Using the notebook, I’ve created a couple of datasets ready for download and use.

Total number of issues per year for every newspaper in Trove

Harvested 10 October 2021

CSV formatted dataset containing the number of newspaper issues available on Trove, totalled by title and year – comprises 27,604 rows with the fields:

  • title – newspaper title
  • title_id – newspaper id
  • state – place of publication
  • year – year published
  • issues – number of issues

Download from Cloudstor: newspaper_issues_totals_by_year_20211010.csv (2.1mb)

Complete list of issues for every newspaper in Trove

Harvested 10 October 2021

CSV formatted dataset containing a complete list of newspaper issues available on Trove – comprises 2,654,020 rows with the fields:

  • title – newspaper title
  • title_id – newspaper id
  • state – place of publication
  • issue_id – issue identifier
  • issue_date – date of publication (YYYY-MM-DD)

To keep the file size down, I haven’t included an issue_url in this dataset, but these are easily generated from the issue_id. Just add the issue_id to the end of http://nla.gov.au/nla.news-issue. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.

Download from Cloudstor: newspaper_issues_20211010.csv (222mb)

For more information see the Trove newspapers section of the GLAM Workbench.

GLAM Workbench at eResearch Australasia 2021

Way back in 2013, I went to the eResearch Australasia conference as the manager of Trove to talk about new research possibilities using the Trove API. Eight years years later I was back, still spruiking the possibilities of Trove data. This time, however, I was discussing Trove in the broader context of GLAM data – all the exciting possibilities that have emerged as galleries, libraries, archives and museums make more of their collections available in machine-readable form. The big question is, of course, how do researchers, particularly those in the humanities, make use of that data? The GLAM Workbench is my attempt to address that question – to provide humanities researchers with both the tools and information they need, and an understanding of the possibilities that might emerge if they invest a bit of time in working with GLAM data. My eResearch Australasia 2021 presentation provides a quick introduction to the GLAM Workbench, here’s the video, and the slides.

A GLAM Workbench for humanities researchers from Tim Sherratt on Vimeo.

The presentation was pre-recorded, but I managed to sneak in an update via chat for those who attended the session. More news on this next week… 🥳

New Python package to download Trove newspaper images

There’s no reliable way of downloading an image of a Trove newspaper article from the web interface. The image download option produces an HTML page with embedded images, and the article is often sliced into pieces to fit the page.

This Python package includes tools to download articles as complete JPEG images. If an article is printed across multiple newspaper pages, multiple images will be downloaded – one for each page. It’s intended for integration into other tools and processing workflows, or for people who like working on the command line.

You can use it as a library:

from trove_newspaper_images.articles import download_images

images = download_images('107024751')

Or from the command line:

trove_newspaper_images.download 107024751 --output_dir images

If you just want to quickly download an article as an image without installing anything, you can use this web app in the GLAM Workbench. To download images of all articles returned by a search in Trove, you can also use the Trove Newspaper and Gazette Harvester.

See the documentation for more information. #dhhacks

More records for the GLAM Name Index Search

Two more datasets have been added to the GLAM Name Index Search! From the History Trust of South Australia and Collab, I’ve added:

In total there’s 9.67 million name records to search across 197 datasets provided by 9 GLAM organisations!

More QueryPic in action

Recently I created a list of publications that made use of QueryPic, my tool to visualise searches in Trove’s digitised newspapers. Here’s another example of the GLAM Workbench and QueryPic in action, in Professor Julian Meyrick’s recent keynote lecture, ‘Looking Forward to the 1950s: A Hauntological Method for Investigating Australian Theatre History’.

Some thoughts on the ‘Trove Researcher Platform for Advanced Research’ draft plan

Late last year the Federal Government announced it was making an $8.9 million investment in HASS and Indigenous research infrastructure. This program is being managed by the ARDC and will lead to the development of a HASS Research Data Commons. According to the ARDC, a research data commons:

brings together people, skills, data, and related resources such as storage, compute, software, and models to enable researchers to conduct world class data-intensive research

Sounds awesome!

Based on scoping studies commissioned by the Department of Education, Skills, and Employment (which have not yet been made public), four activities were selected for initial funding under this program. Draft project plans for these four activities have now been released for public comment.

One of these activities aims to develop a ’Trove researcher platform for advanced research’:

Augmenting existing National Library of Australia resources, this platform will enable a focus on the delivery of researcher portals accessible through Trove, Australia’s unique public heritage site. The platform will create tools for visualisation, entity recognition, transcription and geocoding across Trove content and other corpora.

You can download the draft project plan for the Trove platform. Funding for this activity will be capped at $2,301,185 across 2021-23. In this post I’ll try to pull together some of my own thoughts on this plan.

I suppose I’d better start with a disclaimer – I’m not a neutral observer in this. I started scraping data from Trove newspapers way back in 2010, building the first versions of tools like QueryPic and the Trove Newspaper Harvester. While I was manager of Trove, from 2013 to 2016, I argued for recognition of Trove as a key part of Australia’s humanities research infrastructure, and highlighted possible research uses of Trove data available through the API. Since then I’ve worked to bring a range of digital tools, examples, tutorials, and hacks together for researchers in the GLAM Workbench – a large number of these work with data from Trove.

I strongly believe that Trove should receive ongoing funding through NCRIS as a piece of national research infrastructure. Unfortunately though, the draft project plan does not make a strong case for investment – it’s vague, unimaginative, and makes little attempt to integrate with existing tools and services. I think it scores poorly against the ARDC’s evaluation criteria, and doesn’t seem to offer good value for money. As someone who has championed the use of Trove data for research across the last decade, I’m very disappointed.

What’s planned?

So what is being proposed? There seems to be three main components:

  1. Authenticated ‘project’ spaces for researchers where datasets relating to a particular research topic can be stored
  2. The ability to create custom datasets from a search in Trove
  3. Tools to visualise stored datasets.

There’s no doubt that these are all useful functions for researchers, but many problems arise when we look at how they’re going to be implemented.

1. Authenticated project spaces

The draft plan indicates that authentication of users through the Australian Access Federation is preferred. Why? Trove already has a system for the creation of user accounts. Using AAF would limit use of the new platform to those attached to universities or research agencies. I don’t understand what the use of AAF adds to the project, except perhaps to provide an example of integration with existing infrastructure services.

The plan notes that project spaces could be ‘public’ or ‘private’. Presumably a ‘public’ space would give access to stored datasets, but what sort of access controls would be available in relation to individual datasets? It’s also noted (Deliverable 7) that researchers would have ‘an option to “publish” their research findings for public consumption‘. Does this mean datasets and visualisations would be assigned a DOI (or other persistent identifier) and preserved indefinitely? How might these spaces integrate with existing data repositories?

2. Create custom datasets

The lack of detail in the plan makes it difficult to assess what’s being proposed here. But it seems that users would be able to construct a search using the Trove web interface (or a new search interface?) and save the results as a dataset.

What data would be searched? It’s not clear, but in reference to the visualisations it’s stated that data would come from ’Trove’s existing full text collections (newspapers and gazettes, magazines and newsletters, books)’. So no web archives, and no metadata from any of Trove’s aggregated collections (even without full text, collection metadata can create interesting research possibilities, see for example the Radio National records in the GLAM Workbench).

What will be included in each dataset? There’s few details, but at a minimum you’d expect something like a CSV containing the metadata of all the matching records, and files containing the full text content of the items. These could potentially be very large. There’s no indication about how storage and processing demands would be managed, but presumably there would be some per user, or per project, limits.

Deliverable 8, ‘Data and visual download’, states that:

All query results must be available as downloadable files, this would include CSV, JSON and XML for the query results list.

But there’s no mention of the full text content at all. Will it be included in downloadable datasets?

As well as the record metadata and full text, you’d want there to be some metadata captured about the dataset itself – the search query used, when it was captured, the number of records, etc. To support integration and reuse, it would be good to align this with something like RO Crate.

How will searches be constructed? It’s not clear if this will be integrated with the existing search interface, or be something completely separate; however, the plan does note that ‘limitations are put onto the dataset like keyword search terms and filters corresponding to the filters currently available in the interface’. So it seems that the new platform will be using the existing search indexes. It’s obviously important for the relationship between existing search functions and the new dataset creation tool to be explicit and transparent so that researchers understand what they’re getting.

It’s also worth noting that changes to the search interface last year removed some useful options from the advanced search form. In particular, you can no longer exclude matches in tags or comments. If you’re a researcher looking for the occurrence of a particular word, you generally don’t want to include records where that word only appears in a user added tag (I have a story about ‘Word War I’ that illustrates this!).

This raises a broader issue. There doesn’t seem to be any mention in the project plan of work to improve the metadata and indexing in response to research needs. Even just identifying digitised books in the current web interface can be a bit of a challenge, and digitised books and periodicals can be grouped into work records with other versions. We need to recognise that the needs of discovery sometimes compromise specific research uses.

I’m trying to be constructive in my responses here, but at this point I just have to scream – WHAT ABOUT THE TROVE NEWSPAPER HARVESTER? A tool has existed for ten years that lets users create a dataset containing metadata and full text from a search in Trove’s newspapers and gazettes. I’ve spent a lot of time over recent years adding features and making it easier to use. Now you can download not only full text, but also PDFs and images of articles. The latest web app version in the GLAM Workbench runs in the cloud. Just one click to start it up, then all you need to do is paste in your Trove API key and the url of your search. It can’t get much easier.

The GLAM Workbench also includes tools to create datasets and download OCRd text from Trove’s books and digitised journals. These are still in notebook form, so are not as easy to use, but I have created pre-harvested datasets of all books and periodicals with OCRd text, and stored them on CloudStor. What’s missing at the moment is something to harvest a collection of journal articles, but this would not be difficult. As an added bonus, the GLAM Workbench has tools to create full text datasets from the Australian Web Archive.

So what is this project really adding? And why is there no attempt to leverage existing tools and resources?

3. Visualise datasets

Again, there’s a fair bit of hand waving in the plan, but it seems that users will be able to select a stored dataset and then choose a form of visualisation. The plan says that:

An initial pilot would allow users to create line graphs that plot the frequency of a search term over time and maps that display results based on state-level geolocation.

Up to three additional visualisations would be created later based on research feedback. It’s not clear which researchers will be consulted and when their feedback will be sought.

The value of these sorts of visualisations is obviously dependent on the quality and consistency of the metadata. There’s nothing built into this plan that would, for example, allow a researcher to clean or normalise any of the saved data. You have to take what you’re given. The newspaper metadata is generally consistent, but books and periodicals less so.

It’s also important to clarify what’s meant by ‘the frequency of a search term over time’. Does this mean the number of records matching a search term, or the number of times that the search term actually appears in the full text of all matched records? If the latter, then this would be a major enrichment of the available data. Though if this data was available it should be pushed through the API and/or made available as a downloadable dataset for integration with other platforms (perhaps along the lines of the Hathi Trust’s Extracted Features Dataset). I suspect, however, that what is actually meant is the number of matching search results.

Again, the value of any geospatial visualisation depends on what is actually being visualised! The state facet in newspapers indicates place of publication, it’s not clear what the place facet in other categories represents. For this sort of visualisation to be useful in a research context, there would need to be some explanation of how these values were created, and any gaps or uncertainties.

Time for another scream of frustration — WHAT ABOUT QUERYPIC? Another long-standing tool which has already been cited a number of times in research literature. QueryPic visualises searches in Trove’s newspapers and gazettes over time. You can adjust time scales and intervals, and download the results as images, a CSV file, and an HTML page. The project plan makes a point of claiming that its tools would not require any coding, but neither does QueryPic. Just plug in an API key and a search URL. I even made some videos about it! The GLAM Workbench also includes a number of examples of how you can visualise places of publication of newspaper articles.

But it’s not just the GLAM Workbench. The Linguistics Data Commons of Australia, another activity to be funded as part of the HASS Research Data Commons, will include tools for text analysis and visualisation. The Time Layered Cultural Map is developing tools for geospatial visualisation of Australian collections. Surely the focus should be on connecting and reusing what’s available. Again I’m wondering what this project is really adding.

Portals and platforms

The original language describing the funded activity is interesting — it is intended to ‘focus on the delivery of researcher portals accessible through Trove’.

Portals (plural) accessible through (not in) Trove.

The NLA could meet a fair proportion of its stated objectives right now, simply by including links to QueryPic and the Trove Newspaper and Gazette Harvester. Done! There’s a million dollars saved.

More seriously, there’s no reason why the outcome of this activity should be a new interface attached to Trove and managed by the NLA. Indeed, such an approach works against integration, reuse, and data sharing. I believe the basic assumptions of the draft plan are seriously flawed. We need to separate out the strands of what’s meant by a ‘platform for advanced research’, and think more creatively and collaboratively about how we could achieve something useful, flexible, and sustainable.

Where’s the API?

I think the primary role of the NLA in the development of this research platform should be as the data provider. There are numerous ways in which Trove’s data might be improved and enriched in support of new research uses. These improvements could then be pushed through the API to integrate with a range of tools and resources. Which raises the question — where is the API in this plan?

The only mention of the API comes as an option for a user with ‘high technical expertise’ to extend the analysis provided by the built-in visualisations. This is all backwards. The API is the key pipeline for data-sharing and integration and should be at the heart of this plan.

This program offers an opportunity to make some much-needed improvements to the API. Here’s a few possibilities:

  • Bring the web interface and API back into sync so that researchers can easily transfer queries between the two (Trove’s interface update introduced new categories, while the API still groups resources by the original zones).
  • Provide public API access to additional data about digitised items. For example, you can get lists of newspaper titles and issues from the API, but there’s no comparable method to get titles and issues for digitised periodicals. The data’s there – it’s used to generate lists of issues in the browse interface – but it’s not in the API. There’s also other resource metadata, such as parent/child relationships, which are embedded in web pages but not exposed in the API.
  • Standardise the delivery of OCRd text for different resource types.
  • Finally add the People & Organisations data to the main RESTful API.
  • Fix the limitations of the web archives CDX API (documented here).
  • Add a search API for the web archives.
  • And what about a Write API? Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources.

I think the HASS RDC would benefit greatly by thinking much more about the role of the Trove API in establishing reusable data flows, and connecting up components.

Pathways

Anyone who’s been to one of my GLAM Workbench talks will know that I talk a lot about ‘pathways’. My concern is not just to provide useful tools and examples, but to try and connect them in ways that encourage researchers to develop their skills and confidence. So a researcher with limited digital skills can spin up QueryPic and start making visualisations without any specialised knowledge. But if they want to explore the data and assumptions behind QueryPic, they can view a notebook that walks them through the process of getting data from facets and assembling a time series. If they find something interesting in QueryPic, they can go to the Newspaper Harvester and assemble a dataset that helps them zoom into a particular period. There are places to go.

Similarly, users can start making use of the GLAM Workbench in the cloud using Binder – one click and it’s running. But as their research develops they might find Binder a bit limiting, so there are options to spin up the GLAM Workbench using Reclaim Cloud or Docker. As a researcher’s skills, needs, and questions change, so does their use of the GLAM Workbench. At least that’s the plan – I’m very aware that there’s much, much more to do to build and document these pathways.

The developments described in the draft plan are focused on providing simple tools for non-technical users. That’s fair enough, but you have to give those users somewhere to go, some path beyond, or else it just becomes another dead end. Users can download their data or visualisation, but then what?

Of course you don’t point a non-coder to API documentation and say ‘there you go’. But coders can use the API to build and share a range of tools that introduce people to the possibilities of data, and scaffold their learning. Why should there be just one interface? It’s not too difficult to imagine a range of introductory visualisation tools aimed at different humanities disciplines. Instead of focusing inward on a single Trove Viz Lite tool, why not look outwards at ways of embedding Trove data within a range of research training contexts?

Integration

A number of the HASS RDC Evaluation Criteria focus on issues of integration, collaboration, and reuse of existing resources. For example:

  • Project plans should display robust proposal planning including the maximisation of the use or re-use of existing research infrastructure, platforms, tools, services, data storage and compute.
  • Project plans should display integrated infrastructure layers with other HASS RDC activities, in particular by linking together elements such as data storage, tools, authentication, licensing, networks, cloud and high-performance computing, and access to data resources for reuse.
  • Project plans must be robust and contribute to the HASS RDC as a coherent whole that capitalises on existing data collections, adheres to the F.A.I.R. principles, develops collaborative tools, utilises shared underlying infrastructure and has appropriate governance planning.

There’s little evidence of this sort of thinking in the draft project plan. I’ve mentioned a few obvious opportunities for integration above, but there are many more. Overall, I think the proposed ‘platform for advanced research’ needs to be designed as a series of interconnected components, and not be seen as the product of a single institution.

We could imagine, for example, a system where the NLA focused on the delivery of research-ready data via the Trove API. A layer of data filtering, cleaning, and packaging tools could be built on top of the API to help users assemble actionable datasets. The packaging processes could use standards such as RO-Crate to prepare datasets for ingest into data repositories. Existing storage services, such as CloudStor, could be used for saving and sharing working datasets. Another layer of visualisation and analysis tools could either process these datasets, or integrate directly with the API. These tools could be spread across different projects including LDaCA, TLCMap, and the GLAM Workbench — using standards such as Jupyter to encourage sharing and reuse of individual components, and running on a variety of cloud-hosted platforms. Instead of just adding another component to Trove, we’d be building a collaborative network of tool builders and data wranglers — developing capacities across the research sector, and spreading the burden of maintenance.

Sustainability

The draft project plan includes some pretty worrying comments about long-term support for the new platform. Work Package 5 notes:

The developed product will require support post release which can be guaranteed for a period not exceeding the contracted period for this project

And:

ARDC will be responsible for providing ongoing financial support for this phase. It has not been included in the proposal.

So once the project is over, the NLA will not support the new platform unless the ARDC provides ongoing funding. What researcher would want to ‘publish’ their data on a platform that could disappear at any time? We all know that sustainability is hard, but you would think that the NLA could at least offer to work collaboratively with the research sector to develop a plan for sustainability, instead of just asking for more money. Why would anyone invest so much for so little?

Leadership and community

The development of collaborations and communities also figure prominently in the HASS RDC Evaluation Criteria. For example:

  • Project plans should clearly demonstrate that they enable collaboration and build communities across geographically dispersed research groups through facilitated sharing of high-quality data, particularly for computational analysis; the development of new platforms for collaboration and sharing; and, the encouragement of innovative methodologies through the use of analytic tools.
  • Project plans must include a demonstrated commitment to ongoing community development to ensure the sustainability of the development is vital. The deliverables will act as ongoing national research infrastructure. They must be broadly usable by more than just the project partners and serve as input to a wide range of research.
  • Project plans, and project leads in particular, should demonstrate the research leadership that will foster and encourage the uptake and use of the HASS RDC.

Once again the draft project plan falls short. There are no project partners listed. Instead the plan refers broadly to all of Trove’s content partners, none of whom have direct involvement in this project. Indeed, as noted above, data aggregated from project parters is excluded from the new platform.

There are no new governance arrangements proposed for this project. Instead the plan refers to the Trove Strategic Advisory Committee which includes representatives from partner organisations. But there are no researcher representatives on this committee.

The only consultation with the research sector undertaken in the ‘Consultation Phase’ of the project is that undertaken by the ARDC itself. Does that mean this current process whereby the ARDC is soliciting feedback on the project plans? Whoa, meta…

The plan notes that during the testing phase described in Work Package 3, ‘HASS community members would gain access to a beta version of the product for comment’. However, later it is stated that access would be provided to ’a subset of researchers’, and that only system bugs and ‘high priority improvements’ would be acted upon.

Generally speaking, it seems that the NLA is seeking as little consultation as possible. It’s not exploring options for collaboration. It’s not engaging with the research community about these developments. That doesn’t seem like an effective way to build communities. Nor does it demonstrate leadership.

Summing up

This project plan can’t be accepted in its current form. We’ve had failures and disappointments in the development of HASS research infrastructure in the past. The HASS RDC program gives us a chance to start afresh, and the focus on integration, data-sharing, and reuse give hope that we can build something that will continue to grow and develop, and not wither through lack of engagement and support. So should the NLA be getting $2 million to add a new component to Trove that is not integrated with other HASS RDC projects, and substantially duplicates tools available elsewhere? No, I don’t think so. They need to go back to the drawing board, undertake some real consultation, and build collaborations, not products.

Some research projects that have used QueryPic

A Twitter thread about some of the research uses of QueryPic…

Government publications in Trove

Over the last few weeks I’ve been updating my harvests of OCRd text from digitised books and periodicals in Trove. As part of the harvesting process, I’ve created lists of both that are available in digital form – this includes digitised works, as well as those that are born-digital (such as PDFs or epubs). I’ve published the full lists of books and periodicals as searchable databases to make them easy to explore.

One thing that you might notice is that works with the format ‘Government publication’ pop up in both lists – sometimes it’s not clear whether something is a ‘book’ or ‘periodical’. To make it easier to find these items, no matter what their format, I’ve combined data from my two harvests and created a searchable dataset of government publications. It includes links to download OCRd text from CloudStor if available.

All three databases make use of Datasette, which I’ve also used for the GLAM Name Index Search. One of the cool things about Datasette is that it provides it’s own API, so if you find some interesting in any of these databases, you can easily download the machine-readable data for further analysis. #dhhacks

GLAM Workbench – a platform for digital HASS research

We’re in the midst of planning for the HASS Research Data Commons, which will deliver some much-needed investment in digital research infrastructure for the humanities and social sciences. Amongst the funded programs are tools for text analysis as part of the Linguistics Data Commons, and a platform for more advanced research using Trove. I’m hoping that this will be an opportunity to take stock of existing tools and resources, and build flexible pathways for researchers that enable them to collect, move, analyse, preserve, and share data across different platforms and services.

To this end, I thought it might be useful to try and summarise what the GLAM Workbench offers, particularly for Trove researchers. The GLAM Workbench doesn’t really have an institutional home, and is mostly unfunded – it’s my passion project. That means that it’s easy to overlook, particularly when the big grants are being doled out. But I think it has a lot to offer and I’m looking forward to exploring ways it can connect with these new initiatives.

Getting and moving data

There’s lots of fabulous data in Trove and other GLAM collections. In fact, there’s so much data that it can be difficult for researchers to find and collect what’s relevant to their interests. There are many tools in the GLAM Workbench to help researchers assemble their own datasets. For example:

  • Get newspaper articles in bulk with the Trove Newspaper and Gazette Harvester – This has been around in some form for more than ten years (it pre-dates the Trove API!). Give it the url of a search in Trove’s newspapers and gazettes and the harvester will save all the metadata in a CSV file, and optionally download the complete articles as OCRd text, images, or PDFs. The amount of data you harvest is really only limited by your patience and disk space. I’ve harvested more than a million articles in the past. The GLAM Workbench includes a web app version of the harvester that runs live in the cloud – just paste in your Trove API key and the search url, and click the button.
  • Get Trove newspaper pages as images – If you need a nice, high-resolution version of a newspaper page you can use this web app. If you want to harvest every front page (or some other particular page) here’s an example that gets all the covers of the Australian Women’s Weekly. A pre-harvested collection of the AWW covers is included as a bonus extra.
  • Get Trove newspaper articles as images – The Trove web interface makes it difficult to download complete images of articles, but this tool will do the job. There’s a handy web app to grab individual images, but the code from this tool is reused in other places such as the Trove Newspaper Harvester and the Omeka uploader, and could be built-in to your own research workflows.
  • Upload Trove newspaper articles to Omeka – Whether you’re creating on online exhibition or building a research database, Omeka can be very useful. This notebook connects Trove’s newspapers to Omeka for easy upload. Your selected articles can come from a search query, a Trove list, a Zotero library, or just a list of article ids. Metadata records are created in Omeka for each article and newspaper, and an image of each article is attached.
  • Get OCRd text from digitised periodicals in Trove – They’re often overshadowed by the newspapers, but there’s now lots of digitised journals, magazines, and parliamentary papers in Trove. You can get article-level data from the API, but not issue data. This notebook enables researchers to get metadata and OCRd text from every available issue of a periodical. To make researchers’ lives even easier, I regularly harvest all the available OCRd text from digitised periodicals in Trove. The latest harvest downloaded 51,928 issues from 1,163 periodicals – that’s about 10gb of text. You can browse the list of periodicals with OCRd text, or search this database. All the OCRd text is stored in a public repository on CloudStor.
  • Get page images from digitised periodicals in Trove – There’s more than text in digitised periodicals, and you might want to download images of pages for visual analysis. This notebook shows you how to get cover images, but could be easily modified to get another page, or a PDF. I used a modified version of this to create a collection of 3,471 full page editorial cartoons from The Bulletin, 1886 to 1952 – all available to download from CloudStor.
  • Get OCRd text from digitised books in Trove – Yep, there’s digitised books as well as newspapers and periodicals. You can download OCRd text from an individual book using the Trove web interface, but how do you make a collection of books without all that pointing and clicking? This notebook downloads all the available OCRd text from digitised books in Trove. The latest harvest includes text from 26,762 works. You can explore the results using this database.
  • Harvest parliamentary press releases from Trove – Trove includes more than 380,000 press releases, speeches, and interview transcripts issued by Australian federal politicians and saved by the Parliamentary Library. This notebook shows you how to harvest both metadata and fulltext from a search of the parliamentary press releases. For example, here’s a collection of politicians talking about ‘refugees’, and another relating to COVID-19.
  • Harvest details of Radio National programs from Trove – Trove creates records for programs broadcast on ABC Radio National, for the major current affairs programs these records at at segment level. Even though they don’t provide full transcripts, this data provide a rich, fine-grained record of Australia’s recent political, social, and economic history. This notebook shows you how to download the Radio National data. If you just want to dive straight in, there’s also a pre-harvested collection containing more than 400,000 records, with separate downloads for some of the main programs.
  • Find all the versions of an archived web page in Trove – Many of the tools in the Web Archives section of the GLAM Workbench will work with the Australian Web Archive, which is part of Trove. This notebook shows you how to get data about the number of times a web page has been archived over time.
  • Harvesting collections of text from archived web pages in Trove – If you want to explore how the content of a web page changes over time, you can use this notebook to capture the text content of every archived version of a web page.
  • Convert a Trove list into a CSV file – While Trove provides a data download option for lists, it leaves out a lot of useful data. This notebook downloads full details of newspaper articles and other works in a list and saves them as CSV files. Like the Trove Newspaper Harvester, it lets you download OCRd text and images from newspaper articles.
  • Collecting information about Trove user activity – It’s not just the content of Trove that provides interesting research data, it’s also the way people engage with it. Using the Trove API it’s possible to harvest details of all user created lists and tags. And yes, there’s pre-harvested collections of lists and tags for the impatient.

While I’m focusing here on Trove, there’s also tools to create datasets from the National Archives of Australia, Digital NZ and Papers Past, the National Museum of Australia and more. And there’s a big list of readily downloadable datasets from Australian GLAM organisations.

Visualisation and analysis

Many of the notebooks listed above include examples that demonstrate ways of exploring and analysing your harvested data. There are also a number of companion notebooks that examine some possibilities in more detail, for example:

But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:

  • QueryPic – Another tool that’s been around in different forms for a decade, QueryPic visualises searches in Trove’s newspapers. The latest web app couldn’t be simpler, just paste in your API key and a search url and create charts showing the number of matching articles over time. You can combine queries, change time scales, and download the data and visualisations.
  • Visualise Trove newspaper searches over time – This is like a deconstructed version of QueryPic that walks you through the process of using Trove’s facets to assemble a dataset of results over time. It provide a lot of detail on the sorts of data available, and the questions we can ask of it.
  • Visualise the total number of newspaper articles in Trove by year and state – This notebook uses a modified version of the code above to analyse the construction and context of Trove’s newspaper corpus itself. What are you actually searching? Meet the WWI effect and the copyright cliff of death! This is a great place to start if you want to get people thinking critically about digital resources are constructed.
  • Analyse rates of OCR correction – Some more meta-analysis of the Trove corpus itself, this time focusing on patterns of OCR correction by Trove users.
  • Identifying non-English language newspapers in Trove – There are a growing number of non-English language newspapers digitised in Trove. However, if you’re only searching using English keywords, you might never know that they’re there. This notebook analyses a sample of articles from every newspaper in Trove to identify non-English content.
  • Beyond the copyright cliff of death – Most of the newspaper articles on Trove were published before 1955, but there are some from the later period. This notebook helps you find out how many, and which newspapers they were published in.
  • Map Trove newspaper results by state – This notebook uses the Trove state facet to create a choropleth map that visualises the number of search results per state.
  • Map Trove newspaper results by place of publication – This notebook uses the Trove title facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.
  • Compare two versions of an archived web page – This notebook demonstrates a number of different ways of comparing versions of archived web pages. Just choose a repository, enter a url, and select two dates to see comparisons based on: page metadata, basic statistics such as file size and number of words, numbers of internal and external links, cosine similarity of text, line by line differences in text or code, and screenshots.
  • Display changes in the text of an archived web page over time – This web app gathers all the available versions of a web page and then visualises changes in its content between versions – what’s been added, removed, and changed?
  • Use screenshots to visualise change in a page over time– Create a series of full page screenshots of a web page over time, then assemble them into a time series.

There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.

Documentation and examples

All the Trove notebooks in the GLAM Workbench help document the possibilities and limits of the Trove API. The examples above can be modified and reworked to suit different research interests. Some notebooks also explore particular aspects of the API, for example:

  • Trove API Introduction – Some very basic examples of making requests and understanding results.
  • Today’s news yesterday – Uses the date index and the firstpageseq parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.
  • The use of standard licences and rights statements in Trove image records – Version 2.1 of the Trove API introduced a new rights index that you can use to limit your search results to records that include one of a list of standard licences and rights statements. We can also use this index to build a picture of which rights statements are currently being used, and by who.
  • Random items from Trove – Changes to the Trove API meant that techniques you could previously use to select resources at random no longer work. This section documents some alternative ways of retrieving random-ish works and newspaper articles from Trove.

And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.

Pathways

In developing the GLAM Workbench I’m very aware that people will arrive with different levels of digital skill, confidence, and experience. That’s why I’ve been putting a lot of thought and effort into ways of providing a range of entry points.

Someone who might not identify as a ‘digital’ researcher can, with a single click, start up QueryPic and start exploring changes over time in Trove’s newspapers. This is possible because the GLAM Workbench is configured to make use of Binder, a service that spins up customised computing environments as needed.

Another researcher might start running the Trove Newspaper Harvester using Binder, but find that they want to run bigger and longer harvests. In that case, the GLAM Workbench offers a one-click installation of the Trove Newspaper Harvester on Reclaim Cloud. Unlike Binder, Reclaim Cloud environments are persistent, so you can run the harvester for as long as you want without the worry of interruptions.

Yet another researcher might want to understand how the Trove API works and the sorts of data that it makes available. By exploring the various notebooks they’ll find useful snippets of code they can try out in their own projects.

The GLAM Workbench connects outwards to make use of a range of other services – the notebooks run in Binder, Reclaim Cloud, and Docker; the code is all openly licensed and publicly available through GitHub and Zenodo; data is hosted on GitHub, CloudStor, and Zenodo; datasets can be explored using Datasette running on Glitch or Google CloudRun. I’m hoping that the new investments in HASS research infrastructure will embed a similar philosophy, connecting up existing services rather than starting from scratch.

The future

This is just an outline on what the GLAM Workbench currently offers researchers wanting to make use of the data available from Trove. It’s all there now, publicly accessible, openly licensed, and ready to use – take it, use it, change it, share it. But there’s much more I’d like to do, both in regard to Trove and to encourage use of GLAM data more generally. I’m also interested in your ideas for new tools, examples, or data sources – what would help your research? You can add a suggestion in GitHub, or post a comment in the GLAM Workbench channel of OzGLAM Help.

See the Getting Started section of the GLAM Workbench for more hints and examples. And keep an eye on the news feed for the latest additions and updates.

A Family History Month experiment – search millions of name records from GLAM organisations

There’s a lot of rich historical data contained within the indexes that Australian GLAM organisations provide to help people navigate their records. These indexes, often created by volunteers, allow access by key fields such as name, date or location. They aid discovery, but also allow new forms of analysis and visualisation. Kate Bagnall and I wrote about some of the possibilities, and the difficulties, in this recently published article.

Many of these indexes can be downloaded from government data portals. The GLAM Workbench demonstrates how these can be harvested, and provides a list of available datasets to browse. But what’s inside them? The GLAM CSV Explorer visualises the contents of the indexes to give you a sneak peek and encourage you to dig deeper.

There’s even more indexes available from the NSW State Archives. Most of these aren’t accessible thought the NSW government data portal yet, but I managed to scrape them from the website a couple of years ago and made them available as CSVs for easy download.

It’s Family History Month at the moment, and the other night I thought of an interesting little experiment using the indexes. I’ve been playing round with Datasette lately. It’s a fabulous tool for exploring tabular data, like CSVs. I also noticed that Datasette’s creator Simon Willison had added a search-all plugin that enabled you to run a full text search across multiple databases and tables. Hmmm, I wondered, would it be possible to use Datasette to provide a way of searching for names across all those GLAM indexes?

After a few nights work, I found the answer was yes.

Try out my new aggregated search interface here!.

(The cloud service it uses runs on demand, so if it has gone to sleep, it might take a little while to wake up again – just be patient for a few seconds.)

Currently, the GLAM Name Search interface lets you search for names across 195 indexes from eight GLAM organisations. All together, there’s a total of more than 9.2 million rows of data to explore!

It’s simple to use – just enter a name in the search box and Datasette will search each index in turn, displaying the first five matching results. You can click through to view all results from a specific index. Not surprisingly, the aggregated name search only searches columns containing names. However, once you click through to an individual table, you can apply additional filters or facets.

To create the aggregated search interface I worked through the list of CSVs I’d harvested from government data portals to identify those that contained names of people, and discard those that contained administrative, rather than historical data. I also made a note of the columns that contained the names so I could index their contents once they’d been added to the database. Usually these were fields such as Surname or Given names, but sometimes names were in the record title or notes.

Datasette uses SQLite databases to store its data. I decided to create one database for each GLAM organisation. I wrote some code to work through my list of datasets, saving them into an SQLite database, indexing the name columns, and writing information about the dataset to a metadata.json file. This file is used by Datasette to display information such as the title, source, licence, and last modified date of each of the indexes.

Once that was done, I could fire up Datasette and feed it all the SQLite databases. Amazingly it all worked – searching across all the indexes was remarkably quick! To make it publicly available I used the Datasette publish to push everything to Google CloudRun (about 1.4 gb of data). The first time I used CloudRun it took some time to get the authentication and other settings working properly. This time was much smoother. Before long it was live!

Once I knew it all worked, I decided to add in another 59 indexes from the NSW State Archives. I also plugged in a few extra indexes from the Public Record Office of Victoria. These datasets are stored as ZIP files in the Victorian government data portal, so it took a little bit of extra manual processing to get everything sorted. But finally I had all 195 indexes loaded.

What now? That depends on whether people find this experiment useful. I have a few ideas for improvements. But if people do use it, then the costs will go up. I’m going to have to monitor this over the next couple of months to see if I can afford to keep it going. If you want to help with the running costs, you might like to sign up as a GitHub sponsor.

And please let me know if you think it’s worth developing! #dhhacks

Explore Trove’s digitised books

The Trove books section of the GLAM Workbench has been updated! There’s freshly-harvested data, as well as updated Python packages, integration with Reclaim Cloud, and automated Docker builds.

Included is a notebook to harvest details of all books available from Trove in digital form. This includes both digitised books, that have been scanned and OCRd, as well as born digital publications, such as PDFs and epubs. The definition of ‘books’ is pretty loose – I’ve harvested details of anything that has been assigned the format ‘Book’ in Trove, but this includes ephemera, such as posters, pamphlets, and advertising.

In the latest harvest, I ended up with details of 42,174 ‘books’. This includes some duplicates, because multiple metadata entries can point to the same digital object. I thought it was best to preserve the duplicates, rather than discard the metadata.

Once I’d harvested the details of the books, I tried to see if there was any OCRd text available for download. If there was, I saved it to a public folder on CloudStor. In total, I was able to download 26,762 files of OCRd text.

Screenshot of database showing details of digital book

The easiest way to explore the books is using this searchable database. It’s created using Datasette and is running on Glitch. Full text search is available on the ‘title’ and ‘contributors’ fields, and you can filter on things like date, copyright status, number of pages, and whether OCRd text is available for download. If there is OCRd text, a direct link to the file on CloudStor is included. You can use the database to filter the titles, creating your own dataset that you can download in CSV or JSON format.

If you just want the full list of books as a CSV file, you can download it here. And if you want all the OCRd text, you can go straight to the public folder on CloudStor – there’s about 3.6gb of text files to explore! #dhhacks

A miscellany of ephemera, oddities, & estrays

I’m just in the midst of updating my harvest of OCRd text from Trove’s digitised books (more about that soon!). But amongst the items catalogued as ‘books’ are a wide assortment of ephemera, posters, advertisements, and other oddities. There’s no consistent way of identifying these items through the search interface, but because I’ve found the number of pages in each ‘book’ as part of the harvesting process, I can limit results to items with just a single digitised page – there’s more than 1,500! To make it easy to explore this collection of odds and ends, I’ve downloaded all the single page images and compiled them into one big PDF with links back to their entries in Trove. Enjoy your browsing!

This is another example of the ways in which we can extend and enrich existing collection interfaces using simple technologies like PDFs and CSVs. We can create slices across existing categories to expose interesting features, and provide new entry points for researchers. Some other examples in the GLAM Workbench are the collection of editorial cartoons from The Bulletin, the list of Trove newspapers with non-English content, the harvest of ABC Radio National programs, and the recent collection of politicians talking about COVID. Let me know if you have any ideas for additional slices! #dhhacks

Everyday heritage and the GLAM Workbench

Some good news on the funding front with the success of the Everyday Heritage project in the latest round of ARC Linkage grants. The project aims to look beyond the formal discourses of ‘national’ heritage to develop a more diverse range of heritage narratives. Working at the intersection of place, digital collections, and material culture, team members will develop a series of ‘heritage biographies’, that document everyday experience, and provide new models for the heritage sector.

Screen capture of project details in ARC grants database

Digital methods will play a major role in the project. I’ll be leading the ‘Heritage Hacks’ work package that will support the creation of the heritage biographies and develop a range of new tools and tutorials for use in heritage management contexts. All the tools, methods, and data generated through the project will be documented using Jupyter notebooks and published through the GLAM Workbench. Watch this space!

The project is led by Tracy Ireland (University of Canberra), with Jane Lydon (UWA), Kate Bagnall (UTAS), and me as chief investigators. Our industry partner is GML Heritage.

Recent GLAM Workbench presentations

So far this year I’ve given eight workshops or presentations relating to the GLAM Workbench, with probably a few more yet to come. Here’s the latest:

You can view all the GLAM Workbench presentations here.

Updated! Lots and lots of text freshly harvested from Trove periodicals

For a few years now I’ve been harvesting downloadable text from digitised periodicals in Trove and making it easily available for exploration and research. I’ve just completed the latest harvest – here’s the summary:

  • 1,163 digitised periodicals had text available for download
  • Text was downloaded from 51,928 individual issues
  • Adding up to a total of around 12gb of text

If you want to dive straight in, here’s a list of all the harvested periodicals, with links to download a summary of available issues, as well as all the harvested text (there’s one file per issue). You’ll notice that the list includes a large number of parliamentary papers and government reports as well as published journals.

List of Trove periodicals with downloadable text

All of the harvested text is available from a public folder on CloudStor.

The harvesting process involves a few different steps:

  • First I generate a list of periodicals available in digital form from Trove. This includes digitised titles, as well as born-digital titles submitted through e-Legal Deposit. This produced a CSV file containing the details of 7,270 titles. See this notebook for details.
  • Then I work through this list of titles to find out how many issues of each title are available through Trove. This information isn’t accessible through the API, so I have to do some screen scraping.
  • Next I work through the list of issues and try to download the text contents. Most of the born-digital titles don’t have downloadable text.
  • Once I’ve downloaded all the text I can from a title, I create a CSV file for it that lists the available issues and notes whether text is available for each. This file is stored with the text on CloudStor.
  • Once I’ve checked all the titles, I generate another CSV file that lists the details of all the periodicals that have downloadable text.
  • The code to harvest and document the downloaded text is available in this notebook. #dhhacks

New dataset – Politicians talking about COVID

The Trove Journals section of the GLAM Workbench includes a notebook that helps you download press releases, speeches, and interview transcripts by Australian federal politicians. These documents are compiled and published by the Parliamentary Library, and the details are regularly harvested into Trove.

Using this notebook, I’ve created a collection of documents that include the words ‘COVID’ or ‘Coronavirus’. It includes all the metadata from Trove, as well as the full text of each document downloaded from the Parliamentary Library. There’s 3,995 documents in total, covering the period up until early April 2021. You can download them all as a zip file (12 mb).

While I was compiling this dataset, I also made a few improvements to the notebook. You can now filter the results to weed out false positives, and identify duplicates. #dhhacks

8 million Trove tags to explore!

I’ve always been interested in the way people add value to resources in Trove. OCR correction tends to get all the attention, but Trove users have also been busy organising resources using tags, lists, and comments. I used to refer to tagging quite often in presentations, pointing to the different ways they were used. For example, ‘TBD’ is a workflow marker, used by text correctors to label articles that are ‘To Be Done’. My favourite was ‘LRRSA’, one of the most heavily-used tags across the whole of Trove. What does it mean? It stands for the Light Rail Research Society of Australia, and the tag is used by members to mark items of shared interest. It’s a great example of how something as simple as plain text tags can be used to support collaboration and build communities.

Word cloud showing the top 200 Trove tags

Until its update last year, Trove used to provide some basic stats about user activity. There was also a tag cloud that let you explore the most commonly-used tags. It’s now much harder to access this sort of information. However, you can extract some basic information about tags from the Trove API. First of all, you can filter a search using ‘has:tags’ to limit the results to items that have tags attached to them. Then to find out what the tags actually are, you can add the include=tags parameter. This embeds the tags within the item record, so you can work through a set of results, extracting all the tags as you go. To save you the trouble, I’ve done this for the whole of Trove, and ended up with a dataset containing more than 8 million tags!

Chart showing the number of tags per year and zone.

The dataset is saved as a 500mb CSV file, and contains the following fields:

  • tag – lower-cased version of the tag
  • date – date the tag was added
  • zone – the API zone that contains the tagged resource
  • resource_id – the identifier of he tagged resource

There’s a few things to note about the data:

  • Works (such as books) in Trove can have tags attached at either work or version level. This dataset aggregates all tags at the work level, removing any duplicates.
  • A single resource in Trove can appear in multiple zones – for example, a book that includes maps and illustrations might appear in the ‘book’, ‘picture’, and ‘map’ zones. This means that some of the tags will essentially be duplicates – harvested from different zones, but relating to the same resource. Depending on your interests, you might want to remove these duplicates.
  • While most of the tags were added by Trove users, more than 500,000 tags were added by Trove itself in November 2009. I think these tags were automatically generated from related Wikipedia pages. Depending on your interests, you might want to exclude these by limiting the date range or zones.
  • User content added to Trove, including tags, is available for reuse under a CC-BY-NC licence.

You can download the complete dataset from Zenodo, or from CloudStor. For more information on how I harvested the data, and some of its limits and complexities, see the notebooks in the new ‘Tags’ section in the GLAM Workbench. There’s also some examples of analysing and visualising the tags. As an extra bonus, there’s a more compact 50mb CSV dataset which lists each unique tag and the number of times it has been used.

Of course, it’s worth remembering that this sort of dataset is out of date before the harvest is even finished. More tags are being added all the time! But hopefully this data will help us better understand the way people work to organise and enrich complex resources like Trove. #dhhacks

Integrating GLAM Workbench news and discussion

I’ve spent a lot of time this year working on ways of improving the GLAM Workbench’s documentation and its integration with other services. Last year I created OzGLAM Help to provide a space where users of GLAM collections could ask questions and share discoveries – including a dedicated GLAM Workbench channel. Earlier this year, I tweaked my Micro.blog powered updates to include a dedicated GLAM Workbench news feed. Now I’ve brought the two together! What does this mean?

  • Any GLAM Workbench news that I post to my updates feed is now automatically added to OzGLAM Help
  • Links are automatically added to items in the news feed that let you add comments or questions in OzGLAM Help

So now there’s two-way communication between the services providing more ways for people to discover and discuss how the GLAM Workbench can help them.

GLAM Workbench now on YouTube!

I’ve started creating short videos to introduce or explain various components of the GLAM Workbench. The first video shows how you can visualise searches in Trove’s digitised newspapers using the latest version of QueryPic. It’s a useful introduction to the way access to collection data enables us to ask different types of questions of historical sources.

As with all GLAM Workbench resources, the video is openly-licensed – so feel free to stop it into your own course materials or workshops. It could, for example, provide an interesting little digital methods task in an Australian history unit.

I’ll be creating a second QueryPic video shortly, demonstrating how you can work with complex queries and differing timescales. Let me know if you find it useful, or if you have any ideas for future topics. #dhhacks

GLAM Workbench office hours

To help you make use of the GLAM Workbench, I’ve set up an ‘office hours’ time slot every Friday when people can book in for 30 minute chats via Zoom. Want to talk about how you might use the GLAM Workbench in your latest research project? Are you having trouble getting started with GLAM data? Or perhaps you have some ideas for future notebooks you’d like to share? Just click on the ‘Book a chat’ link in the GLAM Workbench, or head straight to the scheduling page to set up a time!

Book a chat!

This is yet another experiment to see how I can support the use of GLAM data and the development of digital skills with the GLAM Workbench. Let me know if you think it’s worthwhile. #dhhacks

QueryPic: The Next Generation

QueryPic is a tool to visualise searches in Trove’s digitised newspapers. I created the first version way back in 2011, and since then it’s taken a number of different forms. The latest version introduces some new features:

  • Automatic query creation – construct your search in the Trove web interface, then just copy and paste the url into QueryPic. This means you can take advantage of Trove’s advanced search and facets to build complex queries.
  • Multiple time scales – previous versions only aggregated search results by year, but now you can also aggregate by month, or by day. QueryPic will automatically choose a time unit based on the date range of your query, but if you’re not happy with the result you can change it!
  • Links back to Trove – click on any of the points on the chart to search Trove within that time period. This enables you to zoom in and out of your results, from the high-level visualisation, to individual articles.

Screenshot of QueryPic chart

This version of QueryPic is built within a Jupyter notebook, and designed to run using Voila (which hides all the code and makes the notebook look like a web app). See the Trove Newspapers section of the GLAM Workbench for more information. If you’d like to give it a try, just click the button below to run it live using Binder.

Binder badge

Hope you find it useful! #dhhacks

Everyone gets a Lab!

I recently took part in a panel at the IIPC Web Archiving Conference discussing ‘Research use of web archives: a Labs approach’. My fellow panellists described some amazing stuff going on in European cultural heritage organisations to support researchers who want to make use of web archives. My ‘lab’ doesn’t have a physical presence, or an institutional home, but it does provide a starting point for researchers, and with the latest Reclaim Cloud and Docker integrations, everyone can have their own web archives lab! Here’s my 8 minute video. The slides are available here.

Minor change to Reclaim Cloud config

When the 1-click installer for Reclaim Cloud works its magic and turns GLAM Workbench repositories into your own, personal digital labs, it creates a new work directory mounted inside of your main Jupyter directory. This new directory is independent of the Docker image used to run Jupyter, so it’s a handy place to copy things if you ever want to update the Docker image. However, I just realised that there was a permissions problem with the work directory which meant you couldn’t write files to it from within Jupyter.

To fix the problem, I’ve added an extra line to the reclaim-manifest.jps config file to make the Jupyter user the owner of the work directory:

	- cmd[cp]: chown -R jovyan:jovyan /home/jovyan/work

This takes care of any new installations. If you have an existing installation, you can either just create a completely new environment using the updated config, or you can manually change the permissions:

  • Hover over the name of your environment in the control panel to display the option buttons.
  • Click on the Settings button. A new box will open at the bottom of the control panel with all the settings options.
  • Click on ‘SSH Access’ in the left hand menu of the settings box.
  • Click on the ‘SSH Connection’ tab.
  • Under ‘Web SSH’ click on the Connect button and select the default node.
  • A terminal session will open. At the command line enter the following:

    	chown -R jovyan:jovyan /home/jovyan/work
    

Done! See the Using Reclaim Cloud section of the GLAM Workbench for more information.

Trove Query Parser

Here’s a new little Python package that you might find useful. It simply takes a search url from Trove’s Newspapers & Gazettes category and converts it into a set of parameters that you can use to request data from the Trove API. While some parameters are used both in the web interface and the API, there are a lot of variations – this package means you don’t have to keep track of all the differences!

It’s very simple to use.

How to use the Trove Query Parser.

The code for the parser has been basically lifted from the Trove Newspaper Harvester. I wanted to separate it out so that I could use it at various spots in the GLAM Workbench and in other projects.

This package, the documentation, and the tests were all created using nbdev, which is really quite a fun way to develop Python packages. #dhhacks