I’ve added a notebook to the GLAM Workbench that walks through the steps involved in creating a fully searchable database of content extracted from a periodical uploaded to Trove through the National eDeposit service (NED).
I was contacted recently by a member of the team that publishes The Triangle, a community newsletter from the south coast of NSW. Issues of The Triangle from 2007 to the present have been uploaded to Trove through the National eDeposit service, but they were wondering whether it was possible to search across all their newsletters in Trove. Unfortunately, the answer is no.
Issues of The Triangle are saved in Trove as PDFs with a searchable text layer. Individual issues can be browsed and searched using the built-in PDF viewer, but there seems to be no way of searching across multiple issues in Trove. There’s a couple of reasons for this:
On top of this, Trove’s inconsistency and lack of transparency means that you’re never really sure what you’re searching and why. Why do some PDFs get indexed, but others don’t? Why are community newsletters contributed through NED in the ‘Books & Libraries’ category rather than ‘Magazines & Newsletters’? Why are some periodicals searchable by article while others are not? I’m trying to document many of these inconsistencies in the Trove Data Guide as they cause confusion and uncertainty for users – if your search returns no results, is it because Trove has no relevant content, or is it because the relevant content isn’t fully searchable? You just don’t know.
I recently updated my harvest of periodicals submitted to Trove through the National eDeposit Service. In total, there’s 8,572 different periodicals containing 179,510 issues! I used the l-format=Periodical
facet to separate out the periodicals from other types of publications, and I don’t think it’s always accurate – some of the publications in the dataset look like one-off reports. Nonetheless, there are lots of periodicals. As I’ve noted previously, this includes a rich assortment of local and community newsletters – not just The Triangle, but The Apollo Bay News, Palm Island Voice, and many, many others. As local newspapers die out, these sorts of publications capture details of community life that might otherwise be missing from the historical record. Just as the diversity of Trove’s digitised newspapers have given historians new perspectives on the past, I believe these NED periodicals will provide researchers with an increasingly important source of information on daily life in Australia. Equally, having these publications accessible online through a free, national service like Trove ensures that communities themselves will have ongoing access to their own histories. But both for researchers and communities, the value of the publications will be affected by their accessibility – how can they be found, searched, and used?
If you can’t search across a NED periodical within Trove, perhaps we need to develop alternative approaches outside of Trove. Using The Triangle as my test case, I’ve developed a workflow that creates a standalone, fully searchable database of content from a NED periodical. The database supports full text searches across the complete text content, including query options like wildcards and boolean operators that don’t work within standard PDF viewers. Have a go! You can try searching The Triangle here.
There are a number of different steps involved in creating databases like this from NED periodicals:
This is fully documented in the GLAM Workbench notebook Create a searchable database from issues of a NED periodical. Once you’ve created the database you can explore it using any SQLite client, but I like to use Datasette. The notebook creates a custom metadata file for configuring Datasette, and explains how to open the database using either Datasette or Datasette-Lite.
The standard Datasette interface can look a little intimidating if you just want to run a full text search. To make it easier, I’ve developed a customised Datasette theme and canned query that generates a simple search page with a few extra features such as date facets, result snippets, and query highlighting. It’s basically designed to look like a standard search interface. Sometimes simple takes a bit of extra work!
The canned query defines the search parameters and constructs the SQL query called by the search box. It’s automatically included in the metadata.json
file created by the notebook. To put everything together, you just need to point Datasette to your database, the metadata file, and the custom template.
I’ve embedded the template in a new Datasette-Lite repository. The notebook explains how to construct a url that will open your database using this repository. The Triangle search interface runs using this customised version of Datasette-Lite.
I’ve also created a new Trove NED section of the GLAM Workbench. The notebook that harvests metadata about NED periodicals was previously in the Trove periodicals section, but I think its better to keep the NED documentation separate. When I first started developing the GLAM Workbench, it made sense for the structure to mirror Trove’s zones. But over the years I’ve become very aware of the fact that the way content is treated in Trove has less to do with its format than the processing pipelines that get it into Trove. In Trove, digitised newspapers are very different to digitised journals, and digitised journals are very different to NED periodicals, even though they’re all really periodicals! There’s more of this sort of fun in the Trove Data Guide.
Trove and the National Library of Australia refuse to share links relating to the GLAM Workbench or the Trove Data Guide, so there’s a good chance that the people who might benefit most directly from this work will never know that it exists, and will instead be left struggling on their own. I think that’s bad for Australian HASS research. So if it seems this might be useful, please share amongst your networks!