Tim Sherratt

Sharing recent updates and work-in-progress

Apr 2024

Update! Saving Trove newspaper articles and pages as images

You probably know that when you select the Download as Image option for a digitised newspaper article in Trove what you get back is not actually an image ­– it’s an HTML document, in which the original image has been sliced up to try and fit on an A4 page when printed. So this article:

Image of newspaper article as it appeared in the published newspaper, with the central column extending well below the main body of the article

Ends up looking like this!!

Image of the results of using Trove's download article as image option -- the content of the article has been sliced and jumbled making it very difficult to understand how it originally appeared

So what do you do when you just want an image of an article as it appeared in the newspaper? Some years ago I figured out a workaround that involves scraping the OCR positional data that’s embedded in Trove’s newspaper viewer and cropping the article from a high-resolution image of the page. The method is documented in the GLAM Workbench and the Trove Data Guide, and I’ve packaged up the code in trove-newspapers-images so you can embed it in your own projects.

I also created a web app (using Jupyter and Voilá) to make it as simple as possible for people to download images from articles. Unlike most of the other notebooks in the GLAM Workbench which are spun up on demand, this web app was hosted on a constantly-running server. This made it faster to start and use, but it was relatively expensive, wasteful, and difficult to maintain. So I decided to make a change!

The new version of the Save Trove newspaper article as image web app is actually embedded with the GLAM Workbench. Behind the scenes, the page calls a AWS Lambda function which uses trove-newspapers-images to generate the image. So far it seems to be working pretty well. Try it now!

Screen capture of the web app from the GLAM Workbench

Even better, I’ve made some changes to image generation code to give users the option of masking the articles. The original version crops a rectangle from the page using the article coordinates. If an article extends over multiple columns with different lengths, the image will include content from neighbouring articles. It’s not a big problem, but it always annoyed me. Recently I realised that the solution was quite simple – instead of cropping one big box from the page, you can crop each individual OCR ‘zone’ and paste them into a new empty image with the same dimensions as the original. Once you’ve pasted all the zones, you crop the new page image using the article coordinates. Here’s an example of an article without masking:

And the same article with masking:

This enhancement has been pushed to the trove-newspapers-images package, and is available through the web app by simply checking the ‘mask image’ option.

Another frustrating feature of the Trove web interface is that there’s no way of saving a newspaper page as an image, only as a PDF. In this case the workaround is pretty simple, you just have to know the url pattern used to download page images. This is documented in the Trove Data Guide. Once again, I’ve been providing a web app to make this easy for users, and once again I’ve just updated it so that it’s embedded with the GLAM Workbench itself. Try it now!