Where's 1920? Missing volume added to Tasmanian Post Office Directories!

Thursday, September 26, 2024

Visualisation is a great way to find problems in your data.

As part of the Everyday Heritage project, I’m working with a team to document the lives of Tasmania’s Chinese residents in the 19th and early 20th centuries. We’re using a variety of sources such as Trove’s newspapers, the Tasmanian Names Index, and the Tasmanian Post Office Directories. To help with the research, I converted all the PDF volumes of the Post Office Directories into a public, online, searchable database. Or at least, I thought I had.

The Tasmanian Post Office Directories database embeds metadata about each line of text in its results, so it’s easy to save items of interest using Zotero. A member of our team has already saved hundreds of entries this way. The other day I started pulling these entries out of Zotero using its web API, and thought I’d get an overview by charting the number of results per year. It was then I noticed that 1920 was missing…

I checked the PDF volumes in Libraries Tasmania and the 1920 volume was there, so I worked back through my processing code to figure out why I’d missed it. It turns out the 1920 volume is named using a different pattern, and the regular expression I used to scrape the list of volumes was a little too specific. At least that was easy to rectify.

However, it wasn’t just a matter of feeding the 1920 volume through my processing notebooks, because the content of the 1920 PDF was also quite different to all the other volumes. Most of the PDFs made available from Libraries Tasmania have a single page per image, and the images have been pre-processed for OCR. The PDFs also include searchable, OCRd text. Here’s an example of one of the images from 1921:

The images in the 1920 volume are double page spreads, in colour, without any pre-processing. The PDF doesn’t include any OCRd text, so it’s not searchable. The quality of the images is also quite variable: tight bindings mean some text is cut off, pages are sometimes skewed, and bad lighting causes shadows across the right hand page – when converted to black and white for OCR, these shadows become black blobs that completely obscure the text. Here’s an example of one of the images from 1920:

All this meant I had to do a lot of additional processing of the images before I could extract useful text via OCR. Here’s a summary of the image pre-processing:

extracted all the images from the PDF
sliced all the images roughly in half and saved the results as separate files, so I had one image per page
manually cropped all 800 pages to get them as clean as possible
removed the shadows from the images thanks to this recipe in Stack Overflow
binarized and deskewed the images ready for OCR

From then on I could apply the processes in my existing notebooks:

OCRd the images using Tesseract
uploaded the cropped, colour images to the AWS bucket for delivery using serveless-IIIF
added the volume metadata and OCRd text to the Datasette database, generating a full-text index on the text column
updated the application on Google Cloudrun

I also took the chance to tweak the theme a bit, including a new dark mode.

The updated database is live, now containing 49 volumes from 1890 to 1948 including 1920!

Try it now!

UPDATE 27, September 2024

It seems I was too focused on the gap in 1920 and missed some other missing volumes from the 1930s and 40s. I’ve started processing these but it’s going to take a fair while to work through them all. I’ll add each volume to the database as it’s finished. Check here for regular updates.

glamworkbench