Visualisation is a great way to find problems in your data.
As part of the Everyday Heritage project, I’m working with a team to document the lives of Tasmania’s Chinese residents in the 19th and early 20th centuries. We’re using a variety of sources such as Trove’s newspapers, the Tasmanian Names Index, and the Tasmanian Post Office Directories. To help with the research, I converted all the PDF volumes of the Post Office Directories into a public, online, searchable database. Or at least, I thought I had.
The Tasmanian Post Office Directories database embeds metadata about each line of text in its results, so it’s easy to save items of interest using Zotero. A member of our team has already saved hundreds of entries this way. The other day I started pulling these entries out of Zotero using its web API, and thought I’d get an overview by charting the number of results per year. It was then I noticed that 1920 was missing…
I checked the PDF volumes in Libraries Tasmania and the 1920 volume was there, so I worked back through my processing code to figure out why I’d missed it. It turns out the 1920 volume is named using a different pattern, and the regular expression I used to scrape the list of volumes was a little too specific. At least that was easy to rectify.
However, it wasn’t just a matter of feeding the 1920 volume through my processing notebooks, because the content of the 1920 PDF was also quite different to all the other volumes. Most of the PDFs made available from Libraries Tasmania have a single page per image, and the images have been pre-processed for OCR. The PDFs also include searchable, OCRd text. Here’s an example of one of the images from 1921:
The images in the 1920 volume are double page spreads, in colour, without any pre-processing. The PDF doesn’t include any OCRd text, so it’s not searchable. The quality of the images is also quite variable: tight bindings mean some text is cut off, pages are sometimes skewed, and bad lighting causes shadows across the right hand page – when converted to black and white for OCR, these shadows become black blobs that completely obscure the text. Here’s an example of one of the images from 1920:
All this meant I had to do a lot of additional processing of the images before I could extract useful text via OCR. Here’s a summary of the image pre-processing:
From then on I could apply the processes in my existing notebooks:
I also took the chance to tweak the theme a bit, including a new dark mode.
The updated database is live, now containing 49 volumes from 1890 to 1948 including 1920!
It seems I was too focused on the gap in 1920 and missed some other missing volumes from the 1930s and 40s. I’ve started processing these but it’s going to take a fair while to work through them all. I’ll add each volume to the database as it’s finished. Check here for regular updates.