Tim Sherratt

Sharing recent updates and work-in-progress

Nov 2020

Harvest text from the Australian Women's Weekly!

The Trove Newspaper & Gazette Harvester has been updated to version 0.4.0. The major change is that if the OCRd text for an article isn’t available through the API, it will be automatically downloaded via the web interface. What does this mean in practice? Well previously you couldn’t harvest OCRd text from the Australian Women’s Weekly because it’s not included in API results, but now you can!

You don’t need to do anything differently. If there are AWW articles in your search, and you ask for all the OCRd text using the --text option, the AWW text files will automagically appear in your harvest.

Under the hood, I’ve started using html2text to remove tags from the OCRd text. I think this should produce more consistent results. As previously, line breaks are removed by default from the OCRd text files. However, I’ve now added a --include_linebreaks option if you’d like to keep them. This generally produces text that is more human-readable, but note that the line breaks produced by OCR aren’t always accurate.

Head to the GLAM Workbench to try it out, or download the code from PyPi. #dhhacks