Tim Sherratt

Sharing recent updates and work-in-progress

Jan 2021

Finding non-English newspapers in Trove

There are a growing number of non-English newspapers in Trove, but how do you know what’s there? After trying a few different approaches, I generated a list of 48 newspapers with non-English content. The full details are in this notebook).

As the notebook describes, I found the language metadata for newspapers was incomplete, so I used some language detection code on a sample of articles from every newspaper to try and find those with non-English content. But this had its own problems – such as the fact that the language detection code thought that bad OCR looked like Maltese…

Anyway, if you’re just searching using English keywords, you might not even be aware that these titles exist. It’s important to explore ways of making diversity visible within large digitised collections. #dhhacks