Tim Sherratt

Sharing recent updates and work-in-progress

Jan 2019

Adventures in stemming, or what happens when you search Trove for 'naturalization'

Fun fact — the Porter stemming algorithm treats the words ‘naturalisation’ and ‘naturalization’ differently. Naturalisation is stemmed to ‘naturalis’, naturalization to ‘natur’. You can try this yourself using this NLTK stemming demo.

Why does this matter? If you try searching for ‘naturalization’ in Trove you get almost 14 million results, most of which aren’t relevant because they’re matching words like ‘nature’.

Of course you can switch off stemming in Trove by using the text: modifier. A search for text:naturalization returns only 17,000 results. But now we’ll be missing related terms such as ‘naturalized’, so we have to explicity include any forms of the word that we want.

I’m not sure if this applies to other s/z spellings, but given that historical sources (in Australia at least) often tend to use both forms it’s something to keep in mind when you’re constructing your searches. #dhhacks