Sharing recent updates and work-in-progress
Fun fact — the Porter stemming algorithm treats the words ‘naturalisation’ and ‘naturalization’ differently. Naturalisation is stemmed to ‘naturalis’, naturalization to ‘natur’. You can try this yourself using this NLTK stemming demo.
Why does this matter? If you try searching for ‘naturalization’ in Trove you get almost 14 million results, most of which aren’t relevant because they’re matching words like ‘nature’.
Of course you can switch off stemming in Trove by using the
text: modifier. A search for
text:naturalization returns only 17,000 results. But now we’ll be missing related terms such as ‘naturalized’, so we have to explicity include any forms of the word that we want.
I’m not sure if this applies to other s/z spellings, but given that historical sources (in Australia at least) often tend to use both forms it’s something to keep in mind when you’re constructing your searches. #dhhacks