Tim Sherratt - Sharing recent updates and work-in-progress

Tim Sherratt

Sharing recent updates and work-in-progress

21 Jan 2019

Just updated my harvest of metadata and full text from The Bulletin in @TroveAustralia. There’s about 2gb of OCRd text from 4,534 issues (1880-1968). Full text for about 60 issues have been added since my last harvest. 111 have no OCRd text. Download it all from GitHub #dhhacks

20 Jan 2019

Fifty most common words occuring before the word ‘aliens’ in @TroveAustralia newspapers (213,000 articles)…

19 Jan 2019

You want big data? I just harvested 213,340 newspaper articles (including full OCRd text) from @TroveAustralia in 82 minutes, at about 40 articles a second. https://mybinder.org/v2/gh/GLAM-Workbench/trove-newspaper-harvester/master?urlpath=%2Fapps%2Fnewspaper_harvester_app.ipynb

19 Jan 2019

So now I’ve updated TroveHarvester and built a new interface I can get back to the task I wanted the TroveHarvester for a couple of days ago — harvesting all references to ‘aliens’ in newspapers… #yakshaving

19 Jan 2019

Want an easy way to download @TroveAustralia newspaper articles in bulk? No installation? Point and click? I’ve created a simple web app version of my TroveHarvester using a Jupyter notebook & running on @mybinderteam. Try it live! #dhhacks

19 Jan 2019

And version 0.2.2 of TroveHarvester quickly follows 0.2.1 as I squash a bug when downloading PDFs… Also managed to get the README displaying properly on Pypi. pypi.org/project/t…

18 Jan 2019

TroveHarvester 0.2.1 — updated to work with version 2 of the @TroveAustralia API. Now on pypi! More details shortly…

18 Jan 2019

Ok, that’s more like it. Full text and metadata of 29,203 newspaper articles harvested using the @TroveAustralia API in under 10 minutes. Testing nearly done…

18 Jan 2019

Ah ok, I forgot about the new ‘bulkHarvest’ parameter in the @TroveAustralia API. Setting that to ‘true’ seems to make all the difference…

18 Jan 2019

Uh, never come across one of these before from the @TroveAustralia API. Needless to say it causes the Newspaper Harvester to die.

It’s easy to check for these things once you know they exist, but…

18 Jan 2019

Testing the updated Trove Newspaper Harvester…

Run into a problem with the @TroveAustralia API not returning the complete result set in large harvests, trying to figure out why…

18 Jan 2019

Thanks to the @TroveAustralia API upgrade, the new version of the Trove Newspaper Harvester should be a lot faster. For harvests with full text (but not PDFs which slow things down a lot) I’m getting 40-50 articles a second.

18 Jan 2019

Since I’m updating the Trove Newspaper Harvester to work with version 2 of the @TroveAustralia API thought I might as well fix up a few other things as well…

Now with added progress bars!

17 Jan 2019

I’m enjoying using micro.blog as a way of capturing what I’m working on: updates.timsherratt.org

Just need to get the GitHub mirror site working…

17 Jan 2019

Finally biting the bullet and getting to work on updating the TroveHarvester to work with version 2 of the API…

17 Jan 2019

That’s cool — just realised I can share easily share live versions of Altair charts from Jupyter notebooks using Vega. Here’s the complete ‘aliens’ chart.

17 Jan 2019

And also “coloured alien” which, not suprisingly, peaks in 1901 when the Immigration Restriction Act is passed…

17 Jan 2019

Exploring some of the adjectives attached to ‘alien’ in @TroveAustralia newspapers…

You can create these sorts of comparisons yourself using this app. #dhhacks

17 Jan 2019

Just to emphasise my point the other day about the impact of stemming on searches for naturalisation/naturalization in @TroveAustralia. Compare these — the stemming on/off results for ‘naturalisation’ are pretty much in proportion, but not for ‘naturalization’…

17 Jan 2019

Nothing like browsing the databases of another country’s national/state archives to make you realise how useful the series system is…

16 Jan 2019

The Australian version of ‘Who’s responsible?’ is up! Just select a government function and explore the different agencies associated with it over time. It’s built with data from @naagovau’s RecordSearch. Try it live! #dhhacks

16 Jan 2019

New notebook added to the #GLAMWorkbench RecordSearch repository — get the basic details of agencies associated with all government functions used in @naagovau’s RecordSearch and save to a single JSON data file. View code and data. #dhhacks

16 Jan 2019

Hmm, wondering why the ‘National Council of Women of the Australian Capital Territory’ is assigned the function ‘CITIZENSHIP’ in @naagovau’s RecordSearch…

15 Jan 2019

As well as cross-posting updates to Twitter and Mastodon, I’ve now set up IFTTT to keep an eye on my micro.blog feed and post anything with the hashtag #dhhacks to my 101 DH Hacks FB page!

15 Jan 2019

Adventures in stemming, or what happens when you search Trove for 'naturalization'

Fun fact — the Porter stemming algorithm treats the words ‘naturalisation’ and ‘naturalization’ differently. Naturalisation is stemmed to ‘naturalis’, naturalization to ‘natur’. You can try this yourself using this NLTK stemming demo. Why d...