So now I’ve updated TroveHarvester and built a new interface I can get back to the task I wanted the TroveHarvester for a couple of days ago — harvesting all references to ‘aliens’ in newspapers… #yakshaving
So now I’ve updated TroveHarvester and built a new interface I can get back to the task I wanted the TroveHarvester for a couple of days ago — harvesting all references to ‘aliens’ in newspapers… #yakshaving
Want an easy way to download @TroveAustralia newspaper articles in bulk? No installation? Point and click? I’ve created a simple web app version of my TroveHarvester using a Jupyter notebook & running on @mybinderteam. Try it live! #dhhacks
And version 0.2.2 of TroveHarvester quickly follows 0.2.1 as I squash a bug when downloading PDFs… Also managed to get the README displaying properly on Pypi. pypi.org/project/t…
TroveHarvester 0.2.1 — updated to work with version 2 of the @TroveAustralia API. Now on pypi! More details shortly…
Ok, that’s more like it. Full text and metadata of 29,203 newspaper articles harvested using the @TroveAustralia API in under 10 minutes. Testing nearly done…
Ah ok, I forgot about the new ‘bulkHarvest’ parameter in the @TroveAustralia API. Setting that to ‘true’ seems to make all the difference…
Uh, never come across one of these before from the @TroveAustralia API. Needless to say it causes the Newspaper Harvester to die.
It’s easy to check for these things once you know they exist, but…
Testing the updated Trove Newspaper Harvester…
Run into a problem with the @TroveAustralia API not returning the complete result set in large harvests, trying to figure out why…
Thanks to the @TroveAustralia API upgrade, the new version of the Trove Newspaper Harvester should be a lot faster. For harvests with full text (but not PDFs which slow things down a lot) I’m getting 40-50 articles a second.
Since I’m updating the Trove Newspaper Harvester to work with version 2 of the @TroveAustralia API thought I might as well fix up a few other things as well…
Now with added progress bars!
I’m enjoying using micro.blog as a way of capturing what I’m working on: updates.timsherratt.org
Just need to get the GitHub mirror site working…
Finally biting the bullet and getting to work on updating the TroveHarvester to work with version 2 of the API…
That’s cool — just realised I can share easily share live versions of Altair charts from Jupyter notebooks using Vega. Here’s the complete ‘aliens’ chart.
And also “coloured alien” which, not suprisingly, peaks in 1901 when the Immigration Restriction Act is passed…
Exploring some of the adjectives attached to ‘alien’ in @TroveAustralia newspapers…
You can create these sorts of comparisons yourself using this app. #dhhacks
Just to emphasise my point the other day about the impact of stemming on searches for naturalisation/naturalization in @TroveAustralia. Compare these — the stemming on/off results for ‘naturalisation’ are pretty much in proportion, but not for ‘naturalization’…
Nothing like browsing the databases of another country’s national/state archives to make you realise how useful the series system is…
The Australian version of ‘Who’s responsible?’ is up! Just select a government function and explore the different agencies associated with it over time. It’s built with data from @naagovau’s RecordSearch. Try it live! #dhhacks
New notebook added to the #GLAMWorkbench RecordSearch repository — get the basic details of agencies associated with all government functions used in @naagovau’s RecordSearch and save to a single JSON data file. View code and data. #dhhacks
Hmm, wondering why the ‘National Council of Women of the Australian Capital Territory’ is assigned the function ‘CITIZENSHIP’ in @naagovau’s RecordSearch…