Perhaps appropriately for #DayofDH2022, I’ve spent most of the morning trying to hunt down a bug. Everything works perfectly running locally, but fails mysteriously when deployed to the cloud…
Perhaps appropriately for #DayofDH2022, I’ve spent most of the morning trying to hunt down a bug. Everything works perfectly running locally, but fails mysteriously when deployed to the cloud…
Micro.blog offers another alternative for people wanting more control over their socials. I’m posting this from my Micro account – it’ll be cross-posted to Mastodon & Twitter, saved to GitHub, syndicated via RSS, and accessible from updates.timsherratt.org
I’ve been doing a bit of cleaning up, trying to make some old datasets more easily available. In particular I’ve been pulling together harvests of the number of newspaper articles in Trove by year and state. My first harvests date all the way back to 2011, before there was even a Trove API. Unfortunately, I didn’t run the harvests as often as I should’ve and there are some big gaps. Nonetheless, if you’re interested in how Trove’s newspaper corpus has grown and changed over time, you might find them useful. They’re available in this repository and also in Zenodo.
This chart shows how the number of newspaper articles per year in Trove has changed from 2011 to 2022. Note the rapid growth between 2011 and 2015.
To try and make sure that there’s a more consistent record from now on, I’ve also created a new Git Scraper – a GitHub repository that automatically harvests and saves data at weekly intervals. As well as the number of articles by year and state, it also harvests the number of articles by newspaper and category. As mentioned, these four datasets are updated weekly. If you want to get all the changes over time, you can retrieve earlier versions from the repository’s commit history.
All the datasets are CC-0 licensed and validated with Frictionless.
There’s also a notebook in the GLAM Workbench that explores this sort of data.
So my FOI request to release the scoping studies that informed investments in the current round of ARDC-managed HASS research infrastructure development was partially successful. As I’ve previously noted, reports from the ARDC and Academy of Humanities are now publicly available. There was a third document identified as relevant to my request that was finally released yesterday, but it isn’t what I was expecting. It’s a report of consultation relating to a discussion paper by Dandalo Partners, rather than the actual discussion paper itself! Clearly I spoke the wrong magic words, and should have asked for ‘discussion papers’ as well as ‘scoping studies’. FOI can be a bit of a fishing expedition. I’m considering whether I want to start the whole FOI process again by asking for the final version of the discussion paper, but in the meantime, I’ve put the currently-available documents in a Dropbox folder. Enjoy!
Thanks again to those who helped me cover the FOI costs!
Over the past few months I’ve been doing a lot of behind-the-scenes work on the GLAM Workbench – automating, standardising, and documenting processes for developing and managing repositories. These sort of things ease the maintenance burden on me and help make the GLAM Workbench sustainable, even as it continues to grow. But these changes are also aimed at making it easier for you to contribute to the GLAM Workbench!
Perhaps you’re part of a GLAM organisation that wants to help researchers explore its collection data – why not create your own section of the GLAM Workbench? It would be a great opportunity for staff to develop their own digital skills and learn about the possibilities of Jupyter notebooks. I’ve developed a repository template and some detailed documentation to get you started. The repository template includes everything you need to create and test notebooks, as well as built-in integration with Binder, Docker, Reclaim Cloud, and Zenodo. And, of course, I’ll be around to help you through the process.
Or perhaps you’re a researcher who wants to share some code you’ve developed that extends or improves an existing GLAM Workbench repository. Yes please! Or maybe you’re a GLAM Workbench user who has something to add to one of the lists of resources; or you’ve noticed a problem with some of the documentation that you’d like to fix. All contributions welcome!
The Get involved! page includes links to all this information, as well as some other possibilities such as becoming a sponsor, or sharing news. And to recognise those who make a contribution to the code or documentation there’s also a brand new contributors page.
I’m looking forward to exploring how we can build the GLAM Workbench together. #dhhacks
Over the last couple of years I've been fiddling with bits of Python code to work with the Omeka S REST API. The Omeka S API is powerful, but the documentation is patchy, and doing basic things like uploading images can seem quite confusing. My code was an attempt to simplify common tasks, like creating new items.
In case it's of use to others, I've now shared my code as a Python package. So you can just `pip install omeka-s-tools` to get started. The code helps you:
There's quite detailed documentation available, including an example of adding a newspaper article from Trove to Omeka. If you want to see the code in action, there's also a notebook in the Trove newspapers section of the GLAM Workbench that uploads newspaper articles (including images and OCRd text) to Omeka from a variety of sources, including Trove searches, Trove lists, and Zotero libraries.
If you find any problems, or would like additional features, feel free to create an issue in the GitHub repository. #dhhacks
Last year I started compiling information about the level of Zotero integration provided by Australian GLAM organisations though their online collections. The basic test is, can Zotero capture useful, structured information about an item from the collection interface. The results are not great.
Zotero extracts information from a web site using a variety of 'translators'. Some of these translators look for generic information embedded in a web page, such as <TITLE> and <META> tags. Others are written to work with widely-used software systems, such as library catalogues. For bespoke systems, custom translators can be created to extract the necessary data. But sometimes, the data just isn't available in a form Zotero can access.
So in the results spreadsheet you'll notice that the National Archives of Australia's RecordSearch database works with Zotero. This is because I created a custom translator for it many years ago. There's also a custom translator for Trove's digitised newspapers, but the translator for other parts of Trove was broken by the site's upgrade in 2020.
A number of the results note that Zotero support is 'limited'. This usually means that Zotero can get a title and a url, and maybe a description. This basic metadata is often embedded in pages for use by social media sites like Facebook and Twitter, but Zotero can also grab it. The absolute very least you might expect from a GLAM organisation is to make sure this sort of basic metadata is embedded in every page, but as you can see by all the red in the results table, even this is missing from many online collections. Libraries used to champion the use of Dublin Core metadata in web pages, and Australian government agencies were expected to make use of AGLS to aid web discovery. But it seems these are no longer priorities.
The good news is we can fix this, either by supporting and encouraging GLAM organisations to embed useful metadata, or by creating custom translators. There's been a lot of talk about HASS research infrastructure over the last six months, and it seems to me that some integration and cooperation in this area would make a big difference to the way researchers work with GLAM collections.
There's an online form available if you'd like to contribute new results to the spreadsheet, or update existing details.
I regularly update the Python packages used in the different sections of the GLAM Workbench; though probably not as often as I should. Part of the problem is that once I've updated the packages, I have to run all the notebooks to make sure I haven't inadvertently broken something -- and this takes time. And in those cases where the notebooks need an API key to run, I have to copy and paste the key in at the appropriate spots, then remember to delete them afterwords. They're little niggles, but they add up, particularly as the GLAM Workbench itself expands.
I've been looking around at Jupyter notebook automated testing options for a while. There's nbmake, testbook, and nbval, as well as custom solutions involving things like papermill and nbconvert. After much wavering, I finally decided to give `nbval` a go. The thing that I like about `nbval` is that I can start simple, then increase the complexity of my testing as required. The `--nbval-lax` option just checks to make sure that all the cells in a notebook run without generating exceptions. You can also tag individual cells that you want to exclude from testing. This gives me a testing baseline -- this notebook runs without errors -- it might not do exactly what I think it's doing, but at least it's not exploding in flames. Working from this baseline, I can start tagging individual cells where I want the output of the cell to be checked. This will let me test whether a cell is doing what it's supposed to.
This approach means that I can start testing without making major changes to existing notebooks. The main thing I had to think about is how to handle API keys or other variables which are manually set by users. I decided the easiest approach was to store them in a `.env` file and use dotenv to load them within the notebook. This also makes it easy for users to save their own credentials and use them across multiple notebooks -- no more cutting and pasting of keys! Some notebooks are designed to run as web apps using Voila, so they expect human interaction. In these cases, I added extra cells that only run in the testing environment -- they populate the necessary fields and simulate button clicks to start.
While I was in a QA frame of mind, I also started playing with nbqa -- a framework for all sorts of code formatting, linting, and checking tools. I decided I'd try to standardise the formatting of my notebook code by running isort, black, and flake8. As well ask making the code cleaner and more readable, they pick up things like unused imports or variables. To further automate this process, I configured the `nbqa` checks to run when I try to commit any notebook code changes using `git`. This was made easy by the pre-commit package.
This is all set up and running in the Trove newspapers repository -- you can see the changes here. Now if I update the Python packages or make any other changes to the repository, I can just run `pytest --nbval-lax` to test every notebook at once. And if I make changes to an individual notebook, `nbqa` will automatically give the changes a code quality check before I save them to the repository. I'm planning to roll these changes out across the whole of the GLAM Workbench in coming months.
Developments like these are not very exciting for users, but they're important for the management and sustainability of the GLAM Workbench, and help create a solid foundation for future development and collaboration. Last year I created a GLAM Workbench repository template to help people or organisations thinking about contributing new sections. I can now add these testing and QA steps into the template to further share and standardise the work of developing the GLAM Workbench.
One of the things I really like about Jupyter is the fact that I can share notebooks in a variety of different formats. Tools like QueryPic can run as simple web apps using Voila, static versions of notebooks can be viewed using NBViewer, and live versions can be spun up as required on Binder. It’s also possible to export notebooks at PDFs, slideshows, or just plain-old HTML pages. Just recently I realised I could export notebooks to HTML using the same template I use for Voila. This gives me another way of sharing – static web pages delivered via the main GLAM Workbench site.
Here’s a couple of examples:
The video of my key story presentation at ResBaz Queensland (simulcast via ResBaz Sydney) is now available on Vimeo. In it, I explore some of the possibilities of GLAM data by retracing my own journey through WWI service records, The Real Face of White Australia, #redactionart, and Trove – ending up at the GLAM Workbench, which brings together a lot of my tools and resources in a form that anyone can use. The slides are also available, and there’s an archived version of everything in Zenodo.
This and many other presentations about the GLAM Workbench are listed here. It seems I’ve given at least 11 talks and workshops this year! #dhhacks
The newly-updated DigitalNZ and Te Papa sections of the GLAM Workbench have been added to the list of available repositories in the Nectar Research Cloud’s GLAM Workbench Application. This means you can create your very own version of these repositories running in the Nectar Cloud, simply by choosing them from the app’s dropdown list. See the Using Nectar help page for more information.
I’ve also taken the opportunity to make use of the new container registry service developed by the ARDC as part of the ARCOS project. The app now pulls the GLAM Workbench Docker images from Quay.io via the container registry’s cache. This means that copies of the images are cached locally, speeding things up and saving on data transfers. Yay for integration!
Thanks again to Andy and the Nectar Cloud staff for their help! #dhhacks
In preparation for my talk at ResBaz Aotearoa, I updated the DigitalNZ and Te Papa sections of the GLAM Workbench. Most of the changes are related to management, maintenance, and integration of the repositories. Things like:
index.ipynbfile based on
README.mdto act as a front page within Jupyter Lab
reclaim-manifest.jpsfile to allow for one-click installation of the repository in Reclaim Cloud
README.mdwith instructions on how to run the repository via Binder, Reclaim Cloud, Nectar Research Cloud, and Docker Desktop.
.zenodo.jsonmetadata file so that new releases are preserved in Zenodo
requirements.txtfiles, and the include unpinned requirements in
From the user’s point of view, the main benefit of these changes is the ability to run the repositories in a variety of different environments depending on your needs and skills. The Docker images, generated using repo2Docker, are used by Binder, Reclaim Cloud, Nectar, and Docker Desktop. Same image, multiple environments! See ‘Run these notebooks’ in the DigitalNZ and Te Papa sections of the GLAM Workbench for more information.
Of course, I’ve also re-run all of the notebooks to make sure everything works and to update any statistics, visualisations, and datasets. As a bonus, there’s a couple of new notebooks in the DigitalNZ repository:
I’m hoping that the GLAM Workbench will encourage GLAM organisations and GLAM data nerds (like me) to create their own Jupyter notebooks. If they do, they can put a link to them in the list of GLAM Jupyter resources. But what if they want to add the notebooks to the GLAM Workbench itself?
To make this easier, I’ve been working on a template repository for the GLAM Workbench. It generates a new skeleton repository with all the files you need to develop and manage your own section of the GLAM Workbench. It uses GitHub’s built in templating feature, together with Cookiecutter , and this GitHub Action by Stefan Buck. Stefan has also written a very helpful blog post.
The new repository is configured to do various things automatically, such as generate and save Docker images, and integrate with Reclaim Cloud and Zenodo. Lurking inside the
dev folder of each new repository, you’ll find some basic details on how to set up and manage your development environment.
This is just the first step. There’s more documentation to come, but you’re very welcome to try it out. And, of course, if you are interested in contributing to the development of the GLAM Workbench, let me know and I’ll help get you set up!
Previously on ‘What could we do with $2.3 million?’, the National Library of Australia produced a draft plan for an ‘Advanced Researcher Platform’ that was thoroughly inadequate. Rather than submit this plan to the ARDC for consideration as part of the HASS RDC process, the NLA wisely decided to make some fundamental changes. The redrafted draft is now available for re-feedback. This is where we pick up the story…
Generally speaking there seem to be two major changes.
These two changes can work together positively. The details of what’s being developed can be worked out through consultation, ensuring that what’s developed meets real research needs. But this assumes first that the consultation process is effective, and second that the overall scope of the project is appropriate. I’m not convinced on either of these two fronts.
TL;DR – it’s better than it was, but the hand waving around collaboration and integration aren’t convincing, the scope needs rethinking, and I still can’t see how what’s proposed provides good value for money.
Given that there was almost no space for consultation in the previous draft, this version could only do better. There’s certainly lots of talk about consulting with the HASS community, and a new governance structure that includes researcher representatives, but what will the consultation process actually deliver?
The NLA is now partnering with the ANU (why the ANU? because it’s closest?), and the ANU will apparently be driving the consultation process. The whole process is complex and weirdly hierarchical. Any HASS researcher can participate in the initial rounds, but their numbers dwindle as you move up the hierarchy, until you reach the Project Board where only researchers with specific institutional affiliations are admitted. I’m imagining a Hunger Games scenario…
The process aims to gather feedback on ‘what developments would offer the best assistance to the majority and be feasibly achieved within the project timeframe’. That sounds ok, but earlier in the document the outcome of the consultation process is described in the following way:
The outcome will be expressed as one (or potentially two) goals articulated in high level requirements document(s) and align to the objectives as described in this project plan. Indicative goals might include: a graph visualising occurrence of keywords over time; an interface visualising place of publication as geospatial data on a map; or, or a concordance for exploring textual data such as with key word in context.
So all of this complex, hierarchical consultative structure is just to decide whether we have a line graph or a map (or both if we’re really lucky)? If you look at the project deliverables, it seems that the researcher feedback funnels into Deliverable 6 in Work Package 3 – ‘Analysing and visualising research data’. But what about the rest of the project? Will researcher feedback have any role in determining how datasets are created, for example?
I suppose even this very limited consultation is better than what was previously proposed (ie. nothing). But what then happens to the feedback from researchers? An Advisory Panel will be selected (by whom?) to collate the ideas and produce the high-level requirements. Detailed requirements will be generated from the high-level requirements (don’t you just love project management speak?), and then subjected to the argy bargy of the MoSCoW process where priorities will be set. It’s likely that these priorities will be whittled down further as development proceeds. These are crucial decision-making stages where important ideas can be relegated to the ‘nice to have’ category and never heard of again. It’s not clear from the plan who is involved in this, and where the final decision-making power lies.
Of course, some of these details can be worked out later. But given that the big sell of this version of the plan is the expanded consultative process, I think it’s important to know where the power really lies. What role will researchers actually play in determining the outcomes of the project? This is not at all clear.
But what is the project? In general terms it hasn’t really changed. There will be some sort of portal where researchers can create and share datasets and visualisations. Crucially, it’s assumed that this portal will be part of Trove itself. As noted the last time round, the original project description provided by the government made no such assumption. It was focused on ’the delivery of researcher portals accessible through Trove’. The NLA has interpreted ‘through’ to mean ‘as part of’, and given the limits on the consultative process described above it seems this won’t change.
Or will it? I’m still puzzling over a few sections in the plan that talk about looking beyond the NLA to see whether there are existing options to meet user requirements. Deliverable 2 in Work Package 2 will:
Undertake an environmental scan for current research usage and tools such as (Glam Workbench) and a market scan to determine if these gaps can be filled by existing services that the HASS community and/or Trove support.
What’s with the weird brackets around ‘Glam Workbench’? Makes me think it was a last minute addition. I suppose I should be grateful that the NLA wants to spend some money to confirm that the GLAM Workbench actually exists. But then what? The next deliverable will:
determine which requirements will be delivered within the Trove Platform and which will be outsourced to other services.
So if the Trove Newspaper Harvester, for example, meets one of the requirements, will Trove simply link to it? Imagine that, Trove actually linking to one of the dozens of Trove tools and resources provided by the GLAM Workbench. Oh frabjous day! But then does the NLA still get the money to develop the thing that I’ve already developed, or will they share some of the project money with me (yeah right)? I really have no idea what’s envisaged here. How will the ‘solution architecture’ integrate existing tools and services? And what does that mean for the resourcing of the project as a whole?
Elsewhere the plan talks about services ‘dedicated specifically to Trove collections and/or to the Australian research community’ that could be ‘“plugged in”’ to the platform ecosystem’. That sounds hopeful, but if the platform is an ecosystem of tools and services from the NLA and beyond, then that changes the scope of the project completely. Why not start with that? Start with the idea of developing an ecosystem of tools and services making use of Trove data, rather than just developing a new Trove interface. Then we could work together to build something really useful.
It just seems that the scope of the project as a whole hasn’t been properly thought through. The original plan has been expanded in vague, hand wavy directions, without thinking through the implications and possibilities of that expansion. Tinkering around the edges isn’t enough, the nature of this project needs to be completely rethought.
Rather than have an open call for project funding, the HASS RDC process has focused instead on making strategic investments. But where’s the strategy? The current projects were identified through a number of scoping studies undertaken by the Department of Education, Skills, and Employment. But these scoping studies haven’t been publicly released, so we don’t really know why these projects were recommended for funding. Is giving the NLA buckets of money to develop a new interface really what was envisaged? Surely if you were thinking strategically, you’d be considering ways in which the rich data asset represented by Trove could be opened to new research uses. You’d look around at existing tools and resources and think about how they could be leveraged. You’d examine limitations in the delivery of Trove data, and think about what sort of plumbing was needed to connect up new and existing projects. So how did we end up here?
Perhaps we just need to take a step back and recognise that just because Trove provides the data, doesn’t mean it should direct the project. There needs to be another layer of strategic planning which identifies the necessary components and directs resources accordingly. As I noted before, there’s plenty of ways in which Trove’s data and APIs could be improved. Give them money to do that. But should they be building tools for researchers to use that data? Nope. Absolutely not.
I attended the eResearch Australasia Conference recently, and was really impressed with all the activity around research software development. If the tool building component of this project was opened up, it could provide a really useful focus for developing collaboration across the research software community and building capacities and understanding in HASS. It would also encourage greater reuse and integration. This would seems to be a much more strategic intervention.
I’m not going to go through the plan in detail again. I’ve already spent a couple of weeks engaged in, or worrying about, the HASS RDC process. I’m tired, and I’m frustrated, and I can’t shake the depressing thought that the NLA will end up being rewarded for its bad behaviour. Many of my comments on the earlier draft still apply, particularly those around the API and the development of pathways for researchers.
It’s worth noting, however, that the ’sustainability’ section of the plan has disappeared completely – perhaps not surprisingly, as the only suggestion last time was for someone to give them more money. There are gestures towards integration, such as including representatives of the other HASS RDC projects on the Trove Project Board. But real integration would happen through technical interchange, not governance structures, and there’s no plan for that.
I’m also a bit confused about the role of ANU. It seems to be mostly focused on consultation, but then there are statements like:
The development phase will be completed as a collaboration between the ANU and NLA, with both institutions working on the development of their own systems to align the with the product goals and Trove.
What ANU systems are we talking about here? And why are they part of the project?
A couple of objectives were also added to the plan:
On the issue of community relationships, a few people in the previous round of feedback indicated to me that they didn’t want to criticise the NLA too harshly because they might want to work with them in the future. That’s not healthy community building.
After two attempts, the NLA has still not delivered a coherent project plan that demonstrates real value to the HASS sector and meets the ARDC’s assessment criteria. I think the project needs to be radically rethought, and leadership sought from outside the NLA to ensure that the available funding is used effectively.
I love Trove. I recognise the way it has transformed research, and was honoured to play a small part in its history. It should be appropriately funded. But it shouldn’t be funded to do everything.
In the end, we could do so much better, and so much more…
You can provide your own feedback on the new draft plan. There’ll be a roundtable event on 10 November when you can ask questions of the project participants. You can also submit your feedback by 17 November using the form at the bottom of this page. You might also want to remind yourself of the ARDC’s evaluation criteria.
I don’t think I’ll attend the roundtable, as this whole process has taken a bit of a toll, but I encourage you to do so. The more voices the better.
Here’s the final feedback that I submitted to the ARDC.
Want a bit of added GLAM with your digital research skills? You’re in luck, as I’ll be speaking at not one, but three ResBaz events in November. If you haven’t heard of it before, ResBaz (Research Bazaar) is ‘a worldwide festival promoting the digital literacy at the centre of modern research’.
The programs of all three ResBaz events are chock full of excellent opportunities to develop your digital skills, learn new research methods, and explore digital tools. If you’re an HDR student you should check out what’s on offer.
The latest help video for the GLAM Workbench walks through the web app version of the Trove Newspaper & Gazette Harvester. Just paste in your search url and Trove API key and you can harvest thousands of digitised newspaper articles in minutes!
An inquiry on Twitter prompted me to put together a notebook that you can use to download all available issues of a newspaper as PDFs. It was really just a matter of copying code from other tools and making a few modifications. The first step harvests a list of available issues for a particular newspaper from Trove. You can then download the PDFs of those issues, supplying an optional date range. Beware – this could consume a lot of disk space!
The PDF file names have the following structure:
[newspaper identifier]-[issue date as YYYYMMDD]-[issue identifier].pdf
903– the Glen Innes Examiner
19320528– 28 May 1932
1791051– to view in Trove just add this to
http://nla.gov.au/nla.news-issue, eg http://nla.gov.au/nla.news-issue1791051
The GLAM Workbench isn’t dependent on one big piece of technological infrastructure. It’s basically a collection of Jupyter notebooks, and those notebooks can be used within a variety of different environments. This helps make the GLAM Workbench more sustainable – new components can be swapped in and out as required. It also makes it possible to create different pathways for users, depending on their digital skills, institutional support, and research needs. For example, links to Binder make it easy for users to explore the possibilities of the GLAM Workbench and accomplish quick tasks. But Binder has limits. Where do you go when your research project scales up?
Earlier this year I added one-click installation of GLAM Workbench repositories in Reclaim Cloud. Today I’m very pleased to announce that selected GLAM Workbench repositories can be installed as applications within the Nectar Research Cloud. Using nationally-funded digital infrastructure, researchers in Australian universities can now create their own workbenches in minutes. So whether you’re harvesting truckloads of data from Trove or analysing web archives at scale, you can move beyond Binder and set up an environment dedicated to your research project. Cool huh?
Currently four repositories can be installed on Nectar in this way:
But more will be added in the future as I update repositories to generate the necessary Docker images. Nectar installation information is now included in each of these four repositories, and I’ve added a Using the Nectar Cloud section to the help documentation that includes a detailed walkthrough of the installation process. If you strike any problems either raise an issue on GitHub, or ask a question at OzGLAM Chat.
Huge thanks to Andy, Jacob, and Jo at the Australian Research Data Commons (ARDC) who responded enthusiastically to my tweeted query, and packaged the repositories up into an easy-to-install, reusable application. After all the work I’ve put into the GLAM Workbench, it’s really exciting to see it embedded within Australia’s digital research infrastructure. #dhhacks
A new version of the GLAM Name Index Search is available. An additional 49 indexes have been added, bringing the total to 246. You can now search for names in more than 10.2 million records from 9 organisations.
The new indexes come from Queensland State Archives and the State Library of WA. QSA announced on Friday that they’d added two new indexes to their site. When I went to harvest them, I realised there was another 25 indexes that I hadn’t previously picked up. It seems that some QSA datasets are tagged as ‘Queensland State Archives’ in the data.qld.gov.au portal, but others are tagged as ‘queensland state archives’ – and the tag search is case sensitive! I now search for both the upper and lower case tags.
There’s also a number of additions from the State Library of WA. These datasets were already in my harvest, but because of some oddities in their formatting, I hadn’t included them in the Index Search. Looking at them again, I realised they were right to go, so I’ve added them in.
Here’s the list of additions:
After a bit more work last night I added in a dataset from the State Library of Victoria:
That’s an extra 21,000 records, and takes the total number of indexes to 247 from 10 different GLAM organisations!
When you search Trove’s newspapers, you find articles – these articles are grouped by page, and all the pages from a particular date make up an issue. But how do you find out what issues are available? How do you get a list of dates when newspapers were published? This notebook in the GLAM Workbench shows how you can get information about issues from the Trove API.
Using the notebook, I’ve created a couple of datasets ready for download and use.
Harvested 10 October 2021
CSV formatted dataset containing the number of newspaper issues available on Trove, totalled by title and year – comprises 27,604 rows with the fields:
title– newspaper title
title_id– newspaper id
state– place of publication
year– year published
issues– number of issues
Download from Cloudstor: newspaper_issues_totals_by_year_20211010.csv (2.1mb)
Harvested 10 October 2021
CSV formatted dataset containing a complete list of newspaper issues available on Trove – comprises 2,654,020 rows with the fields:
title– newspaper title
title_id– newspaper id
state– place of publication
issue_id– issue identifier
issue_date– date of publication (YYYY-MM-DD)
To keep the file size down, I haven’t included an
issue_url in this dataset, but these are easily generated from the
issue_id. Just add the
issue_id to the end of
http://nla.gov.au/nla.news-issue. For example: http://nla.gov.au/nla.news-issue495426. Note that when you follow an issue url, you actually get redirected to the url of the first page in the issue.
Download from Cloudstor: newspaper_issues_20211010.csv (222mb)
For more information see the Trove newspapers section of the GLAM Workbench.
Way back in 2013, I went to the eResearch Australasia conference as the manager of Trove to talk about new research possibilities using the Trove API. Eight years years later I was back, still spruiking the possibilities of Trove data. This time, however, I was discussing Trove in the broader context of GLAM data – all the exciting possibilities that have emerged as galleries, libraries, archives and museums make more of their collections available in machine-readable form. The big question is, of course, how do researchers, particularly those in the humanities, make use of that data? The GLAM Workbench is my attempt to address that question – to provide humanities researchers with both the tools and information they need, and an understanding of the possibilities that might emerge if they invest a bit of time in working with GLAM data. My eResearch Australasia 2021 presentation provides a quick introduction to the GLAM Workbench, here’s the video, and the slides.
The presentation was pre-recorded, but I managed to sneak in an update via chat for those who attended the session. More news on this next week… 🥳
There’s no reliable way of downloading an image of a Trove newspaper article from the web interface. The image download option produces an HTML page with embedded images, and the article is often sliced into pieces to fit the page.
This Python package includes tools to download articles as complete JPEG images. If an article is printed across multiple newspaper pages, multiple images will be downloaded – one for each page. It’s intended for integration into other tools and processing workflows, or for people who like working on the command line.
You can use it as a library:
from trove_newspaper_images.articles import download_images images = download_images('107024751')
Or from the command line:
trove_newspaper_images.download 107024751 --output_dir images
If you just want to quickly download an article as an image without installing anything, you can use this web app in the GLAM Workbench. To download images of all articles returned by a search in Trove, you can also use the Trove Newspaper and Gazette Harvester.
See the documentation for more information. #dhhacks
In total there’s 9.67 million name records to search across 197 datasets provided by 9 GLAM organisations!
Here’s the preprint version of an article, ‘More than newspapers’, that I’ve submitted for a forum about Trove in a forthcoming issue of History Australia.
Recently I created a list of publications that made use of QueryPic, my tool to visualise searches in Trove’s digitised newspapers. Here’s another example of the GLAM Workbench and QueryPic in action, in Professor Julian Meyrick’s recent keynote lecture, ‘Looking Forward to the 1950s: A Hauntological Method for Investigating Australian Theatre History’.