Noticed that QueryPic was having a problem with some date queries. Should be fixed in the latest release of the Trove Newspapers section of the #GLAMWorkbench: glam-workbench.net/trove-new… #maintenance #researchinfrastructure
Noticed that QueryPic was having a problem with some date queries. Should be fixed in the latest release of the Trove Newspapers section of the #GLAMWorkbench: glam-workbench.net/trove-new… #maintenance #researchinfrastructure
The Trove Newspapers section of the #GLAMWorkbench has been updated! Voilá was causing a problem in QueryPic, stopping results from being downloaded. A package update did the trick! Everything now updated & tested. glam-workbench.net/trove-new…
Some more #GLAMWorkbench maintenance – this app to download a high-res page images from Trove newspapers now doesn’t require an API key if you have a url, & some display problems have been fixed. trove-newspaper-apps.herokuapp.com/voila/ren…
The Trove Newspaper and Gazette Harvester section of the #GLAMWorkbench has been updated! No major changes to notebooks, just lots of background maintenance stuff such as updating packages, testing, linting notebooks etc. glam-workbench.net/trove-har…
If you have a dataset that you want to share as a searchable online database then check out Datasette – it’s a fabulous tool that provides an ever-growing range of options for exploring and publishing data. I particularly like how easy Datasette makes it to publish datasets on cloud services like Google’s Cloudrun and Heroku. A couple of weekends ago I migrated the TungWah Newspaper Index to Datasette. It’s now running on Heroku, and I can push updates to it in seconds.
I’m also using Datasette as the platform for sharing data from the Sydney Stock Exchange Project that I’m working on with the ANU Archives. There’s a lot of data – more than 20 million rows – but getting it running on Google Cloudrun was pretty straightforward with Datasette’s publish
command. The problem was, however, that Datasette is configured to run on most cloud services in ‘immutable’ mode and we want authenticated users to be able to improve the data. So I needed to explore alternatives.
I’ve been working with Nectar over the past year to develop a GLAM Workbench application that helps researchers do things like harvesting newspaper articles from a Trove search. So I thought I’d have a go at setting up Datasette in a Nectar instance, and it works! Here’s a few notes on what I did…
systemd
. This involved installing datasette
via pip
, creating a folder for the Datasette data and configuration files, creating a datasette.service
file for systemd
.datasette install
command to add a couple of Datasette plugins. One of these is the datasette-github-auth
plugin, which needs a couple of secret tokens set. I added these as environment variables in the datasette.service
file.systemd
setup uses Datasette’s configuration directory mode. This means you can put your database, metadata definitions, custom templates and CSS, and any other settings all together in a single directory and Datasette will find and use them. I’d previously passed runtime settings via the command line, so I had to create a settings.json
for these.rsync
and started the Datasette service. It worked!/pvol
in the virtual machine as the Nectar documentation describes.datasette-root
folder under pvol
, copied the Datasette files to it, and changed the datasette.service
file to point to it. This didn’t seem to work and I’m not sure why. So instead I created a symbolic link between /home/ubuntu/datasette-root
and /pvol/datasette-root
and and set the path in the service file back to /home/ubuntu/datasette-root
. This worked! So now the database and configuration files are sitting in the persistent storage volume.Although the steps above might seem complicated, it was mainly just a matter of copying and pasting commands from the existing documentation. The new Datasette instance is running here, but this is just for testing and will disappear soon. If you’d like to know more about the Stock Exchange Project, check out the ANU Archives section of the GLAM Workbench.
I’m thinking about the Trove Researcher Platform discussions & ways of integrating Trove with other apps and platforms (like the GLAM Workbench).
As a simple demo I modifed my Trove Proxy app to convert a newspaper search url from the Trove web interface into an API query (using the trove-query-parser package). The proxy app then redirects you to the Trove API Console so you can see the results of the API query without needing a key.
To make it easy to use, I created a bookmarklet that encodes your current url and feeds it to the proxy. To use it just:
This little hack provides a bit of ‘glue’ to help researchers think their about search results as data, and explore other possibilities for download and analysis. #dhhacks
The ARDC is collecting user requirements for the Trove researcher platform for advanced research. This is a chance to start from scratch, and think about the types of data, tools, or interface enhancements that would support innovative research in the humanities and social sciences. The ARDC will be holding two public roundtables, on 13 and 20 May, to gather ideas. I created a list of possible API improvements in my response to last year’s draft plan, and thought it might be useful to expand that a bit, and add in a few other annoyances, possibilities, and long-held dreams.
My focus is again on the data; this is for two reasons. First because public access to consistent, good quality data makes all other things possible. But, of course, it’s never just a matter of OPENING ALL THE DATA. There will be questions about priorities, about formats, about delivery, about normalisation and enrichment. Many of these questions will arise as people try to make use of the data. There needs to be an ongoing conversation between data providers, research tool makers, and research tool users. This is the second reason I think the data is critically important – our focus should be on developing communities and skills, not products. A series of one-off tools for researchers might be useful, but the benefits will wane. Building tools through networks of collaboration and information sharing based on good quality data offers much more. Researchers should be participants in these processes, not consumers.
Anyway, here’s my current wishlist…
Bring the web interface and main public API back into sync, so that researchers can easily transfer queries between the two. The Trove interface update in 2020 reorganised resources into ‘categories’, replacing the original ‘zones’. The API, however, is still organised by zone and knows nothing about these new categories. Why does this matter? The web interface allows researchers to explore the collection and develop research questions. Some of these questions might be answered by downloading data from the API for analysis or visualisation. But, except for the newspapers, there is currently no one-to-one correspondence between searches in the web interface and searches using the API. There’s no way of transferring your questions – you need to start again.
Expand the metadata available for digitised resources other than newspapers. In recent years, the NLA has digitised huge quantities of books, journals, images, manuscripts, and maps. The digitisation process has generated new metadata describing these resources, but most of this is not available through the public API. We can get an idea of what’s missing by comparing the digitised journals to the newspapers. The API includes a newspaper
endpoint that provides data on all the newspapers in Trove. You can use it to get a list of available issues for any newspaper. There is no comparable way of retrieving a list of digitised journals, or the issues that have been digitised. The data’s somewhere – there’s an internal API that’s used to generate lists of issues in the browse interface and I’ve scraped this to harvest issue details. But this information should should be in the public API. Manuscripts are described using finding aids, themselves generated from EAD formatted XML files, but none of this important structured data is available from the API, or for download. There’s also other resource metadata, such as parent/child relationships between different levels in the object hierarchy (eg publication > pages). These are embedded in web pages but not exposed in the API. The main point is that when it comes to data-driven research, digitised books, journals, manuscripts, images, and maps are second-class citizens, trailing far behind the newspapers in research possibilities. There needs to be a thorough stocktake of available metadata, and a plan to make this available in machine actionable form.
Standardise the delivery of text, images, and PDFs and provide download links through the API. As noted above, digitised resources are treated differently depending on where they sit in Trove. There are no standard mechanisms for downloading the products of digitisation, such as OCRd text and images. OCRd text is available directly though the API for newspaper and journal articles, but to download text from a book or journal issue you need to hack the download mechanisms from the web interface. Links to these should be included in the API. Similarly, machine access to images requires various hacks and workarounds. There should be a consistent approach that allows researchers to compile image datasets from digitised resources using the API. Ideally IIIF standard APIs should be used for the delivery of images and maps. This would enable the use of the growing ecosystem of IIIF compliant tools for integration, analysis, and annotation.
Provide an option to exclude search results in tags and comments. The Trove advanced search used to give you the option of excluding search results which only matched tags or comments, rather than the content of the resource. Back when I was working at Trove, the IT folks said this feature would be added to a future version of the API, but instead it disappeared from the web interface with the 2020 update! Why is this important? If you’re trying to analyse the occurance of search terms within a collection, such as Trove’s digitised newspapers, you want to be sure that the result reflects the actual content, and not a recent annotation by Trove users.
Finally add the People & Organisations data to the main API. Trove’s People & Organisations section was ahead of the game in providing machine-readable access, but the original API is out-of-date and uses a completely different query language. Some work was done on adding it to the main RESTful API, but it was never finished. With a bit of long-overdue attention, the People & Organisations data could power new ways of using and linking biographical resources.
Improve web archives CDX API. Although the NLA does little to inform researchers of the possibilities, the web archives software it uses (Pywb) includes some baked in options for retrieving machine-readable data. This includes support for the Memento protocol, and the provision of a CDX API that delivers basic metadata about individual web page captures. The current CDX API has some limitations ( documented here ). In particular, there’s no pagination or results, and no support for domain-level queries. Addressing these limitations would make the existing CDX API much more useful.
Provide new data sources for web archives analysis. There needs to be an constructive, ongoing discussion about the types of data that could be extracted and shared from the Australian web archive. For example, a search API, or downloadable datasets of word frequencies. The scale is a challenge, but some pilot studies could help us all understand both the limits and the possibilities.
Provide a Write API for annotations. Integration between components in the HASS RDC would be greatly enhanced if other projects could automatically add structured annotations to existing Trove resources. Indeed, this would create exciting possibilities for embedding Trove resources within systems of scholarly analysis, allowing insights gained through research to be automatically fed back into Trove to enhance discovery and understanding.
Provide historical statistics on Trove resources. It’s important for researchers to understand how Trove itself changes over time. There used to be a page that provided regularly-updated statistics on the number of resources and user annotations, but this was removed by the interface upgrade in 2020. I’ve started harvesting some basic stats relating to Trove newspapers, but access to general statistics should be reinstated.
Reassess key authentication and account limits. DigitalNZ recently changed their policy around API authentication, allowing public access without a key. Authentication requirements hinder exploration and limit opportunities for using the API in teaching and workshops. Similarly, I don’t think the account usage limits have been changed since the API was released, even though the capacity of the systems has increased. It seems like time that both of these were reassessed.
Ok, I’ll admit, that’s a pretty long list, and not everything can be done immediately! I think this would be a good opportunity for the NLA to develop and share an API and Data Roadmap, that is regularly updated, and invites comments and suggestions. This would help researchers plan for future projects, and build a case for further government investment.
Unbreak Zotero integration. The 2020 interface upgrade broke the existing Zotero integration and there’s no straightforward way of fixing it without changes at the Trove end. Zotero used to be able to capture search results, metadata and images from most of the zones in Trove. Now it can only capture individual newspaper articles. This greatly limits the ability of researchers to assemble and manage their own research collections. More generally, a program to examine and support Zotero integration across the GLAM sector would be a useful way of spending some research infrastructure dollars.
Provide useful page metadata. Zotero is just one example of a tool that can extract structured metadata from web pages. Such metadata supports reuse and integration, without the need for separate API requests. Only Trove’s newspaper articles currently provide embedded metadata. Libraries used to lead the way is promoting the use of standardised, structured, embedded page metadata (Dublin Core anyone?), but now?
Explore annotation frameworks. I’ve mentioned the possibility of a Write API for annotations above, but there are other possibilities for supporting web scale annotations, such as Hypothesis. Again, the current Trove interface makes the use of Hypothesis difficult, and again this sort of integration would be usefully assessed across the whole GLAM sector.
Obviously any discussion of new tools or interfaces needs to start by looking at what’s already available. This is difficult when the NLA won’t even link to existing resources such as the GLAM Workbench. Sharing information about existing tools needs to be the starting point from which to plan investment in the Trove Researcher Platform. From there we can identify gaps and develop processes and collaborations to meet specific research needs. Here’s a list of some Trove-related tools and resources currently available through the GLAM Workbench.
I forgot to add these annoying bugs:
newspaper
endpoint returns both newspaper and gazette titles, even though there’s a separate gazette
endpoint. This forces you to do silly workarounds like this in the GLAM Workbench.list
zone has recurring problems. At the moment it’s impossible to harvest a complete set of Trove lists.Spending the evening updating the NAA section of the #GLAMWorkbench. Here’s a fresh harvest of the agency functions currently being used in RecordSearch… gist.github.com/wragge/d1…
The ARDC is organising a couple of public forums to help gather researcher requirements for the Trove component of the HASS RDC. One of the roundtables will look at ‘Existing tools that utilise Trove data and APIs’. Last year I wrote a summary of what the GLAM Workbench can contribute to the development of humanities research infrastructure, particularly in regard to Trove. I thought it might be useful to update that list to include recent additions to the GLAM Workbench, as well as a range of other datasets, software, tools, and interfaces that exist outside of the GLAM Workbench.
Since last year’s post I’ve also been working hard to integrate the GLAM Workbench with other eResearch services such as Nectar and CloudStor, and to document and support the ways that individuals and institutions can contribute code and documentation.
There’s lots of fabulous data in Trove and other GLAM collections. In fact, there’s so much data that it can be difficult for researchers to find and collect what’s relevant to their interests. There are many tools in the GLAM Workbench to help researchers assemble their own datasets. For example:
While I’m focusing here on Trove, there’s also tools to create datasets from the National Archives of Australia, Digital NZ and Papers Past, the National Museum of Australia and more. And there’s a big list of readily downloadable datasets from Australian GLAM organisations.
Many of the notebooks listed above include examples that demonstrate ways of exploring and analysing your harvested data. There are also a number of companion notebooks that examine some possibilities in more detail, for example:
But there are also many other notebooks that demonstrate methods for analysing Trove’s content, for example:
state
facet to create a choropleth map that visualises the number of search results per state.title
facet to find the number of results per newspaper, then merges the results with a dataset of geolocated newspapers to map where articles were published.There are also possibilities for using Trove data creatively. For example you can create ‘scissors and paste’ messages from Trove newspaper articles.
All the Trove notebooks in the GLAM Workbench help document the possibilities and limits of the Trove API. The examples above can be modified and reworked to suit different research interests. Some notebooks also explore particular aspects of the API, for example:
date
index and the firstpageseq
parameter to find articles from exactly 100 years ago that were published on the front page. It then selects one of the articles at random and downloads and displays an image of the front page.And while it’s not officially part of the GLAM Workbench, I also maintain the Trove API Console which provides lots of examples of the API in action.
I’ve started making videos to help you get started with the GLAM Workbench.
A number of pre-harvested datasets are noted above in the ‘Getting and moving data’ section. Here’s a fairly complete list of ready-to-download datasets harvested from Trove.
See also Sources of Australian GLAM data in the GLAM Workbench.
The GLAM Workbench makes use of a number of software packages that I’ve created in Python to work with Trove data. These are openly-licensed and available for installation from PyPi.
Over the years I’ve developed many tools and interfaces using Trove data. Some have been replaced by the GLAM Workbench, but others keep chugging along, for example:
See also More GLAM tools and interfaces in the GLAM Workbench. #dhhacks
Ok, I’ve created a new #GLAMWorkbench meta issue to try and bring together all the things I’m trying to do to improve & automate the code & documentation. This should help me keep track of things… github.com/GLAM-Work… #DayofDH2022
A couple of hours of #DayofDH2022 left – feeling a bit uninspired, so I’m going to do some pruning & reorganising of the #GLAMWorkbench issues list: github.com/GLAM-Work…
I’ve been doing a bit of cleaning up, trying to make some old datasets more easily available. In particular I’ve been pulling together harvests of the number of newspaper articles in Trove by year and state. My first harvests date all the way back to 2011, before there was even a Trove API. Unfortunately, I didn’t run the harvests as often as I should’ve and there are some big gaps. Nonetheless, if you’re interested in how Trove’s newspaper corpus has grown and changed over time, you might find them useful. They’re available in this repository and also in Zenodo.
This chart shows how the number of newspaper articles per year in Trove has changed from 2011 to 2022. Note the rapid growth between 2011 and 2015.
To try and make sure that there’s a more consistent record from now on, I’ve also created a new Git Scraper – a GitHub repository that automatically harvests and saves data at weekly intervals. As well as the number of articles by year and state, it also harvests the number of articles by newspaper and category. As mentioned, these four datasets are updated weekly. If you want to get all the changes over time, you can retrieve earlier versions from the repository’s commit history.
All the datasets are CC-0 licensed and validated with Frictionless.
There’s also a notebook in the GLAM Workbench that explores this sort of data.
Over the past few months I’ve been doing a lot of behind-the-scenes work on the GLAM Workbench – automating, standardising, and documenting processes for developing and managing repositories. These sort of things ease the maintenance burden on me and help make the GLAM Workbench sustainable, even as it continues to grow. But these changes are also aimed at making it easier for you to contribute to the GLAM Workbench!
Perhaps you’re part of a GLAM organisation that wants to help researchers explore its collection data – why not create your own section of the GLAM Workbench? It would be a great opportunity for staff to develop their own digital skills and learn about the possibilities of Jupyter notebooks. I’ve developed a repository template and some detailed documentation to get you started. The repository template includes everything you need to create and test notebooks, as well as built-in integration with Binder, Docker, Reclaim Cloud, and Zenodo. And, of course, I’ll be around to help you through the process.
Or perhaps you’re a researcher who wants to share some code you’ve developed that extends or improves an existing GLAM Workbench repository. Yes please! Or maybe you’re a GLAM Workbench user who has something to add to one of the lists of resources; or you’ve noticed a problem with some of the documentation that you’d like to fix. All contributions welcome!
The Get involved! page includes links to all this information, as well as some other possibilities such as becoming a sponsor, or sharing news. And to recognise those who make a contribution to the code or documentation there’s also a brand new contributors page.
I’m looking forward to exploring how we can build the GLAM Workbench together. #dhhacks
Over the last couple of years I've been fiddling with bits of Python code to work with the Omeka S REST API. The Omeka S API is powerful, but the documentation is patchy, and doing basic things like uploading images can seem quite confusing. My code was an attempt to simplify common tasks, like creating new items.
In case it's of use to others, I've now shared my code as a Python package. So you can just `pip install omeka-s-tools` to get started. The code helps you:
There's quite detailed documentation available, including an example of adding a newspaper article from Trove to Omeka. If you want to see the code in action, there's also a notebook in the Trove newspapers section of the GLAM Workbench that uploads newspaper articles (including images and OCRd text) to Omeka from a variety of sources, including Trove searches, Trove lists, and Zotero libraries.
If you find any problems, or would like additional features, feel free to create an issue in the GitHub repository. #dhhacks
I regularly update the Python packages used in the different sections of the GLAM Workbench; though probably not as often as I should. Part of the problem is that once I've updated the packages, I have to run all the notebooks to make sure I haven't inadvertently broken something -- and this takes time. And in those cases where the notebooks need an API key to run, I have to copy and paste the key in at the appropriate spots, then remember to delete them afterwords. They're little niggles, but they add up, particularly as the GLAM Workbench itself expands.
I've been looking around at Jupyter notebook automated testing options for a while. There's nbmake, testbook, and nbval, as well as custom solutions involving things like papermill and nbconvert. After much wavering, I finally decided to give `nbval` a go. The thing that I like about `nbval` is that I can start simple, then increase the complexity of my testing as required. The `--nbval-lax` option just checks to make sure that all the cells in a notebook run without generating exceptions. You can also tag individual cells that you want to exclude from testing. This gives me a testing baseline -- this notebook runs without errors -- it might not do exactly what I think it's doing, but at least it's not exploding in flames. Working from this baseline, I can start tagging individual cells where I want the output of the cell to be checked. This will let me test whether a cell is doing what it's supposed to.
This approach means that I can start testing without making major changes to existing notebooks. The main thing I had to think about is how to handle API keys or other variables which are manually set by users. I decided the easiest approach was to store them in a `.env` file and use dotenv to load them within the notebook. This also makes it easy for users to save their own credentials and use them across multiple notebooks -- no more cutting and pasting of keys! Some notebooks are designed to run as web apps using Voila, so they expect human interaction. In these cases, I added extra cells that only run in the testing environment -- they populate the necessary fields and simulate button clicks to start.
While I was in a QA frame of mind, I also started playing with nbqa -- a framework for all sorts of code formatting, linting, and checking tools. I decided I'd try to standardise the formatting of my notebook code by running isort, black, and flake8. As well ask making the code cleaner and more readable, they pick up things like unused imports or variables. To further automate this process, I configured the `nbqa` checks to run when I try to commit any notebook code changes using `git`. This was made easy by the pre-commit package.
This is all set up and running in the Trove newspapers repository -- you can see the changes here. Now if I update the Python packages or make any other changes to the repository, I can just run `pytest --nbval-lax` to test every notebook at once. And if I make changes to an individual notebook, `nbqa` will automatically give the changes a code quality check before I save them to the repository. I'm planning to roll these changes out across the whole of the GLAM Workbench in coming months.
Developments like these are not very exciting for users, but they're important for the management and sustainability of the GLAM Workbench, and help create a solid foundation for future development and collaboration. Last year I created a GLAM Workbench repository template to help people or organisations thinking about contributing new sections. I can now add these testing and QA steps into the template to further share and standardise the work of developing the GLAM Workbench.
One of the things I really like about Jupyter is the fact that I can share notebooks in a variety of different formats. Tools like QueryPic can run as simple web apps using Voila, static versions of notebooks can be viewed using NBViewer, and live versions can be spun up as required on Binder. It’s also possible to export notebooks at PDFs, slideshows, or just plain-old HTML pages. Just recently I realised I could export notebooks to HTML using the same template I use for Voila. This gives me another way of sharing – static web pages delivered via the main GLAM Workbench site.
Here’s a couple of examples:
Both are HTML pages that embed visualisations created using Altair. The visualisations are rendered using javascript, and even though the notebook isn’t running in a live computing environment, there’s some basic interactivity built-in – for example, you can hover for more details, and click on the DigitalNZ chart to search for articles from a newspaper. More to come! #dhhacks
The video of my key story presentation at ResBaz Queensland (simulcast via ResBaz Sydney) is now available on Vimeo. In it, I explore some of the possibilities of GLAM data by retracing my own journey through WWI service records, The Real Face of White Australia, #redactionart, and Trove – ending up at the GLAM Workbench, which brings together a lot of my tools and resources in a form that anyone can use. The slides are also available, and there’s an archived version of everything in Zenodo.
This and many other presentations about the GLAM Workbench are listed here. It seems I’ve given at least 11 talks and workshops this year! #dhhacks
The newly-updated DigitalNZ and Te Papa sections of the GLAM Workbench have been added to the list of available repositories in the Nectar Research Cloud’s GLAM Workbench Application. This means you can create your very own version of these repositories running in the Nectar Cloud, simply by choosing them from the app’s dropdown list. See the Using Nectar help page for more information.
I’ve also taken the opportunity to make use of the new container registry service developed by the ARDC as part of the ARCOS project. The app now pulls the GLAM Workbench Docker images from Quay.io via the container registry’s cache. This means that copies of the images are cached locally, speeding things up and saving on data transfers. Yay for integration!
Thanks again to Andy and the Nectar Cloud staff for their help! #dhhacks
In preparation for my talk at ResBaz Aotearoa, I updated the DigitalNZ and Te Papa sections of the GLAM Workbench. Most of the changes are related to management, maintenance, and integration of the repositories. Things like:
index.ipynb
file based on README.md
to act as a front page within Jupyter Labreclaim-manifest.jps
file to allow for one-click installation of the repository in Reclaim CloudREADME.md
with instructions on how to run the repository via Binder, Reclaim Cloud, Nectar Research Cloud, and Docker Desktop..zenodo.json
metadata file so that new releases are preserved in Zenodopip-tools
for generating requirements.txt
files, and the include unpinned requirements in requirements.in
From the user’s point of view, the main benefit of these changes is the ability to run the repositories in a variety of different environments depending on your needs and skills. The Docker images, generated using repo2Docker, are used by Binder, Reclaim Cloud, Nectar, and Docker Desktop. Same image, multiple environments! See ‘Run these notebooks’ in the DigitalNZ and Te Papa sections of the GLAM Workbench for more information.
Of course, I’ve also re-run all of the notebooks to make sure everything works and to update any statistics, visualisations, and datasets. As a bonus, there’s a couple of new notebooks in the DigitalNZ repository:
#dhhacks
I’m hoping that the GLAM Workbench will encourage GLAM organisations and GLAM data nerds (like me) to create their own Jupyter notebooks. If they do, they can put a link to them in the list of GLAM Jupyter resources. But what if they want to add the notebooks to the GLAM Workbench itself?
To make this easier, I’ve been working on a template repository for the GLAM Workbench. It generates a new skeleton repository with all the files you need to develop and manage your own section of the GLAM Workbench. It uses GitHub’s built in templating feature, together with Cookiecutter , and this GitHub Action by Stefan Buck. Stefan has also written a very helpful blog post.
The new repository is configured to do various things automatically, such as generate and save Docker images, and integrate with Reclaim Cloud and Zenodo. Lurking inside the dev
folder of each new repository, you’ll find some basic details on how to set up and manage your development environment.
This is just the first step. There’s more documentation to come, but you’re very welcome to try it out. And, of course, if you are interested in contributing to the development of the GLAM Workbench, let me know and I’ll help get you set up!
Want a bit of added GLAM with your digital research skills? You’re in luck, as I’ll be speaking at not one, but three ResBaz events in November. If you haven’t heard of it before, ResBaz (Research Bazaar) is ‘a worldwide festival promoting the digital literacy at the centre of modern research’.
The programs of all three ResBaz events are chock full of excellent opportunities to develop your digital skills, learn new research methods, and explore digital tools. If you’re an HDR student you should check out what’s on offer.
The latest help video for the GLAM Workbench walks through the web app version of the Trove Newspaper & Gazette Harvester. Just paste in your search url and Trove API key and you can harvest thousands of digitised newspaper articles in minutes!
An inquiry on Twitter prompted me to put together a notebook that you can use to download all available issues of a newspaper as PDFs. It was really just a matter of copying code from other tools and making a few modifications. The first step harvests a list of available issues for a particular newspaper from Trove. You can then download the PDFs of those issues, supplying an optional date range. Beware – this could consume a lot of disk space!
The PDF file names have the following structure:
[newspaper identifier]-[issue date as YYYYMMDD]-[issue identifier].pdf
For example:
903-19320528-1791051.pdf
903
– the Glen Innes Examiner19320528
– 28 May 19321791051
– to view in Trove just add this to http://nla.gov.au/nla.news-issue
, eg http://nla.gov.au/nla.news-issue1791051I also took the opportunity to create a new Harvesting data heading in the Trove newspapers section of the GLAM Workbench. #dhhacks