Do you want your Trove newspaper articles in bulk? Meet the new Trove Newspaper Harvester Python package!

The Trove Newspaper Harvester has been around in different forms for more than a decade. It helps you download all the articles in a Trove newspaper search, opening up new possibilities for large-scale analysis. You can use it as a command-line tool by installing a Python package, or through the Trove Newspaper Harvester section of the GLAM Workbench.

I’ve just overhauled development of the Python package. The new trove-newspaper-harvester replaces the old troveharvester repository. The command-line interface remains the same (with a few new options), so it’s really a drop in replacement. Read the full documentation of the new package for more details.

Screenshot of the trove-newspaper-harvester documentation describing its use as a Python library.

Here’s a summary of the changes:

  • the package can now be used as a library (that you incorporate into your own code) as well as a standalone command-line tool – this means you can embed the harvester in your own tools or workflows
  • both the library and the CLI now let you set the names of the directories in which your harvests will be saved – this makes it easier to organise your harvests into groups and give them meaningful names
  • the harvesting process now saves results into a newline-delimited JSON file (one JSON object per line) – the library has a save_csv() option to convert this to a CSV file, while the CLI automatically converts the results to CSV to maintain compatibility with previous versions
  • behind the scenes, the package is now developed and maintained using nbdev – this means the code and documentation are all generated from a set of Jupyter notebooks
  • the Jupyter notebooks include a variety of automatic tests which should make maintenance and development much easier in the future

I’ve also updated the Trove Newspaper Harvester section of the GLAM Workbench to use the new package. The new core library will make it easier to develop more complex harvesting examples – for example, searching for articles from a specific day across a range of years. If you find any problems, or want to suggest improvements, please raise an issue.

From 48 PDFs to one searchable database – opening up the Tasmanian Post Office Directories with the GLAM Workbench

A few weeks ago I created a new search interface to the NSW Post Office Directories from 1886 to 1950. Since then, I’ve used the same process on the Sydney Telephone Directories from 1926 to 1954. Both of these publications had been digitised by the State Library of NSW and made available through Trove. To build the new interfaces I downloaded the text from Trove, indexed it by line, and linked it back to the online page images.

But there are similar directories from other states that are not available through Trove. The Tasmanian Post Office Directory, for example, has been digitised between 1890 and 1948 and made available as 48 individual PDF files from Libraries Tasmania. While it’s great that they’ve been digitised, it’s not really possible to search them without downloading all the PDFs.

As part of the Everyday Heritage project, Kate Bagnall and I are working on mapping Tasmania’s Chinese history – finding new ways of connecting people and places. The Tasmanian Post Office Directories will be a useful source for us, so I thought I’d try converting them into a database as I had with the NSW directories. But how?

There were several stages involved:

  • Downloading the 48 PDF files
  • Extracting the text and images from the PDFs
  • Making the separate images available online so they could be integrated with the search interface
  • Loading all the text and image links into a SQLite database for online delivery using Datasette

And here’s the result!

Screenshot of search interface for the Tasmanian Post Office Directories.

Search for people and places in Tasmania from 1890 to 1948!

The complete process is documented in a series of notebooks, shared through the brand new Libraries Tasmania section of the GLAM Workbench. As with the NSW directories, the processing pipeline I developed could be reused with similar publications in PDF form. Any suggestions?

Some technical details

There were some interesting challenges in connecting up all the pieces. Extracting the text and images from the PDFs was remarkably easy using PyMuPDF, but the quality of the text wasn’t great. In particular, I had trouble with columns – values from neighbouring columns would be munged together, upsetting the order of the text. I tried working with the positional information provided by PyMuPDF to improve column detection, but every improvement seemed to raise another issue. I was also worried that too much processing might result in some text being lost completely.

I tried a few experiments re-OCRing the images with Textract ( a paid service from Amazon) and Tesseract. The basic Textract product provides good OCR, but again I needed to work with the positional information to try and reassemble the columns. On the other hand, Tesseract’s automatic layout detection seemed to work pretty well with just the default settings. It wasn’t perfect, but good enough to support search and navigation. So I decided to re-OCR all the images using Tesseract. I’m pretty happy with the result.

The search interfaces for the NSW directories display page images loaded directly from Trove into an OpenSeadragon viewer. The Tasmanian directories have no online images to integrate in this way, so I had to set up some hosting for the images I extracted from the PDFs. I could have just loaded them from an Amazon s3 bucket, but I wanted to use IIIF to deliver the images. Fortunately there’s a great project that uses Amazon’s Lambda service to provide a Serverless IIIF Image API. To prepare the images for IIIF, you convert them to pyramidal TIFFs (a format that contains an image at a number of different resolutions) using VIPS. Then you upload the TIFFs to an s3 bucket and point the Serverless IIIF app at the bucket. There’s more details in this notebook. It’s very easy and seems to deliver images amazingly quickly.

The rest of the processing followed the process I used with the NSW directories – using SQLite-utils and Datasette to package the data and deliver it online via Google Cloudrun.

Postscript: Time and money

I thought I should add a little note about costs (time and money) in case anyone was interested in using this workflow on other publications. I started working on this on Sunday afternoon and had a full working version up about 24 hours later – that includes a fair bit of work that I didn’t end up using, but doesn’t include the time I spent re-OCRing the text a day or so later. This was possible because I was reusing bits of code from other projects, and taking advantage of some awesome open-source software. Now that the processing pipeline is pretty well-defined and documented it should be even faster.

The search interface uses cloud services from Amazon and Google. It’s a bit tricky to calculate the precise costs of these, but here’s a rough estimate.

I uploaded 63.9gb of images to Amazon s3. These should cost about US$1.47 per month to store.

The Serverless IIIF API uses Amazon’s Lambda service. At the moment my usage is within the free tier, so $0 so far.

The Datasette instance uses Google Cloudrun. Costs for this service are based on a combination of usage, storage space, and the configuration of the environment. The size of the database for the Tasmanian directories is about 600mb, so I can get away with 2gb of application memory. (The NSW Post Office directory currently uses 8gb.) These services scale to zero – so basically they shut down if they’re not being used. This saves a lot of money, but means there can be a pause if they need to start up again. I’m running the Tasmanian and NSW directories, as well as the GLAM Name Index search, within the same Google Cloud account, and I’m not quite sure how to itemise the costs. But overall, it’s costing me about US$4.00 a month to run them all. Of course if usage increases, so will the costs!

So I suppose the point is that these sorts of approaches can be quite a practical and cost-effective way of improving access to digitised resources, and don’t need huge investments in time or infrastructure.

If you want to contribute to the running costs of the NSW and Tasmanian directories you can sponsor me on GitHub.

Fresh harvest of OCRd text from Trove's digitised periodicals – 9gb of text to explore and analyse!

I’ve updated the GLAM Workbench’s harvest of OCRd text from Trove’s digitised periodicals. This is a completely fresh harvest, so should include any corrections made in recent months. It includes:

  • 1,430 periodicals
  • OCRd text from 41,645 issues
  • About 9gb of text

The easiest way to explore the harvest is probably this human-readable list. The list of periodicals with OCRd text is also available as a CSV. You can find more details in the Trove journals section of the GLAM Workbench, and download the complete corpus from CloudStor.

Finding which periodical issues in Trove have OCRd text you can download is not as easy as it should be. The fullTextInd index doesn’t seem to distinguish between digitised works (with OCR) and born-digital publications (like PDFs) without downloadable text. You can use has:correctabletext to find articles with OCR, but you can’t get a full list of the periodicals the articles come from using the title facet. As this notebook explains, you can search for nla.obj, but this returns both digitised works and publications supplied through edeposit. In previous harvests of OCRd text I processed all of the titles returned by the nla.obj search, finding out whether there was any OCRd text by just requesting it and seeing what came back. But the number of non-digitised works on the list of periodicals in digital form has skyrocketed through the edeposit scheme and this approach is no longer practical. It just means you waste a lot of time asking for things that don’t exist.

For the latest harvest I took a different approach. I only processed periodicals in digital form that weren’t identified as coming through edeposit. These are the publications with a fulltext_url_type value of either ‘digitised’ or ‘other’ in my dataset of digital periodicals. Is it possible that there’s some downloadable text in edeposit works that’s now missing from the harvest? Yep, but I think this is a much more sensible, straightforward, and reproduceable approach.

That’s not the only problem. As I noted when creating the list of periodicals in digital form, there are duplicates in the list, so they have to be removed. You then have to find information about the issues available for each title. This is not provided by the Trove API, but there is an internal API used in the web interface that you can access – see this notebook for details. I also noticed that sometimes where there’s a single issue of a title, it’s presented as if each page is an issue. I think I’ve found a work around for that as well.

All these doubts, inconsistencies and workarounds mean that I’m fairly certain I don’t have everything. But I do think I have most of the OCRd text available from digitised periodicals, and I do have a methodology, documented in this notebook, that at least provides a starting point for further investigation. As I noted in my wishlist for a Trove Researcher Platform, it would be great if more metadata for digitised works, other than newspapers, was made available through the API.

Explore Trove's digitised newspapers by place

I’ve updated my map displaying places where Trove digitised newspapers were published or distributed. You can view all the places on single map – zoom in for more markers, and click on a marker for title details and a link back to Trove.

A map of Australia with coloured markers indicating the number of Trove’s digitised newspapers published in different locations around the country.

If you want to find newspapers from a particular area, just click on a location using this map to view the 10 closest titles.

A map section focused on Walhalla in eastern Victoria with markers indicating nearby places where Trove’s digitised newspapers were published. A column on the right lists the newspaper titles.

You can view or download the dataset used to construct the map. Place names were extracted from the newspaper titles using the Geoscience Gazetteer.

Making NSW Postal Directories (and other digitised directories) easier to search with the GLAM Workbench and Datasette

As part of my work on the Everyday Heritage project I’m looking at how we can make better use of digitised collections to explore the everyday experiences woven around places such as Parramatta Road in Sydney. For example, the NSW Postal Directories from 1886 to 1908 and 1909 to 1950 have been digitised by the State Library of NSW and made available through Trove. The directories list residences and businesses by name and street location. Using them we can explore changes in the use of Parramatta Road across 60 years of history. But there’s a problem. While you can browse the directories page by page, searching is clunky. Trove’s main search indexes the contents of the directories by ‘article’. Each ‘article’ can be many pages long, so it’s difficult to focus in on the matching text. Clicking through from the search results to the digitised volume lands you in another set of search results, showing all the matches in the volume. However, the internal search index works differently to the main Trove index. In particular it doesn’t seem to understand phrase or boolean searches. If you start off searching for “parramatta road” , Trove tells you there’s 50 matching articles, but if you click through to a volume you’re told there’s no results. If you remove the quotes you get every match for ‘parramatta’ or ‘road’. It’s all pretty confusing.

The idea of ‘articles’ is really not very useful for publications like the Post Office Directories where information is mostly organised by column, row or line. You want to be able to search for a name, and go directly to the line in the directory where that name is mentioned. And now you can! Using Datasette, I’ve created an interface that searches by line across all 54 volumes of the NSW Post Office Directory from 1886 to 1950 (that’s over 30 million lines of text).

Screenshot of home page for the NSW Post Office Directories. There’s a search box to search across all 54 volumes, some search tips, and a summary of the data that notes there are more than 30 million rows.

Try it now!

Basic features

  • The full text search supports phrase, boolean, and wildcard searches. Just enter your query in the main search box to get results from all 54 volumes in a flash.

  • Each search result is a single line of text. Click on the link to view this line in context – it’ll show you 5 lines above and below your match, as well as a zoomable image of the digitised page from Trove.

  • For more context, you can click on View full page to see all the lines of text extracted from that page. You can then use the Next and Previous buttons to browse page by page.

  • To view the full digitised volume, just click on the View page in Trove button.

Screenshot of information about a single row in the NSW Post Office Directories. The row is highlighted in yellow, and displayed in context with five rows above and below. There’s a button to view the full page, and box displaying a zoomable image of the page from Trove.

How it works

There were a few stages involved in creating this resource, but mostly I was able to reuse bits of code from the GLAM Workbench’s Trove journals and books sections, and other related projects such as the GLAM Name Index Search. Here’s a summary of the processing steps:

  • I started with the two top-level entries for the NSW Postal Directories, harvesting details of the 54 volumes under them.
  • For each of these 54 volumes, I downloaded the OCRd text page by page. Downloading the text by page, rather than volume, was very slow, but I thought it was important to be able to link each line of text back to its original page.
  • To create links back to pages, I also needed the individual identifiers for each page. A list of page identifiers is embedded as a JSON string within each volume’s HTML, so I extracted this data and matched the page ids to the text.
  • Using sqlite-utils, I created a SQLite database with a separate table for every volume. Then I processed the text by volume, page, and line – adding each line line of text and its page details as a individual record in the database.
  • I then ran full text indexing across each line to make it easily searchable.
  • Using Datasette and its search-all plugin, I loaded up the database and BINGO! More than 30 million lines of text across 54 digitised volumes were instantly searchable.
  • To make it all public, I used Datasette’s publish function to push the database to Google’s Cloudrun service.

All the code is available in the journals section of the GLAM Workbench.

Future developments

One of the most exciting things to me is that this processing pipeline can be used with any digitised publication on Trove where it would be easier to search by line rather than article. Any suggestions?

Interested in Victorian shipwrecks? Kim Doyle and Mitchell Harrop have added a new notebook to the Heritage Council of Victoria section of the GLAM Workbench exploring shipwrecks in the Victorian Heritage Database: glam-workbench.net/heritage-…

Updates!

Minor update to RecordSearch Data Scraper – now captures ‘institution title’ for agencies if it is present. pypi.org/project/r…

Many thanks to the British Library – sponsors of the GLAM Workbench’s web archives section!

You might have noticed some changes to the web archives section of the GLAM Workbench.

Screenshot of the web archives section showing the acknowledgement of the British Library's sponsorship.

I’m very excited to announce that the British Library is now sponsoring the web archives section! Many thanks to the British Library and the UK Web Archive for their support – it really makes a difference.

The web archives section was developed in 2020 with the support of the International Internet Preservation Consortium’s Discretionary Funding Programme, in collaboration with the British Library, the National Library of Australia, and the National Library of New Zealand. It’s intended to help historians, and other researchers, understand what sort of data is available through web archives, how to get it, and what you can do with it. It provides a series of tools and examples that document existing APIs, and explore questions such as how web pages change over time. The notebooks focus on four particular web archives: the UK Web Archive, the Australian Web Archive (National Library of Australia ), the New Zealand Web Archive (National Library of New Zealand), and the Internet Archive. However, the tools and approaches could be easily extended to other web archives (and soon will be!). I introduced the web archives section of the GLAM Workbench in this seminar for the IIPC in August 2020:

According to the Binder launch stats, the web archives section is the most heavily used part of the GLAM Workbench. In December 2020, it won the British Library Labs Research Award. Last year I updated the repository, automating the build of Docker images, and adding integrations with Zenodo, Reclaim Cloud, and Australia’s Nectar research cloud. I’m also thinking about some new notebooks – watch this space!

The GLAM Workbench receives no direct funding from government or research agencies, and so the support of sponsors like the British Library and all my other GitHub sponsors is really important. Thank you! If you think this work is valuable, have a look at the Get involved! page to see how you can contribute. And if your organisation would like to sponsor a section of the GLAM Workbench, let me know!

New GLAM data to search, visualise and explore using the GLAM Workbench!

There’s lots of GLAM data out there if you know where to look! For the past few years I’ve been harvesting a list of datasets published by Australian galleries, libraries, archives, and museums through open government data portals. I’ve just updated the harvest and there’s now 463 datasets containing 1,192 files. There’s a human-readable version of the list that you can browse. If you just want the data you can download it as a CSV. Or if you’d like to search the list there’s a database version hosted on Glitch. The harvesting and processing code is available in this notebook.

The GLAM data from government portals section of the GLAM Workbench provides more information and a summary of results. For example, here’s a list of the number of data files by GLAM institution.

Table showing the number of datasets published by each GLAM organisation. Queensland State Archives is on top with 108 datasets.

Most of the datasets are in CSV format, and most have a CC-BY licence.

What’s inside?

Obviously it’s great that GLAM organisations are sharing lots of open data, but what’s actually inside all of those CSV files? To help you find out, I created the GLAM CSV Explorer. Click on the blue button to run it in Binder, then just select a dataset from the dropdown list. The CSV Explorer will download the file, and examine the contents of every field to try and determine the type of data it holds – such as text, dates, or numbers. It then summarises the results and builds a series of visualisations to give you an overview of the dataset.

Screenshot of GLAM CSV Explorer. A series of dropdown boxes allow you to select a dataset to analyse.

Search for names

Many of the datasets are name indexes to collections of records – GLAM staff or volunteers have transcribed the names of people mentioned in records as an aid to users. For Family History Month last year I aggregated all of the name indexes and made them searchable through a single interface using Datasette. The GLAM Name Index Search has been updated as well – it searches across 10.3 million records in 253 indexes from 10 GLAM organisations. And it’s free!

Screenshot of GLAM Name Index Search showing the list of GLAM organisations with the number of tables and rows you can search from each.

And a bit of maintenance…

As well as updating the data, I also updated the code repository adding the features that I’m rolling out across the whole of the GLAM Workbench. This includes automated Docker builds saved to Quay.io, integrations with Reclaim Cloud and Zenodo, and some basic quality controls through testing and code format checks.

Zotero now saves links to digitised items in Trove from the NLA catalogue!

I’ve made a small change to the Zotero translator for the National Library of Australia’s catalogue. Now, if there’s a link to a digitised version of the work in Trove, that link will be saved in Zotero’s url field. This makes it quicker and easier to view digitised items – just click on the ‘URL’ label in Zotero to open the link.

It’s also handy if you’re viewing a digitised work in Trove and want to capture the metadata about it. Just click on the ‘View catalogue’ link in the details tab of a Trove item, then use Zotero to save the details from the catalogue.

View embedded JSON metadata for Trove's digitised books and journals

The metadata for digitised books and journals in Trove can seem a bit sparse, but there’s quite a lot of useful metadata embedded within Trove’s web pages that isn’t displayed to users or made available through the Trove API. This notebook in the GLAM Workbench shows you how you can access it. To make it even easier, I’ve added a new endpoint to my Trove Proxy that returns the metadata in JSON format.

Just pass the url of a digitised book or journal as a parameter named url to https://trove-proxy.herokuapp.com/metadata/. For example:

https://trove-proxy.herokuapp.com/metadata/?url=https://nla.gov.au/nla.obj-2906940941

Screenshot of the collapsed JSON metadata returned from the url above. It includes fields such as 'title', 'accessConditions', 'marcData', and 'children'.

I’ve created a simple bookmarklet to make it simpler to open the proxy. To use it just:

  • Drag this link to your bookmarks toolbar: Get Trove work metadata
  • View a digitised book or journal in Trove.
  • Click on the bookmarklet to view the metadata in JSON format.

To view the JSON data in your browser you might need to install an extension like JSONView.

Where did all those NSW articles go? Trove Newspapers Data Dashboard update!

I was looking at my Trove Newspapers Data Dashboard again last night trying to figure out why the number of newspaper articles from NSW seemed to have dropped by more than 700,000 since my harvesting began. It took me a while to figure out, but it seems that the search index was rebuilt on 31 May, and that caused some major shifts in the distribution of articles by state, as reported by the main result API. So the indexing of the articles changed, not the actual number of articles. Interestingly, the number of articles by state reported by the newspaper API doesn’t show the same fluctuations.

Screenshot of data dashboard that compares the number of articles by state as reported by the results and newspapers APIs. There are major differences in the column that shows the change since April 2022.

This adds another layer of complexity to understanding how Trove changes over time. To try and document such things, I’ve added a ‘Significant events’ section to the Dashboard. I’ve also included a new ‘Total articles by publication state’ section that compares results from the result and newspaper APIs. This should make it easier to identify such issues in the future.

Stay alert people – remember, search interfaces lie!

Catching up – some recent GLAM Workbench updates!

There’s been lots of small updates to the GLAM Workbench over the last couple of months and I’ve fallen behind in sharing details. So here’s an omnibus list of everything I can remember…

Data

  • Weekly harvests of basic Trove newspaper data continue, there’s now about three months worth. You can view a summary of the harvested data through the brand new Trove Newspaper Data Dashboard. The Dashboard is generated from a Jupyter notebook and is updated whenever there’s a new data harvest.
  • There’s also weekly harvests of files digitised by the NAA, now 16 months worth of data.
  • Updated harvest of Trove public tags (Zenodo) – includes 2,201,090 unique public tags added to 9,370,614 resources in Trove between August 2008 and July 2022.
  • I’ve started moving other pre-harvested datasets out of the GLAM Workbench code repositories, into their own data repositories. This means better versioning and citability. The first example is the list of Trove newspapers with articles post the 1955 copyright cliff of death – here’s the GH repo, and the Zenodo record.
  • To bring together datasets that provide historical data about Trove itself, I’ve created a Trove historical data community on Zenodo. Anyone’s welcome to contribute. There’s much more to come.

Tag cloud showing the frequency of the two hundred most commonly-used tags in Trove.

Tag cloud generated from the latest harvest of Trove Tags

Code

  • Big thanks to Mitchell Harrop who contributed a new Heritage Council of Victoria section to the GLAM Workbench providing examples using the Victorian Heritage Database API.
  • The troveharvester Python package has been updated. Mainly to remove annoying Pandas warnings and to make use of the trove-query-parser package.
  • As a result of the above, the Trove Newspaper & Gazette Harvester section of the GLAM Workbench has been updated. No major changes to notebooks, but I’ve implemented basic testing and linting to improve code quality.
  • The Trove newspapers section of the GW has been updated. There were a few bug fixes and minor improvements. In particular there was a problem downloading data and HTML files from QueryPic, and some date queries in QueryPic were returning no results.
  • The tool to download complete, high-res newspaper page images has been updated so that you now no longer need to supply an API key. Also fixed a problem with displaying the images in Voila.
  • The recordsearch_data_scraper Python package has been updated. This fixes a bug where agency and series searches with only one result weren’t being captured properly.
  • The RecordSearch section of the GW has been updated. This is incorporates the above update, but I took the opportunity to update all packages, and implement basic testing and linting. The Harvest items from a search in RecordSearch notebook has been simplified and reorganised. There are two new notebooks: Exploring harvested series data, 2022 – generates some basic statistics from the harvest of series data in 2022 and compares the results to the previous year; Summary of records digitised in the previous week – run this notebook to analyse the most recent dataset of recently digitised files, summarising the results by series.
  • A new Zotero translator for Libraries Tasmania has been developed

Calling all Tasmanian historians – you can now save resources from Libraries Tasmania into Zotero!

I’ve created a Zotero translator for the Libraries Tasmania catalogue. Using it, you can save metadata and digital resources to your own research database with a single click. Libraries Tasmania actually has three catalogues rolled into one – the main library catalogue, the Archives catalogue, and the Names Index. The translator works across all three. Features include:

  • Select and save items from a page of search results.
  • Save individual items across the full range of formats. (By default, individual records in the catalogue open in a modal overlay. For Zotero to recognise the item you need to click on the Permalink button and open the record on a separate page.)
  • Automatically download digital images and PDFs attached to records. This works when the record points to a particular page – it won’t download multiple images from a single link. However, if a record contains multiple links to digitised pages (such as the Convict records in the Names Index), you’ll get them all!
  • Fields in the Archives catalogue and Name Index that don’t map to Zotero properties are saved as key/value pairs in Zotero’s ‘Extra’ field

Screenshot of Zotero interface showing captured Libraries Tasmania records.

The translator is now included in the main Zotero repository so should install and update itself automatically. If the Zotero browser extension doesn’t seem to be detecting Libraries Tasmania items you can force an update by right clicking on the Zotero icon in your browser toolbar and clicking on Preferences > Advanced > Update translators.

My work on this translator was not entirely altruistic – it’s going to be very useful in the Everyday Heritage project as Kate Bagnall and I try to bring together sources relating to Chinese heritage in Tasmania.

But I’m also very happy to be able to update my spreadsheet of Zotero support in Australian GLAM organisations and put Libraries Tasmania in the green! #dhhacks

Screen capture of spreadsheet showing full Zotero support for Libraries Tasmania

Updated dataset! Harvests of Trove list metadata from 2018, 2020, and 2022 are now available on Zenodo: doi.org/10.5281/z… Another addition to the growing collection of historical Trove data. #GLAMWorkbench

Screen capture of version information from Zenodo showing that there are three available versions, v1.0, v1.1, and v1.2.

Updated dataset! Details of 2,201,090 unique public tags added to 9,370,614 resources in Trove between August 2008 and July 2022. Useful for exploring folksonomies, and the way people organise and use massive online resources like Trove. doi.org/10.5281/z…

Ok, I’ve created a Zenodo community for datasets documenting changes in the content and structure of Trove. Lots more to add… zenodo.org/communiti…

Coz I love making work for myself, I’ve started pulling datasets out of #GLAMWorkbench code repos & creating new data repos for them. This way they’ll have their own version histories in Zenodo. Here’s the first: github.com/GLAM-Work…

Ahead of my session at #OzHA2022 tomorrow, I’ve updated the NAA section of the #GLAMWorkbench. Come along to find out how to harvest file details, digitsed images, and PDFs, from a search in RecordSearch! github.com/GLAM-Work…

55,633 items digitised by the National Archives of Australia last week. Including:

  • Bonegilla name index cards (A2751 & A2752): +42,434
  • CMF Personnel Dossiers (B884): +10,150
  • Aust Women’s Land Army personnel cards (C610): +961

github.com/wragge/na…

A2571, Name Index Cards, Migrants Registration [Bonegilla], 33686 files digitised; B884, Citizen Military Forces Personnel Dossiers, 1939-1947, 10150 files digitised; A2572, Name Index Cards, Migrants Registration [Bonegilla], 8748 files digitised; C610, Australian Women's Land Army - personnel cards, alphabetical series, 961 files digitised; A9301, RAAF Personnel files of Non-Commissioned Officers (NCOs) and other ranks, 1921-1948, 735 files digitised; D874, Still photograph outdoor and studio negatives, annual single number series with N prefix (and progressive alpha infix A-K from 1948-1957), 624 files digitised; B883, Second Australian Imperial Force Personnel Dossiers, 1939-1947, 163 files digitised; J853, Architectural plans, annual single number series with alpha (denoting Papua New Guinea and discipline) prefix and/or alpha/numeric (denoting size and amendment) suffix, 161 files digitised; A14487, Royal Australian Air Force Air Board and Air Council Agendas, Submissions and Determinations - Master Copy, 102 files digitised; A2478, Non-British European migrant selection documents, 21 files digitised; D4881, Alien registration cards, alphabetical series, 18 files digitised; A471, Courts-Martial files [including war crimes trials], single number series, 10 files digitised; A1877, British migrants - Selection documents for free or assisted passage (Commonwealth nominees), 9 files digitised; A13860, Medical Documents - Army (Department of Defence Medical Documents), 9 files digitised; A1196, Correspondence files, multiple number series [Class 501] [501-539] [Classified] [Main correspondence files series of the agency], 9 files digitised; B78, Alien registration documents, 8 files digitised; A712, Letters received, annual single number series with letter prefix or infix, 6 files digitised; A12372, RAAF Personnel files - All Ranks [Main correspondence files series of the agency], 6 files digitised; AP476/4, Applications etc. for registration of copyright of literary, dramatic and musical productions, pictures etc., 6 files digitised; A714, Books of duplicate certificates of naturalization A(1)[Individual person] series, 6 files digitised;

Newspapers added to Trove last week

  • Freelance (WA)
  • The Standard (WA)
  • Berrigan Advocate (NSW)
  • Baileys Sporting & Dramatic Weekly (WA)
  • Farmers’ Weekly (WA)
  • Harvey-Waroona Mail (WA)
  • W.A. Family Sphere (WA)
  • Coonabarabran Times (NSW)

github.com/wragge/tr…

Noticed that QueryPic was having a problem with some date queries. Should be fixed in the latest release of the Trove Newspapers section of the #GLAMWorkbench: glam-workbench.net/trove-new… #maintenance #researchinfrastructure

The Trove Newspapers section of the #GLAMWorkbench has been updated! Voilá was causing a problem in QueryPic, stopping results from being downloaded. A package update did the trick! Everything now updated & tested. glam-workbench.net/trove-new…

Some more #GLAMWorkbench maintenance – this app to download a high-res page images from Trove newspapers now doesn’t require an API key if you have a url, & some display problems have been fixed. trove-newspaper-apps.herokuapp.com/voila/ren…

Screen shot of app --  Download a page image  The Trove web interface doesn't provide a way of getting high-resolution page images from newspapers. This simple app lets you download page images as complete, high-resolution JPG files.