26 May 2022

Using Datasette on Nectar

If you have a dataset that you want to share as a searchable online database then check out Datasette – it’s a fabulous tool that provides an ever-growing range of options for exploring and publishing data. I particularly like how easy Datasette makes it to publish datasets on cloud services like Google’s Cloudrun and Heroku. A couple of weekends ago I migrated the TungWah Newspaper Index to Datasette. It’s now running on Heroku, and I can push updates to it in seconds.

I’m also using Datasette as the platform for sharing data from the Sydney Stock Exchange Project that I’m working on with the ANU Archives. There’s a lot of data – more than 20 million rows – but getting it running on Google Cloudrun was pretty straightforward with Datasette’s publish command. The problem was, however, that Datasette is configured to run on most cloud services in ‘immutable’ mode and we want authenticated users to be able to improve the data. So I needed to explore alternatives.

I’ve been working with Nectar over the past year to develop a GLAM Workbench application that helps researchers do things like harvesting newspaper articles from a Trove search. So I thought I’d have a go at setting up Datasette in a Nectar instance, and it works! Here’s a few notes on what I did…

First of course you need get yourself a resource allocation on Nectar. I’ve also got a persistent volume storage allocation that I’m using for the data.
From the Nectar Dashboard I made sure that I had an SSH keypair configured, and created a security group to allow access via SSH, HTTP and HTTPS. I also set up a new storage volume.
I then created a new Virtual Machine using the Ubuntu 22.04 image, and attaching the keypair, security group, and volume storage. For the stock exchange data I’m currently used the ‘m3.medium’ flavour of virtual machine which provides 8gb of RAM and 4 VCPUs. This might be overkill, but I went with the bigger machine because of the size of the SQLite database (around 2gb). This is similar to what I used on Cloudstor after I ran into problems with the memory limit. I think most projects would run perfectly well using one of the ‘small’ flavours. In any case, it’s easy to resize if you run into problems.
Once the new machine was running I grabbed the IP address. Because I have DNS configured on my Nectar project, I also created a ‘datasette’ subdomain from the DNS dashboard by pointing an ‘A’ (alias) record to the IP address.
Using the IP address I logged into the new machine via SSH.
With all the Nectar config done, it was time to set up Datasette. I mainly just followed the excellent instructions in the Datasette documention for deploying Datasette using systemd. This involved installing datasette via pip, creating a folder for the Datasette data and configuration files, creating a datasette.service file for systemd.
I also used the datasette install command to add a couple of Datasette plugins. One of these is the datasette-github-auth plugin, which needs a couple of secret tokens set. I added these as environment variables in the datasette.service file.
The systemd setup uses Datasette’s configuration directory mode. This means you can put your database, metadata definitions, custom templates and CSS, and any other settings all together in a single directory and Datasette will find and use them. I’d previously passed runtime settings via the command line, so I had to create a settings.json for these.
Then I just uploaded all my Datasette database and configuration files to the folder I created on the virtual machine using rsync and started the Datasette service. It worked!
The next step was to use the persistent volume storage for my Datasette files. The persistent storage exists independently of the virtual machine, so you don’t need to worry about losing data if there’s a change to the instance. I mounted the storage volume as /pvol in the virtual machine as the Nectar documentation describes.
I created a datasette-root folder under pvol, copied the Datasette files to it, and changed the datasette.service file to point to it. This didn’t seem to work and I’m not sure why. So instead I created a symbolic link between /home/ubuntu/datasette-root and /pvol/datasette-root and and set the path in the service file back to /home/ubuntu/datasette-root. This worked! So now the database and configuration files are sitting in the persistent storage volume.
To make the new Datasette instance visible to the outside world, I installed nginx, and configured it as a Datasette proxy using the example in the Datasette documentation.
Finally I configured HTTPS using certbot.

Although the steps above might seem complicated, it was mainly just a matter of copying and pasting commands from the existing documentation. The new Datasette instance is running here, but this is just for testing and will disappear soon. If you’d like to know more about the Stock Exchange Project, check out the ANU Archives section of the GLAM Workbench.

Tim Sherratt

Using Datasette on Nectar