Generating a GraphQL API for Calibre (or any SQLite database)

2024-02-24

Ramblin' ahead, skip to the code?

Coming home, autopilot usually kicks in. Doors naturally close, shoes evaporate off somewhere and all of the little bits and bobs, keys and cards that have been weighing down my pockets unceremoniously get dumped in a small bowl.

A keen eye might spot similar bowls around the place, unobtrusive gifts, and souvenirs keeping the daily debris from spilling over.

I see the SQLite databases on my devices as the digital equivalent of those bowls. There they are, hiding in plain sight across the filesystems, keeping their application's data from cluttering up the place.

It's a remarkable piece of technology and one of the rarer examples of those that actually are used. Likely Trillions of installations, and much (but not enough!) has been said about the project's testing strategy.

One such installation is the database that keeps track of my books inside Calibre. If you are not familiar, Calibre is a library and ebook management tool, to browse, edit and convert them between different formats. Sometimes I see people having... Let's call them unfulfilled expectations on the look and feel, hasn't bothered me though.

Mainly since I rarely use it, only fixing otherwise tricky issues and I've no complaints there. To interact with my library I've been using Calibre-Web, a sister project which as the name implies is a web service instead of a desktop app. Different developers and feature sets, but it's reading the same database.

Now yesterday, I was itching to try out some small experiments. These would have been simple if Calibre-Web had a REST/GraphQL API to play with, but alas, it is a server-side rendered Flask app. So is there a quick way to spin up an API to expose the data in the database?

Of course, one of the big strengths of SQLite is that it's easily embedded into pretty much any language and project. It's simple to write some CRUD handlers or GraphQL revolvers for the data.

Do you want to know how I feel about writing another controller, service, repository, followed by some semi-decent solution to get sorting and filtering in there without making a mess or losing interest along the way?

:'(

That is the goal of this article: A GraphQL API surfacing data from a Calibre SQLite database, without tears.

The Plan

Tools that automatically generate APIs from databases are popping up like dandelions through the asphalt of ORMs. Some of these target SQLite, e.g data exploration tool Datasette can provide a read-only JSON API. For my purpose I'm a bit more interested in read-write, however, and I'm also more familiar with PostgREST for Postgres.

Postgres. Yes, you have to be squinting hard a bit to confuse Postgres with SQLite. But don't worry, to help blur the vision Postgres helpfully provides Foreign Data Wrappers (FDW).

It looks like you're querying a table, but aha! Is it actually a Parquet file on S3, or your Stripe payments? Quite useful, and we'll use sqlite_fdw to bridge the two databases.

The data will still be stored in SQLite (unless you add a materialized view), but with the tables accessible in Postgres, pointing PostgREST at them gives us an API ready to go.

The picture we've painted is a little bit absurd. Part of me wants more, piling on layers. Use a BLOB column to store SQLite databases, fetch and query it on the client-side using SQLite-wasm before re-persisting it.

Strange Loop ambitions aside, replace SQLite with another FDW, and the utility of the dance we're lining up for changes. Let's take the first steps now!

Connecting to Calibre

First of all, let's get a Calibre database and connect to it. You can download a sample database from here to follow along.

$ sqlite3 metadata.db "select title, author_sort from books limit 3;" -line
      title = Nation
author_sort = Pratchett, Terry

      title = A Pattern Language
author_sort = Alexander, Christopher

      title = Brewer's Dictionary of Phrase and Fable
author_sort = Brewer, Ebenezer Cobham

Looks like we have some books in there. If you're working with an unknown database you can omit the query to open the db in interactive mode and view the schema using the .schema command.

Loading sqlite_fdw into postgres

To read the SQLite database from Postgres we need to install the extension sqlite_fdw, let's prepare a quick Dockerfile.

FROM postgres:16

RUN apt-get update -y && apt-get -y install \
  make \
  gcc \
  postgresql-server-dev-16 \
  libsqlite3-dev \
  pgxnclient \
  && rm -rf /var/lib/apt/lists/*

# Install sqlite_fdw extension using the pgxn repository
RUN USE_PGXS=1 pgxn install 'sqlite_fdw=2.4.0'

Build the image and run it using your favorite dark incantations, or try this docker-compose.yml.

version: '3.9'
services:
  db:
    build: .
    volumes:
      - ./metadata.db:/data/metadata.db
    ports:
      - 5432:5432
    environment:
      POSTGRES_PASSWORD: postgres

Let's see if it makes us a Postgres. If you encounter an error, I'd expect dependency issues so take a gander through the sqlite_fdw readme.

docker compose up -d
docker compose exec postgres psql -U postgres

Hopefully, you've been dropped into a postgres shell. If not, I'm going to grab a cup of tea while you consult your LLM or search engine of choice to debug the issue. If it did work, I'm already on my way to the kettle so see you here when I'm back.

...

Good, next up is a little bit of configuration inside Postgres. We need to enable the extension we installed in the Dockerfile, tell Postgres how to connect to our database, and finally create import or set up the tables.

-- Enable the extension
CREATE EXTENSION sqlite_fdw;

-- Set up the FDW for our SQLite database
CREATE SERVER sqlite_server
FOREIGN DATA WRAPPER sqlite_fdw
OPTIONS (database '/data/metadata.db');

-- Import all tables in the database into our public schema
IMPORT FOREIGN SCHEMA calibre
FROM SERVER sqlite_server
INTO public;

Now let's try to query the books, this time in Postgres.

postgres=# SELECT id, title FROM books LIMIT 3;

 id |            title            |    author_sort
----+-----------------------------+-------------------
  1 | Dune                        | Herbert, Frank
  2 | The Return of the King      | Tolkien, J.R.R.
  3 | One Piece #1 - Romance Dawn | Oda, Eiichiro

Fantastic!

If you enter \d to view the tables, you might see a few that you don't really care or want to access, in this case, you can use the LIMIT TO and EXCEPT options for IMPORT FOREIGN SCHEMA or import the tables one by one using CREATE FOREIGN TABLE.

Adding PostgREST

With half of our strange connection working, let's turn to the API. Let's add the PostgREST container to our compose file. Normally, you would use a separate user for the API, granting it access to only those schemas, tables, or rows (via RLS) that it should be able to read and update.

version: '3.9'
services:
  db:
    build: .
    volumes:
      - ./metadata.db:/data/metadata.db
    ports:
      - 5432:5432
    environment:
      POSTGRES_PASSWORD: postgres

  api:
    image: postgrest/postgrest
    environment:
      PGRST_DB_URI: "postgres://postgres:postgres@db:5432/postgres"
      PGRST_DB_SCHEMA: "public"
      # We're skipping over authentication, using the superuser account
      PGRST_DB_ANON_ROLE: "postgres"
    ports:
      - 3000:3000

After a little docker compose up -d to get our new friend up and running let's try to curl the api!

$ curl "localhost:3000/books?select=id,title&limit=3&offset=6"

[
  {"id":7,"title":"Oathbringer","author_sort":"Sanderson, Brandon"},
  {"id":8,"title":"The Nature of Order: An Essay on the Art of Building and the Nature of the Universe, Book 1 - The Phenomenon of Life","author_sort":"Alexander, Christopher"},
  {"id":9,"title":"Around the World in Eighty Days","author_sort":"Verne, Jules"}
]

There they are! Our little books are now exposed over HTTP. You can see the query parameters mimicking parts of a SQL query, and while the PostgREST API is extensive, does not provide nested routes for related resources like you might expect from a RESTful API. Instead, this is done in the query parameter /books?select=id,title,author(id,name).

At this step, there are a few possible options this dance can go:

Configure Auth to prevent unauthorized access to the API or to the data
Add a touch of API design by using schemas and views to shape and rename the data

But instead, I'd recommend reading through the documentation for that. What I want to show instead is how to extend this API to also speak GraphQL, which also handles nested resources more elegantly. That is after all what I wrote the title of the post to be.

Adding GraphQL

While there are other ways of doing this (Hasura, Postgraphile), we are going to leverage the pg_graphql extension from Supabase to keep parsing, querying, and building json objects all inside the database.

The first step would be to extend the Dockerfile to install the pg_graphql extension as well. This is built using pgxr so we need a little bit more code here.

FROM postgres:16

RUN apt-get update -y && apt-get -y install \
  make \
  gcc \
  postgresql-server-dev-16 \
  libsqlite3-dev \
  pgxnclient \
  curl \
  pkg-config \
  git \
  && rm -rf /var/lib/apt/lists/*

RUN USE_PGXS=1 pgxn install 'sqlite_fdw=2.4.0'

WORKDIR /home/postgres
ENV HOME=/home/postgres \
  PATH=/home/postgres/.cargo/bin:$PATH
RUN chown postgres:postgres /home/postgres
USER postgres

RUN \
  curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y --no-modify-path --profile minimal --default-toolchain nightly && \
  rustup --version && \
  rustc --version && \
  cargo --version

# Install pgrx
RUN cargo install cargo-pgrx --version 0.11.2 --locked
RUN cargo pgrx init --pg${PG_MAJOR} $(which pg_config)

# Fetch the pg_grapql sources
RUN git clone https://github.com/supabase/pg_graphql.git

USER root
# Build and install pg_graphql
RUN cd pg_graphql && cargo pgrx install
USER postgres

We also need a little bit more setup in the SQL land as well.

create extension pg_graphql;

-- Create a function we can call from PostgREST /rpc/graphql
create function graphql(
    "operationName" text default null,
    query text default null,
    variables jsonb default null,
    extensions jsonb default null
)
    returns jsonb
    language sql
as $$
    select graphql.resolve(
        query := query,
        variables := coalesce(variables, '{}'),
        "operationName" := "operationName",
        extensions := extensions
    );
$$;

-- Enable inflection for snake_case to TitleCase conversion
-- https://supabase.github.io/pg_graphql/configuration/
comment on schema public is '@graphql({"inflect_names": true})';

You should now be able to fire GraphQL queries towards localhost:3000/rpc/graphql and get some valid data back. Let's try it using CURL again, but you can use your favorite client.

curl --data-urlencode "query=query { booksCollection(first: 3, offset: 9) { edges { node { id title author_sort } } } }" \
  http://localhost:3000/rpc/graphql | jq

{
  "data": null,
  "errors": [
    {
      "message": "Unknown field \"booksCollection\" on type Query"
    }
  ]
}

And that's it! Wait, Unknown field? I did not want any errors on the main quest here.

Turns out that pg_graphql can only generate GraphQL schema entities for tables with a primary key, and technically our tables are foreign tables, not normal ones. Helpfully, we can overcome this by setting a comment on the table to tell pg_graphql what our primary key is.

COMMENT ON FOREIGN TABLE books IS e'@graphql({ "primary_key_columns": ["id"] })';

Let's run that query again!

{
  "data": {
    "booksCollection": {
      "edges": [
        {
          "node": {
            "id": "12",
            "title": "Why Has Nobody Told Me This Before?",
            "authorSort": "Smith, Julie"
          }
        },
        {
          "node": {
            "id": "13",
            "title": "Dungeon Magazine #1",
            "authorSort": "Unknown"
          }
        },
        {
          "node": {
            "id": "13",
            "title": "Dungeon Magazine #1",
            "authorSort": "Unknown"
          }
        }
      ]
    }
  }
}

We have books. But maybe we'd also like some Authors and Series, which gives us two options:

Copy-pasting the comment for each table in the schema, which is probably the best for more control.
Writing some pl/pgsql, using the metadata tables to generate the comment string

For more control, you'd probably want to take the first option, but my tea is starting to get cold now.

-- Use metadata tables to add "fake" primary keys for pg_graphql
-- https://supabase.github.io/pg_graphql/views/
DO $$
DECLARE
  r RECORD;
BEGIN
  FOR r IN SELECT c.relname AS table_name, a.attname AS column_name
           FROM pg_class c
           JOIN pg_attribute a ON c.oid = a.attrelid
           JOIN pg_type t ON a.atttypid = t.oid
           WHERE c.relkind = 'f'  -- select only foreign tables
           AND a.attfdwoptions='{key=true}' -- Where the key is a primary key
           AND t.typcategory = 'N' -- and the column is numeric, tweak this in case of UUIDs or similar
  LOOP
    -- Add the comment for pg_graphql on each table
    EXECUTE format('COMMENT ON FOREIGN TABLE %I IS E''@graphql({"primary_key_columns": ["%I"]})'';', r.table_name, r.column_name);
  END LOOP;
END;
$$ LANGUAGE plpgsql;
-- Note, could possibly also include non-numeric tables. And might misbehave in case there are multiple keys per table.

Refreshing our GraphQL schema, the list of tables should hopefully have grown a bit. You might notice that the relationships, a main reason for using GraphQL, are missing though they are supported by pg_graphql.

A look at the SQL schema for the database reveals the answer though:

sqlite> .schema 
CREATE TABLE books(id INTEGER PRIMARY KEY /* rest of keys omitted */);
CREATE TABLE authors(id INTEGER PRIMARY KEY /* rest of keys omitted */);

CREATE TABLE books_authors_link ( 
  id INTEGER PRIMARY KEY,
  book INTEGER NOT NULL,
  author INTEGER NOT NULL,
  UNIQUE(book, author)
);

CREATE INDEX books_authors_link_aidx ON books_authors_link (author);
CREATE INDEX books_authors_link_bidx ON books_authors_link (book);
CREATE TRIGGER fkc_insert_books_authors_link
        BEFORE INSERT ON books_authors_link
        BEGIN 
          SELECT CASE
              WHEN (SELECT id from books WHERE id=NEW.book) IS NULL
              THEN RAISE(ABORT, 'Foreign key violation: book not in books')
              WHEN (SELECT id from authors WHERE id=NEW.author) IS NULL
              THEN RAISE(ABORT, 'Foreign key violation: author not in authors')
          END;
        END;
CREATE TRIGGER fkc_update_books_authors_link_a
        BEFORE UPDATE OF book ON books_authors_link
        BEGIN
            SELECT CASE
                WHEN (SELECT id from books WHERE id=NEW.book) IS NULL
                THEN RAISE(ABORT, 'Foreign key violation: book not in books')
            END;
        END;
CREATE TRIGGER fkc_update_books_authors_link_b
        BEFORE UPDATE OF author ON books_authors_link
        BEGIN
            SELECT CASE
                WHEN (SELECT id from authors WHERE id=NEW.author) IS NULL
                THEN RAISE(ABORT, 'Foreign key violation: author not in authors')
            END;
        END;

For pg_graphql (and PostgREST) to infer the relationships, relationship() and foreign keys need to be defined in the schema. In this case, SQLite supports foreign keys, but Calibre is not using them, opting for simulating them using triggers. I'm not exactly sure why, I could imagine performance concerns?

We have two ways of working around this, faking those foreign keys with comments or using computed fields. Using the comments is quite easy, however, they come with the downside of a in my opinion slightly ugly external API. Just as a quick intro to what we are dealing with in the schema, there is a many-to-many relationship between books and authors represented with a join table.

COMMENT ON FOREIGN TABLE books_authors_link IS e'
    @graphql({
        "primary_key_columns": ["id"],
        "foreign_keys": [
          {
            "local_name": "authors",
            "local_columns": ["book"],
            "foreign_name": "Book",
            "foreign_schema": "public",
            "foreign_table": "books",
            "foreign_columns": ["id"]
          },
          {
            "local_name": "books",
            "local_columns": ["author"],
            "foreign_name": "Author",
            "foreign_schema": "public",
            "foreign_table": "authors",
            "foreign_columns": ["id"]
          }
        ]
    })';

query {
  booksCollection {
    edges {
      node { 
        id 
        title 
        authors {
          edges {
            node {
              Author {
                id
                name
              }
            }
          }
        }
      }
    }
  }
}

The other option, and the nicer one in my opinion is using computed columns, and this is possible in both a one-to-one or one-to-many option. In this case, it ends up being less code as well.

create or replace function "authors"("books")
    returns setof "authors" -- add 'row 1' for a one-to-one relationship
    language sql
    as
$$
    select
        authors
    from
        authors
    join books_authors_link ba
        on ba.author = authors.id
        and ba.book = $1.id
$$;

query {
  booksCollection {
    edges {
      node { 
        id 
        title 
        authors {
          edges {
            node {
                id
                name
              }
            }
          }
        }
      }
    }
  }
}

The new query allows us to skip one level of indirection, cleaning up the API in the process.

Wrapping Up

With that, we are finished and so is my tea. If you made it all the way to the end, thank you! I'm sure there are a few thoughts in your head, hopefully mostly curious ones.

On the positive side, I'm pretty happy with the time it took to glue these technologies together. I also think we ended up with something quite flexible.

It doesn't have to be a SQLite database on the other end, as there are FDWs for a surprising number of technologies (with varying quality). If you've been dreading creating an API to front that old IBM DB2 instance without pulling your hair figuring out how to navigate their documentation, and getting the SDK to compile and understand the licensing, this could be a more appealing alternative.

Or if you want to build something similar to Calibre-Web, but don't want to juggle multiple databases for application and library data, you could keep the metadata in SQLite for compatibility and the application data in Postgres. More features are available and add a frontend or BFF layer for business logic.

But it also raises an eyebrow.

Even though the sqlite_fdw does pushdown of WHERE clauses (runs them in SQLite rather than in Postgres) to reduce the data transfer, an application querying it directly would consume fewer resources and be faster. This is likely to be true for most FDWs.

A lot of clever people will also object to the idea of putting logic or treating the database as more than a dump store, for good reasons. And this architecture can start to pull towards that.

All in all, I would regard this as a neat trick, that certainly has its uses, but would be careful to make it a load-bearing part of your architecture without consideration.