Why I wrote my own SQL query builder (in Python)

This is the third article in a series about writing an SQL query builder in 150 lines of Python.

Today, we'll talk about:

why I decided to write my own
the alternatives I considered
why I didn't use an existing library
how I knew it wouldn't become a maintenance burden

The first part is in rough chronological order. If you're only interested in the libraries I looked at, you can find the list at the end.

Contents

Background
The first prototype
Requirements, and existing libraries
The second prototype
Deciding to use my own
Other alternatives

Background #

reader is a Python feed reader library – it allows users to retrieve, store, and manage web feeds through a high-level API, without having to deal with feed-related details.

It is a hobby and learning project, and I can only spend a limited amount of time on it, sometimes quite far in between. It's not necessarily about learning technologies; rather, it is about library design, writing code that's maintainable long-term, and showing restraint – if I were to make the best library I could, what would it look like?

Note

Because of this, the conclusions of this article, if any, may not be directly applicable to regular "work" projects. With reader, I have different constraints over different time scales, a somewhat different definition of success, and more freedom in some aspects.

However, not all projects are the same, and not all parts of a project are the same. Sometimes, this kind of long-term thinking can be useful, and it can actually be achieved through a combination of planning, strategical technical debt, and saying no to, reducing the scope of, or postponing features.

In the spirit of keeping things maintainable, the core library is fully typed and has 100% test coverage, to make refactoring straightforward (if not painless).

Also, almost from the start, I put all storage code in a single module, behind a data access object; the rest of the library doesn't even know it's talking to a database.

One of reader's main features is filtering articles by article metadata, user-set metadata (read, important, feed tags), and full-text search; the results can be sorted in various ways and paginated.

In May 2019, with less than half of the above implemented, the function building the SQL query was over 100 lines, and I had already felt the need to add this comment:

This needs some sort of query builder so badly.

The first prototype #

I opened an issue, and did some research.

At some point I stumbled upon sqlbuilder.mini, which was built around an interesting insight – queries can be represented as plain data structures:

>>> sql = [
...     'SELECT', ['first_name'],
...     'FROM', ['author'],
...     'WHERE', ['status', '==', P('new')],
... ]
>>> compile(sql)
('SELECT first_name FROM author WHERE status == %s', ['new'])

You can then modify the query directly:

>>> sql[sql.index('SELECT') + 1].append('last_name')
>>> compile(sql)
('SELECT first_name, last_name FROM author WHERE status == %s', ['new'])

Or via a wrapper that simplifies navigation:

>>> sql = Q(sql)
>>> sql.append_child(
...     ['SELECT'],  # path
...     ['age']
... )  # returns self to allow method chaining
<sqlbuilder.mini.Q object>
>>> compile(sql)
('SELECT first_name, last_name, age FROM author WHERE status == %s', ['new'])

I really liked how simple and flexible this is, and the choice of not dealing with SQL correctness or dialects – a middle ground between building strings by hand and "proper" query builders. On the other hand, it seemed too verbose (even with the wrapper), and the generated SQL wasn't very readable.

Surely, it would be possible to make this look like sql.SELECT('age'), right?

So I made a prototype – with no real intention of using it, just to see how easy it is to do. The core was quite short, about 80 lines; my thoughts at the time:

The end result looks nice, but using it would add ~150 lines of code (that need to be tested), and it's less useful for simpler queries.

Also, it looks nice now, when I just wrote it; 6 months from now it may be hard to understand.

Afraid I'm too happy with it, and with my curiosity satisfied, I did just that: postponed for six months.

Requirements, and existing libraries #

My main concern with building my own was that over time, with additions and fixes, the effort would be greater than that of getting an existing library to do what I needed.

I did two things to deal with this.

First, I wrote down detailed requirements.

Whatever I used had to support the following:

SELECT with conditional WHERE, ORDER BY, JOIN etc. (example)
common table expressions (WITH)
scrolling window queries (or be possible to build on top)
arbitrary SQL (so I don't have to use the query builder for everything)
the order in which you add clauses shouldn't matter

Also, it should be easy to understand and maintain, and it should be possible to support additional SQL features.

Because reader is a library, I wanted to keep the number of (transitive) dependencies as small as possible, since any extra dependency gets passed down to the users.

Both to keep things simple and due to historical reasons, I did not want to switch to an abstraction layer like SQLAlchemy Core just for query building – I needed (and still need) only SQLite support, and already had code to deal with stuff like migrations.

Second, I did a slightly more serious survey of existing libraries.

I didn't feel any of the ones I looked at was ideal, for at least one of these reasons:

They came with a full abstraction layer. This isn't bad in itself, but meant I had to switch everything, eventually – using a mix would make things worse.
They had too many features. Usually this is good, but it means there's more of everything: more features, more documentation to go through, more concepts to keep in your head, more things contributors need to know or learn.
They didn't make things more readable or more composable.
They weren't actively maintained.

Again, I chose to wait until more features are implemented.

The second prototype #

By May 2020, most of the features were implemented. The function building the query was 150 lines, with part of it duplicated for search. At some point, I tried to optimize it to use indexes, but gave up because trying things simply took too long.

So, a full year later, I made the prototype support all the required features and a few extra (UNION, nested queries), and tried it out on the full real queries.

It didn't take all that long, and it remained around 100 lines.

Deciding to use my own #

At this point, most of the work was already done, and integrating it took less than an hour.

Excluding the 136 lines of the builder itself with scrolling window query support, the code went from 1400 to 1300 lines. I took that as a win, since for the price of 36 lines I was able to reuse the filtering logic. (Now, one year later, it enabled a lot more reuse, without growing significantly.)

I ended up keeping it, because:

Using an existing library would take too much effort. (I'll reconsider when the requirements change, though.)
It is tiny, which makes it relatively easy to understand and modify; the prototypes made me quite confident it's likely to stay that way. Because it is only used internally, I can leave out a lot of nice things that aren't actually needed (including the extra features).
It has 0 dependencies. That's even better than 1.
reader already had great test coverage, so little additional testing was required.

Attention

My query builder is not directly comparable with that of an ORM. Instead, it is an alternative to building plain SQL strings by hand.

The caveats that apply to plain SQL apply to it as well: Using user-supplied values directly in an SQL query exposes you to SQL injection attacks. Instead, use parametrized queries whenever possible, and escaping only as a last resort.

Since I was coming from plain SQL, I was already doing all of this.

Other alternatives #

Here's a non-exhaustive list of other things I looked at. I'm only covering the libraries I actually considered using, or that are interesting in some way. There are others out there; some aren't actively maintained, some I simply missed.

Update (July 2021)

I added a few more interesting libraries.

Do nothing #

It's worth keeping in mind that "do nothing" is always an option – probably the first option to consider, in many cases.

There's two kinds of doing nothing: passive, where you wait for new requirements to come up – for the problem to reveal itself –, and active, where you explore options, but don't commit to anything just yet.

I ended up doing both, to a point.

Disable parts of the query #

An interesting (but quickly abandoned) idea was to not build queries; instead, have just one query, and disable parts of it with boolean or optional parameters, and hope the query planner optimizes it:

SELECT ...
FROM entries
WHERE
    (:read IS NOT NULL AND entries.read = :read) AND
    ...  -- 7 more expressions like this

There are two huge issues:

I'm not sure any optimizer is that smart (also, the query might be optimized before the parameters are passed in). Even if it were, I'm not smart enough to design indexes for a query like this.
It doesn't seem possible to do it for JOIN, different ORDER BY terms, or even an arbitrary number of WHERE conditions (e.g. for tags).

Even if they were all possible, the result would be almost impossible to understand.

SQLAlchemy Core, Peewee query builder #

SQLAlchemy and Peewee are both SQL toolkits / object-relational mappers.

SQLAlchemy has over 15 years of history, and is the database toolkit for Python. Hell, there's even an Architecture of Open Source Applications chapter about it.

Peewee is a bit younger (~10 years), simple and small by design.

Both have a lot of extensions¹, and can be used without the ORM part; Peewee can even generate plain SQL without defining models.

In the end, both seemed too complicated, and meant I had to switch to them eventually, adding the burden of researching a use case I don't have yet. However, if I ever need multi-database support, it's likely I'll use one of them.²

SQLBuilder, PyPika, python-sql #

SQLBuilder and PyPika are standalone query builders – no ORM, no connection management, just SQL generation; they are similar to the Peewee query builder.

SQLBuilder doesn't seem actively maintained. Aside from that, I didn't use it because it would make a potential migration to SQLAlchemy or Peewee more difficult.

PyPika I discovered while writing this article; it is actively maintained and has somewhat better documentation.

(update) Another actively maintained query builder is python-sql. It is part of Tryton, an open-source ERP platform; it's been around for a while, and will likely continue to be. I missed this one during my research :)

sqlbuilder.mini, psycopg2.sql #

SQLBuilder comes with another, extremely lightweight SQL builder, sqlbuilder.mini.

As dicussed in the beginning, I like the overall approach (and at ~500 lines, it's small enough to vendor), but it still seems verbose, and the generated SQL isn't very readable.

(update) psycopg2.sql has a similar philosophy in how it treats SQL strings. Unlike my builder, it's "inside-out" (you append stuff to lists explicitly), so it's more verbose. It does support escaping identifiers and placeholders, though; I didn't really deal with that in any way.

JinjaSQL, SQLpy, PugSQL #

These two are interesting because they use templating.

JinjaSQL is exactly what you'd expect: generate SQL from Jinja templates. I didn't use it because composition through macros would still be verbose, and a bit tricky (careful with the comma after that last column).

(update) Template engines like Jinja are useful if you need to allow end users to build queries, since you can sandbox templates (sandboxing Python is not easy, at least not with CPython). dbt, brought to my attention by a reader, seems to be using Jinja in this way.

SQLpy is similar, but different. You put your named queries in a separate file, and access them from Python as functions. Query building happens via named parameters: if you don't pass a parameter when executing the query, the lines using that parameter aren't included in the query (as you'd expect, this comes with a lot of caveats).

(update) I also re-discovered PugSQL, which is like SQLpy, but is actively maintained and doesn't do the magic line disappearing stuff. Turns out they're both inspired by a pair of Clojure libraries.

If your SQL doesn't change, but is parametrized, PugSQL looks like a good lightweight solution. For me, adding WHERE conditions was a strong requirement.

Pony #

I don't think I considered Pony at the time, but it's worth mentioning: it has been around since 2012, is actively maintained, and has commercial support.

And it can translate generator expressions like this one into SQL queries:

select(c for c in Customer if sum(c.orders.price) > 1000)

For reader it is definitely overkill.

It does look really, really interesting, though (too interesting, maybe?).

That's it for now.

Learned something new today? Share it with others, it really helps! PyCoder's Weekly HN Bluesky linkedin Twitter

Next: Write an SQL query builder in 150 lines Python!

That's one of the benefits of using libraries that have been around for a while. Some extensions relevant to my project: SQLAlchemy has Alembic for migrations (from the same author) and sqlakeyset for scrolling window queries; Peewee has a lot of SQLite-specific functionality. ^[return]
They'd also be my first choice for a project with resources and deadlines. ^[return]

This is part of a series: