Why use an SQL query builder in the first place?

2021-05-18 ∙ eight minute read

Previously

This is the second article in a series about writing an SQL query builder in 150 lines of Python, why I wrote it, how I thought about it, and the decisions I had to make.

Today, we'll talk about why I needed a query builder in the first place, and how it:

The examples below are actual use cases I had for my feed reader library. In practice, they overlap a bit, since they're different aspects of the same problem.

Contents

The problem #

You have a method that retrieves all or some of the entries from the database:

def get_entries(
    feed: Optional[str],
    entry: Optional[Tuple[str, str]],
    read: Optional[bool],
    important: Optional[bool],
    has_enclosures: Optional[bool],
    sort: Literal['recent', 'random'],
) -> Iterable[Entry]: ...

In the end, the method queries the entries table. The query changes depending on the arguments: different columns are selected, different WHERE expressions and ordering terms are used.

The simplest solution in terms of code is to have one query per variation; if the method started with no arguments and they were added over time, this may be a natural thing to do (initially).

Let's count the queries you'd need get_entries() above:

  • all queries are parametrized, so the query doesn't change for different values (e.g. if feed is not None, we add WHERE feed = :feed)
  • optional arguments are either present or not present (2 queries)
    • to make queries more readable, for optional boolean arguments I emit 3 queries: one for True (condition), one for False (NOT condition), and one for None (nothing added); however, since it can be done with only 2 queries, we'll count 2
  • sort changes the query in more complicated ways, but it currently has 2 values, so we'll count it as 2 queries (additional sorts can be added in the future, though)

So, there's 6 parameters with 2 variations each: 26 = 64 variations!

I'm not sure having 64 different queries is such a good idea... 😕

The real method has one parameter that can take an arbitrary number of tags in a bunch of different formats (True, ['one'], [['one'], ['two']], [['one', 'two']], ...). I don't know how to count that, so I left it out; even without it, the conclusion is the same.

Preventing combinatorial explosion #

OK, what now?

The next easiest thing is to build the query by concatenating strings. To keep things short, we'll only do read and important:

def make_entries_query(read=None, important=None):
    where_snippets = []

    if read is not None:
        where_snippets.append(f"{'' if read else 'NOT'} entries.read")
    if important is not None:
        where_snippets.append(f"{'' if important else 'NOT'} entries.important")

    if any(where_snippets):
        where_keyword = 'WHERE'
        where_snippet = ' AND\n            '.join(where_snippets)
    else:
        where_keyword = ''
        where_snippet = ''

    return f"""
        SELECT
            entries.id,
            entries.title
        FROM entries
        {where_keyword}
            {where_snippet}
    """

make_entries_query(False, True) outputs:

        SELECT
            entries.id,
            entries.title
        FROM entries
        WHERE
             entries.read AND
            NOT entries.important

Not bad...

Separation of concerns #

OK, it is a bit verbose, especially as we add more snippets in other places. Don't think so? Here's the full original code; that's a lot of work just to append() some strings.

Also, while the SQL above is acceptable, it's not the best:

  • it's needlessly indented
  • there's an extra space in front of entries.read
  • the where_snippets joiner needs more spaces if we indent the function
  • (not shown above) multi-line snippets are tricky to indent properly

We could wrap everything in dedent/indent calls, but that would make things even more verbose.

What if there was an object with lots of append()-like methods? ...something like:

def make_entries_query(read=None, important=None):
    query = (
        Query()
        .SELECT('entries.id', 'entries.title')
        .FROM('entries')
    )

    if read is not None:
        query.WHERE(f"{'' if read else 'NOT'} entries.read")
    if important is not None:
        query.WHERE(f"{'' if important else 'NOT'} entries.important")

    return query

The important bits are still there, but most of the noise went away. Formatting is a separate concern, so it happens somewhere else; things are magically cleaned up, dedented, and stitched back together into clean, regular SQL:

SELECT
    entries.id,
    entries.title
FROM
    entries
WHERE
    entries.read AND
    NOT entries.important

Composition and reuse #

Let's say we later add a method to count entries.

Surely, we'd want it to have the same filtering capabilities as get_entries(). So we move that part into a function:

def apply_filter_options(query, read=None, important=None):
    if read is not None:
        query.WHERE(f"{'' if read else 'NOT'} entries.read")
    if important is not None:
        query.WHERE(f"{'' if important else 'NOT'} entries.important")

def make_entry_counts_query(read=None, important=None):
    query = (
        Query()
        .SELECT(
            'count(*)',
            'coalesce(sum(read == 1), 0)',
            'coalesce(sum(important == 1), 0)',
        )
        .FROM("entries")
    )
    apply_filter_options(query, read, important)
    return query

Why pass the query in, instead of just returning the WHERE snippets? Because it allows doing other kinds of changes transparently:

def apply_filter_options(query, ..., updates_enabled=None):
    ...
    if updates_enabled is not None:
        query.JOIN("feeds ON feeds.url = entries.feed")
        query.WHERE(f"{'' if updates_enabled else 'NOT'} feeds.updates_enabled")

The query builder offers a standard interface for making changes to a query, which makes it easier to compose operations, which in turn makes it easier to reuse logic.

If you find this a bit contrived, check out this function for a real-world example.

Intermission: scrolling window queries #

If you're familiar with scrolling window queries / cursor pagination / keyset pagination, feel free to skip to the next section.

Before we go forward, we need to learn how to implement pagination.

Assume we have a table like this:

sqlite> CREATE TABLE things(name);
sqlite> INSERT INTO things VALUES ('b'), ('a'), ('c');

Let's say we need to get the things sorted by name, but split over multiple queries, each returning a 2-row page. There are two main ways of doing it.

The obvious one is LIMIT+OFFSET:

sqlite> SELECT * FROM things ORDER BY name LIMIT 2 OFFSET 0;
a
b
sqlite> SELECT * FROM things ORDER BY name LIMIT 2 OFFSET 2;
c

This returns the correct answer, but has one big performance issue: each query fetches and discards the rows up to OFFSET, and then returns (at most) LIMIT rows; the more data you have, and the nearer the end of the result set you are, the slower this gets.

The other one is scrolling window queries (also known as cursor based pagination):

sqlite> SELECT * FROM things ORDER BY name LIMIT 2;
a
b
sqlite> SELECT * FROM things WHERE name > 'b' ORDER BY name LIMIT 2;
c

Instead of using OFFSET to skip n rows, it uses WHERE to skip to after the last row from the previous query, using the sort key (the cursor). While superficially the same, now the query can use indices to avoid fetching rows it doesn't need.

An additional benefit, unrelated to performance, is that you won't get duplicate/missing results if rows are inserted/deleted between queries.

See this article for a more detailed explanation, complete with a benchmark you can run yourself.

Introspection #

Now that we're familiar with scrolling window queries, let's do it for a real query:

def make_feeds_query():
    query = Query().SELECT('url', 'coalesce(user_title, title)').FROM('feeds')

    # if sort == 'title':
    query.SELECT(("title_sort", "lower(coalesce(user_title, title))"))
    query.ORDER_BY('title_sort', 'url')

    return query

This retrieves feeds sorted by title, preferring custom titles; for feeds with the same title, it falls back to the primary key, so the result stays consistent between queries. I omitted other supported orderings for brevity.

To get the first page, we add a limit:

query = make_feeds_query()
query.LIMIT(':limit')
SELECT
    url,
    coalesce(user_title, title),
    lower(coalesce(user_title, title)) AS title_sort
FROM
    feeds
ORDER BY
    title_sort,
    url
LIMIT
    :limit

After we run the query, we save the last sort key:

>>> params = dict(limit=2)
>>> first_page = list(db.execute(str(query), params)
>>> last_result = first_page[-1]
>>> last_result
('file:feed.xml', 'Title', 'title')
>>> last = dict(last_title_sort=last_result[2], last_url=last_result[0])

To get the second page, we add a limit and a WHERE condition:

query = make_feeds_query()
query.LIMIT(':limit')
query.WHERE('(title_sort, url) > (:last_title_sort, :last_url)')

...and use the last sort key from the previous page as query parameters:

>>> params = dict(limit=2)
>>> params.update(last)
>>> second_page = list(db.execute(str(query), params)

But, every time we do this, we have to remember the indices of the sort key values in the result and the parameter names for the current ordering.

There must be a better way.

The WHERE parameters don't actually need a descriptive name; also, the computer already knows both what we're sorting by, and where those things are in the result tuple – it's right there in make_feeds_query().

Let's make the computer do the work, then. To add the WHERE condition:

query = make_feeds_query()
query.LIMIT(':limit')

order_by_things = [t.value for t in query.data['ORDER BY']]
labels = [f':last_{i}' for i in range(len(order_by_things))]
# use the query builder do the formatting (sneaky, but it works)
comparison = Query({'(': order_by_things, ') > (':labels, ')': ['']})

query.WHERE(str(comparison))
SELECT
    url,
    coalesce(user_title, title),
    lower(coalesce(user_title, title)) AS title_sort
FROM
    feeds
WHERE
    (
        title_sort,
        url
    ) > (
        :last_0,
        :last_1
    )
ORDER BY
    title_sort,
    url
LIMIT
    :limit

To extract the sort key parameters from the last result:

>>> order_by_things = [t.value for t in query.data['ORDER BY']]
>>> names = [t.alias or t.value for t in query.data['SELECT']]
>>> last = {
...     f'last_{i}': last_result[names.index(t)]
...     for i, t in enumerate(order_by_things)
... }
>>> last
{'last_0': 'title', 'last_1': 'file:feed.xml'}

Now, we only need to specify what we're sorting by once, in make_feeds_query(). Because the query builder provides a standard representation of a query, we can inspect that programmatically, and let the computer do the work itself.

Can we do better?

Abstraction #

The pattern above seems quite useful, but it's complicated enough I probably wouldn't get it right all the time; I bet I missed some corner cases, too.

If only there was a way to tell the computer directly what I want:

def make_feeds_query():
    query = Query().SELECT('url', 'coalesce(user_title, title)').FROM('feeds')

    # if sort == 'title':
    query.SELECT(("title_sort", "lower(coalesce(user_title, title))"))
    query.scrolling_window_order_by('title_sort', 'url')

    return query

...and then have it execute the queries for me:

>>> query = make_feeds_query()
>>> first_page = list(paginated_query(db, query, limit=2))
>>> first_page[-1]
(('file:feed.xml', 'Title', 'title'), ('title', 'file:feed.xml'))
>>> last = first_page[-1][1]
>>> query = make_feeds_query()
>>> second_page = list(paginated_query(db, query, limit=2, last=last))

Pretty neat, huh?

This is provided by a mixin that's not counted in the 150 lines – but hey, not even SQLAlchemy has it built-in.


That's it for now.

Learned something new today? Share this article with others, it really helps! :)

This is my first planned series, and still a work in progress.

This means you get a say in it. Email me your questions or comments, and I'll do my best to address them in one of the future articles.

Next: Why I wrote my own SQL query builder


This is part of a series: