Write an SQL query builder in 150 lines of Python!
August 2021 ∙ 20 minute read ∙
This is the fourth article in a series about writing an SQL query builder for my feed reader library.
Today, we'll dive into the code by rewriting it from scratch.
Think of it as part walk-through, part tutorial; along the way, we'll explore:
- API design
- knowing when to be lazy
- worse and better ways of doing things
As you read this, keep in mind it is a story, thus linear by necessity. Development was decidedly not so: I tried things out, I changed my mind multiple times, and I rewrote everything once. Even now, there are other equally-good or better implementations; this one is simply good enough.
Contents
What are we trying to build? #
We want a way of building SQL strings that takes care of formatting:
>>> query = Query()
>>> query.SELECT('one').FROM('table')
<builder.Query object at 0x7fc953e60640>
>>> print(query)
SELECT
one
FROM
table
... and allows us to add parts incrementally:
>>> query.SELECT('two').WHERE('condition')
<builder.Query object at 0x7fc953e60640>
>>> print(query)
SELECT
one,
two
FROM
table
WHERE
condition
While not required, I recommend reading the previous articles to get a better idea of the problem we're trying to solve, and the context we're solving it in.
In short, whatever we build should:
- support SELECT with conditional WITH, WHERE, ORDER BY, JOIN etc.
- expose the names of the result columns (for scrolling window queries)
- be easy to use, understand and maintain
Attention
This query builder is not directly comparable with that of an ORM. Instead, it is an alternative to building plain SQL strings by hand.
The caveats that apply to plain SQL apply to it as well: Using user-supplied values directly in an SQL query exposes you to SQL injection attacks. Instead, use parametrized queries whenever possible, and escaping only as a last resort.
Trade-offs #
Our solution does not exist in a void; it exists to be used by my feed reader library.
Notably, we're not making a general-purpose library with external users whose needs we're trying to anticipate; there's exactly one user with a pretty well-defined use case, and strict backwards compatibility is not necessary.
This allows us to make some upfront decisions to help with maintainability:
- No needless customization. We can change the code directly if we need to.
- No other features except the known requirements. We can add new ones when we need them.
- No effort to support other syntax than SQLite.
- No extensive testing. We can rely on the exising comprehensive functional tests.
- No SQL validation. The database does this already.
- However, it would be nice to get at least a little error checking. No need for custom exceptions, any kind is acceptable – they should come up only during development and testing anyway.
A minimal plausible solution #
Data representation #
As mentioned before, my prototype was based on the idea that queries can be represented as plain data structures.
Looking at a nicely formatted query, a natural representation may reveal itself:
SELECT
one,
two
FROM
table
WHERE
condition AND
another-condition
See it?
It's a mapping with a list of strings for each clause:
{
'SELECT': [
'one',
'two',
],
'FROM': [
'table',
],
'WHERE': [
'condition',
'another-condition',
],
}
Let's use this as our starting model, and make ourselves a query builder.
Classes #
We start with a class:
|
|
We use a class because most of the time we don't want to interact with the underlying data structure, since it's more likely to change. We're not subclassing dict, since that would unintentionally expose its methods (and thus, behavior), and we may need those names for something else.
Also, a class allows us to reduce verbosity:
# we want
query.SELECT('one', 'two').FROM('table')
# not
query['SELECT'].extend(['one', 'two'])
query['FROM'].append('table')
We use class variables for "static" data instead of hardcoding or module variables so it's easy to override (more on that later).
We don't customize anything in __init__()
fow now;
if we need more clauses, we can add them to keywords
directly.
Adding all known keywords to data
upfront gets us free error checking:
data[keyword]
raises KeyError for unknown keywords.
Note
Unless specified otherwise,
I'll use clause and keyword to mean "item in self.data
",
not "SQL clause or keyword in general".
We could use dataclasses,
but of the generated magic methods, we'd only use __repr__()
,
and its output would be too long to be useful anyway.
Adding things #
Next, we add code for adding string fragments to each clause:
|
|
add()
is roughly equivalent to data[keyword].extend(args)
.
The main difference is that we dedent the arguments and remove trailing whitespace. This is intentional: we clean everything up and make as many choices when adding things, so we don't have to care about that when generating output, and so error checking happens as early as possible.
Also, add()
returns self
to enable method chaining: query.add(...).add(...)
.
__getattr__()
is called when an attribute does not exist,
and allows us to return something instead of getting the default AttributeError.
What we return is a KEYWORD(*args)
callable made on the fly
by wrapping add()
in a partial;
a closure capturing name
would be functionally equivalent.
Requiring the keywords to be uppercase is a stylistic choice,
but does have advantages:
it signals to the reader these are special "methods",
and avoids shadowing dunder methods like __deepcopy__()
without extra checks.
To indicate the attribute really doesn't exist,
we need to raise AttributeError;
we let getattr() do it for us
(the parent object doesn't have a custom __getattr__()
).
We could store the partial on the instance,
which would side-step __getattr__()
on subsequent calls,
so we only make one partial per keyword;
we could do it in __init__()
, and not use __getattr__()
at all;
we could even use partialmethod,
so there's only one per keyword per class!
Or we can do nothing – they're likely premature optimization,
and what we're doing now is more flexible anyway.
I said error checking happens as early as possible;
that's almost true:
if you look carefully at the code,
you may notice query.ESLECT
doesn't raise an exception
until called – query.ESLECT()
.
Doing most of the work in add()
does have some benefits, though:
we can use it with partial and get chaining for free,
and it's an escape hatch for when we want
to use a "keyword" that's not a Python identifier
(this will be useful later).
Output #
Finally, we turn the query into SQL:
|
|
|
|
The only output API is str(); being the standard way of turning objects into strings in Python, it requires zero effort to learn.
str(query)
calls __str__
, which delegates to _lines()
.
We use a generator mainly because it allows us to write
yield line
instead of rv.append(line)
,
making for somewhat cleaner code.
Another benefit of a generator is that it's lazy,
so we can pass it around
without having to build intermediary lists in memory;
for example, to a file's writelines() method,
or in yield from
in another generator
(e.g. for nested subqueries).
We don't need it here,
but it's useful when generating a lot of values.
We split the logic for individual clauses into _lines_keyword()
,
because we'll keep adding stuff to it.
(I initially left everything in _lines()
,
and refactored when things got too complicated;
no need to do that now.)
Since we'll want to indent things in the same way in more than one place, we make it a static "method" using partial.
You may notice we're not sorting the clauses in any way;
dicts guarantee insertion order in Python 3.6+1,
and we built data
from keywords
, so the order is preserved.
Tests #
Let's add a simple test to make sure we don't break already working stuff:
|
|
We'll keep adding to it with each feature.
For a minimal solution, we are done. We've "spent" 62 lines, or 38 statements.
The code so far: builder.py, test_builder.py.
Separators #
At this point, WHERE doesn't really make sense:
>>> print(Query().WHERE('a', 'b'))
WHERE
a,
b
We fix it by special-casing separators for a few clauses:
|
|
|
|
We could've used defaultdict instead of using default_separator
,
but then we'd have to remember non-comma separators need a space: ' AND'
;
putting it in code means we don't have to remember anything.
Also, we could've put the separator on a new line:
'one\nAND two'
vs. 'one AND\ntwo'
.
While slightly better style,
it makes code more complicated for little benefit,
and makes it less obvious that AND is just another separator.
We add WHERE to the test.
|
|
The code so far: builder.py, test_builder.py.
Aliases #
One of the requirements is that it should be possible to implement scrolling window queries on top. For this, code needs to get the result column names – the SELECT expressions or their aliases – and add them to a generated WHERE condition.
Parsing the result column is straightforward only for simple cases:
>>> query = Query().SELECT(
... 'column',
... 'column AS alias',
... 'column as alias',
... '(SELECT column FROM table AS another-table)',
... )
>>> [s.rpartition(' AS ')[2] for s in query.data['SELECT']]
['column', 'alias', 'column as alias', 'another-table)']
An acceptable compromise is using pairs of strings for aliased columns. Since the column expression might be quite long, we'll make the alias the first thing in the pair.
>>> print(Query().SELECT(('alias', 'one'), 'two'))
SELECT
one AS alias,
two
As mentioned earlier, we store everything in a standard way to keep output code simpler. A plain 2-tuple is a decent choice, but a named tuple is more readable.
|
|
Conveniently, this gives us a place where to convert the string-or-pair:
the from_arg()
alternate constructor.
We could've made it a stand-alone function,
but this way it's easier to see what type is being returned.
Note
Note that we use an empty string to mean "no alias". In general, it's a good idea to distinguish this kind of absence by using None, since the empty string may be a valid input, and None can prevent some bugs – e.g. you can't concatenate None to a string. Here, an empty string cannot be a valid alias, and we use format strings, so we don't bother.
Using it is just a one-line change to add()
:
|
|
On output, we have two concerns:
- there may or may not be an alias
- the order differs depending on the keyword:
you have
SELECT expr AS column-alias
, butWITH table-name AS (stmt)
(we treat the CTE table name as an alias)
We can model this with mostly-empty defaultdicts with per-clause format strings:
|
|
... and choose the right defaultdict using the alias's boolean value:2
|
|
We add an aliased expression to the test.
|
|
The code so far: builder.py, test_builder.py.
Subqueries #
Currently, WITH is still a little broken:
>>> print(Query().WITH(('table-name', 'SELECT 1')))
WITH
table-name AS SELECT 1
Since common table expressions always have the SELECT statement paranthesized, we'd like to have it out of the box, with proper indentation:
WITH
table-name AS (
SELECT 1
)
A simple way of handling this is to change the WITH format string to
'{alias} AS (\n{indented}\n)'
,
where indented
is the value, but indented.3
This kinda works, but is limited in usefulness; for instance, we can't easily build something like this on top:
Query().FROM(('alias', 'SELECT 1'), is_subquery=True)
Instead, let's keep refining our model, and use a flag to mark subqueries:
|
|
We can then check if a clause always has subqueries, and set the flag accordingly:
|
|
|
|
Using it for output is just an extra if:
|
|
We add WITH to our test.
|
|
The code so far: builder.py, test_builder.py.
Joins #
One clause that's entirely missing is JOIN. And it's important, changing your mind about what you're selecting from happens quite often.
JOIN is a bit more complicated, mostly because it has different forms – JOIN, LEFT JOIN and so on; SQLite supports at least 10 variations.
I initially treated any keyword containing JOIN
as a separate keyword,
and dealt with it during output.
This has a few drawbacks, though;
aside from making the code more complicated,
it reorders the tables:
query.JOIN('a').LEFT_JOIN('b').JOIN('c')
results in JOIN a JOIN c LEFT JOIN b
.
A better solution is to refine our model even further.
Take a look at these railroad diagrams for the SELECT statement:
You may notice table-or-subquery
followed by ,
in FROM
is actually a subset of table-or-subquery
followed by join-operator
in join-clause
.
That is, for SQLite, a comma is just another join operator.
Put the other way around, a join operator is just another separator.
Because our separators come after things, not before,
we'll model join operators separately, as fake keywords
(that is, not used to index into data
).
First, let's set them:
|
|
|
|
|
|
We could've probably just hardcoded this in add()
(if 'JOIN' in keyword: ...
),
but doing it like this makes it easier to see at a glance that
"JOIN is a fake FROM".
Using keyword
as a separator is relatively straightforward:
|
|
Since FROM always comes before JOIN, we make sure to output the real ones first.
We add a JOIN to the test.
|
|
The code so far: builder.py, test_builder.py.
Distinct #
The final model change is to support SELECT DISTINCT.
DISTINCT and ALL are flags that apply to the whole clause; we'll model them as such:
|
|
|
|
Since most of the time we're OK with the default flag
,
we don't bother setting it in __init__
,
and use a class variable instead.
If we need to customize it,
we can set flag
on the instance, shadowing the class variable.
A __repr__
showing the flag would be nice,
but it'd only be useful during debugging, so we skip it as well.
We set the flag based on a known set for each clause; like with fake keywords, we pull the "parsing" logic into a separate method:
|
|
|
|
|
|
Using it for output is again straightforward:
|
|
We add a SELECT DISTINCT to our test.
|
|
The code so far: builder.py, test_builder.py.
More tests #
Our only test isn't all that simple anymore; maybe it's time to split it in two: one with a really simple query, and one with a really complicated query.
... something like this.
|
|
The code so far: builder.py, test_builder.py.
More init #
One last feature: I'd like to reuse the formatting logic for paranthesized lists.
Good thing __init__
doesn't take any arguments yet:
|
|
Using it looks like:
>>> print(Query({'(': ['one', 'two'], ')': ['']}, separators={'(': 'OR'}))
(
one OR
two
)
We could have required data
to have the same structure as the attribute;
however, it would be too verbose to use,
and I'd have to do all the clean up myself;
that's not very convenient.
Instead, we make it mean
"add()
these strings for these keywords".4
We add a separate test for the fancy __init__
.
|
|
OK, now we're really done. We've spent 148 lines, or 101 statements.
The final version: builder.py, test_builder.py.
Note
Since the code in this article is essentially the reader one, but without type annotations, feel free to use it under the same BSD 3-Clause license.
That's it for now. :)
Learned something new today? Share this with others, it really helps!
Bonus: things that didn't make the cut #
When talking about trade-offs, I said we'll only add features as needed; this may seem a bit handwavy – how can I tell adding them won't make the code explode?
Because I did add them; that's what prototyping was for. But since they weren't actually used, I removed them – there's no point in them rotting away.
Here's how you'd go about implementing a few of them.
Insert / update / delete #
Make them flag keywords, to support the OR ABORT/FAIL/...
variants.
To make VALUES bake in the parentheses, set its format
to ({value})
.
That's to add one values tuple at a time.
To add one column at a time, we could do this:
- allow
add()
ing INSERT with arbitrary flags - make
INSERT('column', into='table')
a synonym ofadd('INSERT INTO table', 'column')
- classify INSERT and VALUES as
parens_keywords
– likesubquery_keywords
, but they apply once per keyword, not per value
It'd look like this:
# first insert sets flag
query.INSERT('one', into='table').VALUES(':one')
# later we just add stuff
query.INSERT('two').VALUES(':two')
Arbitrary strings as subqueries #
Allow setting add(..., is_subquery=True)
; you'd then do:
query.FROM('subquery', is_subquery=True)
query.FROM('not subquery')
Query objects as subqueries #
Using Query objects as subqueries
without having to convert them explicitly to strings
would allow changing them after being add()
ed.
To do it, we just need to allow _Thing.value
to be a Query,
and override its is_subquery
based on an isinstance() check.
Union / intersect / except #
This one goes a bit meta:
- add a "virtual" COMPOUND keyword
- add a new
compound(keyword)
method, which moves everything except WITH and ORDER BY to a subquery, and appends the subquery todata['COMPOUND']
with the appropriate fake keyword - make
__getattr__
return acompound()
partial for compound keywords - special-case COMPOUND in
_lines()
The guaranteed insertion order was actually added to the language specification in 3.7, but in 3.6 both CPython and PyPy had it as an implementation detail. [return]
This works because
False == 0
andTrue == 1
, and is likely too clever. [return]That's what I did initially. [return]
An alternate constructor or a subclass might have been a better choice here. We'll fix it if we need
__init__
for something else. ¯\_(ツ)_/¯ [return]
This is part of a series: