DynamoDB crash course: part 1 – philosophy

This is part one of a series covering core DynamoDB concepts and patterns, from the data model and features all the way up to single-table design.

The goal is to get you to understand what idiomatic usage looks like and what the trade-offs are in under an hour, providing entry points to detailed documentation.

(Don't get me wrong, the AWS documentation is comprehensive, but can be quite complex, and DynamoDB being a relatively low level product with lots of features added over the years doesn't really help with that.)

Today, we're looking at what DynamoDB is and why it is the way it is.

What is DynamoDB? #

Quoting Wikipedia:

Amazon DynamoDB is a managed NoSQL database service provided by AWS. It supports key-value and document data structures and is designed to handle a wide range of applications requiring scalability and performance.

See also

This definition should suffice for now; for a more detailed refresher, see:

The DynamoDB data model can be summarized as follows:

A table is a collection of items, and an item is a collection of attributes. Items are uniquely identified by two attributes, the partition key and the sort key. The partition key determines where (i.e. on what computer) an item is stored. The sort key is used to get ordered ranges of items from a specific partition.

That's is, that's the whole data model. Sure, there's indexes and transactions and other features, but at its core, this is it. Put another way:

A DynamoDB table is a hash table of B-trees¹ – partition keys are hash table keys, and sort keys are B-tree keys. Because of this, any access not based on partition and sort key is expensive, since you end up doing a full table scan.

If you were to implement this model in Python, it'd look something like this:

from collections import defaultdict
from sortedcontainers import SortedDict

class Table:

    def __init__(self, pk_name, sk_name):
        self._pk_name = pk_name
        self._sk_name = sk_name
        self._partitions = defaultdict(SortedDict)

    def put_item(self, item):
        pk, sk = item[self._pk_name], item[self._sk_name]
        old_item = self._partitions[pk].setdefault(sk, {})
        old_item.clear()
        old_item.update(item)

    def get_item(self, pk, sk):
        return dict(self._partitions[pk][sk])

    def query(self, pk, min_sk=None, max_sk=None, inclusive=(True, True), reverse=False):
        # in the real DynamoDB, this operation is paginated
        partition = self._partitions[pk]
        for sk in partition.irange(min_sk, max_sk, inclusive, reverse):
            yield dict(partition[sk])

    def scan(self):
        # in the real DynamoDB, this operation is paginated
        for partition in self._partitions.values():
            for item in partition.values():
                yield dict(item)

    def update_item(self, item):
        pk, sk = item[self._pk_name], item[self._sk_name]
        old_item = self._partitions[pk].setdefault(sk, {})
        old_item.update(item)

    def delete_item(self, pk, sk):
        del self._partitions[pk][sk]

>>> table = Table('Artist', 'Song')
>>>
>>> table.put_item({'Artist': '1000mods', 'Song': 'Vidage', 'Year': 2011})
>>> table.put_item({'Artist': '1000mods', 'Song': 'Claws', 'Album': 'Vultures'})
>>> table.put_item({'Artist': 'Kyuss', 'Song': 'Space Cadet'})
>>>
>>> table.get_item('1000mods', 'Claws')
{'Artist': '1000mods', 'Song': 'Claws', 'Album': 'Vultures'}
>>> [i['Song'] for i in table.query('1000mods')]
['Claws', 'Vidage']
>>> [i['Song'] for i in table.query('1000mods', m='Loose')]
['Vidage']

Philosophy #

One can't help but feel this kind of simplicity would be severely limiting.

A consequence of DynamoDB being this low level is that, unlike with most relational databases, query planning and sometimes index management happen at the application level, i.e. you have to do them yourself in code. In turn, this means you need to have a clear, upfront understanding of your application's access patterns, and accept that changes in access patterns will require changes to the application.

In return, you get a fully managed, highly-available database that scales infinitely:² there are no servers to take care of, there's almost no downtime, and there are no limits on table size or the number of items in a table; where limits do exist, they are clearly documented, allowing for predictable performance.

This highlights an intentional design decision that is essentially DynamoDB's main proposition to you as its user: data modeling complexity is always preferable to complexity coming from infrastructure maintenance, availability, and scalability (what AWS marketing calls "undifferentiated heavy lifting").

To help manage this complexity, a number of design patterns have arisen, covered extensively by the official documentation, and which we'll discuss in a future article. Even so, the toll can be heavy – by AWS's own admission, the prime disadvantage of single table design, the fundamental design pattern, is that:

[the] learning curve can be steep due to paradoxical design compared to relational databases

As this walkthrough puts it:

a well-optimized single-table DynamoDB layout looks more like machine code than a simple spreadsheet

...which, admittedly, sounds pretty cool, but also why would I want that? After all, most useful programming most people do is one or two abstraction levels above assembly, itself one over machine code.

A bit of history #

Perhaps it's worth having a look at where DynamoDB comes from.

Amazon.com used Oracle databases for a long time. To cope with the increasing scale, they first adopted a database-per-service model, and then sharding, with all the architectural and operational overhead you would expect. At its 2017 peak (five years after DynamoDB was released in AWS, and over ten years after some version of it was available internally), they still had 75 PB of data in nearly 7500 Oracle databases, owned by 100+ teams, with thousands of applications, for OLTP workloads alone. That sounds pretty traumatic – it was definitely bad enough to allegedly ban OLTP relational databases internally, and require that teams get VP approval to use one.

Yeah, coming from that, it's hard to argue DynamoDB adds complexity.

That is not to say relational databases cannot be as scalable as DynamoDB, just that Amazon doesn't belive in them – distributed SQL databases like Google's Spanner and CockroachDB have existed for a while now, and even AWS seems to be warming up to the idea.

This might also explain why the design patterns are so slow to make their way into SDKs, or even better, into DynamoDB itself; when you have so many applications and so many experienced teams, the cost of yet another bit of code to do partition key sharding just isn't that great.