DynamoDB crash course: part 2 – data model

This is part two of a series covering core DynamoDB concepts, from philosophy all the way to single-table design. The goal is to get you to understand idiomatic usage and trade-offs in under an hour.

Today, we're looking at the DynamoDB data model – what the main abstractions are, what you can do with them, and how they scale.

(While the AWS documentation is mostly comprehensive, it's also all over the place, including some other places that aren't the documentation at all, like the AWS blog. This series brings the important stuff in one place, so you can get a mental model of how it all ties together without having to read the entire documentation twice).

Contents

Core components
Limits
Indexes
- Global secondary indexes
- Local secondary indexes
Features

Core components #

According to the documentation, the core components of DynamoDB are tables, items, and attributes. This is accurate in the sense of what you can act on through the API, but can be deceptively simple, and leaves out two other equally important aspects: what you can do with it (the logical model) and how it scales (the physical model).

Let's put it all together, starting from the top.

API model: tables, items, attributes #

As far as the API is concerned, "a table is a collection of items, and each item is a collection of attributes".¹

An item is uniquely identified by two attributes, the partition key and the sort key,² which together compose its primary key.³ A group of items with the same partition key value is called an item collection,⁴ but this is more of a logical grouping, and does not exist as a distinct entity in the API.

An attribute is a named data element, with its value either a scalar (number, string, binary, boolean, null), a set of scalars, or a document (a list or map of possibly nested attributes, similar to JSON).

There are no limits on table size or number of items, nor on those of an item collection. Items do have a size limit of 400 KB / item, which indirectly limits attribute size.

As we've seen in the previous article, the core DynamoDB data operations are:

PutItem, GetItem, UpdateItem, DeleteItem
Query items with the same partition key, sorted by sort key, and optionally narrowed down to a specific a range of sort keys
Scan all the items in the table, possibly in parallel

Besides whole items, the API allows getting and updating specific attributes, as well as filtering query and scan results by expressions using them.

See also

Logical model: hash table of B‍-‍trees #

The operations above may seem arbitrarily restrictive – for example, why can't I query items by sort key alone? It might make more sense to think about it like this:

Conceptually, a DynamoDB table is a hash table of B‍-‍trees, with partition keys being hash table keys, and sort keys being B‍-‍tree keys (making item collections B‍-‍trees). The hash table allows efficient find collection by partition key operations; within each collection, the B‍-‍tree keeps the items sorted, and allows efficient find item by sort key and find items by sort key range operations.

As a consequence, any access not based on partition and sort key is expensive, since instead of taking advantage of the underlying data structure, you have to go through all the items in the table to find anything (aka a full table scan), and at the scales you'd use DynamoDB at, this can mean billions of items.⁵

Example

(from here) Take a Music table where items correspond to songs, with Artist as primary key and Song as sort key:

# table Music (partition key: Artist, sort key: Song)
1000mods: !btree
  Claws: { Album: Vultures }
  Vidage: { Year: 2011 }
Kyuss: !btree
  Space Cadet: { }

You can efficiently:

query songs by artist (sorted by song title)
get the song by artist and song title

...and that's it, anything else requires a full table scan.

To a first approximation, this is also a decent model of how DynamoDB scales – you could imagine that each collection has its own dedicated computer, which in theory would account for the unlimited number of collections.

Physical model: partitions #

Of course, there are not infinitely many computers, and that would be wildly inefficient anyway. Instead, collections are packed together into a smaller number of partitions, each a few gigabytes in size. To figure out which partition an item should go on, DynamoDB hashes its partition key (also called a hash key, for obvious reasons).

This is similar to hash table buckets,⁶ except there's one more level of indirection – instead of mapping to a single number, each partition maps to a range of numbers, which allows splitting a partition into two new ones by splitting its range. Furthermore, an item collection can be split on multiple partitions too, by using the sort key.⁷

And that is how the scaling magic happens:

When you increase provisioned capacity, partitions are split as needed.
If a partition or collection becomes too big, it gets split.
If the throughput to a partition or collection is high enough for long enough, it also gets split,⁸ possibly with a bias towards keys with higher utilization; this is a feature of adaptive capacity.

Partition management is handled entirely by DynamoDB and is transparent to the user, but it doesn't happen instantly – it takes several minutes to allocate new partitions and shuffle things around.

Since partitions are backed by real computers, they do have a throughput limit.

See also

Limits #

Part of DynamoDB's appeal is that it scales "infinitely" for specific dimensions: there are no limits on table size or number of items. However, there are some hard, non-adjustable limits you will have to take into account when designing your application.

See also

Cheat sheet # Service quota basics
Quotas
Constraints
(unofficial) The Three DynamoDB Limits You Need to Know

Partition throughput #

The most important limit is that on partition throughput (aka capacity) – how much data DynamoDB can read from or write to a partition in a given amount of time:⁹

1 MB/s for writes
24 MB/s for reads, eventually consistent
12 MB/s for reads, strongly consistent

Throughput measures whole items DynamoDB has to access, not the data that goes through the API. While you can touch single attributes and filter query results, the consumed capacity is always that of the whole items DynamoDB had to read or write.¹⁰

Once you reach the limit, the operation is throttled, and you can try again later, ideally with exponential backoff (the AWS SDK usually takes care of this for you).

The best way to avoid throttling is to distribute the load uniformly across partitions by using a high-cardinality partition key.¹¹ Uneven key distribution can create hot partitions that suffer from persistent throttling.

Nowadays, this is less of a problem. For long-term imbalances, partition splitting should rebalance things over time; you "might" even end up with a single popular item per partition. For short-term ones like traffic spikes, burst and adaptive capacity will, on a best effort basis, "borrow" capacity above the table limit and between partitions. However, AWS is very non-committal about their behavior, and there's nothing you can do besides increasing traffic gradually, so good partition key design remains key.

Of note, while the throughput is fixed, the other dimensions are not; this means that you have a trade-off between how often you access items, the number of items, and item size; for example, you can split items into smaller ones based on how attributes are accessed, aka vertical partitioning.

See also

Read and write operations (capacity unit consumption)
Partition key design and Distributing workloads
Sort key design
(blog) Choosing the Right DynamoDB Partition Key

Item size #

Second, the maximum item size is 400 KB, which ought to be enough for anybody. You can work around this limit either by splitting items into parts, or by putting the data somewhere else entirely, like S3, and keeping only a reference in DynamoDB.

See also

Page size #

Finally, the maximum response size for query and scan operations is 1 MB (a page). You can continue from the end of the previous page by passing the LastEvaluatedKey response element to subsequent calls, which is essentially keyset pagination. One consequence of this is that, throughput limit aside, there's an implicit limit on how fast you can query the items in a collection, since the calls are sequentiall.¹²

See also

Indexes #

As discussed in the logical model, access not based on primary key is very inefficient.

Secondary indexes allow queries and scans that use alternative primary keys, ones composed of different attributes than that of the base table. Unlike tables, index sort keys do not have to be unique for a given partition key. An item that is missing one of the index primary key attributes will not appear in the index.

Changes to the table are automatically propagated to any secondary indexes. Aside from the index and table primary key attributes, an index can include copies of other attributes (aka attribute projection), which allows the index to answer queries alone, without extra reads to the base table.¹³

See also

Global secondary indexes #

A global secondary index allows using different partition and sort key attributes.

Conceptually, a global secondary index is just a table: it has its own separate capacity, no limits on size or number of items, and the same partition throughput limits apply.

Despite GSIs being updated asynchronously, an index without enough capacity to process the updates will cause write throttling. To retrieve attributes not in the index, you have to get them yourself from the table (batch operations can speed this up).

Example

(from here) Continuing with the music example, a GSI with Genre and Album as partition and sort keys would allow you to also efficiently:

query songs by genre (and, with additional processing, albums by genre)
query songs by genre and album (but, since two albums can have the same genre and title, you might want to group by artist in application code)

# table Music (partition key: Artist, sort key: Song)
Kyuss: !btree
  Demon Cleaner: { Album: Welcome To Sky Valley, Genre: Rock }
  Space Cadet: { Album: Welcome To Sky Valley, Genre: Rock }
1000mods: !btree
  Claws: { Genre: Rock }  # has no Album!
  Vidage: { Album: Super Van Vacation, Genre: Rock }
Solar Fields: !btree
  Air Song: { Album: Leaving Home, Genre: Electronic }

# GSI Genres (partition key: Genre, sort key: Album)
Rock: !btree
  Super Van Vacation: { Artist: 1000mods, Song: Vidage}
  Welcome To Sky Valley: { Artist: Kyuss, Song: Space Cadet }
  Welcome To Sky Valley: { Artist: Kyuss, Song: Demon Cleaner }
Electronic: !btree
  Leaving Home: { Artist: Solar Fields, Song: Air Song }

See also

Local secondary indexes #

A local secondary index allows using a different sort key attribute.

LSI data is stored together with partition data (the index is local to the partition), so besides the table B‍-‍tree, each collection has one B‍-‍tree per LSI.

This allows strongly consistent reads and fetching non-projected attributes, but also limits collection size to 10 GB and collection throughput to the partition limit, since it prevents further partition splitting (as each sort key would split the items in a different way).

Example

A LSI with Year as sort key would allow you to also efficiently:

query songs by artist, in chronological order
query songs by artist and year

# table Music (partition key: Artist, sort key: Song)
1000mods: !btree
  Claws: { Year: 2014 }
  Road To Burn: { Year: 2011 }
  Vidage: { Year: 2011 }
Solar Fields: !btree
  Sombrero: { Year: 2011 }

# table Music (partition key: Artist, LSI sort key: Year)
1000mods: !btree
  2011: { Song: Vidage }
  2011: { Song: Road To Burn }
  2014: { Song: Claws }
Solar Fields: !btree
  2011: { Song: Sombrero }

See also

Features #

Let's look at some of the things DynamoDB can do besides CRUD operations.

Eventual consistency #

So, remember I said partitions are backed by real computers? I didn't say how many.

To allow the high-availability magic to happen, a partition is backed by three nodes in separate data centers:¹⁴ a leader that handles writes and two asynchronous replicas.

This explains why there are two kinds of reads:

strongly consistent reads go to the leader, so you always get the latest data
eventually consistent reads go to any node, so you may get slightly older data, but if you repeat the read later, you will eventually get the latest data; because they use all the available nodes, they are more efficient, and thus cheaper

Note that strongly consistent reads do not replace synchronization primitives like conditional writes and transactions, but they can be useful to lower the rate at which these operations fail for highly-contended items.

See also

DynamoDB read consistency

Conditional writes #

Write operations can specify a condition expression that must be true for the write to happen (e.g. an attribute has a specific value); if the expression is false, the write fails. Condition expressions can refer only to the item being modified.

Conditional writes are critical for data consistency and avoiding concurrency bugs, since they are only the way to run logic server-side, while the item is being modified. You can use conditional writes to build higher level abstractions like optimistic locking, distributed locks, and atomic counters.¹⁵

See also

Transactions #

Transactions allow performing multiple writes as a single atomic operation, isolated from other operations; if two operations attempt to change an item at the same time, one of them fails. Transactions can target up to 100 distinct items in one or more tables in the same region, and consume twice as much capacity.

You can use transactions with condition expressions – if a condition fails for one item, none of the items are modified; you can also check an item without modifying it. Like with single-item writes, an expression can refer only an individual item (you can't have a condition about another item in the transaction).

See also

Working with transactions

Batch operations #

Batch operations allow you to put/delete up to 25 items or read up to 100 items in a single request, up to 16 MB in total, more efficiently than using single-item operations. Batch writes don't support updates or condition expressions.

The operations in a batch are independent from one another – some writes may fail, or only some of the read items may be returned (e.g. if throughput limits are reached).

See also

Streams #

Streams allow you to capture changes to the items in a table in near-real time. There are two flavors of streams, DynamoDB Streams and Kinesis Data Streams, each with different features and integrations.

Notable applications of streams are Lambda triggers (similar to the ones in relational databases, except they run after the change), replication to places like S3 or Redshift via Firehose, and automatic archival.

See also

Anyway, that's it for now.

In the next article, we'll have a closer look at core DynamoDB design patterns, including the fundamental single-table design.

Learned something new today? Share it with others, it really helps! PyCoder's Weekly HN Bluesky linkedin Twitter

Here, "collection" just means a "group of things". ^[return]
Ignore the names for now, it'll make sense in a bit. ^[return]
It is also possible to have a table with only a partition key and no sort key, but you can think of that like a degenerate case where the sort key is a constant value, and thus each partition key can have only one item. ^[return]
Yes, an item collection is different from a "collection of items". Don't look at me, I didn't pick the names. ಠ_ಠ ^[return]
Indexes, which we'll discuss later, offer an escape hatch to this. ^[return]
A quick rant on naming. With a hash table, you would say "hash table key", or maybe even "item key"; you would not say "bucket key", since that's a low level detail, and also a bucket can have multiple keys. You know, like in DynamoDB.

THEN WHY IS IT CALLED A PARTITION KEY ^[return]
You have to admit that this is a great explanation. Surely you'd find it in the docs, and not burried in a random blog post published four years after the feature was announced, also in a blog post, and which itself explains more than the official documentation does to this day, seven years later! ^[return]
Assuming the table has enough configured throughput. ^[return]
Converted to normal people units for your convenience. DynamoDB uses its own capacity units as a convoluted way of saying that for accounting purposes, item size is rounded up to 4 KB (1 RCU) for reads and 1 KB (1 WCU) for writes. This is presumably because the size of a capacity unit can increase over time. ^[return]
Yes, this includes just counting them. ^[return]
If there is no good natural partition key, you can make one by sharding a low-cardinality attribute, which we'll cover in the next article. ^[return]
You could make it twice as fast by querying from both ends, but probably no faster, since for binary search you'd need to jump in the middle of two sort keys. Unsurprisingly, we'll look at a potential solution in the next article. ^[return]
This is the same as a covering index in relational databases. ^[return]
Or better said, availability zones. ^[return]
Although you can also use update expressions for atomic counters. ^[return]