Ordered key sharding in DynamoDB

So, you want to keep a sorted index in DynamoDB, but for whatever reason – usually throughput-related – it won't fit on a single partition. What do you do?

Today, we look at the available solutions, do the math, and find out which is best.

Tip

This worked example is part of my DynamoDB crash course series.

Contents

Requirements
A sparse index is almost enough
But scan results are not ordered
But a single partition key causes throttling
But random suffixes are random
But hash suffixes are not ordered
But there are a lot of first characters
But some first bytes need multiple shards
But tries and prefix ranges are complicated
But the prefix distribution can change

Requirements #

Say you're using single table design with a table of artists, albums, and songs.¹

You keep an artist's items in a single collection (aka same partition key), and use sort keys artist, album#{Album}, and song#{Album}#{Song}, depending on their type:

# table Music (partition key: Artist, sort key: sk)
Solar Fields: !btree
  'album#Leaving Home': { Album: Leaving Home, ... }
  'artist': { ... }
  'song#Leaving Home#Air Song': { ... }
  'song#Leaving Home#Monogram': { ... }

To list albums without doing a full table scan, you need a global secondary index.

Let's come up with some reasonable requirements; the GSI should support:

items up 500 bytes (we project additional attributes besides the keys)
10,000 queries/second, max 100 items/query, sorted alphabetically
- list all albums
- list albums by title
10,000 writes/second (to avoid write throttling during imports)

A sparse index is almost enough #

One way to do it is to use a dedicated sparse index, taking advantage of the fact that items with missing index keys don't appear in the index.

If only albums have an Album attribute, we just create a new GSI:

# GSI Albums (partition key: Album, sort key: Artist)
Leaving Home: !btree
  International Pony: { sk: 'album#Leaving Home', ... }
  Solar Fields: { sk: 'album#Leaving Home', ... }
Heavy Migration: !btree
  Dday One: { sk: 'album#Heavy Migration', ...}

If songs have an Album too, we add a dedicated AlbumsPK attribute instead.

In many ways, this is the ideal solution. To list all albums, we scan the index. To list albums by title, we query an index partition key. We have lots of unique partition keys with items spread pretty evenly across them, which should prevent throttling.

But scan results are not ordered #

...except scan results are not ordered, so we're missing the sorted alphabetically part.

What is ordered are sort keys, so we can use a single index collection instead:

# GSI GSI1 (partition key: gsi1pk, sort key: gsi1sk)
'albums': !btree
  Heavy Migration: { Artist: Dday One, sk: 'album#Heavy Migration', ... }
  Leaving Home: { Artist: Solar Fields, sk: 'album#Leaving Home', ... }
  Leaving Home: { Artist: International Pony, sk: 'album#Leaving Home', ... }

This is also seemingly ideal. To list all albums, we query the entire index partition key. To list albums by title, we use a sort key. The results are sorted as required, and there's no limit on the number of items in a collection.

But a single partition key causes throttling #

However, there are per-partition limits of 24 MB/s for reads and 1 MB/s for writes.

Let's see how they compare to our requirements:

reads: 500 bytes/item * 10k queries/s * 100 items/query = 500 MB/s (~21x)
writes: 500 bytes/item * 10k items/s = 5 MB/s (5x)

Uh-oh, turns out we need 21 times the throughput one partition can deliver.

One way to spread the load is sharding, using multiple synthetic partition keys of the form album#{shard_id}. A common option for the shard id is a random number from a known range, e.g. album#{randrange(21)}:

# GSI GSI1 (partition key: gsi1pk, sort key: gsi1sk)
'album#1': !btree
  Leaving Home: { Artist: Solar Fields, ... }
'album#12': !btree
  Heavy Migration: { Artist: Dday One, ... }
'album#20': !btree
  Leaving Home: { Artist: International Pony, ... }

To list all albums, query each shard in turn:

for shard in range(21):
    for item in dynamodb.query(f"album#{shard}"):
        yield item

But random suffixes are random #

There's a problem, though – with random shard ids we can't easily list albums by title, since albums with the same title may end up on any shard.

A better option is to calculate the shard id from the album title using a hash function:

def hash(s):
    return int.from_bytes(sha256(s.encode()).digest())

def album_shard_id(album_title):
    return hash(album_title) % 21

# GSI GSI1 (partition key: gsi1pk, sort key: gsi1sk)
'album#6': !btree
  Leaving Home: { Artist: Solar Fields, ... }
  Leaving Home: { Artist: International Pony, ... }
'album#8': !btree
  Heavy Migration: { Artist: Dday One, ... }

To list albums by title:

dynamodb.query(f"album#{album_shard_id(album_title)}", sk=album_title, index='GSI1')

But hash suffixes are not ordered #

That takes care of throughput, but now results aren't sorted alphabetically anymore. We can sort items within each shard using the sort key, but they are spread uniformly across shards, and there's no order between shards.

Maybe we could use the first letter as shard id instead?

Of course, we have to account for some first letters being more frequent than others. In this case, we can approximate the actual distribution by using MusicBrainz data.

There are 5.5 million albums:

>>> import polars as pl
>>> titles = pl.read_csv(
...     'mbdump/release',
...     has_header=False,
...     separator='\t',
...     quote_char=None,
...     columns=[2],
...     new_columns=['title'],
... )[:,0]
>>> titles.count()
5535986

...but only 3.3 million unique titles, partly due to different releases of the same album, partly due to some titles being more popular – a few of them, really popular:

>>> titles.value_counts(sort=True)
shape: (3_370_505, 2)
 title            count
 Greatest Hits    4638
 Demo             3140
 …                …
 Salsa salsa      1
 Glamour: Deluxe  1
>>> titles.unique_counts().quantile([.9, .99, .999, .9999, 1])
[2.0, 11.0, 58.0, 221.0, 4638.0]

Let's look at first characters:

>>> normalized = titles.str.to_lowercase().str.normalize('NFKD').sort()
>>> normalized.str.slice(0, 1).value_counts(sort=True)
shape: (5_402, 2)
 title  count
 t      584760
 s      509065
 a      317513
 l      298757
 …      …
 🫀     1
 🫂     1
 🫧     1
 󠀼       1

But there are a lot of first characters #

5402?! Indeed, there's more to Unicode than the Latin alphabet:

>>> normalized.str.slice(0, 1).unique().sample(10).to_list()
['學', '舒', 'і', '进', '੦', '潮', '向', '妳', '陳', '🍅']

And it's actually worse than that – there are five thousand characters in our dataset, but there are hundreds of thousands of possible Unicode characters.

This is not a problem when adding the albums, but it is a problem when listing them, since we need to enumerate all the shards in a reasonable amount of time (and most shards being empty doesn't help, either).

As a very bad compromise, we could use the first byte of the UTF-8 encoding instead; this caps the number of shard ids at 256, and at least Latin titles would be sorted (I did say it's a bad compromise). There:

>>> firstbytes = normalized.map_elements(str.encode).bin.slice(0, 1)
>>> firstbytes.value_counts(sort=True)
shape: (136, 2)
 title    count
 b"t"     584760
 b"s"     509065
 b"a"     317513
 b"l"     298757
 …        …
 b"\xd4"  2
 b"U"     1
 b"\xee"  1
 b"\xf3"  1

But some first bytes need multiple shards #

We knew the first byte distribution would be skewed, but some of them don't even fit on a single shard (and it gets worse the more shards we need):

>>> shard_count = 21
>>> firstbytes.value_counts(sort=True).with_columns(
...     pl.col('count') / (len(firstbytes) / shard_count)
... ).head()
shape: (5, 2)
 title  count
 b"t"   2.218206
 b"s"   1.931068
 b"a"   1.204442
 b"l"   1.133294
 b"b"   1.106949

We're back to where we started: how do we sort between shards with the same prefix?

We don't – we find longer prefixes that fit in one shard.

That sounds like the perfect job for a trie (aka prefix tree). This would also allow us to switch back to characters, and merge small prefixes into ranges until each range fits one shard. But that's complicated, and as often the case, there must be a better way.²

But tries and prefix ranges are complicated #

We're looking for contiguous ranges, each of a certain size. Tries are good for finding the shortest prefix, but we don't really care about prefix length.

Why not just split the sorted titles into N equal ranges instead? This takes care of the uneven distribution:

>>> boundaries = normalized.gather_every(2000).str.slice(0, 4)
>>> boundaries.value_counts(sort=True)
shape: (2_103, 2)
 title     count
 the       161
 live      23
 …         …
 風吹けは  1
 魔法少女  1

...provided a long enough prefix:

>>> boundaries = normalized.gather_every(2000).str.slice(0, 16)
>>> boundaries.value_counts(sort=True).filter(pl.col('count') > 1)
shape: (3, 2)
 title             count
 greatest hits     3
 demo              2
 the very best of  2

...almost there:

>>> boundaries = normalized.gather_every(2000).str.slice(0, 20)
>>> boundaries.value_counts(sort=True).filter(pl.col('count') > 1)
shape: (2, 2)
 title          count
 greatest hits  3
 demo           2

This highlights another problem – if the shard size is too small, there may be more than a shard's worth of albums with identical titles; we can fix this by using another, random suffix (ordering doesn't matter anymore, since they have the same title).

Thankfully, our shards are huge, so it's not an issue:

>>> shard_size = int(math.ceil(len(normalized) / shard_count))
>>> shard_size
263619
>>> boundaries = normalized.gather_every(shard_size).str.slice(0, 20)

To use this, save the list of boundaries in code, and find the index of the biggest boundary smaller than a given album title:

import bisect
import unicodedata

ALBUM_TITLE_BOUNDARIES = [
    '',  # replaced with the smallest possible string
    'agartha',
    'barstow / crazy',
    'can you feel it',
    'cyan rot',
    'dreams take over eve',
    'feud semiotics (rb. ',
    'grave poetry',
    'i live',
    'kannaval',
    'live in florence',
    "mir ist's gleich / i",
    'notice',
    'platforms ep',
    'rituals',
    'skylten',
    'surtr / absorbed',
    'the human touch',
    'tonttujen jouluyö: ',
    'walking away',
    'голос',
]

def album_shard_id(album_title):
    normalized = unicodedata.normalize('NFKD', album_title.lower())
    return bisect.bisect(ALBUM_TITLE_BOUNDARIES, normalized) - 1

>>> album_shard_id('2 Pie Island')
0
>>> album_shard_id('Heavy Migration')
7
>>> album_shard_id('Leaving Home')
9
>>> album_shard_id('Space Cadet')
15

But the prefix distribution can change #

We were lucky to have data on the prefix distribution, but that's not always the case, and even if it is, the distribution can change over time.

For example, the last of the 21 shards above starts in the Cyrillic Unicode block, which means most existing scripts go into a single shard. What if we import 1 million Chinese and Japanese albums at some point?

One way to deal with this is to give more weight to known gaps in the data. Another is to have more shards from the start to account for unknown gaps – 210 shards instead of 21 sounds pretty reasonable.

If all else fails, you can move to a new index with more shards, but that comes with its own complications, so it's best to get it roughly right from the start.

Anyway, that's it for now.

Learned something new today? Share it with others, it really helps! PyCoder's Weekly HN Bluesky linkedin Twitter

This is a simplified example; as the MusicBrainz database shows, the schema for this kind of thing would be way more complicated in practice. ^[return]
You're welcome to try, though, especially if you're preparing for an interview. ^[return]

This is part of a series: