Deterministic hashing of Python data objects

2021-03-19 ∙ 10 minute read

... in which we calculate deterministic hashes for Python data objects, stable across interpreter versions and implementations.

If you're in a hurry, you can find the final version of the code at the end.

Contents

Why is this useful? #

Let's say you have a feed reader library; one of its main features is retrieving and storing web feeds (Atom, RSS, and so on).

Entries (articles) usually have an updated date, indicating the last time the entry was modified in a significant way. An entry is updated only if its updated in the feed is newer than the one we have stored for it.

However, you notice the content of some entries changes without updated changing, so you decide to update entries whenever they change, regardless of updated.

After the feed is retrieved, entries are converted to data objects like these (the real ones have more attributes, though):

@dataclass
class Content:
    value: str
    language: Optional[str] = None

@dataclass
class Entry:
    id: str
    updated: Optional[datetime] = None
    content: Sequence[Content] = ()

A naive approach is to get the stored entry data and compare it with the feed version, but that's pretty inefficient.

A better solution is to use a hash function – a way to map data of arbitrary size (the message) to a fixed-size chunk of data (the hash value), such that:

  • it is quick to compute the hash value for any given message
  • the same message always results in the same hash
  • it is extremely unlikely two slightly different messages have the same hash

Then, instead of getting the full entry data from storage, we just get its (previously computed) hash, and compare it with the hash of the one we just retrieved.

Cryptographic hash functions have more properties, but the three listed above are everything we need.

Requirements #

Looking at our use case, we need a hash function that:

  1. supports (almost) arbitrary data objects; in our case, the various built-in types, datetimes, and dataclass instances should be enough
  2. is safe; passing an unsupported object should be an error
  3. is stable across interpreter versions and implementations, operating systems, and host machines
  4. ignores "empty" values, to allow adding new fields without the hash changing
  5. can skip some of the fields (I actually realized this is needed much later)

Because I'm using it in an existing library, I have some additional requirements:

  1. it should not have other dependencies outside the standard library, since any extra dependency gets passed down to the users
  2. it should be minimally invasive to existing code
  3. it should work with static typing

Problem: we need a stable hash function #

hash(): not stable, too restrictive #

An easy solution seems to be the built-in hash() function, which returns the integer hash of an object. However, it has a couple of issues.

By default, the hashes of str and bytes objects are randomized for security reasons (details, second note), so they're not predictable between Python processes:

$ python3 -c 'print(hash("abc"))'
-4743820898567001518
$ python3 -c 'print(hash("abc"))'
-6699381079787346150

Also, hash() only supports hashable objects; this means no lists, dicts, or non-frozen dataclasses. For my specific use case, this wasn't actually a problem, but it already puts huge constraints on how arbitrary the input can be.

hashlib: still restrictive #

hashlib contains many different secure hash algorithms, which are by definition deterministic.

But it has one big problem – it only takes bytes:

>>> hashlib.md5(b'abc').hexdigest()
'900150983cd24fb0d6963f7d28e17f72'
>>> hashlib.md5('abc').hexdigest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Unicode-objects must be encoded before hashing
>>> hashlib.md5(1).hexdigest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required

This is something we can work with, though: it changes our problem from we need a stable hash function to we need a deterministic way of serializing objects to bytes.

If you're curious why it only takes bytes, check out this article.

Problem: we need a deterministic way of serializing objects #

pickle: not stable #

pickle can turn most Python objects into bytes.

It does have multiple protocols, and the default one can change with the Python version; but we can select one and stick with it – we'll use version 4, added in Python 3.4.

Again, the easy solution is deceiving, since it seems to work:

$ function pickle {
$1 <<EOD
import pickle
from datetime import datetime
print(pickle.dumps($3, protocol=$2))
EOD
}
$ pickle python3.6 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -
$ pickle python3.7 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -
$ pickle python3.8 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -
$ pickle pypy3.6 4 '[1, "abc"]' | md5sum
02fa88b9fea0912efe731ed56906b251  -

... until it doesn't:

$ pickle python3.6 4 'datetime(1, 1, 1)' | md5sum
9c4423b791578d865d8fbeb070a1b934  -
$ pickle pypy3.6 4 'datetime(1, 1, 1)' | md5sum
3c7c834cb2f1cf4aba8be5c326bb9ddd  -

Version 0 isn't stable either, but in a different way:

$ pickle python3.6 0 'datetime(1, 1, 1)' | md5sum
01acd91b95556a09f5ff9b7495e120da  -
$ pickle pypy3.6 0 'datetime(1, 1, 1)' | md5sum
01acd91b95556a09f5ff9b7495e120da  -
$ pickle python3.7 0 'datetime(1, 1, 1)' | md5sum
a6c815eca494dbf716cd4a7e5556d779  -

Version 3 does seem to work fine across all of the above, on both macOS and Linux (I also tested it with more complicated data). But what's to say it'll remain that way?

In fairness, this is not an issue with pickle – it guarantees you'll get the same object back after unpickling, not that you'll get the same binary stream after pickling.

And it's easy to explain why: pickles are actually programs. Some relevant and quite interesting comments from pickletools:

"A pickle" is a program for a virtual pickle machine (PM, but more accurately called an unpickling machine). It's a sequence of opcodes, interpreted by the PM, building an arbitrarily complex Python object.

Historically, many enhancements have been made to the pickle protocol in order to do a better (faster, and/or more compact) job on [builtin scalar and container types, like ints and tuples].

As explained below [for compatibility], pickle opcodes never go away, not even when better ways to do a thing get invented. The repertoire of the PM just keeps growing over time.

This means there can be multiple pickles that unpickle to the same object1, even within the bounds of a specific protocol version.

str() and repr(): not stable, not safe #

str() and repr() might seem like valid solutions, but neither of them are.

First, they're not stable, and not guaranteed to have all the information we may care about:

str(object) returns object.__str__(), which is the "informal" or nicely printable string representation of object. [...] If object does not have a __str__() method, then str() falls back to returning repr(object).

For many types, [repr()] makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.

More importantly, they're not safe – all Python objects have them, even some that we would not want to serialize at all:

>>> Content(object())
Content(value=<object object at 0x7f993cd5ff40>, language=None)
>>> Content(lambda: 1)
Content(value=<function <lambda> at 0x7f993e96ec10>, language=None)

json 👍 #

json might be just what we need; even the pickle documentation recommends it, although for different reasons.

Compared to the previous solutions, json has the opposite problem: it only supports dicts (with string/number keys only), lists, strings, and a few other basic types.

But that's not that big of an issue as you may think, since the json module makes it really easy to support other types: we just have to convert them to something it already understands.

Let's start with datetimes:

def json_default(thing):
    if isinstance(thing, datetime):
        return thing.isoformat(timespec='microseconds')
    raise TypeError(f"object of type {type(thing).__name__} not serializable")
>>> json.dumps(datetime(1, 1, 1), default=json_default)
'"0001-01-01T00:00:00.000000"'

Dataclasses aren't much harder to add either, thanks to asdict(), which converts them to dicts and recurses into any nested dataclasses, dicts, lists and tuples along the way:

def json_default(thing):
    try:
        return dataclasses.asdict(thing)
    except TypeError:
        pass
    if isinstance(thing, datetime.datetime):
        return thing.isoformat(timespec='microseconds')
    raise TypeError(f"object of type {type(thing).__name__} not serializable")
>>> json.dumps(Entry('id', datetime(1, 1, 1), content=[Content('value')]), default=json_default)
'{"id": "id", "updated": "0001-01-01T00:00:00.000000", "content": [{"value": "value", "language": null}]}'

You may notice the dataclass type does not appear in the result, which for our use case is actually fine: dataclasses One(value=1) and Two(value=1) will both get converted to the dict {'value': 1}, resulting in the same hash. To make them different, we can include the type name:

>>> {'__type': type(thing).__name__, **asdict(thing)}
{'__type': 'Content', 'value': 1, 'language': None}

To ensure the output remains stable across Python versions, we'll force all of the dumps() default arguments to known values, and require it to sort dict keys:

def json_dumps(thing):
    return json.dumps(
        thing,
        default=json_default,
        ensure_ascii=False,
        sort_keys=True,
        indent=None,
        separators=(',', ':'),
    )

One more wrapper to hash the serialized value, and we're done:

def get_hash(thing):
    return hashlib.md5(json_dumps(thing).encode('utf-8')).digest()
>>> get_hash(Entry('id', datetime(1, 1, 1), content=[Content('value')])).hex()
'78b4b8120af5f832b7ecfc34db1fe02b'

Problem: we need to ignore empty values #

Say we have a dataclass like the following:

>>> @dataclass
... class Data:
...     value: int
...
>>> get_hash(Data(2)).hex()
'5d872de403edb944a7b10450eda2f46a'

Which in time evolves to get another attribute:

>>> @dataclass
... class Data:
...     value: int
...     another: str = None
...
>>> get_hash(Data(2)).hex()
'e54ad2c961239bd70da9a603cd078e18'

The old and new versions result in different dicts, so they have different hashes.

But should they? The only "real" data Data(2) contains is value=2. Likewise, there's not much of a difference in actual information between None and [] (for our use case, at least).

We can ignore "empty" values quite easily by using the asdict() dict_factory argument. I overlooked it initially, thinking it has to be a mapping class; it turns out any function that takes a key-value pair iterable works:

from collections.abc import Collection

def dict_drop_empty(pairs):
    return dict(
        (k, v)
        for k, v in pairs
        if not (
            v is None
            or not v and isinstance(v, Collection)
        )
    )

def json_default(thing):
    try:
        return dataclasses.asdict(thing, dict_factory=dict_drop_empty)
    # ...

We now get the same hash as for the first version of the object:

>>> Data(2)
Data(value=2, another=None)
>>> get_hash(Data(2)).hex()
'5d872de403edb944a7b10450eda2f46a'

Note that we are specific about the falsy values we ignore: an empty string, list, or dict are empty, but 0 or False are not.

collections.abc provides abstract base classes that can be used to test whether a class provides a particular interface, without requiring it to subclass anything.

isinstance(v, (str, tuple, list, dict)) works in our example, but isinstance(v, Collection) checks for other collections we may have failed to think about that don't inherit from them; for example: sets, deques, numpy arrays, and even ones that don't exist yet.

Problem: we need to skip some fields #

Let's look at a more advanced version of our Entry class:

@dataclass
class Entry:
    feed_url: str
    id: str
    updated: Optional[datetime] = None
    content: Sequence[Content] = ()

It the feed URL changes, does the data of the entry change? That is, would we want to update the entry next time we retrieve it?

No, in our context, both feed_url and id are metadata; changing them should not change the hash, since the actual data does not change.

We could deal with it by changing json_default() to remove specific keys from the asdict() result.

However, this moves class-specific logic away from the class, and into json_default(), which forces us to change it whenever we add a new class, or the list of metadata attributes changes. Passing the list as an argument to get_hash() just moves the problem elsewhere.

A better way is to allow classes to tell json_default() what attributes to remove via a well-known class attribute.

Neither of these approaches work with asdict() and nested dataclasses, since asdict() is recursive, and we only get to look at the top-level class. We cannot do anything in dict_factory either, since we don't know which class the attribute pairs belong to.

To work around it, we'll implement a non-recursive version of asdict(), and rely on json.dumps() for recursion (it's cool like that):

def dataclass_dict(thing):
    fields = dataclasses.fields(thing)
    if isinstance(thing, type):
        raise TypeError("got dataclass type, expected instance")

    exclude = getattr(thing, '_hash_exclude_', ())

    rv = {}
    for field in fields:
        if field.name in exclude:
            continue
        value = getattr(thing, field.name)
        if value is None or not value and isinstance(value, Collection):
            continue
        rv[field.name] = value

    return rv

def json_default(thing):
    try:
        return dataclass_dict(thing)
    # ...

We're using _hash_exclude_ (with an underscore at the end) to mimic the __double_underscore__ convention Python uses for magic methods; we are not using double underscores because they are reserved by Python.

To exclude some fields from the hash, we just need to set _hash_exclude_ to a tuple containing their names:

@dataclass
class Entry:
    feed_url: str
    id: str
    updated: Optional[datetime] = None
    content: Sequence[Content] = ()
    _hash_exclude_ = ('feed_url', 'id')
>>> get_hash(Entry('feed one', 'id', datetime(1, 1, 1))).hex()
'4f97b1e8e99d3304f50cd3c4428f224e'
>>> get_hash(Entry('feed two', 'id', datetime(1, 1, 1))).hex()
'4f97b1e8e99d3304f50cd3c4428f224e'
>>> get_hash(Entry('feed one', 'id', datetime(2, 2, 2))).hex()
'0c739e158792c5a91ec1632804d905c1'

Conclusion #

By leveraging dataclasses and the json module, we've managed to get stable, deterministic hashes for (almost) arbitrary Python data objects, with a decent trade-off between generality and safety, and two additional features, all in under 50 lines of code!

It's true that the solution is pretty specific to my use case, but with this little code, it should be trivial to adapt to something else.

If you're interested in using it, have a look at:

As always, if you found this useful and would like to see more, drop me a line :)

  1. I found this idea somewhere on Stack Overflow, but I can't seem to find that specific post again. [return]