hashlib: object supporting the buffer API required

April 2021 ∙ three minute read ∙

So you're trying to compute a hash using hashlib, and get an exception like this:

>>> hashlib.md5(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object supporting the buffer API required

... or like this:

>>> hashlib.md5('two')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Unicode-objects must be encoded before hashing

What does it mean? #

The first clue are these two bits from the docs:

You can now feed this object with bytes-like objects (normally bytes) using the update() method.

Feeding string objects into update() is not supported, as hashes work on bytes, not on characters.

Now, "object supporting the buffer API required" is a more precise way of saying "the object is not bytes-like". That is, it cannot export a series of bytes through the buffer interface, a way for Python objects to provide access to their underlying binary data.

Note

In the code above, the constructor passes the initial data to update().

Why does this happen? #

update() refuses to take anything other than bytes because there are many different ways of converting arbitrary objects to bytes (and some can't even be meaningfully converted – for example, file objects or sockets).

Let's look at the initial example, where we're trying to get the hash of an int.

One way of converting an int to bytes is to get its string representation, and convert that into bytes; encode()'s default utf-8 encoding should be acceptable:

>>> x = 2
>>> repr(x)
'2'
>>> repr(x).encode()
b'2'

Alternatively, we can use to_bytes() to convert it directly; to do it, we must specify an explicit byte length and order:

>>> x.to_bytes(2, 'big')
b'\x00\x02'
>>> x.to_bytes(2, 'little')
b'\x02\x00'
>>> x.to_bytes(4, 'big')
b'\x00\x00\x00\x02'
>>> x.to_bytes(4, 'little')
b'\x02\x00\x00\x00'

The struct module allows doing the same thing for C structs composed of bools, bytes, integers and floats, with varied representations:

>>> struct.pack('>i', x)
b'\x00\x00\x00\x02'
>>> struct.pack('<i', x)
b'\x02\x00\x00\x00'
>>> struct.pack('>q', x)
b'\x00\x00\x00\x00\x00\x00\x00\x02'
>>> struct.pack('<q', x)
b'\x02\x00\x00\x00\x00\x00\x00\x00'

As you can see, we get different bytes depending on the method used. Obviously, the hash also differs:

>>> values = [
...     repr(x).encode('utf-8'),
...     x.to_bytes(2, 'big'),
...     x.to_bytes(2, 'little'),
...     x.to_bytes(4, 'big'),
...     x.to_bytes(4, 'little'),
...     struct.pack('>i', x),
...     struct.pack('<i', x),
... ]
>>> for value in values:
...     print(hashlib.md5(value).hexdigest())
...
c81e728d9d4c2f636f067f89cc14862c
7209a1ce16f85bd1cbd287134ff5cbb6
11870cb56df12527e588f2ef967232e8
f11177d2ec63d995fb4ac628e0d782df
f2dd0dedb2c260419ece4a9e03b2e828
f11177d2ec63d995fb4ac628e0d782df
f2dd0dedb2c260419ece4a9e03b2e828

What now? #

In general, you have to pick a standard way of converting things to bytes.

If you only want to hash integers, you can pick one of the methods above. If you go with to_bytes() or struct, the byte size has to fit the biggest number you expect; for example, 255 is the biggest number you can express with 1 byte; you need 2 bytes for 256:

>>> (255).to_bytes(1, 'big')
b'\xff'
>>> (256).to_bytes(1, 'big')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: int too big to convert
>>> (256).to_bytes(2, 'big')
b'\x01\x00'

If you want to hash arbitrary objects, you have to find a standard way of converting them to bytes for each type you need to support, recursively. I've written an article about doing this for (almost) arbitrary objects.

Particularly, note that repr(...).encode() will only work if the result of the object's __repr__ method has all the data you need, in a predictable order, and nothing that changes between equal objects (including across processes etc.).

>>> class C:
...     def __init__(self, n):
...         self.n = n
...     def __eq__(self, other):
...         if isinstance(other, type(self)):
...             return self.n == other.n
...         return False
...
>>> a = C(2)
>>> b = C(2)
>>> a == b
True
>>> repr(a) == repr(b)
False
>>> repr(a)
'<__main__.C object at 0x7f8890132df0>'
>>> repr(b)
'<__main__.C object at 0x7f88901be580>'

Here, a and b are equal, but because C doesn't define __repr__, it inherits the default one from object, which just returns the type name and memory address of the object.


That's it for now.

Learned something new today? Share this with others, it really helps!