namedtuple in a post-dataclasses world

2021-07-21 ∙ six minute read

namedtuple has been around since forever,1 and over time, its convenience saw it used far outside its originally intended purpose.

With dataclasses now covering part of those use cases, what should one use named tuples for?

In this article, we take a look at exactly that, with a few examples from real code.

Contents

What are named tuples used for? #

namedtuple exists in the standard library since Python 2.6, and allows building tuple subclasses that also have fields accessible by attribute lookup.

>>> from collections import namedtuple
>>> Point = namedtuple('Point', 'x y')

In general, this is useful when wrapping structured data; from the docs:

Named tuples are especially useful for assigning field names to result tuples returned by the csv or sqlite3 modules.

Because of how easy they are to define, named tuples have also been used for:

  • quick-and-dirty temporary data structures, more readable than plain tuples and regular classes (you get constructor keyword arguments and a __repr__ for free)
  • hashable instances (to use as dict keys or set members, or as arguments to functions decorated with e.g. functools.lru_cache)
  • immutable instances2

dataclasses was added in Python 3.7, and allows writing regular classes just as easily, by generating the required special methods. With frozen instances, it even covers hashable and immutable instances.

Before dataclasses, named tuples were used for the last three use cases because there were no other good alternatives in the standard library – you can do it with a normal class definition, but you have to write all the special methods by hand.

In case you've never used them, here's a comparison.

I'm using typing.NamedTuple because it looks similar to dataclasses; the result is the same as that of the traditional collections.namedtuple factory.

>>> class Point(NamedTuple):
...     x: int
...     y: int
...
>>> p = Point(1, y=2)
>>> p
Point(x=1, y=2)
>>> p.x
1
>>> p[0]
1
>>> list(p)
[1, 2]
>>> @dataclass
... class Point:
...     x: int
...     y: int
...
>>> p = Point(1, y=2)
>>> p
Point(x=1, y=2)
>>> p.x
1
>>> p[0]
Traceback (most recent call last):
  ...
TypeError: 'Point' object is not subscriptable
>>> list(p)
Traceback (most recent call last):
  ...
TypeError: 'Point' object is not iterable

The problems with named tuples #

PEP 5573 explains why sometimes namedtuple isn't good enough; in summary:

  • The instances are always iterable; this can make it difficult to add fields, because adding a new field will break code that uses unpacking.
    • Also, if used as return value in a backwards-compatible API, it means the result must remain iterable/indexable forever, even if you later stop using namedtuple.
  • Instances can be accidentally compared with any other tuple.
  • There's no mutable version (in the standard library).
  • Fields can't be combined by inheritance.

What are named tuples still good for? #

With the drawbacks mentioned above, and with dataclasses covering a lot of their (maybe unintended) use cases, are named tuples good for anything anymore?

As you'd expect, the answer is yes.

The data is naturally a tuple #

Named tuples remain perfect for their originally intended purpose: ordered, structured data.

Some examples:

  • rows returned by a database query
  • the result of parsing a binary file format
  • pairs of things, like HTTP headers (a dict is not always appropriate, since the same header can appear more than once, and the order does matter in some cases)

Pairs of things are interesting, because both unpacking and attribute access are valid usage patterns.

For example, for my feed reader library I use a named tuple to model the result of a feed update, a (feed URL, update details or exception) pair.

This makes it easier to make sense of what a value means in interactive sessions or when debugging; compare the named and unnamed versions:

>>> result = next(reader.update_feeds_iter())
>>> result
UpdateResult(url='http://antirez.com/rss', value=None)
>>> tuple(result)
('http://antirez.com/rss', None)

Also, the distinct class allows having a docstring that users can look at with help(), and better semantics via properties (error/ok here).

You're already using a tuple #

You're already using a tuple, and want to make new code more readable: a namedtuple gets you this, but guarantees you won't break old code.

Some people argue that wherever you return a non-trivial tuple, you should be returning a namedtuple instead. I tend to agree.

You want consumers that do unpacking to fail #

In some cases, you want consumers that do unpacking to fail.

For example, in my feed reader library, I use a named tuple to group arguments related to filtering, because there's a lot of them, and they get passed around quite a bit before being used (I cover why in more detail here).

I know all arguments should always be handled, so I use unpacking specifically because I want the code to fail when a new one is added – if I used attribute access, the code would silently succeed. This is no substitute for tests, but the early warning is nice, especially in a larger code base.

Memory and speed #

Last, but not least, named tuples are useful if you care about memory or speed; they are much smaller and faster than the equivalent (data)class. In most cases, the difference doesn't matter, but it can become noticeable if you create millions of instances.

Setting __slots__ helps with memory, but doesn't really help with speed.

Here's a quick comparison:

cls(1, 2) obj.a hash(obj) size total size
dataclass 846.9 49.7 361.7 152 320
dataclass + slots 709.1 45.5 342.5 48 104
namedtuple 465.3 43.2 99.6 56 112
dataobject + gc 150.3 43.1 104.1 48 48
dataobject 136.4 45.1 106.5 32 32

cls(1, 2), obj.a, hash(obj) are timings for that expression, in nanoseconds.

size is the sys.getsizeof of the object itself plus that of its __dict__ (if any), excluding the actual values. total size includes the values, as returned by Pympler's asizeof.

I ran this with 64-bit CPython 3.8 on macOS; Linux looks roughly the same.

When increasing the number of fields, obj.a remains constant, while the other timings increase proportionally. The slots dataclass is always 8 bytes smaller than the namedtuple.

For the dataobject rows I used recordclass, which provides dataclass/namedtuple-equivalent types. The version without gc doesn't participate in cyclic garbage collection, so it shouldn't be used for recursive data structures.

The library still has some rough edges, though: the documentation is a bit confusing, and I had to use the (yet unreleased) 0.15 version to get it working; also, note the wrong total size (it may be a Pympler bug). Nevertheless, the numbers are pretty compelling, and if you have this problem, it's definitely worth a look.

The class definitions:

For dataclasses, __slots__ must be set explicitly; this was fixed in Python 3.10.

from typing import NamedTuple
from dataclasses import dataclass
from recordclass import dataobject

class NT(NamedTuple):
    a: int
    b: int

@dataclass(frozen=True)
class DC:
    a: int
    b: int

@dataclass(frozen=True)
class DS:
    a: int
    b: int
    __slots__ = ('a', 'b')

class DO(dataobject):
    a: int
    b: int
    __options__ = dict(readonly=True, fast_new=True)

class DG(dataobject):
    a: int
    b: int
    __options__ = dict(readonly=True, fast_new=True, gc=True)

That's it for now. :)

Learned something new today? Share this with others, it really helps!

Want more? Get updates via email or Atom feed.

  1. 20072008 seems like forever enough these days. [return]

  2. See this for an example of why you might want immutable instances. [return]

  3. I also cover PEP 557 and what can be learned from it here. [return]