Learn by reading code: Python standard library design decisions explained (for advanced beginners)

2021-04-12 ∙ five minute read

So, you're an advanced beginner – you've learned your way past Python basics and can solve real problems.

You've now moving past tutorials and blog posts; maybe you feel they offer one-dimensional solutions to simple, made-up problems; maybe instead of solving this specific problem, you want to get better at solving problems in general.

Maybe you've heard you should develop an eye by reading and writing a lot of code.

It's true.

So, what code should you read?

"Just read what you like."

What if you don't know what you like?

What if you don't like the right thing? Or worse, what if you like the wrong thing, and get stuck with bad habits because of it?

After all, you have to have an eye for that...

...but that's what you're trying to develop in the first place.

"There are so many projects on GitHub – pick one you like and see how they did it."

But most successful projects are quite large; where do you start from?

And even if you knew where to start, how they did it isn't always obvious. Yes, the code is right there, but it doesn't really tell you why they did it, what they didn't do, nor how they thought about the whole thing.

In other words, it is not obvious from the code itself what the design philosophy was, and what choices were considered before settling on an implementation.

In this article, we'll look at some Python standard library modules where it is.

A note about the standard library #

As a whole, the Python standard library isn't great for learning "good" style.

While all the modules are useful, they're not very uniform:

  • they have different authors;
  • some of them are old (pythonic was different 10-20 years ago); and
  • they have to preserve backwards compatibility (refactoring risks introducing bugs, and major API changes are out of the question).

On the other hand, at least part of them have detailed proposals explaining the design goals and tradeoffs, and the newer ones are actually quite consistent.

It's a few of the latter we'll look at.

Style aside, there's a lot to learn from the standard library, since it solves real problems for a diverse population of developers.

It's interesting to look at the differences between stdlib stuff and newer external alternatives – the gap shows a perceived deficiency (otherwise they wouldn't have bothered with the new thing). A decent example of this is urllib vs. requests.

How to read these #

Roughly in this order:

  • Get familiar with the library as a user: read the documentation, play with the examples a bit.
  • Read the corresponding Python Enhancement Proposal (PEP). The interesting sections usually are the abstract, rationale, design decisions, discussion, and rejected ideas.
  • Read the code; it's conveniently linked at the top of each documentation page.

statistics #

The statistics module adds statistical functions to the standard library; it's not intended to be a competitor to libraries like NumPy, but is rather "aimed at the level of graphing and scientific calculators".

It was introduced in PEP 450. Even if you are not familiar with the subject matter, it is a very interesting read:

  • The Rationale section compares the proposal with NumPy and do-it-yourself solutions; it's particularly good at showing what and why something is added to the standard library.
  • There's also a Design Decisions section that makes explicit what the general design philosophy was; Discussion and FAQ have some interesting details as well.

The documentation is also very nice. This is by design; as the proposal says: "Plenty of documentation, aimed at readers who understand the basic concepts but may not know (for example) which variance they should use [...] But avoid going into tedious mathematical detail."

The code is relatively simple, and when it's not, there are comments and links to detailed explanations or papers. This may be useful if you're just learning about this stuff and find it easier to read code than maths notation.

pathlib #

The pathlib module provides a simple hierarchy of classes to handle filesystem paths; it is a higher level alternative to os.path.

It was introduced in PEP 428. Most of the examples serve to illustrate the underlying philosophy, with the code left as specification.

The code is a good read for a few reasons:

  • You're likely already familiar with the subject matter; even if you didn't use pathlib before, you may have used os.path, or a similar library in some other language.

  • It is a good object-oriented solution. It uses object oriented programming with abstract (read: invented) concepts to achieve better code structure and reuse. It's probably a much better example than the old Animal​–​Dog​–​Cat​–​Duck​–​speak().

  • It is a good comparative study subject: both pathlib and os.path solve the same problem with vastly different programming styles. Also, there was another proposal that was rejected, and there are at least five similar libraries out there; pathlib learns from all of them.

dataclasses #

The dataclasses module reduces the boilerplate of writing classes by generating special methods like __init__ and __repr__. (See this tutorial for an introduction that has more concrete examples than the official documentation.)

It was introduced in PEP 557 as a simpler version of attrs. The Specification section is similar to the documentation; the good stuff is in Rationale, Discussion, and Rejected Ideas.

The code is extremely well commented; particularly interesting is this use of decision tables (ASCII version, nested if version).

It is also a good example of metaprogramming; Raymond Hettinger's Dataclasses: The code generator to end all code generators talk1 covers this aspect in detail. If you're having trouble understanding the code, watch the talk first; I found its examination of the generated code quite helpful.

Bonus: graphlib #

graphlib was added in Python 3.9, and at the moment contains just one thing: an implementation of a topological sort algorithm (here's a refresher on what that is and how it's useful).

It doesn't come with a PEP; it does however have an issue with lots of comments from various core developers, including Raymond Hettinger and Tim Peters (of Zen of Python fame).

Since this is essentially a solved problem, the discussion focuses on the API: where to put it, what to call it, how to represent the input and output, how to make it easy to use and flexible at the same time.

One thing they're trying to do is reconcile two diferent use cases:

  • Here's a graph, give me all the nodes in topological order.
  • Here's a graph, give me the nodes that can be processed right now (either because they don't have dependencies, or because their dependencies have already been processed). This is useful to parallelize work, for example downloading and installing packages that depend on other packages.

Unlike with PEPs, you can see the solution evolving as you read. Most proposals summarize the other choices as well, but if you don't follow the mailing list it's easy to get the impression they just appear, fully formed.

Compared to the discussion in the issue, the code itself is tiny – just under 250 lines, mostly comments and documentation.

That's it for now.

If you found this useful, please consider sharing it with others :)

  1. Recording, HTML slides, PDF slides. [return]

This is part of a series: