ProcessThreadPoolExecutor: when I‍/‍O becomes CPU-bound

So, you're doing some I‍/‍O bound stuff, in parallel.

Maybe you're scraping some websites – a lot of websites.

Maybe you're updating or deleting millions of DynamoDB items.

You've got your ThreadPoolExecutor, you've increased the number of threads and tuned connection limits... but after some point, it's just not getting any faster. You look at your Python process, and you see CPU utilization hovers above 100%.

You could split the work into batches and have a ProcessPoolExecutor run your original code in separate processes. But that requires yet more code, and a bunch of changes, which is no fun. And maybe your input is not that easy to split into batches.

If only we had an executor that worked seamlessly across processes and threads.

Well, you're in luck, since that's exactly what we're building today!

And even better, in a couple years you won't even need it anymore.

Establishing a baseline
Why not both?
Bonus: free threading

Establishing a baseline #

To measure things, we'll use a mock that pretends to do mostly I‍/‍O, with a sprinkling of CPU-bound work thrown in – a stand-in for something like a database connection, a Requests session, or a DynamoDB client.

class Client:
    io_time = 0.02
    cpu_time = 0.0008

    def method(self, arg):
        # simulate I/O
        time.sleep(self.io_time)

        # simulate CPU-bound work
        start = time.perf_counter()
        while time.perf_counter() - start < self.cpu_time:
            for i in range(100): i ** i

        return arg

We sleep() for the I‍/‍O, and do some math in a loop for the CPU stuff; it doesn't matter exactly how long each takes, as long I‍/‍O time dominates.

Real multi-threaded clients are usually backed by a shared connection pool, which allows for connection reuse (so you don't pay the cost of a new connection on each request) and multiplexing (so you can use the same connection for multiple concurrent requests, possible with protocols like HTTP/2 or newer). We could simulate this with a semaphore, but limiting connections is not relevant here – we're assuming the connection pool is effectively unbounded.

Since we'll use our client from multiple processes, we write an initializer function to set up a global, per-process client instance (remember, we want to share potential connection pools between threads); we can then pass the initializer to the executor constructor, along with any arguments we want to pass to the client. Similarly, we do the work through a function that uses this global client.

# this code runs in each worker process

client = None

def init_client(*args):
    global client
    client = Client(*args)

def do_stuff(*args):
    return client.method(*args)

Finally, we make a simple timing context manager:

@contextmanager
def timer():
    start = time.perf_counter()
    yield
    end = time.perf_counter()
    print(f"elapsed: {end-start:1.3f}")

...and put everything together in a function that measures how long it takes to do a bunch of work using a concurrent.futures executor:

def benchmark(executor, n=10_000, timer=timer, chunksize=10):
    with executor:
        # make sure all the workers are started,
        # so we don't measure their startup time
        list(executor.map(time.sleep, [0] * 200))

        with timer():
            values = list(executor.map(do_stuff, range(n), chunksize=chunksize))

        assert values == list(range(n)), values

Threads #

So, a ThreadPoolExecutor should suffice here, since we're mostly doing I‍/‍O, right?

>>> from concurrent.futures import *
>>> from bench import *
>>> init_client()
>>> benchmark(ThreadPoolExecutor(10))
elapsed: 24.693

More threads!

>>> benchmark(ThreadPoolExecutor(20))
elapsed: 12.405

Twice the threads, twice as fast. More!

>>> benchmark(ThreadPoolExecutor(30))
elapsed: 8.718

Good, it's still scaling linearly. MORE!

>>> benchmark(ThreadPoolExecutor(40))
elapsed: 8.638

confused cat with question marks around its head

...more?

>>> benchmark(ThreadPoolExecutor(50))
elapsed: 8.458
>>> benchmark(ThreadPoolExecutor(60))
elapsed: 8.430
>>> benchmark(ThreadPoolExecutor(70))
elapsed: 8.428

squinting confused cat

Problem: CPU becomes a bottleneck #

It's time we take a closer look at what our process is doing. I'd normally use the top command for this, but since the flags and output vary with the operating system, we'll implement our own using the excellent psutil library.

@contextmanager
def top():
    """Print information about current and child processes.

    RES is the resident set size. USS is the unique set size.
    %CPU is the CPU utilization. nTH is the number of threads.

    """
    process = psutil.Process()
    processes = [process] + process.children(True)
    for p in processes: p.cpu_percent()

    yield

    print(f"{'PID':>7} {'RES':>7} {'USS':>7} {'%CPU':>7} {'nTH':>7}")
    for p in processes:
        try:
            m = p.memory_full_info()
        except psutil.AccessDenied:
            m = p.memory_info()
        rss = m.rss / 2**20
        uss = getattr(m, 'uss', 0) / 2**20
        cpu = p.cpu_percent()
        nth = p.num_threads()
        print(f"{p.pid:>7} {rss:6.1f}m {uss:6.1f}m {cpu:7.1f} {nth:>7}")

And because it's a context manager, we can use it as a timer:

>>> init_client()
>>> benchmark(ThreadPoolExecutor(10), timer=top)
    PID     RES     USS    %CPU     nTH
  51395   35.2m   28.5m    38.7      11

So, what happens if we increase the number of threads?

>>> benchmark(ThreadPoolExecutor(20), timer=top)
    PID     RES     USS    %CPU     nTH
  13912   16.8m   13.2m    70.7      21
>>> benchmark(ThreadPoolExecutor(30), timer=top)
    PID     RES     USS    %CPU     nTH
  13912   17.0m   13.4m    99.1      31
>>> benchmark(ThreadPoolExecutor(40), timer=top)
    PID     RES     USS    %CPU     nTH
  13912   17.3m   13.7m   100.9      41

With more threads, the compute part of our I‍/‍O bound workload increases, eventually becoming high enough to saturate one CPU – and due to the global interpreter lock, one CPU is all we can use, regardless of the number of threads.¹

Processes? #

I know, let's use a ProcessPoolExecutor instead!

>>> benchmark(ProcessPoolExecutor(20, initializer=init_client))
elapsed: 12.374
>>> benchmark(ProcessPoolExecutor(30, initializer=init_client))
elapsed: 8.330
>>> benchmark(ProcessPoolExecutor(40, initializer=init_client))
elapsed: 6.273

Hmmm... I guess it is a little bit better.

More? More!

>>> benchmark(ProcessPoolExecutor(60, initializer=init_client))
elapsed: 4.751
>>> benchmark(ProcessPoolExecutor(80, initializer=init_client))
elapsed: 3.785
>>> benchmark(ProcessPoolExecutor(100, initializer=init_client))
elapsed: 3.824

OK, it's better, but with diminishing returns – there's no improvement after 80 processes, and even then, it's only 2.2x faster than the best time with threads, when, in theory, it should be able to make full use of all 4 CPUs.

Also, we're not making best use of connection pools (since we now have 80 of them, one per process), nor multiplexing (since we now have 80 connections, one per pool).

Problem: more processes, more memory #

But it gets worse!

>>> benchmark(ProcessPoolExecutor(80, initializer=init_client), timer=top)
    PID     RES     USS    %CPU     nTH
   2479   21.2m   15.4m    15.0       3
   2480   11.2m    6.3m     0.0       1
   2481   13.8m    8.5m     3.4       1
  ... 78 more lines ...
   2560   13.8m    8.5m     4.4       1

13.8 MiB * 80 ~= 1 GiB ... that is a lot of memory.

Now, there's some nuance to be had here.

First, on most operating systems that have virtual memory, code segment pages are shared between processes – there's no point in having 80 copies of libc or the Python interpreter in memory.

The unique set size is probably a better measurement than the resident set size, since it excludes memory shared between processes.² So, for the macOS output above,³ the actual usage is more like 8.5 MiB * 80 = 680 MiB.

Second, if you use the fork or forkserver start methods, processes also share memory allocated before the fork() via copy-on-write; for Python, this includes module code and variables. On Linux, the actual usage is 1.7 MiB * 80 = 136 MiB:

>>> benchmark(ProcessPoolExecutor(80, initializer=init_client), timer=top)
    PID     RES     USS    %CPU     nTH
 329801   17.0m    6.6m     5.1       3
 329802   13.3m    1.6m     2.1       1
  ... 78 more lines ...
 329881   13.3m    1.7m     2.0       1

However, it's important to note that's just a lower bound; memory allocated after fork() is not shared, and most real work will unavoidably allocate more memory.

Liking this so far? Here's another article you might like:

When to use classes in Python? When your functions take the same arguments

Why not both? #

One reasonable way of dealing with this would be to split the input into batches, one per CPU, and pass them to a ProcessPoolExecutor, which in turn runs the batch items using a ThreadPoolExecutor.⁴

But that would mean we need to change our code, and that's no fun.

If only we had an executor that worked seamlessly across processes and threads.

A minimal plausible solution #

In keeping with what has become tradition by now, we'll take an iterative, problem-solution approach; since we're not sure what to do yet, we start with the simplest thing that could possibly work.

We know we want a process pool executor that starts one thread pool executor per process, so let's deal with that first.

class ProcessThreadPoolExecutor(concurrent.futures.ProcessPoolExecutor):

    def __init__(self, max_threads=None, initializer=None, initargs=()):
        super().__init__(
            initializer=_init_process,
            initargs=(max_threads, initializer, initargs)
        )

By subclassing ProcessPoolExecutor, we get the map() implementation for free, since the original is implemented in terms of submit().⁵ By going with the default max_workers, we get one process per CPU (which is what we want); we can add more arguments later if needed.

In a custom process initializer, we set up a global thread pool executor,⁶ and then call the process initializer provided by the user:

# this code runs in each worker process

_executor = None

def _init_process(max_threads, initializer, initargs):
    global _executor

    _executor = concurrent.futures.ThreadPoolExecutor(max_threads)

    if initializer:
        initializer(*initargs)

Likewise, submit() passes the work along to the thread pool executor:

class ProcessThreadPoolExecutor(concurrent.futures.ProcessPoolExecutor):
    # ...
    def submit(self, fn, *args, **kwargs):
        return super().submit(_submit, fn, *args, **kwargs)

# this code runs in each worker process
# ...
def _submit(fn, *args, **kwargs):
    return _executor.submit(fn, *args, **kwargs).result()

OK, that looks good enough; let's use it and see if it works:

def _do_stuff(n):
    print(f"doing: {n}")
    return n ** 2

if __name__ == '__main__':
    with ProcessThreadPoolExecutor() as e:
        print(list(e.map(_do_stuff, [0, 1, 2])))

 $ python ptpe.py
doing: 0
doing: 1
doing: 2
[0, 1, 4]

Wait, we got it on the first try?!

Let's measure that:

>>> from bench import *
>>> from ptpe import *
>>> benchmark(ProcessThreadPoolExecutor(30, initializer=init_client), n=1000)
elapsed: 6.161

Hmmm... that's unexpectedly slow... almost as if:

>>> multiprocessing.cpu_count()
4
>>> benchmark(ProcessPoolExecutor(4, initializer=init_client), n=1000)
elapsed: 6.067

Ah, because _submit() waits for the result() in the main thread of the worker process, this is just a ProcessPoolExecutor with extra steps.

But what if we send back the future object instead?

    def submit(self, fn, *args, **kwargs):
        return super().submit(_submit, fn, *args, **kwargs).result()

def _submit(fn, *args, **kwargs):
    return _executor.submit(fn, *args, **kwargs)

Alas:

$ python ptpe.py
doing: 0
doing: 1
doing: 2
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "concurrent/futures/process.py", line 210, in _sendback_result
    result_queue.put(_ResultItem(work_id, result=result,
  File "multiprocessing/queues.py", line 391, in put
    obj = _ForkingPickler.dumps(obj)
  File "multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
TypeError: cannot pickle '_thread.RLock' object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "ptpe.py", line 42, in <module>
    print(list(e.map(_do_stuff, [0, 1, 2])))
  ...
TypeError: cannot pickle '_thread.RLock' object

The immediate cause of the error is that the future has a condition that has a lock that can't be pickled, because threading locks only make sense within the same process.

The deeper cause is that the future is not just data, but encapsulates state owned by the thread pool executor, and sharing state between processes requires extra work.

It may not seem like it, but this is a partial success: the work happens, we just can't get the results back. Not surprising, to be honest, it couldn't have been that easy.

Getting results #

If you look carefully at the traceback, you'll find a hint of how ProcessPoolExecutor gets its own results back from workers – a queue; the module docstring even has a neat data-flow diagram:

|======================= In-process =====================|== Out-of-process ==|

+----------+     +----------+       +--------+     +-----------+    +---------+
|          |  => | Work Ids |       |        |     | Call Q    |    | Process |
|          |     +----------+       |        |     +-----------+    |  Pool   |
|          |     | ...      |       |        |     | ...       |    +---------+
|          |     | 6        |    => |        |  => | 5, call() | => |         |
|          |     | 7        |       |        |     | ...       |    |         |
| Process  |     | ...      |       | Local  |     +-----------+    | Process |
|  Pool    |     +----------+       | Worker |                      |  #1..n  |
| Executor |                        | Thread |                      |         |
|          |     +----------- +     |        |     +-----------+    |         |
|          | <=> | Work Items | <=> |        | <=  | Result Q  | <= |         |
|          |     +------------+     |        |     +-----------+    |         |
|          |     | 6: call()  |     |        |     | ...       |    |         |
|          |     |    future  |     |        |     | 4, result |    |         |
|          |     | ...        |     |        |     | 3, except |    |         |
+----------+     +------------+     +--------+     +-----------+    +---------+

Now, we could probably use the same queue somehow, but it would involve touching a lot of (private) internals.⁷ Instead, let's use a separate queue:

    def __init__(self, max_threads=None, initializer=None, initargs=()):
        self.__result_queue = multiprocessing.Queue()
        super().__init__(
            initializer=_init_process,
            initargs=(self.__result_queue, max_threads, initializer, initargs)
        )

On the worker side, we make it globally accessible:

# this code runs in each worker process

_executor = None
_result_queue = None

def _init_process(queue, max_threads, initializer, initargs):
    global _executor, _result_queue

    _executor = concurrent.futures.ThreadPoolExecutor(max_threads)
    _result_queue = queue

    if initializer:
        initializer(*initargs)

...so we can use it from a task callback registered by _submit():

def _submit(fn, *args, **kwargs):
    task = _executor.submit(fn, *args, **kwargs)
    task.add_done_callback(_put_result)

def _put_result(task):
    if exception := task.exception():
        _result_queue.put((False, exception))
    else:
        _result_queue.put((True, task.result()))

Back in the main process, we handle the results in a thread:

    def __init__(self, max_threads=None, initializer=None, initargs=()):
        # ...
        self.__result_handler = threading.Thread(target=self.__handle_results)
        self.__result_handler.start()

    def __handle_results(self):
        for ok, result in iter(self.__result_queue.get, None):
            print(f"{'ok' if ok else 'error'}: {result}")

Finally, to stop the handler, we use None as a sentinel on executor shutdown:

    def shutdown(self, wait=True):
        super().shutdown(wait=wait)
        if self.__result_queue:
            self.__result_queue.put(None)
            if wait:
                self.__result_handler.join()
            self.__result_queue.close()
            self.__result_queue = None

Let's see if it works:

$ python ptpe.py
doing: 0
ok: [0]
doing: 1
ok: [1]
doing: 2
ok: [4]
Traceback (most recent call last):
  File "concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
AttributeError: 'NoneType' object has no attribute 'result'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
AttributeError: 'NoneType' object has no attribute 'cancel'

Yay, the results are making it to the handler!

The error happens because instead of returning a Future, our submit() returns the result of _submit(), which is always None.

Fine, we'll make our own futures #

But submit() must return a future, so we make our own:

    def __init__(self, max_threads=None, initializer=None, initargs=()):
        # ...
        self.__tasks = {}
        # ...

    def submit(self, fn, *args, **kwargs):
        outer = concurrent.futures.Future()
        task_id = id(outer)
        self.__tasks[task_id] = outer

        outer.set_running_or_notify_cancel()
        inner = super().submit(_submit, task_id, fn, *args, **kwargs)

        return outer

In order to map results to their futures, we can use a unique identifier; the id() of the outer future should do, since it is unique for the object's lifetime.

We pass the id to _submit(), then to _put_result() as an attribute on the future, and finally back in the queue with the result:

def _submit(task_id, fn, *args, **kwargs):
    task = _executor.submit(fn, *args, **kwargs)
    task.task_id = task_id
    task.add_done_callback(_put_result)

def _put_result(task):
    if exception := task.exception():
        _result_queue.put((task.task_id, False, exception))
    else:
        _result_queue.put((task.task_id, True, task.result()))

Back in the result handler, we find the maching future, and set the result accordingly:

    def __handle_results(self):
        for task_id, ok, result in iter(self.__result_queue.get, None):
            outer = self.__tasks.pop(task_id)
            if ok:
                outer.set_result(result)
            else:
                outer.set_exception(result)

And it works:

$ python ptpe.py
doing: 0
doing: 1
doing: 2
[0, 1, 4]

I mean, it really works:

>>> benchmark(ProcessThreadPoolExecutor(10, initializer=init_client))
elapsed: 6.220
>>> benchmark(ProcessThreadPoolExecutor(20, initializer=init_client))
elapsed: 3.397
>>> benchmark(ProcessThreadPoolExecutor(30, initializer=init_client))
elapsed: 2.575
>>> benchmark(ProcessThreadPoolExecutor(40, initializer=init_client))
elapsed: 2.664

3.3x is not quite the 4 CPUs my laptop has, but it's pretty close, and much better than the 2.2x we got from processes alone.

The executor code so far.

Death becomes a problem #

I wonder what happens when a worker process dies.

For example, the initializer can fail:

>>> executor = ProcessPoolExecutor(initializer=divmod, initargs=(0, 0))
>>> executor.submit(int).result()
Exception in initializer:
Traceback (most recent call last):
  ...
ZeroDivisionError: integer division or modulo by zero
Traceback (most recent call last):
  ...
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

...or a worker can die some time later, which we can help along with a custom timer:⁸

@contextmanager
def terminate_child(interval=1):
    threading.Timer(interval, psutil.Process().children()[-1].terminate).start()
    yield

>>> executor = ProcessPoolExecutor(initializer=init_client)
>>> benchmark(executor, timer=terminate_child)
[ one second later ]
Traceback (most recent call last):
  ...
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Now let's see our executor:

>>> executor = ProcessThreadPoolExecutor(30, initializer=init_client)
>>> benchmark(executor, timer=terminate_child)
[ one second later ]
[ ... ]
[ still waiting ]
[ ... ]
[ hello? ]

If the dead worker is not around to send back results, its futures never get completed, and map() keeps waiting until the end of time, when the expected behavior is to detect when this happens, and fail all pending tasks with BrokenProcessPool.

Before we do that, though, let's address a more specific issue.

If map() hasn't finished submitting tasks when the worker dies, inner fails with BrokenProcessPool, which right now we're ignoring entirely. While we don't need to do anything about it in particular because it gets covered by handling the general case, we should still propagate all errors to the outer task anyway.

    def submit(self, fn, *args, **kwargs):
        # ...
        inner = super().submit(_submit, task_id, fn, *args, **kwargs)
        inner.task_id = task_id
        inner.add_done_callback(self.__handle_inner)

        return outer

    def __handle_inner(self, inner):
        task_id = inner.task_id
        if exception := inner.exception():
            if outer := self.__tasks.pop(task_id, None):
                outer.set_exception(exception)

This fixes the case where a worker dies almost instantly:

>>> executor = ProcessThreadPoolExecutor(30, initializer=init_client)
>>> benchmark(executor, timer=lambda: terminate_child(0))
Traceback (most recent call last):
  ...
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

For the general case, we need to check if the executor is broken – but how? We've already decided we don't want to depend on internals, so we can't use ProcessPoolExecutor._broken. Maybe we can submit a dummy task and see if it fails instead:

    def __check_broken(self):
        try:
            super().submit(int).cancel()
        except concurrent.futures.BrokenExecutor as e:
            return type(e)(str(e))
        except RuntimeError as e:
            if 'shutdown' not in str(e):
                raise
        return None

Using it is a bit involved, but not completely awful:

    def __handle_results(self):
        last_broken_check = time.monotonic()

        while True:
            now = time.monotonic()
            if now - last_broken_check >= .1:
                if exc := self.__check_broken():
                    break
                last_broken_check = now

            try:
                value = self.__result_queue.get(timeout=.1)
            except queue.Empty:
                continue

            if not value:
                return

            task_id, ok, result = value
            if outer := self.__tasks.pop(task_id, None):
                if ok:
                    outer.set_result(result)
                else:
                    outer.set_exception(result)

        while self.__tasks:
            try:
                _, outer = self.__tasks.popitem()
            except KeyError:
                break
            outer.set_exception(exc)

When there's a steady stream of results coming in, we don't want to check too often, so we enforce a minimum delay between checks. When there are no results coming in, we want to check regularly, so we use the Queue.get() timeout to avoid waiting forever. If the check fails, we break out of the loop and fail the pending tasks. Like so:

>>> executor = ProcessThreadPoolExecutor(30, initializer=init_client)
>>> benchmark(executor, timer=terminate_child)
Traceback (most recent call last):
  ...
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

cool smoking cat wearing denim jacket and sunglasses

So, yeah, I think we're done. Here's the final executor and benchmark code.

Some features left as an exercise for the reader:

providing a ThreadPoolExecutor initializer
using other start methods
shutdown()'s cancel_futures

Learned something new today? Share this with others, it really helps! PyCoder's Weekly HN Reddit linkedin Twitter

Bonus: free threading #

You may have heard people being excited about the experimental free threading support added in Python 3.13, which allows running Python code on multiple CPUs.

And for good reason:

$ python3.13t
Python 3.13.2 experimental free-threading build
>>> from concurrent.futures import *
>>> from bench import *
>>> init_client()
>>> benchmark(ThreadPoolExecutor(30))
elapsed: 8.224
>>> benchmark(ThreadPoolExecutor(40))
elapsed: 6.193
>>> benchmark(ThreadPoolExecutor(120))
elapsed: 2.323

3.6x over to the GIL version, with none of the shenanigans in this article!

Alas, packages with extensions need to be updated to support it:

>>> import psutil
zsh: segmentation fault  python3.13t

...but the ecosystem is slowly catching up.

cat patiently waiting on balcony

If you've made it this far, you might like:

This is not interview advice: a priority-expiry LRU cache in Python without heaps or trees

At least, all we can use for pure-Python code. I‍/‍O always releases the global interpreter lock, and so do some extension modules. ^[return]
The psutil documentation for memory_full_info() explains the difference quite nicely and links to further resources, because good libraries educate. ^[return]
You may have to run Python as root to get the USS of child processes. ^[return]
And no, asyncio is not a solution, since the event loop runs in a single thread, so you'd still need to run one event loop per CPU in dedicated processes. ^[return]
We could have used composition instead, but then we'd have to implement the full Executor interface, defining each method explicitly to delegate to the inner process pool executor, and keep things up to date when the interface gets new methods (and we'd have no way to trick the inner executor's map() to use our submit(), so we'd have to implement it from scratch).

Yet another option would be to use both inheritance and composition – inherit the Executor base class directly for the common methods (assuming they're defined there and not in subclasses), and delegate to the inner executor only where needed (likely just map() and shutdown()). But, the only difference from the current code would be that it'd say self._inner instead of super() in a few places, so it's not really worth it, in my opinion. ^[return]
A previous version of this code attempted to shutdown() the thread pool executor using atexit, but since atexit functions run after non-daemon threads finish, it wasn't actually doing anything. Not shutting it down seems to work for now, but we may still need do it to support shutdown(cancel_futures=True) properly. ^[return]
Check out nilp0inter/threadedprocess for an idea of what that looks like. ^[return]
pkill -fn '[Pp]ython' would've done it too, but it gets tedious if you do it a lot, and it's a different command on Windows. ^[return]