Has your password been pwned? Or, how I almost failed to search a 37 GB text file in under 1 millisecond (in Python)

December 2022 ∙ 19 minute read ∙

So, there's this website, Have I Been Pwned, where you can check if your email address has appeared in a data breach.

There's also a Pwned Passwords section for passwords ... but, typing your password on a random website probably isn't such a great idea, right?

Of course, you could read about how HIBP protects the privacy of searched passwords, and understand how k-Anonymity works, and check that the website uses the k-Anonymity API, but that's waaay too much work.

Of course, you could use the API yourself, but where's the fun in that?

Instead, we'll do it the hard way – we'll check passwords offline.

And we're not stopping until it's fast.

The Pwned Passwords list #

OK, first we need the password list.

Go to Pwned Passwords, scroll to Downloading the Pwned Passwords list, and download the SHA-1 ordered-by-hash file – be nice and use the torrent, if you can.

Note

You can also use the downloader to get an updated version of the file, but that didn't exist when I started writing this.

The archive extracts to a 37 GB text file:

$ pushd ~/Downloads
$ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.7z
16257755606
$ 7zz e pwned-passwords-sha1-ordered-by-hash-v8.7z
$ stat -f %z pwned-passwords-sha1-ordered-by-hash-v8.txt
37342268646
$ popd

... that looks like this:

$ head -n5 ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt
000000005AD76BD555C1D6D771DE417A4B87E4B4:10
00000000A8DAE4228F821FB418F59826079BF368:4
00000000DD7F2A1C68A35673713783CA390C9E93:873
00000001E225B908BAC31C56DB04D892E47536E0:6
00000006BAB7FC3113AA73DE3589630FC08218E7:3

Each line has the format:

<SHA-1 of the password>:<times it appeared in various breaches>

Note

For more details: why hash the passwords, and why include a count.

To make commands shorter, we link the file in the current directory:

$ ln -s ~/Downloads/pwned-passwords-sha1-ordered-by-hash-v8.txt pwned.txt

A minimal plausible solution #

We'll take an iterative, problem-solution approach to this. But since right now we don't have any solution, we start with the simplest thing that could possibly work.

First, we get the imports out of the way:

1
2
3
4
5
import os
import sys
import time
import getpass
import hashlib

Then, we open the file:

7
8
path = sys.argv[1]
file = open(path, 'rb')

Opening the file in binary mode makes searching a bit faster, as it skips needless decoding. By opening it early, we get free error checking – no point in reading the password if the file isn't there.

Next, the password:

10
11
12
13
14
15
16
17
try:
    password = sys.argv[2]
except IndexError:
    password = getpass.getpass()
hexdigest = hashlib.sha1(password.encode()).hexdigest()
del password

print("looking for", hexdigest)

We either take the password as the second argument to the script, or read it with getpass(), so real passwords don't remain in the shell history. After computing its hash, we delete it; we don't go as far as to zero1 it, but at least we avoid accidental printing.

Let's see if it works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
$ python pwned.py pwned.txt
Password:
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8

sha1sum seems to agree:

$ echo -n password | sha1sum
5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8  -

To find the hash, we go through the file line by line – remember, simplest thing that could possibly work.

 8
 9
10
11
12
13
14
def find_line(lines, prefix):
    for line in lines:
        if line.startswith(prefix):
            return line
        if line > prefix:
            break
    return None

If a line was found, we print the count:

29
30
31
32
33
34
35
line = find_line(file, hexdigest.upper().encode())

if line:
    times = int(line.decode().rstrip().partition(':')[2])
    print(f"pwned! seen {times:,} times before")
else:
    print("not found")

Finally, let's add some timing code:

29
30
31
start = time.monotonic()
line = find_line(file, hexdigest.upper().encode())
end = time.monotonic()
39
print(f"in {end-start:.6f} seconds")

And, it works:

$ python pwned.py pwned.txt blocking
looking for 000085013a02852372159cb94101b99ccaec59e1
pwned! seen 587 times before
in 0.002070 seconds

The code so far.

Problem: it's slow #

You may have noticed I switched from password to blocking. That's because I deliberately chose a password whose hash is at the beginning of the file.

On my 2013 laptop, password actually takes 86 seconds!

To put a lower bound on the time it takes to go through the file:

$ time python -c 'for line in open("pwned.txt", "rb"): pass'

real	1m28.325s
user	1m10.049s
sys	0m15.577s
$ time wc -l pwned.txt
 847223402 pwned.txt

real	0m51.729s
user	0m27.908s
sys	0m12.119s

There must be a better way.

Skipping #

There's a hint in the file name: the hashes are ordered.

That means we don't have to check all the lines. We can skip ahead repeatedly until we're past the hash, go back one step, and only check each line from there.

Lines are different lengths, so we can't skip exactly X lines without reading them. But we don't need to, any line that's a reasonable amount ahead will do.

17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def skip_to_before_line(file, prefix, offset):
    old_position = file.tell()

    while True:
        file.seek(offset, os.SEEK_CUR)

        file.readline()
        line = file.readline()
        # print("jumped to", (line or b'<eof>').decode().rstrip())

        if not line or line >= prefix:
            file.seek(old_position)
            break

        old_position = file.tell()

So we just seek() ahead a set number of bytes. Since that might not leave us at the start of a line, we discard the incomplete line, and use the next one.

Finally, we wrap the original find_line() to do the skipping:

 8
 9
10
11
12
13
14
def find_line(file, prefix):
    skip_to_before_line(file, prefix, 16 * 2**20)
    return find_line_linear(file, prefix)


def find_line_linear(lines, prefix):
    for line in lines:

It works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
pwned! seen 9,545,824 times before
in 0.027203 seconds

To find the magic 16M offset, I had to try a bunch values:

offset (MiB)   time (s)
           1      0.05
           4      0.035
           8      0.030
          16      0.027  <-- sweet spot
          32      0.14

The code so far.

Problem: it needs tuning, it's still slow #

While three orders of magnitude faster, we still have a bunch of issues:

  1. The ideal offset depends on the computer you're running this on.
  2. The run time still increases linearly with file size – we haven't really solved the problem, as much as made it smaller by a (large, but) constant factor.
  3. The run time still increases linearly with where the hash is in the file.
  4. It's still kinda slow. ¯\_(ツ)_/¯

There must be a better way.

Binary skipping #

To make the "linear" part painfully obvious, uncomment the jumped to line.

$ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c
 139 jumped to 0
 139 jumped to 1
 139 jumped to 2
 139 jumped to 3
 139 jumped to 4
 103 jumped to 5

Surely, after we've jumped to 0 once, we don't need to do it 138 more times, right?

We could jump directly to a line in the middle of the file; if there at all, the hash will be in either of the halves. We could then jump to the middle of that half, and to the middle of that half, and so on, until we either find the hash or there's nowhere left to jump.

Note

If that sounds a lot like binary search, that's because it is – it's just not wearing its regular array clothing.

And most of the work is already done: we can jump to a line at most X bytes from where the hash should be, we only need to do it repeatedly, in smaller and smaller fractions of the file size:

25
26
27
28
29
30
31
32
def skip_to_before_line(file, prefix, offset):
    while offset > 2**8:
        offset //= 2
        skip_to_before_line_linear(file, prefix, offset)


def skip_to_before_line_linear(file, prefix, offset):
    old_position = file.tell()

The only thing left is to get the file size:

 8
 9
10
11
12
13
def find_line(file, prefix):
    file.seek(0, os.SEEK_END)
    size = file.tell()
    file.seek(0)
    skip_to_before_line(file, prefix, size)
    return find_line_linear(file, prefix)

It works:

$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
pwned! seen 9,545,824 times before
in 0.009559 seconds
$ python pwned.py pwned.txt password
looking for 5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8
pwned! seen 9,545,824 times before
in 0.000268 seconds

The huge time difference is due to operating system and/or disk caches – on the second run, the same parts of the file are likely already in memory.

Anyway, look again at the jumped to output: instead of jumping blindly through the whole file, now we're jumping around the hash, getting closer and closer to it.

$ python pwned.py pwned.txt password | grep -o 'jumped to .' | uniq -c
   1 jumped to 7
   1 jumped to 3
   1 jumped to 7
   1 jumped to 5
   1 jumped to 4
  39 jumped to 5
$ python pwned.py pwned.txt password | grep -o 'jumped to ..' | uniq -c
   1 jumped to 7F
   1 jumped to 3F
   1 jumped to 7F
   1 jumped to 5F
   1 jumped to 4F
   1 jumped to 5F
   1 jumped to 57
   1 jumped to 5F
   1 jumped to 5B
   1 jumped to 59
   1 jumped to 5B
   1 jumped to 5A
  32 jumped to 5B

Notice we land on the same 7F... prefix twice; this makes sense – we skip ahead by half the file size, then back, then ahead two times by a quarter. Caching allows us to not care about that.

The code so far.

Better timing #

Given the way caching muddies the waters, how fast is it really?

This function averages a hundred runs, each with a different password:

function average-many {
    for _ in {1..100}; do
        python $@ $( python -c 'import time; print(time.time())' )
    done \
    | grep seconds \
    | cut -d' ' -f2 \
    | paste -sd+ - \
    | sed 's#^#scale=6;(#' | sed 's#$#)/100#' \
    | bc
}
After a few repeats, the average time settles around 3 ms.
$ for _ in {1..20}; do average-many pwned.py pwned.txt; done
.004802
.003904
.004088
.003486
.003451
.003476
.003414
.003442
.003169
.003297
.002931
.003077
.003092
.003011
.002980
.003147
.003112
.002942
.002984
.002934

Again, this is due to caching: the more we run the script, the more likely it is that the pages at {half, quarters, eights, sixteenths, ...} of the file size are already in memory – and the path to any line starts with a subset of those.


And, we're done.

I wave my hands, get a 2020 laptop, and a miracle happens. It's far enough into the totally unpredictable future, now, that you can search any password in under 1 millisecond. You can do anything you want.

So, there we go. Wasn't that an interesting story? That's the end of the article.

Don't look at the scroll bar. Don't worry about it.

If you came here to check your password, you can go.

Subscribe on the way out if you'd like, but take care :)

Go solve Advent of Code or something.

I'm just chilling out.

See ya.

Failing to get to under 1 millisecond #

I swear this was supposed to be the end. This really was supposed to be a short one.

Here's a quote from a friend of mine, that chronologically should be way later into the article, but makes a great summary for what follows:

And now that you have arrived at this point, spend a moment to ponder the arbitrary nature of 1 millisecond given its dependency on the current year and the choice of your particular hardware.

After that moment, continue celebrating.

Nah, fuck it, it has to take less than 1 millisecond on the old laptop.

... so yeah, here's a bunch of stuff that didn't work.

Liking this so far? Here's another article you might like:

Profile before optimizing #

With the obvious improvements out of the way, it's probably a good time to stop and find out where time is being spent.

$ python -m cProfile -s cumulative pwned.py pwned.txt "$( date )"
looking for 3960626a8c59fe927d3cf2e991d67f4c505ae198
not found
in 0.004902 seconds
         1631 function calls (1614 primitive calls) in 0.010 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      ...
        1    0.000    0.000    0.005    0.005 02-binary-search.py:8(find_line)
        1    0.000    0.000    0.005    0.005 02-binary-search.py:22(skip_to_before_line)
       28    0.000    0.000    0.005    0.000 02-binary-search.py:28(skip_to_before_line_linear)
       86    0.004    0.000    0.004    0.000 {method 'readline' of '_io.BufferedReader' objects}
      ...
       71    0.000    0.000    0.000    0.000 {method 'seek' of '_io.BufferedReader' objects}
      ...
       44    0.000    0.000    0.000    0.000 {method 'tell' of '_io.BufferedReader' objects}
      ...

From the output above, we learn that:

readline() is implemented in C, so there's not much we can change there.

What we can change, however, is how often we call it.


Another thing of interest is how much individual readline() calls take.

In skip_to_before_line_linear():

34
35
36
37
38
39
40
        start = time.monotonic()
        file.readline()
        line = file.readline()
        end = time.monotonic()
        print("jumped to", (line[:5] or b'<eof>').decode().rstrip(),
              f"in {(end-start)*1000000:4.0f} us",
              f"at offset {file.tell():16,}")
The output is pretty enlightening:
$ python pwned.py pwned.txt asdf
looking for 3da541559918a808c2402bba5012f6c60b27661c
jumped to 7FF9E in   10 us at offset   18,671,134,394  <-- 1/2 file size
jumped to 3FF89 in    4 us at offset    9,335,567,234  <-- 1/4 file size
jumped to 1FFBA in    3 us at offset    4,667,783,663  <-- 1/8 file size
jumped to 3FF89 in    3 us at offset    9,335,567,322  <-- 1/4 file size
jumped to 2FFA4 in    5 us at offset    7,001,675,508
jumped to 3FF89 in    4 us at offset    9,335,567,366  <-- 1/4 file size
jumped to 37F98 in    4 us at offset    8,168,621,453
jumped to 3FF89 in    3 us at offset    9,335,567,410  <-- 1/4 file size
jumped to 3BF94 in    3 us at offset    8,752,094,477
jumped to 3FF89 in    2 us at offset    9,335,567,498  <-- 1/4 file size
jumped to 3DF8E in    3 us at offset    9,043,831,007
jumped to 3CF90 in    3 us at offset    8,897,962,782
jumped to 3DF8E in    2 us at offset    9,043,831,095
jumped to 3D790 in    3 us at offset    8,970,896,964
jumped to 3DF8E in    2 us at offset    9,043,831,139
jumped to 3DB90 in  253 us at offset    9,007,364,072
jumped to 3D990 in  206 us at offset    8,989,130,552
jumped to 3DB90 in    6 us at offset    9,007,364,160
jumped to 3DA8F in  270 us at offset    8,998,247,402  <-- page 2,196,837
jumped to 3DA0F in  189 us at offset    8,993,689,007
jumped to 3DA8F in    5 us at offset    8,998,247,446  <-- page 2,196,837
jumped to 3DA4F in  212 us at offset    8,995,968,274
jumped to 3DA8F in    5 us at offset    8,998,247,534  <-- page 2,196,837
jumped to 3DA6F in  266 us at offset    8,997,107,921
jumped to 3DA5F in  203 us at offset    8,996,538,139
jumped to 3DA57 in  195 us at offset    8,996,253,241
jumped to 3DA53 in  197 us at offset    8,996,110,772
jumped to 3DA57 in    6 us at offset    8,996,253,285
jumped to 3DA55 in  193 us at offset    8,996,182,045
jumped to 3DA54 in  178 us at offset    8,996,146,471
jumped to 3DA54 in  189 us at offset    8,996,128,666
jumped to 3DA54 in  191 us at offset    8,996,119,760  <-- page 2,196,318
jumped to 3DA54 in   32 us at offset    8,996,128,710
jumped to 3DA54 in    5 us at offset    8,996,124,259
jumped to 3DA54 in   10 us at offset    8,996,122,057  <-- page 2,196,318
jumped to 3DA54 in    4 us at offset    8,996,120,955  <-- page 2,196,318
jumped to 3DA54 in    4 us at offset    8,996,120,382  <-- page 2,196,318
jumped to 3DA54 in    9 us at offset    8,996,120,112  <-- page 2,196,318
jumped to 3DA54 in    1 us at offset    8,996,120,470  <-- page 2,196,318
jumped to 3DA54 in    1 us at offset    8,996,120,338  <-- page 2,196,318
pwned! seen 324,774 times before
in 0.003654 seconds

Half the reads are pretty fast:

  • In the beginning, because searches start with the same few pages.
  • At the end, because searches end on the same page.
  • All reads of any page, after the first.

So, it's those reads in the middle that we need to get rid of.

Position heuristic #

In theory, the output of a good hash function should be uniformly distributed.

This means that with a bit of math, we can estimate where a hash would be – a hash that's ~1/5 in the range of all possible hashes should be at ~1/5 of the file.

Here's a tiny example:

>>> digest = '5b'  # 1-byte hash (2 hex digits)
>>> size = 1000    # 1000-byte long file
>>> int_digest = int(digest, 16)  # == 91
>>> int_end = 16 ** len(digest)   # == 0xff + 1 == 256
>>> int(size * int_digest / int_end)
355

We can do this once, and then binary search a safety interval around that position. Alas, this only gets rid of the fast jumps at the beginning of the binary search, and for some reason, it ends up being slightly slower than binary search alone. (code)

We can also narrow down around the estimated position iteratively, making the interval smaller by a constant factor each time. This seems to work: a factor of 1000 yields 1.7 ms, and a factor of 8000 yields 1.2 ms, both in 2 steps. (code)

However, it has deeper issues:

  • Having arbitrary start/end offsets complicates the code quite a bit.
  • I don't know how to reliably determine the factor.2
  • I don't know how to prove it's correct,3 especially for smaller intervals, where the hashes are less uniform. To be honest, I don't think it can be 100% correct, and I don't know how to estimate how correct it is.

Anyway, if the implementation is hard to explain, it's a bad idea.

Index file #

An arbitrary self-imposed restriction I had was that any solution should mostly use the original passwords list, with little to no preparation.

By relaxing this a bit, and going through the file once, we can build an index like:

<SHA-1 of the password>:<offset in file>

... that we can then search with skip_to_before_line().

Of course, we won't include all the hashes – by including lines a few kilobytes apart from each other, we can seek directly to within a few kilobytes in the big file.

The only thing left to figure out is how much "a few kilobytes" is.

After my endless harping about caching and pages, the answer should be obvious: one page size (4K). And this actually gets us 0.8 ms! But back when I wrote the code, that hadn't really sunk in, so after getting 1.2 ms with a 32K distance, I moved on.

Code: pwned.py, generate-index.py.

Binary file #

Already on the additional file slippery slope, I converted the list to binary, mostly to make it smaller – smaller file, fewer reads.

I packed each line into 24 bytes:

| binary hash (20 bytes) | count (4 bytes) |

This halved the file, but only lowered the runtime to a modest 2.6 ms.

More importantly, it made the code much, much simpler: because items are fixed size, you can know where the Nth item is, so I was able to use bisect for the binary search.

Code: pwned.py, convert-to-binary.py.

Getting to under 1 millisecond #

OK, what now? This is what we have so far:

  • The position heuristic kinda (maybe?) works, but is hard to reason about.
  • The index file gets us there, but barely, and the index is pretty big.
  • The binary file isn't much faster, and it creates a huge file. But, less code!

I don't know what to do with the first one, but I think we can combine the last two.

Generating the index #

Let's start with the script I made for the text index:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import os, sys

file = sys.stdin.buffer
outf = sys.stdout.buffer

while True:
    pos = file.tell()
    line = file.readline()
    if not line:
        break

    outf.write(line.partition(b':')[0] + b':' + str(pos).encode() + b'\n')

    file.seek(2**12, os.SEEK_CUR)
    file.readline()

The output looks like this:

$ python generate-index.py < pwned.txt 2>/dev/null | head -n5
000000005AD76BD555C1D6D771DE417A4B87E4B4:0
00000099A4D3034E14DF60EF50799F695C27C0EC:4157
00000172E8E1D1BD54AC23B3F9AB4383F291CA17:8312
000002C8F808A7DB504BBC3C711BE8A8D508C0F9:12453
0000047139578F13D70DD96BADD425C372DB64A9:16637

We need to pack that into bytes.

A hash takes 20 bytes. But, we only need slightly more than 3 bytes (6 hex digits) to distinguish between the index lines:

$ python generate-index.py < pwned.txt 2>/dev/null | cut -c-6 | uniq -c | head
   2 000000
   1 000001
   1 000002
   1 000004
   1 000005
   1 000007
   1 000008
   1 00000A
   1 00000B
   1 00000C

To represent all the offsets in the file, we need log2(35G) / 8 = 4.39... bytes, which results in a total of 9 bytes (maybe even 8, if we mess with individual bits).

Let's make it future-proof: 6 bytes for the hash buys at least 55 trillion lines (2.4 petabyte files), and 6 bytes for the offset buys 0.28 petabyte files.

12
13
14
    digest, _, _ = line.partition(b':')
    outf.write(bytes.fromhex(digest.decode())[:6])
    outf.write(pos.to_bytes(6, 'big'))

If you look at the text index, you'll notice the offsets are 4K + ~50 bytes apart; this results in sometimes having to read 2 pages, because not all pages have an index entry. Let's fix that by reading the first whole line after a 4K boundary instead:

16
17
    file.seek((pos // 4096 + 1) * 4096)
    file.readline()

OK, we're done:

$ time python generate-index.py < pwned.txt > index.bin

real	1m2.729s
user	0m34.292s
sys	0m21.392s

Using the index #

We start with a skeleton that's functionally identical to the naive script. The only difference is that I've added stubs for passing and using the index:

 9
10
11
def find_line(file, prefix, index):
    skip_to_before_line_index(file, prefix, index)
    return find_line_linear(file, prefix)
23
24
def skip_to_before_line_index(file, prefix, index):
    ...
30
31
index_path = sys.argv[2]
index = ...
43
line = find_line(file, hexdigest.upper().encode(), index)
The whole skeleton, if you want to see it:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import os
import sys
import time
import bisect
import getpass
import hashlib


def find_line(file, prefix, index):
    skip_to_before_line_index(file, prefix, index)
    return find_line_linear(file, prefix)


def find_line_linear(lines, prefix):
    for line in lines:
        if line.startswith(prefix):
            return line
        if line > prefix:
            break
    return None


def skip_to_before_line_index(file, prefix, index):
    ...


path = sys.argv[1]
file = open(path, 'rb')

index_path = sys.argv[2]
index = ...

try:
    password = sys.argv[3]
except IndexError:
    password = getpass.getpass()
hexdigest = hashlib.sha1(password.encode()).hexdigest()
del password

print("looking for", hexdigest)

start = time.monotonic()
line = find_line(file, hexdigest.upper().encode(), index)
end = time.monotonic()

if line:
    times = int(line.decode().rstrip().partition(':')[2])
    print(f"pwned! seen {times:,} times before")
else:
    print("not found")

print(f"in {end-start:.6f} seconds")

As mentioned before, we'll use the standard library bisect module to search the index.

We could read the entire index in memory, as a list of 12-byte bytes. But that would still be slow, even if outside the current timing code, and memory usage would increase with the size of the file.

Fortunately, bisect doesn't only work with lists, it works with any sequence – that is, any object that can pretend to be a list. So we'll build our own, by implementing the two required4 special methods.

27
28
29
30
31
32
class BytesArray:

    item_size = 12

    def __init__(self, file):
        self.file = file

We can go ahead and plug it in:

38
39
index_path = sys.argv[2]
index = BytesArray(open(index_path, 'rb'))

The first special method is __getitem__(), for a[i]:

34
35
36
37
38
39
    def __getitem__(self, index):
        self.file.seek(index * self.item_size)
        buffer = self.file.read(self.item_size)
        if len(buffer) != self.item_size:
            raise IndexError(index)  # out of bounds
        return buffer

The second special method is __len__(), for len(a):

41
42
43
44
    def __len__(self):
        self.file.seek(0, os.SEEK_END)
        size = self.file.tell()
        return size // self.item_size

Using the index becomes straightforward:

23
24
25
26
27
28
29
30
31
32
33
34
def skip_to_before_line_index(file, prefix, index):
    item_prefix = bytes.fromhex(prefix.decode())[:6]
    item = find_lt(index, item_prefix)
    offset = int.from_bytes(item[6:], 'big') if item else 0
    file.seek(offset)


def find_lt(a, x):
    i = bisect.bisect_left(a, x)
    if i:
        return a[i-1]
    return None

We get the first 6 bytes of the hash, find the rightmost value less than that, extract the offset from it, and seek to there. find_lt() comes from bisect's recipes for searching sorted lists.

And we're done:

$ average-many pwned.py pwned.txt index.bin
.002546

Huh? ... that's unexpected...

Oh.

I said we won't read the index in memory. But we can force it into the cache by reading it a bunch of times:

$ for _ in {1..10}; do cat index.bin > /dev/null; done

Finally:

$ average-many pwned.py pwned.txt index.bin
.000421

Code: pwned.py, generate-index.py.

I heard you like indexes (the end) #

Hmmm... isn't that cold start bugging you? If we make an index for the big index, we get 1.2 ms from a cold start. Maybe another smaller index can take us to below 1 ms?

...

Just kidding, this is it, this really is the end.

And now, let's take that moment to ponder:

method statements time (ms, order of magnitude)
linear 29 100,000
linear+skip 42 100
binary search 49 10
binary index 59 (72) 1

For twice the code, it's 5 orders of magnitude faster! I'm deliberately not counting bisect or the OS cache here, beacuse that's the point, they're basically free.

Turns out, you can get pretty far with just a few tricks.


That's it for now.

Learned something new today? Share this with others, it really helps!

If you've made it this far, you might like:

Bonus: better data structures #

As always, a specialized data structure can solve a problem better.

In Sketchy Pwned Passwords, Scott Helme manages to "pack" the entire passwords list into a 1.5G in-memory count-min sketch, which can then be queried in 1 microsecond. And, if you don't care about the counts, a plain Bloom filter can even take you to 0.1 µs! (There is a trade-off, and it's that it takes 11 hours to create either.)

  1. It's kinda difficult to do in Python anyway. [return]

  2. Something like (size / 4096) ** (1 / int(log(size, 4096))), maybe? [return]

  3. I mean, I did cross-check it with other solutions for a few thousand values, and it seems correct, but that's not proof. [return]

  4. We're only implementing the parts of the protocol that bisect uses.

    For the full protocol, __getitem__() would need to implement negative indexes and slicing. For more, free sequence methods, inherit collections.abc.Sequence.

    Interestingly, the class will work in a for loop without an __iter__() method. That's because there are actually two iteration protocols: an older one, using __getitem__(), and a newer one, added in Python 2.1, using __iter__() and __next__(). For details on the logic, see Unravelling for statements. [return]