The unreasonable effectiveness of f‍-‍strings and re.VERBOSE

May 2022 ∙ seven minute read

... in which we look at one or two ways to make life easier when working with Python regular expressions.

tl;dr: You can compose verbose regular expressions using f‍-‍strings.

Here's a real-world example – instead of this:

1
pattern = r"((?:\(\s*)?[A-Z]*H\d+[a-z]*(?:\s*\+\s*[A-Z]*H\d+[a-z]*)*(?:\s*[\):+])?)(.*?)(?=(?:\(\s*)?[A-Z]*H\d+[a-z]*(?:\s*\+\s*[A-Z]*H\d+[a-z]*)*(?:\s*[\):+])?(?![^\w\s])|$)"

... do this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
code = r"""
[A-Z]*H  # prefix
\d+      # digits
[a-z]*   # suffix
"""

multicode = fr"""
(?: \( \s* )?               # maybe open paren and maybe space
{code}                      # one code
(?: \s* \+ \s* {code} )*    # maybe followed by other codes, plus-separated
(?: \s* [\):+] )?           # maybe space and maybe close paren or colon or plus
"""

pattern = fr"""
( {multicode} )             # code (capture)
( .*? )                     # message (capture): everything ...
(?=                         # ... up to (but excluding) ...
    {multicode}             # ... the next code
        (?! [^\w\s] )       # (but not when followed by punctuation)
    | $                     # ... or the end
)
"""
For comparison, the same pattern without f‍-‍strings (click to expand).
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
pattern = r"""
(                       # code (capture)
    # BEGIN multicode

    (?: \( \s* )?       # maybe open paren and maybe space

    # code
    [A-Z]*H  # prefix
    \d+      # digits
    [a-z]*   # suffix

    (?:                 # maybe followed by other codes,
        \s* \+ \s*      # ... plus-separated

        # code
        [A-Z]*H  # prefix
        \d+      # digits
        [a-z]*   # suffix
    )*

    (?: \s* [\):+] )?   # maybe space and maybe close paren or colon or plus

    # END multicode
)

( .*? )                 # message (capture): everything ...

(?=                     # ... up to (but excluding) ...
    # ... the next code

    # BEGIN multicode

    (?: \( \s* )?       # maybe open paren and maybe space

    # code
    [A-Z]*H  # prefix
    \d+      # digits
    [a-z]*   # suffix

    (?:                 # maybe followed by other codes,
        \s* \+ \s*      # ... plus-separated

        # code
        [A-Z]*H  # prefix
        \d+      # digits
        [a-z]*   # suffix
    )*

    (?: \s* [\):+] )?   # maybe space and maybe close paren or colon or plus

    # END multicode

        # (but not when followed by punctuation)
        (?! [^\w\s] )

    # ... or the end
    | $
)

"""

It's better than the non-verbose one, but even with careful formatting and comments, the repetition makes it pretty hard to follow – and wait until you have to change something!

Read on for details and some caveats.

Prerequisites #

Formatted string literals (f‍-‍strings) were added in Python 3.61, and provide a way to embed expressions inside string literals, using a syntax similar to that of str.format():

>>> name = "world"
>>>
>>> "Hello, {name}!".format(name=name)
'Hello, world!'
>>>
>>> f"Hello, {name}!"
'Hello, world!'

Verbose regular expressions (re.VERBOSE) have been around since forever2, and allow writing regular expressions with non-significant whitespace and comments:

>>> text = "H1 code (AH2b+EUH3) fancy code"
>>>
>>> code = r"[A-Z]*H\d+[a-z]*"
>>> re.findall(code, text)
['H1', 'AH2b', 'EUH3']
>>>
>>> code = r"""
... [A-Z]*H  # prefix
... \d+      # digits
... [a-z]*   # suffix
... """
>>> re.findall(code, text, re.VERBOSE)
['H1', 'AH2b', 'EUH3']

The "one weird trick" #

Once you see it, it's obvious – you can use f‍-‍strings to compose regular expressions:

>>> multicode = fr"""
... (?: \( )?         # maybe open paren
... {code}            # one code
... (?: \+ {code} )*  # maybe other codes, plus-separated
... (?: \) )?         # maybe close paren
... """
>>> re.findall(multicode, text, re.VERBOSE)
['H1', '(AH2b+EUH3)']

It's so obvious, it only took me three years to do it after I started using Python 3.6+, despite using both features during all that time.

Of course, there's any number of libraries for building regular expressions; the benefit of this is that it has zero dependencies, and zero extra things you need to learn.

Caveats #

Hashes and spaces need to be escaped #

Because a hash is used to mark the start of a comment, and spaces are mostly ignored, you have to represent them in some other way.

The documentation of re.VERBOSE is quite helpful:

When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

That is, this won't work as the non-verbose version:

>>> re.findall("\d+#\d+", "1#23a")
['1#23']
>>> re.findall("\d+ # \d+", "1#23a", re.VERBOSE)
['1', '23']

... but these will:

>>> re.findall("\d+ [#] \d+", "1#23a", re.VERBOSE)
['1#23']
>>> re.findall("\d+ \# \d+", "1#23a", re.VERBOSE)
['1#23']

The same is true for spaces:

>>> re.findall("\d+ [ ] \d+", "1 23a", re.VERBOSE)
['1 23']
>>> re.findall("\d+ \  \d+", "1 23a", re.VERBOSE)
['1 23']

Hashes need extra care #

When composing regexes, ending a pattern on the same line as a comment might accidentally comment the following line in the enclosing pattern:

>>> one = "1 # comment"
>>> onetwo = f"{one} 2"
>>> re.findall(onetwo, '0123', re.VERBOSE)
['1']
>>> print(onetwo)
1 # comment 2

This can be avoided by always ending the pattern on a new line:

>>> one = """\
... 1 # comment
... """
>>> onetwo = f"""\
... {one} 2
... """
>>> re.findall(onetwo, '0123', re.VERBOSE)
['12']

While a bit cumbersome, in real life most patterns would span multiple lines anyway, so it's not really an issue.

(Note that this is only needed if you use comments.)

Brace quantifiers need to be escaped #

Because f‍-‍strings already use braces for replacements, to represent brace quantifiers you must double the braces:

>>> re.findall("m{2}", "entire mm but only two of mmm")
['mm', 'mm']
>>> letter = "m"
>>> pattern = f"{letter}{{2}}"
>>> re.findall(pattern, "entire mm but only two of mmm")
['mm', 'mm']

I don't control the flags #

Maybe you'd like to use verbose regexes, but don't control the flags passed to the re functions (for example, because you're passing the regex to an API).

Worry not! The regular expression syntax supports inline flags:

(?aiLmsux)
(One or more letters [...]) The group matches the empty string; the letters set the corresponding flags: [...] re.X (verbose), for the entire regular expression. [...] This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function. Flags should be used first in the expression string.
(?aiLmsux-imsx:...)
[...] The letters set or remove the corresponding flags [...] for the part of the expression. [...]

So, you can do this:

>>> onetwo = """\
... (?x)
... 1 # look, ma
... 2 # no flags
... """
>>> re.findall(onetwo, '0123')
['12']

... or this:

>>> onetwo = """\
... (?x:
...     1 # verbose until the close paren
... )2"""
>>> re.findall(onetwo, '0123')
['12']

That's it for now.

Learned something new today? Share this with others, it really helps!

Bonus: I don't use Python #

Lots of other languages support the inline verbose flag, too! You can build a pattern in whichever language is more convenient, and use it in any other one.3 Languages like...

C (with PCRE – and by extension, C++, PHP, and many others):

echo '0123' | pcregrep -o '(?x)
1 2  # such inline
'
... yeah, the C version is actually really long, click to expand.
char *pattern =
    "(?x)\n"
    "1 2  # much verbose\n"
;
char *subject = "0123";
int subject_length = strlen(subject);

int errornumber;
PCRE2_SIZE erroroffset;

pcre2_code *re = pcre2_compile(
    (PCRE2_SPTR)pattern,
    PCRE2_ZERO_TERMINATED,
    0,
    &errornumber,
    &erroroffset,
    NULL
);

pcre2_match_data *match_data = pcre2_match_data_create_from_pattern(re, NULL);

pcre2_match(
    re,
    (PCRE2_SPTR)subject,
    subject_length,
    0,
    0,
    match_data,
    NULL
);

PCRE2_SIZE *ovector = pcre2_get_ovector_pointer(match_data);
PCRE2_SPTR substring_start = (PCRE2_SPTR)subject + ovector[0];
size_t substring_length = ovector[1] - ovector[0];
printf("%.*s\n", (int)substring_length, (char *)substring_start);

C#:

Console.WriteLine(new Regex(@"(?x)
1 2  # wow
").Match("0123"));

grep (only the GNU one):

echo '0123' | grep -Po '(?x) 1 2  # no line'

Java (and by extension, lots of JVM languages, like Scala):

var p = Pattern.compile(
    "(?x)\n" +
    "1 2  # much class\n"
);
var m = p.matcher("0123");
m.find();
System.out.println(m.group(0));

Perl:

"0123" =~ /(?x)(
1 2  # no scare
)/;
print $1 . "\n";

PostgreSQL:

select substring(
  '0123' from
    $$(?x)
    1 2  # such declarative
    $$
);

Ruby:

puts /(?x)
1 2  # nice
/.match('0123')

Rust:

let re = Regex::new(
    r"(?x)
    1 2  # much safe
    "
).unwrap();
println!("{}", re.find("0123").unwrap().as_str());

Swift:

let string = "0123"
let range = string.range(
    of : """
    (?x)
    1 2  # omg hi
    """,
    options : .regularExpression
)
print(string[range!])

Notable languages that don't support inline verbose flags out of the box:

  • C (regex.h – POSIX regular expressions)
  • C++ (regex)
  • Go (regexp)
  • Javascript
  • Lua

Bonus: DEFINE #

(update) Interestingly, Perl, PCRE, and the regex Python library all support reusing subpatterns without string interpolation.4

To define a named subpattern, use the DEFINE pseudo-condition:

(?(DEFINE)
    (?<name> subpattern )
    ...
)

To use it, do a "subroutine" call:

(?&subpattern)

The example at the beginning of the article would then look like this:

(?(DEFINE)
    (?<code>
        [A-Z]*H  # prefix
        \d+      # digits
        [a-z]*   # suffix
    )
    (?<multicode>
        (?: \( \s* )?               # maybe open paren and maybe space
        (?&code)                    # one code
        (?: \s* \+ \s* (?&code) )*  # maybe followed by other codes, plus-separated
        (?: \s* [\):+] )?           # maybe space and maybe close paren or colon or plus
    )
)

( (?&multicode) )           # code (capture)
( .*? )                     # message (capture): everything ...
(?=                         # ... up to (but excluding) ...
    (?&multicode)           # ... the next code
        (?! [^\w\s] )       # (but not when followed by punctuation)
    | $                     # ... or the end
)
  1. In PEP 498 – Literal String Interpolation. [return]

  2. That is, since at least Python 1.5.2, released in 1998 – for all except a tiny minority of Python users, that's before forever. [return]

  3. If they both support inline flags, they likely share most other features. [return]

  4. Thanks to webstrand and asicsp for pointing it out! [return]