Dealing with YAML with arbitrary tags in Python

January 2022 ∙ 10 minute read

... in which we use PyYAML to safely read and write YAML with any tags, in a way that's as straightforward as interacting with built-in types.

If you're in a hurry, you can find the code at the end.

Contents

Why is this useful? #

People mostly use YAML as a friendlier alternative to JSON1, but it can do way more.

Among others, it can represent user-defined and native data structures.

Say you need to read (or write) an AWS CloudFormation template:

EC2Instance:
  Type: AWS::EC2::Instance
  Properties:
    ImageId: !FindInMap [
      AWSRegionArch2AMI,
      !Ref 'AWS::Region',
      !FindInMap [AWSInstanceType2Arch, !Ref InstanceType, Arch],
    ]
    InstanceType: !Ref InstanceType
>>> yaml.safe_load(text)
Traceback (most recent call last):
  ...
yaml.constructor.ConstructorError: could not determine a constructor for the tag '!FindInMap'
  in "<unicode string>", line 4, column 14:
        ImageId: !FindInMap [
                 ^

... or, you need to safely read untrusted YAML that represents Python objects:

!!python/object/new:module.Class { attribute: value }
>>> yaml.safe_load(text)
Traceback (most recent call last):
 ...
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/new:module.Class'
  in "<unicode string>", line 1, column 1:
    !!python/object/new:module.Class ...
    ^

Warning

Historically, yaml.load(thing) was unsafe for untrusted data, because it allowed running arbitrary code. Consider using safe_load() instead.

Details.

For example, you could do this:

>>> yaml.load("!!python/object/new:os.system [echo WOOSH. YOU HAVE been compromised]")
WOOSH. YOU HAVE been compromised
0

There were a bunch of CVEs about it.

To address the issue, load() requires an explicit Loader since PyYAML 6. Also, version 5 added two new functions and corresponding loaders:

  • full_load() resolves all tags except those known to be unsafe (note that this was broken before 5.4, and thus vulnerable)
  • unsafe_load() resolves all tags, even those known to be unsafe (the old load() behavior)

safe_load() resolves only basic tags, remaining the safest.


Can I just get the data, without it being turned into objects?

You can! The YAML spec says:

In a given processing environment, there need not be an available native type corresponding to a given tag. If a node’s tag is unavailable, a YAML processor will not be able to construct a native data structure for it. In this case, a complete representation may still be composed and an application may wish to use this representation directly.

And PyYAML obliges:

>>> text = """\
... one: !myscalar string
... two: !mysequence [1, 2]
... """
>>> yaml.compose(text)
MappingNode(
    tag='tag:yaml.org,2002:map',
    value=[
        (
            ScalarNode(tag='tag:yaml.org,2002:str', value='one'),
            ScalarNode(tag='!myscalar', value='string'),
        ),
        (
            ScalarNode(tag='tag:yaml.org,2002:str', value='two'),
            SequenceNode(
                tag='!mysequence',
                value=[
                    ScalarNode(tag='tag:yaml.org,2002:int', value='1'),
                    ScalarNode(tag='tag:yaml.org,2002:int', value='2'),
                ],
            ),
        ),
    ],
)
>>> print(yaml.serialize(_))
one: !myscalar 'string'
two: !mysequence [1, 2]

... the spec didn't say the representation has to be concise. ¯\_(ツ)_/¯

Here's how YAML processing works, to give you an idea what we're looking at:

YAML Processing Overview Diagram
YAML Processing Overview

The output of compose() above is the representation (node graph).

From that, safe_load() does its best to construct objects, but it can't do anything for tags it doesn't know about.


There must be a better way!

Thankfully, the spec also says:

That said, tag resolution is specific to the application. YAML processors should therefore provide a mechanism allowing the application to override and expand these default tag resolution rules.

We'll use this mechanism to convert tagged nodes to almost-native types, while preserving the tags.

A note on PyYAML extensibility #

PyYAML is a bit unusual.

For each processing direction, you have a corresponding Loader/Dumper class.

For each processing step, you can add callbacks, stored in class-level registries.

The callbacks are method-like – they receive the Loader/Dumper as the first argument:

Dice = namedtuple('Dice', 'a b')

def dice_representer(dumper, data):
    return dumper.represent_scalar(u'!dice', u'%sd%s' % data)

yaml.Dumper.add_representer(Dice, dice_representer)

You may notice the add_...() methods modify the class in-place, for everyone, which isn't necessarily great; imagine getting a Dice from safe_load(), when you were expecting only built-in types.

We can avoid this by subclassing, since the registry is copied from the parent. Note that because of how copying is implemented, registries from two direct parents are not merged – you only get the registry of the first parent in the MRO.


So, we'll start by subclassing SafeLoader/Dumper:

4
5
6
7
8
class Loader(yaml.SafeLoader):
    pass

class Dumper(yaml.SafeDumper):
    pass

Preserving tags #

Constructing unknown objects #

For now, we can use named tuples for objects with unknown tags, since they are naturally tag/value pairs:

12
13
14
class Tagged(typing.NamedTuple):
    tag: str
    value: object

Tag or no tag, all YAML nodes are either a scalar, a sequence, or a mapping. For unknown tags, we delegate construction to the loader's default constructors, and wrap the resulting value:

17
18
19
20
21
22
23
24
25
26
27
28
def construct_undefined(self, node):
    if isinstance(node, yaml.nodes.ScalarNode):
        value = self.construct_scalar(node)
    elif isinstance(node, yaml.nodes.SequenceNode):
        value = self.construct_sequence(node)
    elif isinstance(node, yaml.nodes.MappingNode):
        value = self.construct_mapping(node)
    else:
        assert False, f"unexpected node: {node!r}"
    return Tagged(node.tag, value)

Loader.add_constructor(None, construct_undefined)

Constructors are registered by tag, with None meaning "unknown".

Things look much better already:

>>> yaml.load(text, Loader=Loader)
{
    'one': Tagged(tag='!myscalar', value='string'),
    'two': Tagged(tag='!mysequence', value=[1, 2]),
}

A better wrapper #

That's nice, but every time we use any value, we have to check if it's tagged, and then go through value if is:

>>> one = _['one']
>>> one.tag
'!myscalar'
>>> one.value.upper()
'STRING'

We could subclass the Python types corresponding to core YAML tags (str, list, and so on), and add a tag attribute to each. We could subclass most of them, anyway – neither bool nor NoneType can be subclassed.

Or, we could wrap tagged objects in a class with the same interface, that delegates method calls and attribute access to the wrapee, with a tag attribute on top.

Tip

This is known as the decorator pattern design pattern (not to be confused with Python decorators).

Doing this naively entails writing one wrapper per type, with one wrapper method per method and one property per attribute. That's even worse than subclassing!

There must be a better way!


Of course, this is Python, so there is.

We can use an object proxy instead (also known as "dynamic wrapper"). While they're not perfect in general, the one wrapt provides is damn near perfect enough2:

12
13
14
15
16
17
18
19
20
21
22
class Tagged(wrapt.ObjectProxy):

    # tell wrapt to set the attribute on the proxy, not the wrapped object
    tag = None

    def __init__(self, tag, wrapped):
        super().__init__(wrapped)
        self.tag = tag

    def __repr__(self):
        return f"{type(self).__name__}({self.tag!r}, {self.__wrapped__!r})"
>>> yaml.load(text, Loader=Loader)
{
    'one': Tagged('!myscalar', 'string'),
    'two': Tagged('!mysequence', [1, 2]),
}

The proxy behaves identically to the proxied object:

>>> one = _['one']
>>> one.tag
'!myscalar'
>>> one.upper()
'STRING'
>>> one[:3]
'str'

...up to and including fancy things like isinstance():

>>> isinstance(one, str)
True
>>> isinstance(one, Tagged)
True

And now you don't have to care about tags if you don't want to.

Representing tagged objects #

The trip back is exactly the same, but much shorter:

39
40
41
42
43
44
45
def represent_tagged(self, data):
    assert isinstance(data, Tagged), data
    node = self.represent_data(data.__wrapped__)
    node.tag = data.tag
    return node

Dumper.add_representer(Tagged, represent_tagged)

Representers are registered by type.

>>> print(yaml.dump(Tagged('!hello', 'world'), Dumper=Dumper))
!hello 'world'

Let's mark the occasion with some tests.

Since we still have stuff to do, we parametrize the tests from the start.

 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
BASIC_TEXT = """\
one: !myscalar string
two: !mymapping
  three: !mysequence [1, 2]
"""

BASIC_DATA = {
    'one': Tagged('!myscalar', 'string'),
    'two': Tagged('!mymapping', {'three': Tagged('!mysequence', [1, 2])}),
}

DATA = [
    (BASIC_TEXT, BASIC_DATA),
]

Loading works:

23
24
25
@pytest.mark.parametrize('text, data', DATA)
def test_load(text, data):
    assert yaml.load(text, Loader=Loader) == data

And dumping works:

28
29
30
31
@pytest.mark.parametrize('text', [t[0] for t in DATA])
def test_roundtrip(text):
    data = yaml.load(text, Loader=Loader)
    assert data == yaml.load(yaml.dump(data, Dumper=Dumper), Loader=Loader)

... but only for known types:

34
35
36
def test_dump_error():
    with pytest.raises(yaml.representer.RepresenterError):
        yaml.dump(object(), Dumper=Dumper)

Unhashable keys #

Let's try an example from the PyYAML documentation:

>>> text = """\
... ? !!python/tuple [0,0]
... : The Hero
... ? !!python/tuple [1,0]
... : Treasure
... ? !!python/tuple [1,1]
... : The Dragon
... """

This is supposed to result in something like:

>>> yaml.unsafe_load(text)
{(0, 0): 'The Hero', (1, 0): 'Treasure', (1, 1): 'The Dragon'}

Instead, we get:

>>> yaml.load(text, Loader=Loader)
Traceback (most recent call last):
  ...
TypeError: unhashable type: 'list'

That's because the keys are tagged lists, and neither type is hashable:

>>> yaml.load("!!python/tuple [0,0]", Loader=Loader)
Tagged('tag:yaml.org,2002:python/tuple', [0, 0])

This limitation comes from how Python dicts are implemented,3 not from YAML; quoting from the spec again:

The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique. YAML places no further restrictions on the nodes. In particular, keys may be arbitrary nodes, the same node may be used as the value of several key/value pairs and a mapping could even contain itself as a key or a value.

Constructing pairs #

What now?

Same strategy as before: wrap the things we can't handle.

Specifically, whenever we have a mapping with unhashable keys, we return a list of pairs instead. To tell it apart from plain lists, we use a subclass:

48
49
50
class Pairs(list):
    def __repr__(self):
        return f"{type(self).__name__}({super().__repr__()})"

Again, we let the loader do most of the work:

53
54
55
56
57
58
59
60
61
def construct_mapping(self, node):
    value = self.construct_pairs(node)
    try:
        return dict(value)
    except TypeError:
        return Pairs(value)

Loader.construct_mapping = construct_mapping
Loader.add_constructor('tag:yaml.org,2002:map', Loader.construct_mapping)

We set construct_mapping so that any other Loader constructor wanting to make a mapping gets to use it (like our own construct_undefined() above). Don't be fooled by the assignment, it's a method like any other.4 But we're changing the class from outside anyway, it's best to stay consistent.

Note that overriding construct_mapping() is not enough: we have to register the constructor explictly, otherwise SafeDumper's construct_mapping() will be used (since that's what was in the registry before).

Note

In case you're wondering, this feature is orthogonal from handling unknown tags; we could have used different classes for them. However, as mentioned before, the constructor registry breaks multiple inheritance, so we couldn't use the two features together.

Anyway, it works:

>>> yaml.load(text, Loader=Loader)
Pairs(
    [
        (Tagged('tag:yaml.org,2002:python/tuple', [0, 0]), 'The Hero'),
        (Tagged('tag:yaml.org,2002:python/tuple', [1, 0]), 'Treasure'),
        (Tagged('tag:yaml.org,2002:python/tuple', [1, 1]), 'The Dragon'),
    ]
)

Representing pairs #

Like before, the trip back is short and uneventful:

64
65
66
67
68
69
def represent_pairs(self, data):
    assert isinstance(data, Pairs), data
    node = self.represent_dict(data)
    return node

Dumper.add_representer(Pairs, represent_pairs)
>>> print(yaml.dump(Pairs([([], 'one')]), Dumper=Dumper))
[]: one

Let's test this more thoroughly.

Because the tests are parametrized, we just need to add more data:

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
UNHASHABLE_TEXT = """\
[0,0]: one
!key {0: 1}: {[]: !value three}
"""

UNHASHABLE_DATA = Pairs(
    [
        ([0, 0], 'one'),
        (Tagged('!key', {0: 1}), Pairs([([], Tagged('!value', 'three'))])),
    ]
)

DATA = [
    (BASIC_TEXT, BASIC_DATA),
    (UNHASHABLE_TEXT, UNHASHABLE_DATA),
]

Conclusion #

YAML is extensible by design. I hope that besides what it says on the tin, this article shed some light on how to customize PyYAML for your own purposes, and that you've learned at least one new Python thing.

You can get the code here, and the tests here.

Learned something new today? Share this with others, it really helps!

Bonus: hashable wrapper #

You may be asking, why not make the wrapper hashable?

Most unhashable (data) objects are that for a reason: because they're mutable.

We have two options:

  • Make the wrapper hash change with the content. This this will break dictionaries in strange and unexpected ways (and other things too) – the language requires mutable objects to be unhashable.

  • Make the wrapper hash not change with the content, and wrappers equal only to themselves – that's what user-defined classes do by default anyway.

    This works, but it's not very useful, because equal values don't compare equal anymore (data != load(dump(data))). Also, it means you can only get things from a dict if you already have the object used as key:

    >>> data = {Hashable([1]): 'one'}
    >>> data[Hashable([1])]
    Traceback (most recent call last):
      ...
    KeyError: Hashable([1])
    >>> key = list(data)[0]
    >>> data[key]
    'one'
    

    I'd file this under "strange and unexpected" too.

    (You can find the code for the example above here.)

Bonus: broken YAML #

We can venture even farther, into arguably broken YAML. Let's look at some examples.

First, there are undefined tag prefixes:

>>> yaml.load("!m!xyz x", Loader=Loader)
Traceback (most recent call last):
  ...
yaml.parser.ParserError: while parsing a node
found undefined tag handle '!m!'
  in "<unicode string>", line 1, column 1:
    !m!xyz x
    ^

A valid version:

>>> yaml.load("""\
... %TAG !m! !my-
... ---
... !m!xyz x
... """, Loader=Loader)
Tagged('!my-xyz', 'x')

Second, there are undefined aliases:

>>> yaml.load("two: *anchor", Loader=Loader)
Traceback (most recent call last):
  ...
yaml.composer.ComposerError: found undefined alias 'anchor'
  in "<unicode string>", line 1, column 6:
    two: *anchor
         ^

A valid version:

>>> yaml.load("""\
... one: &anchor [1]
... two: *anchor
... """, Loader=Loader)
{'one': [1], 'two': [1]}

It's likely possible to handle these in a way similar to how we handled undefined tags, but we'd have to go deeper – the exceptions hint to which processing step to look at.

Since I haven't actually encountered them in real life, we'll "save them for later" :)

  1. Of which YAML is actually a superset. [return]

  2. Timothy 20:9. [return]

  3. Using a hash table. For nice explanation of how it all works, complete with a pure-Python implementation, check out Raymond Hettinger's talk Modern Python Dictionaries: A confluence of a dozen great ideas (code). [return]

  4. Almost. The zero argument form of super() won't work for methods defined outside of a class definition, but we're not using it here. [return]