Dealing with YAML with arbitrary tags in Python
January 2022 ∙ 10 minute read ∙
... in which we use PyYAML to safely read and write YAML with any tags, in a way that's as straightforward as interacting with built-in types.
If you're in a hurry, you can find the code at the end.
Contents
Why is this useful? #
People mostly use YAML as a friendlier alternative to JSON1, but it can do way more.
Among others, it can natively represent user-defined and native data structures.
Say you need to read (or write) an AWS CloudFormation template:
EC2Instance:
Type: AWS::EC2::Instance
Properties:
ImageId: !FindInMap [
AWSRegionArch2AMI,
!Ref 'AWS::Region',
!FindInMap [AWSInstanceType2Arch, !Ref InstanceType, Arch],
]
InstanceType: !Ref InstanceType
>>> yaml.safe_load(text)
Traceback (most recent call last):
...
yaml.constructor.ConstructorError: could not determine a constructor for the tag '!FindInMap'
in "<unicode string>", line 4, column 14:
ImageId: !FindInMap [
^
... or, you need to safely read untrusted YAML that represents Python objects:
!!python/object/new:module.Class { attribute: value }
>>> yaml.safe_load(text)
Traceback (most recent call last):
...
yaml.constructor.ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/object/new:module.Class'
in "<unicode string>", line 1, column 1:
!!python/object/new:module.Class ...
^
Warning
Historically, yaml.load(thing)
was unsafe for untrusted data,
because it allowed running arbitrary code.
Consider using safe_load()
instead.
Details.
For example, you could do this:
>>> yaml.load("!!python/object/new:os.system [echo WOOSH. YOU HAVE been compromised]")
WOOSH. YOU HAVE been compromised
0
There were a bunch of CVEs about it.
To address the issue, load()
requires an explicit Loader
since PyYAML 6.
Also, version 5 added two new functions and corresponding loaders:
full_load()
resolves all tags except those known to be unsafe (note that this was broken before 5.4, and thus vulnerable)unsafe_load()
resolves all tags, even those known to be unsafe (the oldload()
behavior)
safe_load()
resolves only basic tags, remaining the safest.
Can I just get the data, without it being turned into objects?
You can! The YAML spec says:
In a given processing environment, there need not be an available native type corresponding to a given tag. If a node’s tag is unavailable, a YAML processor will not be able to construct a native data structure for it. In this case, a complete representation may still be composed and an application may wish to use this representation directly.
And PyYAML obliges:
>>> text = """\
... one: !myscalar string
... two: !mysequence [1, 2]
... """
>>> yaml.compose(text)
MappingNode(
tag='tag:yaml.org,2002:map',
value=[
(
ScalarNode(tag='tag:yaml.org,2002:str', value='one'),
ScalarNode(tag='!myscalar', value='string'),
),
(
ScalarNode(tag='tag:yaml.org,2002:str', value='two'),
SequenceNode(
tag='!mysequence',
value=[
ScalarNode(tag='tag:yaml.org,2002:int', value='1'),
ScalarNode(tag='tag:yaml.org,2002:int', value='2'),
],
),
),
],
)
>>> print(yaml.serialize(_))
one: !myscalar 'string'
two: !mysequence [1, 2]
... the spec didn't say the representation has to be concise. ¯\_(ツ)_/¯
Here's how YAML processing works, to give you an idea what we're looking at:
The output of compose()
above is the representation (node graph).
From that, safe_load()
does its best to construct objects,
but it can't do anything for tags it doesn't know about.
There must be a better way!
Thankfully, the spec also says:
That said, tag resolution is specific to the application. YAML processors should therefore provide a mechanism allowing the application to override and expand these default tag resolution rules.
We'll use this mechanism to convert tagged nodes to almost-native types, while preserving the tags.
A note on PyYAML extensibility #
PyYAML is a bit unusual.
For each processing direction, you have a corresponding Loader/Dumper class.
For each processing step, you can add callbacks, stored in class-level registries.
The callbacks are method-like – they receive the Loader/Dumper as the first argument:
Dice = namedtuple('Dice', 'a b')
def dice_representer(dumper, data):
return dumper.represent_scalar(u'!dice', u'%sd%s' % data)
yaml.Dumper.add_representer(Dice, dice_representer)
You may notice the add_...()
methods modify the class in-place,
for everyone,
which isn't necessarily great;
imagine getting a Dice from safe_load()
,
when you were expecting only built-in types.
We can avoid this by subclassing, since the registry is copied from the parent. Note that because of how copying is implemented, registries from two direct parents are not merged – you only get the registry of the first parent in the MRO.
So, we'll start by subclassing SafeLoader/Dumper:
|
|
Preserving tags #
Constructing unknown objects #
For now, we can use named tuples for objects with unknown tags, since they are naturally tag/value pairs:
|
|
Tag or no tag, all YAML nodes are either a scalar, a sequence, or a mapping. For unknown tags, we delegate construction to the loader's default constructors, and wrap the resulting value:
|
|
Constructors are registered by tag, with None meaning "unknown".
Things look much better already:
>>> yaml.load(text, Loader=Loader)
{
'one': Tagged(tag='!myscalar', value='string'),
'two': Tagged(tag='!mysequence', value=[1, 2]),
}
A better wrapper #
That's nice,
but every time we use any value,
we have to check if it's tagged,
and then go through value
if is:
>>> one = _['one']
>>> one.tag
'!myscalar'
>>> one.value.upper()
'STRING'
We could subclass the Python types corresponding to core YAML tags
(str, list, and so on),
and add a tag
attribute to each.
We could subclass most of them, anyway
– neither bool
nor NoneType
can be subclassed.
Or, we could wrap tagged objects
in a class with the same interface,
that delegates method calls and attribute access to the wrapee,
with a tag
attribute on top.
Tip
This is known as the decorator pattern design pattern (not to be confused with Python decorators). Caching and retrying methods dynamically is another interesting use for it.
Doing this naively entails writing one wrapper per type, with one wrapper method per method and one property per attribute. That's even worse than subclassing!
There must be a better way!
Of course, this is Python, so there is.
We can use an object proxy instead (also known as "dynamic wrapper"). While they're not perfect in general, the one wrapt provides is damn near perfect enough2:
|
|
>>> yaml.load(text, Loader=Loader)
{
'one': Tagged('!myscalar', 'string'),
'two': Tagged('!mysequence', [1, 2]),
}
The proxy behaves identically to the proxied object:
>>> one = _['one']
>>> one.tag
'!myscalar'
>>> one.upper()
'STRING'
>>> one[:3]
'str'
...up to and including fancy things like isinstance():
>>> isinstance(one, str)
True
>>> isinstance(one, Tagged)
True
And now you don't have to care about tags if you don't want to.
Representing tagged objects #
The trip back is exactly the same, but much shorter:
|
|
Representers are registered by type.
>>> print(yaml.dump(Tagged('!hello', 'world'), Dumper=Dumper))
!hello 'world'
Let's mark the occasion with some tests.
Since we still have stuff to do, we parametrize the tests from the start.
|
|
Loading works:
|
|
And dumping works:
|
|
... but only for known types:
|
|
Unhashable keys #
Let's try an example from the PyYAML documentation:
>>> text = """\
... ? !!python/tuple [0,0]
... : The Hero
... ? !!python/tuple [1,0]
... : Treasure
... ? !!python/tuple [1,1]
... : The Dragon
... """
This is supposed to result in something like:
>>> yaml.unsafe_load(text)
{(0, 0): 'The Hero', (1, 0): 'Treasure', (1, 1): 'The Dragon'}
Instead, we get:
>>> yaml.load(text, Loader=Loader)
Traceback (most recent call last):
...
TypeError: unhashable type: 'list'
That's because the keys are tagged lists, and neither type is hashable:
>>> yaml.load("!!python/tuple [0,0]", Loader=Loader)
Tagged('tag:yaml.org,2002:python/tuple', [0, 0])
This limitation comes from how Python dicts are implemented,3 not from YAML; quoting from the spec again:
The content of a mapping node is an unordered set of key/value node pairs, with the restriction that each of the keys is unique. YAML places no further restrictions on the nodes. In particular, keys may be arbitrary nodes, the same node may be used as the value of several key/value pairs and a mapping could even contain itself as a key or a value.
Constructing pairs #
What now?
Same strategy as before: wrap the things we can't handle.
Specifically, whenever we have a mapping with unhashable keys, we return a list of pairs instead. To tell it apart from plain lists, we use a subclass:
|
|
Again, we let the loader do most of the work:
|
|
We set construct_mapping
so that any other Loader constructor
wanting to make a mapping gets to use it
(like our own construct_undefined()
above).
Don't be fooled by the assignment,
it's a method like any other4
...but we're changing the class from outside anyway,
so it's best to stay consistent.
Note that overriding construct_mapping()
is not enough:
we have to register the constructor explictly,
otherwise SafeDumper's construct_mapping()
will be used
(since that's what was in the registry before).
Note
In case you're wondering, this feature is orthogonal from handling unknown tags; we could have used different classes for them. However, as mentioned before, the constructor registry breaks multiple inheritance, so we couldn't use the two features together.
Anyway, it works:
>>> yaml.load(text, Loader=Loader)
Pairs(
[
(Tagged('tag:yaml.org,2002:python/tuple', [0, 0]), 'The Hero'),
(Tagged('tag:yaml.org,2002:python/tuple', [1, 0]), 'Treasure'),
(Tagged('tag:yaml.org,2002:python/tuple', [1, 1]), 'The Dragon'),
]
)
Representing pairs #
Like before, the trip back is short and uneventful:
|
|
>>> print(yaml.dump(Pairs([([], 'one')]), Dumper=Dumper))
[]: one
Let's test this more thoroughly.
Because the tests are parametrized, we just need to add more data:
|
|
Conclusion #
YAML is extensible by design. I hope that besides what it says on the tin, this article shed some light on how to customize PyYAML for your own purposes, and that you've learned at least one new Python thing.
You can get the code here, and the tests here.
Learned something new today? Share this with others, it really helps!
Bonus: hashable wrapper #
You may be asking, why not make the wrapper hashable?
Most unhashable (data) objects are that for a reason: because they're mutable.
We have two options:
Make the wrapper hash change with the content. This this will break dictionaries in strange and unexpected ways (and other things too) – the language requires mutable objects to be unhashable.
Make the wrapper hash not change with the content, and wrappers equal only to themselves – that's what user-defined classes do by default anyway.
This works, but it's not very useful, because equal values don't compare equal anymore (
data != load(dump(data))
). Also, it means you can only get things from a dict if you already have the object used as key:>>> data = {Hashable([1]): 'one'} >>> data[Hashable([1])] Traceback (most recent call last): ... KeyError: Hashable([1]) >>> key = list(data)[0] >>> data[key] 'one'
I'd file this under "strange and unexpected" too.
(You can find the code for the example above here.)
Bonus: broken YAML #
We can venture even farther, into arguably broken YAML. Let's look at some examples.
First, there are undefined tag prefixes:
>>> yaml.load("!m!xyz x", Loader=Loader)
Traceback (most recent call last):
...
yaml.parser.ParserError: while parsing a node
found undefined tag handle '!m!'
in "<unicode string>", line 1, column 1:
!m!xyz x
^
A valid version:
>>> yaml.load("""\
... %TAG !m! !my-
... ---
... !m!xyz x
... """, Loader=Loader)
Tagged('!my-xyz', 'x')
Second, there are undefined aliases:
>>> yaml.load("two: *anchor", Loader=Loader)
Traceback (most recent call last):
...
yaml.composer.ComposerError: found undefined alias 'anchor'
in "<unicode string>", line 1, column 6:
two: *anchor
^
A valid version:
>>> yaml.load("""\
... one: &anchor [1]
... two: *anchor
... """, Loader=Loader)
{'one': [1], 'two': [1]}
It's likely possible to handle these in a way similar to how we handled undefined tags, but we'd have to go deeper – the exceptions hint to which processing step to look at.
Since I haven't actually encountered them in real life, we'll "save them for later" :)
Of which YAML is actually a superset. [return]
Using a hash table. For nice explanation of how it all works, complete with a pure-Python implementation, check out Raymond Hettinger's talk Modern Python Dictionaries: A confluence of a dozen great ideas (code). [return]
Almost. The zero argument form of super() won't work for methods defined outside of a class definition, but we're not using it here. [return]