We are in need of parsing YAML files which contain duplicate keys and all of these need to be parsed. It is not enough to skip duplicates. I know this is against the YAML spec and I would like to not have to do it, but a third-party tool used by us enables this usage and we need to deal with it.
File example:
build:
step: 'step1'
build:
step: 'step2'
After parsing we should have a similar data structure to this:
yaml.load('file.yml')
# [('build', [('step', 'step1')]), ('build', [('step', 'step2')])]
dict
can no longer be used to represent the parsed contents.
I am looking for a solution in Python and I didn't find a library supporting this, have I missed anything?
Alternatively, I am happy to write my own thing but would like to make it as simple as possible. ruamel.yaml
looks like the most advanced YAML parser in Python and it looks moderately extensible, can it be extended to support duplicate fields?
PyYAML will just silently overwrite the first entry, ruamel.yaml¹ will give a
DuplicateKeyFutureWarning
if used with the legacy API, and raise aDuplicateKeyError
with the new API.If you don't want to create a full
Constructor
for all types, overwriting the mapping constructor inSafeConstructor
should do the job:which gives:
However it doesn't seem necessary to make
step: 'step1'
into a list. The following will only create the list if there are duplicate items (could be optimised if necessary, by caching the result of theself.construct_object(key_node, deep=True)
):which gives:
Some points:
<<: *xyz
)yaml = YAML()
) , that will require a more complexconstruct_yaml_map
.If you want to dump the output, you should instantiate a new
YAML()
instance for that, instead of re-using the "patched" one used for loading (it might work, this is just to be sure):which gives (with the first
construct_yaml_map
):What doesn't work in PyYAML nor ruamel.yaml is
yaml.load('file.yml')
. If you don't want toopen()
the file yourself you can do:¹ Disclaimer: I am the author of that package.
You can override how pyyaml loads keys. For example, you could use a defaultdict with lists of values for each keys:
If you can modify the input data very slightly, you should be able to do this by converting the single yaml-like file into multiple yaml documents. yaml documents can be in the same file if they're separated by
---
on a line by itself, and you handily appear to have entries separated by two newlines next to each other:Output: