PyYaml parses '9:00' as int

2019-06-26 00:03发布

问题:

I have a file with the following data:

classes:
  - 9:00
  - 10:20
  - 12:10

(and so on up to 21:00)

I use python3 and yaml module to parse it. Precisely, the source is config = yaml.load (open (filename, 'r')). But then, when I print config, I get the following output for this part of data:

'classes': [540, 630, 730, 820, 910, 1000, 1090, 1180],

The values in the list are ints.

While previously, when I used python2 (and BaseLoader for YAML), I got the values as strings, and I use them as such. BaseLoader is now not acceptable since I want to read unicode strings from file, and it gives me byte-strings.

So, first, why pyyaml does parse my data as ints?

And, second, how do I prevent pyyaml from doing this? Is it possible to do that without changing data file (e.g. without adding !!str)?

回答1:

You should probably check the documentation of YAML

The colon are for mapping values.

I presume you want a string and not an integer, so you should double quote your strings.



回答2:

The documentation of YAML is a bit difficult to "parse" so I can imagine you missed this little bit of info about colons:

Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair.

And what you have there in your input is a sexagesimal and your 9:00 is considered to be similar to 9 minutes and 0 seconds, equalling a total of 540 seconds.

Unfortunately this doesn't get constructed as some special Sexagesimal instance that can be used for calculations as if it were an integer but can be printed in its original form. Therefore, if you want to use this as a string internally you have to single quote them:

classes:
  - '9:00'
  - '10:20'
  - '12:10'

which is what you would get if you dump {'classes': ['9:00', '10:20', '12:10']} (and note that the unambiguous classes doesn't get any quotes).

That the BaseLoader gives you strings is not surprising. The BaseConstructor that is used by the BaseLoader handles any scalar as string, including integers, booleans and "your" sexagesimals:

import ruamel.yaml as yaml

yaml_str = """\
classes:
  - 12345
  - 10:20
  - abc
  - True
"""

data = yaml.load(yaml_str, Loader=yaml.BaseLoader)
print(data)
data = yaml.load(yaml_str, Loader=yaml.SafeLoader)

gives:

{u'classes': [u'12345', u'10:20', u'abc', u'True']}
{'classes': [12345, 620, 'abc', True]}

If you really don't want to use quotes, then you have to "reset" the implicit resolver for scalars that start with numbers:

import ruamel.yaml as yaml
from ruamel.yaml.resolver import Resolver
import re

yaml_str = """\
classes:
  - 9:00
  - 10:20
  - 12:10
"""

for ch in list(u'-+0123456789'):
    del Resolver.yaml_implicit_resolvers[ch]
Resolver.add_implicit_resolver(
    u'tag:yaml.org,2002:int',
    re.compile(u'''^(?:[-+]?0b[0-1_]+
    |[-+]?0o?[0-7_]+
    |[-+]?(?:0|[1-9][0-9_]*)
    |[-+]?0x[0-9a-fA-F_]+)$''', re.X),  # <- copy from resolver.py without sexagesimal support
    list(u'-+0123456789'))

data = yaml.load(yaml_str, Loader=yaml.SafeLoader)
print(data)

gives you:

{'classes': ['9:00', '10:20', '12:10']}