UnicodeEncodeError when parsing month name with Py

2019-04-13 07:25发布

问题:

I have to parse a German date in MONTH YEAR format where MONTH is the full name of the month. I set the appropriate locale in Python and then try to parse the date with strptime. For example:

locale.setlocale(locale.LC_ALL, "deu_deu") # Locale name on Windows
datetime.strptime(dt, "%B %Y")

On encountering a month with a non-ASCII character in its name I get a UnicodeEncodeError. The date is being pulled from an XML file delivered via a web service. Is there a way I can transform my date string to that it works with strptime?

EDIT

datetime.strptime(dt.encode("iso-8859-16"), "%B %Y") 

worked.

回答1:

No answer, just a test (on Unix, though):

>>> import locale, datetime
>>> locale.setlocale(locale.LC_ALL, "de_de")
>>> datetime.datetime.strptime("März 2012", "%B %Y")
datetime.datetime(2012, 3, 1, 0, 0)

The above works as expected. Now simulate unicode as input - März contains a LATIN SMALL LETTER A WITH DIAERESIS:

>>> datetime.datetime.strptime("M\u00E4rz 2012", "%B %Y"))
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' ...

The same can be achieved with the built-in unicode function:

>>> datetime.datetime.strptime(unicode("März 2012", "utf-8"), "%B %Y")
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' ....

Now try with appropriate encoding:

>>> datetime.datetime.strptime(u"M\u00E4rz 2012".encode('utf-8'), "%B %Y")
datetime.datetime(2012, 3, 1, 0, 0)

Again, this is not on Windows - so not really an answer, but it may contain a hint.


Just to investigate a bit more - a scenario, where one deals with an data external source (using JSON for this example, YMMV for XML):

I think a proper JSON encoder will give you unicode, and RFC4627 seems to hint to that:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase.

So to simulate that with python (nobody would parse JSON that way, this is just a simulation):

>>> import json
>>> s = json.dumps({"date" : "März 2012"}).split(":")[1].replace(
        '"', "").replace("}", "").strip().decode("unicode_escape")
>>> # and sure enough ...
>>> datetime.datetime.strptime(s, "%B %Y")
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4 ...

>>> # and again, with the right encoding ...
>>> datetime.datetime.strptime(s.encode("utf-8"), "%B %Y")
datetime.datetime(2012, 3, 1, 0, 0)


回答2:

The following code solves the problem.

locale.setlocale(locale.LC_ALL, "deu_deu") # Locale name on Windows
datetime.strptime(dt.encode("iso-8859-16"), "%B %Y")