I have to parse a German date in MONTH YEAR format where MONTH is the full name of the month. I set the appropriate locale in Python and then try to parse the date with strptime
. For example:
locale.setlocale(locale.LC_ALL, "deu_deu") # Locale name on Windows
datetime.strptime(dt, "%B %Y")
On encountering a month with a non-ASCII character in its name I get a UnicodeEncodeError
. The date is being pulled from an XML file delivered via a web service. Is there a way I can transform my date string to that it works with strptime
?
EDIT
datetime.strptime(dt.encode("iso-8859-16"), "%B %Y")
worked.
No answer, just a test (on Unix, though):
>>> import locale, datetime
>>> locale.setlocale(locale.LC_ALL, "de_de")
>>> datetime.datetime.strptime("März 2012", "%B %Y")
datetime.datetime(2012, 3, 1, 0, 0)
The above works as expected. Now simulate unicode as input - März contains a LATIN SMALL LETTER A WITH DIAERESIS:
>>> datetime.datetime.strptime("M\u00E4rz 2012", "%B %Y"))
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' ...
The same can be achieved with the built-in unicode function:
>>> datetime.datetime.strptime(unicode("März 2012", "utf-8"), "%B %Y")
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' ....
Now try with appropriate encoding:
>>> datetime.datetime.strptime(u"M\u00E4rz 2012".encode('utf-8'), "%B %Y")
datetime.datetime(2012, 3, 1, 0, 0)
Again, this is not on Windows - so not really an answer, but it may contain a hint.
Just to investigate a bit more - a scenario, where one deals with an data external source (using JSON for this example, YMMV for XML):
I think a proper JSON encoder will give you unicode, and RFC4627 seems to hint to that:
Any character may be escaped. If the character is in the Basic
Multilingual Plane (U+0000 through U+FFFF), then it may be
represented as a six-character sequence: a reverse solidus, followed
by the lowercase letter u, followed by four hexadecimal digits that
encode the character's code point. The hexadecimal letters A though
F can be upper or lowercase.
So to simulate that with python (nobody would parse JSON that way, this is just a simulation):
>>> import json
>>> s = json.dumps({"date" : "März 2012"}).split(":")[1].replace(
'"', "").replace("}", "").strip().decode("unicode_escape")
>>> # and sure enough ...
>>> datetime.datetime.strptime(s, "%B %Y")
...
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4 ...
>>> # and again, with the right encoding ...
>>> datetime.datetime.strptime(s.encode("utf-8"), "%B %Y")
datetime.datetime(2012, 3, 1, 0, 0)
The following code solves the problem.
locale.setlocale(locale.LC_ALL, "deu_deu") # Locale name on Windows
datetime.strptime(dt.encode("iso-8859-16"), "%B %Y")