I am reading a list of strings that were imported into an excel xml file from another software program. I am not sure what the encoding of the excel file is, but I am pretty sure its not windows-1252, because when I try to use that encoding, I wind up with a lot of errors.
The specific word that is causing me trouble right now is: "Zmysłowska, Magdalena" (notice the "l" is not a standard "l", but rather, has a slash through it).
I have tried a few things, Ill mention three of them here:
(1)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
page = page.encode("utf-8", "ignore")
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
(2)
page = unicode(page, "utf-8")
page = unicodedata.normalize("NFKD", page)
Output: Zmys\u0142owska, Magdalena
Output after print statment: Zmysłowska, Magdalena
Note: this is great, but I need to encode it back to utf-8 before putting the string into my db. When I do that, by running page.encode("utf-8", "ignore"), I end up with Zmysłowska, Magdalena again.
(3) Do nothing (no normalization, no decode, no encode). It seems like the string is already utf-8 when it comes in. However, when I do nothing, the string ends up with the following output again:
Output: Zmys\xc5\x82owska, Magdalena
Output after print statement: Zmysłowska, Magdalena
Is there a way for me to convert this string to utf-8?
Your problem isn't your encoding and decoding. Your code correctly takes a UTF-8 string, and converts it to an NFKD-normalized UTF-8 string. (You might want to use
page.decode("utf-8")
instead ofunicode(page, "utf-8")
just for future-proofing in case you ever go to Python 3, and to make the code a bit easier to read because theencode
anddecode
are more obviously parallel, but you don't have to; the two are equivalent.)Your actually problem is that you're printing UTF-8 strings to some context that isn't UTF-8. Most likely you're printing to the
cmd
window, which is defaulting to Windows-1252. So,cmd
tries to interpret the UTF-8 characters as Windows-1252, and gets garbage.There's a pretty easy way to test this. Make Python decode the UTF-8 string as if it were Windows-1252 and see if the resulting Unicode string looks like what're seeing.
There are two ways around this:
For option 1:
For option 2, it's going to be one of the following:
Of course if you keep the intermediate Unicode string around, you don't need all those
decode
calls: