I have an Excel spreadsheet that I'm reading in that contains some £ signs.
When I try to read it in using the xlrd module, I get the following error:
x = table.cell_value(row, col)
x = x.decode("ISO-8859-1")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 0: ordinal not in range(128)
If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.
How can I fix this, and read the £ signs in correctly?
--- UPDATE ---
Some kind readers have suggested that I don't need to decode it at all, or that I can just encode it to Latin-1 when I need to. The problem with this is that I need to write the data to a CSV file eventually, and it seems to object to the raw strings.
If I don't encode or decode the data at all, then this happens (after I've added the string to an array called items):
for item in items:
#item = [x.encode('latin-1') for x in item]
cleancsv.writerow(item)
File "clean_up_barnet.py", line 104, in <module>
cleancsv.writerow(item)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2022' in position 43: ordinal not in range(128)
I get the same error even if I uncomment the Latin-1 line.
xlrd
works with Unicode, so the string you get back is a Unicode string. The £-sign has code point U+00A3, so the representation of said string should beu'\xa3'
. This has been read in correctly; it is the string that you should be working with throughout your program.When you write this (abstract, Unicode) string somewhere, you need to choose an encoding. At that point, you should
.encode
it into that encoding, saylatin-1
.Look closely: You got a Unicode***Encode***Error calling the decode method.
The reason for this is that
decode
is intended to convert from a byte sequence (str
) to aunicode
object. But, as John said,xlrd
already uses Unicode strings, sox
is already aunicode
object.In this situation, Python 2.x assumes that you meant to decode a
str
object, so it "helpfully" creates one for you. But in order to convert aunicode
to astr
, it needs an encoding, and chooses ASCII because it's the lowest common denominator of character encodings. Your code effectively gets interpreted aswhich fails because
x
contains a non-ASCII character.Since
x
is already aunicode
object, thedecode
is unnecessary. However, now you run into the problem that the Python 2.xcsv
module doesn't support Unicode. You have to convert your data tostr
objects.This would be correct, except that you have the
•
character (U+2022 BULLET) in your data, and Latin-1 can't represent it. There are several ways around this problem:x.encode('latin-1', 'ignore')
to remove the bullet (or other non-Latin-1 characters).x.encode('latin-1', 'replace')
to replace the bullet with a question mark.*
or·
.These days, UTF-8 is widely supported, so there is little reason to use any other encoding for text files.
A very easy way around all the "'ascii' codec can't encode character…" issues with csvwriter is to instead use unicodecsv, a drop-in replacement for csvwriter.
Install unicodecsv with pip and then you can use it in the exact same way, eg:
Your code snippet says
x.decode
, but you're getting an encode error -- meaningx
is Unicode already, so, to "decode" it, it must be first turned into a string of bytes (and that's where the default codecansi
comes up and fails). In your text then you say "if I rewrite ot to x.encode"... which seems to imply that you do know x is Unicode.So what it IS you're doing -- and what it is you mean to be doing -- encoding a unicode
x
to get a coded string of bytes, or decoding a string of bytes into a unicode object?I find it unfortunate that you can call
encode
on a byte string, anddecode
on a unicode object, because I find it seems to lead users to nothing but confusion... but at least in this case you seem to manage to propagate the confusion (at least to me;-).If, as it seems,
x
is unicode, then you never want to "decode" it -- you may want to encode it to get a byte string with a certain codec, e.g. latin-1, if that's what you need for some kind of I/O purposes (for your own internal program use I recommend sticking with unicode all the time -- only encode/decode if and when you absolutely need, or receive, coded byte strings for input / output purposes).Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.
For what it's worth: I'm the author of
xlrd
.Does
xlrd
produce unicode?Option 1: Read the Unicode section at the bottom of the first screenful of
xlrd
doc: This module presents all text strings as Python unicode objects.Option 2:
print type(text), repr(text)
You say """If I rewrite this to x.encode('utf-8') it stops throwing an error, but unfortunately when I then write the data out somewhere else (as latin-1), the £ signs have all become garbled.""" Of course if you write UTF-8-encoded text to a device that's expecting latin1, it will be garbled. What do did you expect?
You say in your edit: """I get the same error even if I uncomment the Latin-1 line""". This is very unlikely -- much more likely is that you got a slightly different error (mentioning the latin1 codec instead of the ascii codec) in a different source line (the uncommented latin1 line instead of the writerow line). Reading error messages carefully aids understanding.
Your problem here is that in general your data is NOT encodable in latin1; very little real-world data is. Your POUND SIGN is encodable in latin1, but that's not all your non-ASCII data. The problematic character is U+2022 BULLET which is not encodable in latin1.
It would have helped you get a better answer sooner if you had mentioned up front that you were working on Mac OS X ... the usual suspect for a CSV-suitable encoding is
cp1252
(Windows), notmac-roman
.