可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I am receiving an Excel file whose content I cannot influence. It contains some Unicode characters like "á" or "é".
My code has been unchanged, but I migrated from Eclipse Juno to LiClipse together to a migration to a different python package (2.6 from 2.5). In principle the specific package I am using has a working version on win32com package.
When I read the Excel file my code is crashing when extracting and converting to to strings using str(). The console output is the following:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 89: ordinal not in range(128)
Being more concrete I perform the following:
Read the Excel:
xlApp = Dispatch("Excel.Application")
excel = xlApp.Workbooks.Open(excel_location)
in an internal loop I extract the value of the cell:
cell_value = self.excel.ActiveSheet.Cells(excel_line + 1, excel_column + 1)
and finally, if I try to convert cell_value to str, crashes:
print str(cell_value)
If I go to the Excel and remove the non-ASCII characters everything is working smoothly. I have tried this encode proposal. Any other solution I have googled proposes saving the file in a specific format, that I can't do.
What puzzles me is that the code was working before with the same input Excel but this change to LiClipse and 2.6 Python killed everything.
Any idea how can I progress?
回答1:
This is a common problem when working with UTF-8 encoded Unicode data in Python 2.x. The handling of this has changed in a few places between 2.4 and 2.7, so it's no surprise that you suddenly get an error.
The source of the error is print
: In Python 2.x, print
doesn't try to assume what encoding your terminal supports. It just plays save and assumes that ascii
is the only supported charset (which means characters between 0 and 127 are fine, everything else gives an error).
Now you convert a COMObject
to a string. str
is just a bunch of bytes (values 0 to 255) as far as Python 2.x is concerned. It doesn't have an encoding.
Combining the two is a recipe for trouble. When Python prints, it tries to validate the input (the string) and suddenly finds UTF-8 encoded characters (UTF-8 adds these odd \xe1
markers which tells the decoder that the next byte is special in some way; check Wikipedia for the gory details).
That's when the ascii
encoder says: Sorry, can't help you there.
That means you can work with this value, compare it and such, but you can't print
it. A simple fix for the printing problem is:
s = str(cell_value) # Convert COM -> UTF-8 encoded string
print repr(s) # repr() converts anything to ascii
If your terminal supports UTF-8, then you need to tell Python about it:
import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
You should also have a look at sys.stdout.encoding
which tells what Python currently thinks the output encoding is/should be. When Python 2 is properly configured (like on modern Linux distributions), then the correct codec for output should be used automatically.
Related:
- Python 2 Unicode howto
- Pragmatic Unicode, or, How do I stop the pain?
- Setting the correct encoding when piping stdout in Python
回答2:
.Cells(row,col)
returns a Range
object. You probably want the text from the cell:
cell = xl.ActiveSheet.Cells(1,2).Text
or
cell = xl.ActiveSheet.Range('B1').Text
The resulting value will be a Unicode string. To convert to bytes that you can write to a file, use .encode(encoding)
, for example:
bytes = cell.encode('utf8')
The below example uses the following spreadsheet:
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
xl.Workbooks.Open(r'book1.xlsx')
cell = xl.ActiveSheet.Cells(1,2)
cell_value = cell.Text
print repr(cell)
print repr(cell_value)
print cell_value
Output (Note, Chinese will only print if console/IDE supports the characters):
<win32com.gen_py.Microsoft Excel 14.0 Object Library.Range instance at 0x129909424>
u'\u4e2d\u56fd\u4eba'
中国人
回答3:
What is described here is a hack, you should not use as a long term
solution. Looking at the comments it could crush the terminal.
Finally I found a solution helped by the suggestion that @Huan-YuTseng provided, probably the solutions offered by other might work in other context but not in this one.
So, what happened is that I migrated from Eclipse Juno version (as Pydev stopped working due to Java upgrade needed that I can't accomplish in this computer) to LiClipse direct package (I did not upgraded a downloaded Eclipse version).
By default, in my LiClipse version (1.4.0.201502042042) the Console output is not by default utf-8. So I needed to change the output from either LiClipse or using my code. Fourtunately, there was another question related to a similar problem that helped me. You can see more details here, but essentially what you need to do is to include at the begginning of your code the following code:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
And everything works. In the answers from @AarongDigulla the solution is there, but is actually the very last solution.
However, I need to say that LiClipse is giving me an error on sys.setdefaultencoding statement, that during execution is not creating any issue... no idea what's happening. That stopped me testing this solution before. Maybe there is something wrong in LiClipse (is alowing me to execute code with errors!)
回答4:
Use 'utf-8 BOM' which in python used as utf_8_sig for Unicode character & also to avoid irrelevant results in Excel sheet.