ElementTree and unicode

I have this char in an xml file:

<data>
  <products>
      <color>fumè</color>
  </product>
</data>

I try to generate an instance of ElementTree with the following code:

string_data = open('file.xml')
x = ElementTree.fromstring(unicode(string_data.encode('utf-8')))

and I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128)

(NOTE: The position is not exact, I sampled the xml from a larger one).

How to solve it? Thanks

标签： python unicode encoding utf-8 elementtree

6条回答

爱情/是我丢掉的垃圾

2楼-- · 2019-01-18 03:01

Function open() does not return a string. Instead use open('file.xml').read().

0人赞添加讨论(0) 举报

Fickle 薄情

3楼-- · 2019-01-18 03:03

The most likely your file is not UTF-8. è character can be from some other encoding, latin-1 for example.

0人赞添加讨论(0) 举报

该账号已被封号

4楼-- · 2019-01-18 03:04

Have you tried using the parse function, instead of opening the file... (which BTW would require a .read() after it for the .fromstring() to work...)

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()
# etc...

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

5楼-- · 2019-01-18 03:05

Might you have stumbled upon this problem while using Requests (HTTP for Humans), response.text decodes the response by default, you can use response.content to get the undecoded data, so ElementTree can decode it itself. Just remember to use the correct encoding.

More info: http://docs.python-requests.org/en/latest/user/quickstart/#response-content

0人赞添加讨论(0) 举报

闹够了就滚

6楼-- · 2019-01-18 03:10

You need to decode utf-8 strings into a unicode object. So

string_data.encode('utf-8')

should be

string_data.decode('utf-8')

assuming string_data is actually an utf-8 string.

So to summarize: To get an utf-8 string from a unicode object you encode the unicode (using the utf-8 encoding), and to turn a string to a unicode object you decode the string using the respective encoding.

For more details on the concepts I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (not Python specific).

0人赞添加讨论(0) 举报

霸刀☆藐视天下

7楼-- · 2019-01-18 03:28

You do not need to decode XML for ElementTree to work. XML carries it's own encoding information (defaulting to UTF-8) and ElementTree does the work for you, outputting unicode:

>>> data = '''\
... <data>
...   <products>
...       <color>fumè</color>
...   </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'

If your data is contained in a file(like) object, just pass the filename or file object directly to the ElementTree.parse() function:

x = ElementTree.parse('file.xml')

0人赞添加讨论(0) 举报

ElementTree and unicode

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间