ParseError: not well-formed (invalid token) using

I receive xml strings from an external source that can contains unsanitized user contributed content.

The following xml string gave a ParseError in cElementTree:

>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    ET.XML(s)
  File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17

Is there a way to make cElementTree not complain?

标签： python parsing elementtree

9条回答

一纸荒年 Trace。

2楼-- · 2019-01-26 04:55

It seems to complain about \x08 you will need to escape that.

Edit:

Or you can have the parser ignore the errors using recover

from lxml import etree
parser = etree.XMLParser(recover=True)
etree.fromstring(xmlstring, parser=parser)

0人赞添加讨论(0) 举报

劳资没心，怎么记你

3楼-- · 2019-01-26 04:56

A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET

xml = u"""<?xml version='1.0' encoding='utf8'?>
<osm generator="pycrocosm server" version="0.6"><changeset created_at="2017-09-06T19:26:50.302136+00:00" id="273" max_lat="0.0" max_lon="0.0" min_lat="0.0" min_lon="0.0" open="true" uid="345" user="john"><tag k="test" v="Съешь же ещё этих мягких французских булок да выпей чаю" /><tag k="foo" v="bar" /><discussion><comment data="2015-01-01T18:56:48Z" uid="1841" user="metaodi"><text>Did you verify those street names?</text></comment></discussion></changeset></osm>"""

xmltest = ET.fromstring(xml.encode("utf-8"))

However, it works with the addition of a hyphen in the encoding type:

<?xml version='1.0' encoding='utf-8'?>

Most odd. Someone found this footnote in the python docs:

The encoding string included in XML output should conform to the appropriate standards. For example, “UTF-8” is valid, but “UTF8” is not.

0人赞添加讨论(0) 举报

Emotional °昔

4楼-- · 2019-01-26 04:57

I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:

def ParseXmlTagContents(source, tag, tagContentsRegex):
    openTagString = "<"+tag+">"
    closeTagString = "</"+tag+">"
    found = re.search(openTagString + tagContentsRegex + closeTagString, source)
    if found:   
        start = found.regs[0][0]
        end = found.regs[0][1]
        return source[start+len(openTagString):end-len(closeTagString)]
    return ""

Example usage would be:

<?xml version="1.0" encoding="utf-16"?>
<parentNode>
    <childNode>123</childNode>
</parentNode>

ParseXmlTagContents(xmlString, "childNode", "[0-9]+")

0人赞添加讨论(0) 举报

傲

5楼-- · 2019-01-26 05:07

I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)

import xml.etree.ElementTree as ET
parser = ET.XMLParser(encoding="utf-8")
tree = ET.fromstring(xmlstring, parser=parser)

EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...

0人赞添加讨论(0) 举报

贼婆χ

6楼-- · 2019-01-26 05:07

None of the above fixes worked for me. The only thing that worked was to use BeautifulSoup instead of ElementTree as follows:

from bs4 import BeautifulSoup

with open("data/myfile.xml") as fp:
    soup = BeautifulSoup(fp, 'xml')

Then you can search the tree as:

soup.find_all('mytag')

0人赞添加讨论(0) 举报

仙女界的扛把子

7楼-- · 2019-01-26 05:09

See this answer to another question and the according part of the XML spec.

The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity  and cannot occur plainly.

If you need to process this XML snippet, you must replace \x08 in s before feeding it into an XML parser.

0人赞添加讨论(0) 举报

1 2 下一页

ParseError: not well-formed (invalid token) using

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间