I receive xml strings from an external source that can contains unsanitized user contributed content.
The following xml string gave a ParseError in cElementTree
:
>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
ET.XML(s)
File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17
Is there a way to make cElementTree not complain?
It seems to complain about
\x08
you will need to escape that.Edit:
Or you can have the parser ignore the errors using
recover
A solution for gottcha for me, using Python's ElementTree... this has the invalid token error:
However, it works with the addition of a hyphen in the encoding type:
Most odd. Someone found this footnote in the python docs:
I tried the other solutions in the answers here but had no luck. Since I only needed to extract the value from a single xml node I gave in and wrote my function to do so:
Example usage would be:
I was having the same error (with ElementTree). In my case it was because of encodings, and I was able to solve it without having to use an external library. Hope this helps other people finding this question based on the title. (reference)
EDIT: Based on comments, this answer might be outdated. But this did work back when it was answered...
None of the above fixes worked for me. The only thing that worked was to use
BeautifulSoup
instead ofElementTree
as follows:Then you can search the tree as:
See this answer to another question and the according part of the XML spec.
The backspace U+0008 is an invalid character in XML documents. It must be represented as escaped entity

and cannot occur plainly.If you need to process this XML snippet, you must replace
\x08
ins
before feeding it into an XML parser.