I receive xml strings from an external source that can contains unsanitized user contributed content.
The following xml string gave a ParseError in cElementTree
:
>>> print repr(s)
'<Comment>dddddddd\x08\x08\x08\x08\x08\x08_____</Comment>'
>>> import xml.etree.cElementTree as ET
>>> ET.XML(s)
Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
ET.XML(s)
File "<string>", line 106, in XML
ParseError: not well-formed (invalid token): line 1, column 17
Is there a way to make cElementTree not complain?
This is most probably an encoding error. For example I had an xml file encoded in UTF-8-BOM (checked from the Notepad++ Encoding menu) and got similar error message.
The workaround (Python 3.6)
Check the encoding of your xml file. If it is using different encoding, change the 'utf-8-sig' accordingly.
What helped me with that error was Juan's answer - https://stackoverflow.com/a/20204635/4433222 But wasn't enough - after struggling I found out that an XML file needs to be saved with UTF-8 without BOM encoding.
The solution wasn't working for "normal" UTF-8.
I have been in stuck with similar problem. Finally figured out the what was the root cause in my particular case. If you read the data from multiple XML files that lie in same folder you will parse also .DS_Store file. Before parsing add this condition
This trick helped me as well