I have been trying to parse a file with xml.etree.ElementTree
:
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError
def analyze(xml):
it = ET.iterparse(file(xml))
count = 0
last = None
try:
for (ev, el) in it:
count += 1
last = el
except ParseError:
print("catastrophic failure")
print("last successful: {0}".format(last))
print('count: {0}'.format(count))
This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
from yparse import analyze; analyze('file.xml')
File "C:\Python27\yparse.py", line 10, in analyze
for (ev, el) in it:
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
self._parser.feed(data)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
raise err
ParseError: reference to invalid character number: line 1, column 52459
The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.
The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!
This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.
Any ideas?
As @John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.
In fact, all of these entities appear in the text:
Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.
I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:
inside a try & except bracket:
Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.
Here are some ideas:
(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?
Do the following for each failing file:
(1) Find out what is in the file at the point that it is complaining about:
(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/
and edit your question to display your findings.
Update: Here is the minimal xml file that illustrates your problem:
Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.
You may wish to examine your files using regexes like
r'&#([0-9]+);'
andr'&#x([0-9A-Fa-f]+);'
, convert the matched text to an int ordinal and check against the valid list from the spec i.e.#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a
;
',&#not-a-digit
etc etcUpdate 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.
Output:
I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:
Source: http://effbot.org/zone/elementtree-13-intro.htm