I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) to reduce the amount of memory used. The problem I'm having is that I'm getting XML syntax errors such as:
lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59
This then causes everything to stop.
Is there a way to iteratively parse HTML without choking on syntax errors?
At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document, and then restarting the process. Seems like a pretty disgusting solution. Is there a better way?
Edit:
This is what I'm currently doing:
context = etree.iterparse(tfile, events=('start', 'end'), html=True)
in_table = False
header_row = True
while context:
try:
event, el = context.next()
# do something
# remove old elements
while el.getprevious() is not None:
del el.getparent()[0]
except etree.XMLSyntaxError, e:
print e.msg
lineno = int(re.search(r'line (\d+),', e.msg).group(1))
remove_line(tfilename, lineno)
tfile = open(tfilename)
context = etree.iterparse(tfile, events=('start', 'end'), html=True)
except KeyError:
print 'oops keyerror'
Try parsing your HTML document with lxml.html:
Use
True
for iterparse's argumentshtml
andhuge_tree
.At the moment lxml etree.iterparse supports keyword argument recover=True, so that instead of writing custom subclass of HTMLParser fixing broken html you can just pass this argument to iterparse.
To properly parse huge and broken html you only need to do following:
The perfect solution ended up being Python's very own
HTMLParser
[docs].This is the (pretty bad) code I ended up using:
With that code I could then do this: