I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)'
As an example I want something like this (using ElementTree's Target API):
import xml.etree.ElementTree as ET
class EchoTarget:
def start(self, tag, attrib):
if somecondition():
print "start", tag, attrib, self.getpos()
def end(self, tag):
if somecondition():
print "end", tag, self.getpos()
def data(self, data):
if somecondition():
print "data", repr(data), self.getpos()
target = EchoTarget()
parser = ET.XMLParser(target=target)
parser.feed("<p>some text</p>")
parser.close()
However, as far as I can tell, the getpos()
method (or something like it) doesn't exist. And, of course, that is using an XML parser. I want to parse potentially malformed HTML.
Interestingly, the HTMLParser class in the Python Standard Lib does offer support for obtaining the location info (with a getpos()
method), but it is horrible at handling malformed HTML and has been eliminated as a possible solution. I need to parse HTML that exists in the real word without breaking the parser.
I'm aware of two HTML parsers that would work well at parsing malformed HTML, namely lxml and html5lib. And in fact, I would prefer to use either one of them over any other options available in Python.
However, as far as I can tell, html5lib offers no event API and would require that the document be parsed to a tree object. Then I would have to iterate through the tree. Of course, by that point, there is no association with the source document and all location information is lost. So, html5lib is out, which is a shame because it seems like the best parser for handling malformed HTML.
The lxml library offers a Target API which mostly mirrors ElementTree's, but again, I'm not aware of any way to access location information for each event. A glance at the source code offered no hints either.
lxml also offers an API to SAX events. Interestingly, Python's standard lib mentions that SAX has support for Locator Objects, but offers little documentation about how to use them. This SO Question provides some info (when using a SAX Parser), but I don't see how that relates to the limited support for SAX events that lxml provides.
Finally, before anyone suggests Beautiful Soup, I will point out that, as stated on the home page, "Beautiful Soup sits on top of popular Python parsers like lxml and html5lib". All it gives me is an object to extract data from with no connection to the original source document. Like with html5lib, all location info is lost by the time I have access to the data. I want/need raw access to the parser directly.
To expand on the spell checker example I mention in the beginning, I would want to check the spelling only of words in the document text (but not tag names or attributes) and may want to skip checking the content of specific tags (like the script or code tags). Therefore, I need a real HTML parser. However, I am only interested in the position of the misspelled words in the original source document when it comes to reporting the misspelled words and have no need to build a tree object. To be clear, this is only an example of one potential use. I may use it for something completely different but the needs would be essentially the same. In fact, I once built something very similar using HTMLParser, but never used it as the error handling wasn't going to work for that use case. That was years ago, and I seem to have lost that file somewhere along the line. I'd like to use lxml or html5lib instead this time around.
So, is there something I'm missing? I have a hard time believing that none of these parsers (aside from the mostly useless HTMLParser) have any way to access the position information. But if they do it must be undocumented, which seems strange to me.