Can I parse an HTML file using an XML parser?
Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.
The intended use is to make an HTML parser, that is part of a web crawler application
You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.
<br>
,<meta>
,<link>
, and<img>
(also known as void elements)<p>
<dt>
<li>
(their end tags can be implied)<
" characters; e.g., style, textarea, title, script;<script> if (a < b) … </script>
,<title>Using the "<" operator</title>
<meta
charset=utf-8
>
<input
disabled
>
An XML parser will fail to parse any HTML document that uses any of those features.
An HTML parser, on the other hand, will basically never fail no matter what a document contains.
All that said, there has also been work done toward developing a new type of XML parsing—so-called XML5 parsing—capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.
If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.
These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:
Computers are picky. "Almost identical" isn't good enough. HTML allows things that XML doesn't, therefore an XML parser will reject (many, though not all) HTML documents.
In addition, there's a different quality culture. With HTML the culture for a parser is "try to do something with the input if you possibly can". With XML the culture is "if it's faulty, send it back for repair or replacement".
XML parsers will stop as soon as the XML content isn't well formed.
Some XML rules don't apply to HTML (illegal characters for instance), so any XML parser will consider your HTML as not wellformed and won't proceed further.
Consider following HTML "page":
This is perfectly well formed and valid HTML, as you can check on W3C validator (validator.w3.org).
Now just try validating following XML (on http://www.xmlvalidation.com for instance):
You'll be notified it's not well formed XML,since attribute
checked
is not followed by an equal signe and a value.Correct this, then you'll be told that
'&'
is an illegal character. Replace this with corresponding entity&
, then you'll learn that'>'
is an illegal character too.The tool you're trying to use to parse HTML as XML surely will find some error of this kind. As soon as it finds the first one, he stops processing your not well formed XML document.
You'll still have a chance if the HTML page you're trying to parse is well formed XHTML 1.0 strict, or XHTML 1.1...