Parsing an html document using an XML-parser

2019-04-26 15:33发布

Can I parse an HTML file using an XML parser?

Why can('t) I do this. I know that XML is used to store data and that HTML is used to display data. But syntactically they are almost identical.

The intended use is to make an HTML parser, that is part of a web crawler application

3条回答
姐就是有狂的资本
2楼-- · 2019-04-26 16:13

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.

  • elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
  • elements that don’t require end tags; e.g., <p> <dt> <li> (their end tags can be implied)
  • elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
  • attributes with unquoted values; for example, <metacharset=utf-8>
  • attributes that are empty, with no separate value given at all; e.g., <inputdisabled>

An XML parser will fail to parse any HTML document that uses any of those features.

An HTML parser, on the other hand, will basically never fail no matter what a document contains.


All that said, there has also been work done toward developing a new type of XML parsing—so-called XML5 parsing—capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.


The intended use is to make an HTML parser, that is part of a web crawler application

If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.

These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:


查看更多
来,给爷笑一个
3楼-- · 2019-04-26 16:13

syntactically they are almost identical

Computers are picky. "Almost identical" isn't good enough. HTML allows things that XML doesn't, therefore an XML parser will reject (many, though not all) HTML documents.

In addition, there's a different quality culture. With HTML the culture for a parser is "try to do something with the input if you possibly can". With XML the culture is "if it's faulty, send it back for repair or replacement".

查看更多
Melony?
4楼-- · 2019-04-26 16:16

XML parsers will stop as soon as the XML content isn't well formed.
Some XML rules don't apply to HTML (illegal characters for instance), so any XML parser will consider your HTML as not wellformed and won't proceed further.

Consider following HTML "page":

<!doctype html>
<html>
  <head><title>Test</title></head>
  <body>
    <input type="checkbox" name="azerty" checked />
    <p>if A=B & B>D, then A>D</p>
  </body>
</html>

This is perfectly well formed and valid HTML, as you can check on W3C validator (validator.w3.org).

Now just try validating following XML (on http://www.xmlvalidation.com for instance):

<?xml version="1.0"?>
<html>
  <head><title>Test</title></head>
  <body>
    <input type="checkbox" name="azerty" checked />
    <p>if A=B & B>D, then A>D</p>
  </body>
</html>

You'll be notified it's not well formed XML,since attribute checked is not followed by an equal signe and a value.
Correct this, then you'll be told that '&' is an illegal character. Replace this with corresponding entity &amp;, then you'll learn that '>' is an illegal character too.

The tool you're trying to use to parse HTML as XML surely will find some error of this kind. As soon as it finds the first one, he stops processing your not well formed XML document.

You'll still have a chance if the HTML page you're trying to parse is well formed XHTML 1.0 strict, or XHTML 1.1...

查看更多
登录 后发表回答