JSoup parsing invalid HTML with unclosed tags

2019-02-12 22:00发布

问题:

Using JSoup inclusive the last release 1.7.2 there is a bug parsing invalid HTML with unclosed tags.

Example:

String tmp = "<a href='www.google.com'>Link<p>Error link</a>";
Jsoup.parse(tmp);

The Document that generate is:

<html>
 <head></head>
 <body>
  <a href="www.google.com">Link</a>
  <p><a>Error link</a></p>
 </body>
</html>

The browsers would generate something as:

<html>
 <head></head>
 <body>
  <a href="www.google.com">Link</a>
  <p><a href="www.google.com">Error link</a></p>
 </body>
</html>

Jsoup should works as browsers or as source code.

There is any solution? Looking into the API I didn't find anything.

回答1:

The correct behavior is to act as other browsers when parsing this invalid HTML. Thanks for filing this bug. I've fixed the issue that was preventing the adoption agency from keeping the original attributes in the new node. It will be available in 1.7.3, or you can build from head now.



回答2:

If your goal is to get the source code like that browsers generate, you could use selenium, and then pass it to Jsoup to parse. but selenium should open a real browser, of course it could open it automatically. Code like this:

public static void main(String[] args) {

    //System.setProperty("webdriver.chrome.driver", "./chromedriver.exe");
    //WebDriver driver = new ChromeDriver();
    WebDriver driver = new FirefoxDriver();
    driver.get("file:///C:/Users/jgong/Desktop/a.html");

    String html = driver.getPageSource();
    System.out.println(html);
    driver.quit();
    Document doc = Jsoup.parse(html);
    System.out.println(doc.html());

}

and a.html is:

<html><head></head><body><a href="www.google.com">Link<p>Error link</a></body></html>

and the result is that you wanted:

<html><head></head> <body> <a href="www.google.com">Link</a><p><ahref="www.google.com">Error link</a> </p></body></html>


回答3:

Your HTML is not valid

document type does not allow element "P" here; missing one of "APPLET", "OBJECT", "MAP", "IFRAME", "BUTTON" start-tag

<a href='www.google.com'>Link<p>Error link</a>

The mentioned element is not allowed to appear in the context in which you've placed it; the other mentioned elements are the only ones that are both allowed there and can contain the element mentioned. This might mean that you need a containing element, or possibly that you've forgotten to close a previous element.

One possible cause for this message is that you have attempted to put a block-level element (such as "<p>" or "<table>") inside an inline element (such as "<a>", "<span>", or "<font>").

There's no standard way to fix broken HTML and each different parser will try its best. If you want repeatable results for invalid HTML you should stick to strictly the same version of the same parser.