Library to query HTML with XPath in Java?

2019-01-15 04:49发布

问题:

Can anyone recommend me a java library to allow me XPath Queries over URLs? I've tried JAXP without success.

Thank you.

回答1:

jsoup, Java HTML Parser Very similar to jQuery syntax way.



回答2:

There are several different approaches to this documented on the Web:

Using HtmlCleaner

  • HtmlCleaner / Java DOM parser - Using XPath Contains against HTML in Java (This is the way I recommend)
  • HtmlCleaner itself has a built in utility supporting XPath - See the javadocs http://htmlcleaner.sourceforge.net/doc/org/htmlcleaner/XPather.html or this example http://thinkandroid.wordpress.com/2010/01/05/using-xpath-and-html-cleaner-to-parse-html-xml/

Using Jericho

  • Jericho and Jaxen http://sujitpal.blogspot.com/2009/04/xpath-over-html-using-jericho-and-jaxen.html

I have tried a few different variations of these approaches, i.e. HtmlParser plus the Java DOM parser, and JSoup plus Jaxen, but the combination that worked best is HtmlCleaner plus the Java DOM parser. The next best combination was Jericho plus Jaxen.



回答3:

You could use TagSoup together with Saxon. That way you simply replace any XML SAX parser used with TagSoup and the XPath 2.0 or XSLT 2.0 or XQuery 1.0 implementation works as usual.



回答4:

I've used JTidy to make HTML into a proper DOM, then used plain XPath to query the DOM.

If you want to do cross-document/cross-URL queries, better use JTidy with XQuery.