Parse HTML via XPath

2019-01-16 10:08发布

In .Net, I found this great library, HtmlAgilityPack that allows you to easily parse non-well-formed HTML using XPath. I've used this for a couple years in my .Net sites, but I've had to settle for more painful libraries for my Python, Ruby and other projects. Is anyone aware of similar libraries for other languages?

8条回答
smile是对你的礼貌
2楼-- · 2019-01-16 10:48

In python, ElementTidy parses tag soup and produces an element tree, which allows querying using XPath:

>>> from elementtidy.TidyHTMLTreeBuilder import TidyHTMLTreeBuilder as TB
>>> tb = TB()
>>> tb.feed("<p>Hello world")
>>> e= tb.close()
>>> e.find(".//{http://www.w3.org/1999/xhtml}p")
<Element {http://www.w3.org/1999/xhtml}p at 264eb8>
查看更多
Viruses.
3楼-- · 2019-01-16 10:49

The most stable results I've had have been using lxml.html's soupparser. You'll need to install python-lxml and python-beautifulsoup, then you can do the following:

from lxml.html.soupparser import fromstring
tree = fromstring('<mal form="ed"><html/>here!')
matches = tree.xpath("./mal[@form=ed]")
查看更多
戒情不戒烟
4楼-- · 2019-01-16 11:02

It seems the question could be more precisely stated as "How to convert HTML to XML so that XPath expressions can be evaluated against it".

Here are two good tools:

  1. TagSoup, an open-source program, is a Java and SAX - based tool, developed by John Cowan. This is a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML. TagSoup also includes a command-line processor that reads HTML files and can generate either clean HTML or well-formed XML that is a close approximation to XHTML.
    Taggle is a commercial C++ port of TagSoup.

  2. SgmlReader is a tool developed by Microsoft's Chris Lovett.
    SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
    Download the zip file including the standalone executable and the full source code: SgmlReader.zip

查看更多
霸刀☆藐视天下
5楼-- · 2019-01-16 11:05

An outstanding achievement is the pure XSLT 2.0 Parser of HTML written by David Carlisle.

Reading its code would be a great learning exercise for everyone of us.

From the description:

"d:htmlparse(string)
 d:htmlparse(string,namespace,html-mode)

  The one argument form is equivalent to)
  d:htmlparse(string,'http://ww.w3.org/1999/xhtml',true()))

  Parses the string as HTML and/or XML using some inbuilt heuristics to)
  control implied opening and closing of elements.

  It doesn't have full knowledge of HTML DTD but does have full list of
  empty elements and full list of entity definitions. HTML entities, and
  decimal and hex character references are all accepted. Note html-entities
  are recognised even if html-mode=false().

  Element names are lowercased (if html-mode is true()) and placed into the
  namespace specified by the namespace parameter (which may be "" to denote
  no-namespace unless the input has explict namespace declarations, in
  which case these will be honoured.

  Attribute names are lowercased if html-mode=true()
"

Read a more detailed description here.

Hope this helped.

Cheers,

Dimitre Novatchev.

查看更多
劫难
6楼-- · 2019-01-16 11:06

I'm surprised there isn't a single mention of lxml. It's blazingly fast and will work in any environment that allows CPython libraries.

Here's how you can parse HTML via XPATH using lxml.

>>> from lxml import etree
>>> doc = '<foo><bar></bar></foo>'
>>> tree = etree.HTML(doc)

>>> r = tree.xpath('/foo/bar')
>>> len(r)
1
>>> r[0].tag
'bar'

>>> r = tree.xpath('bar')
>>> r[0].tag
'bar'
查看更多
forever°为你锁心
7楼-- · 2019-01-16 11:07

There is a free C implementation for XML called libxml2 which has some api bits for XPath which I have used with great success which you can specify HTML as the document being loaded. This had worked for me for some less than perfect HTML documents..

For the most part, XPath is most useful when the inbound HTML is properly coded and can be read 'like an xml document'. You may want to consider using a utility that is specific to this purpose for cleaning up HTML documents. Here is one example: http://tidy.sourceforge.net/

As far as these XPath tools go- you will likely find that most implementations are actually based on pre-existing C or C++ libraries such as libxml2.

查看更多
登录 后发表回答