Why does my XPath query (scraping HTML tables) onl

2018-12-31 09:41发布

This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.

I'm developing an application that needs to parse a website with tables in it. As deriving XPath expression for scraping web pages is boring and error-prone work, I'd like to use the XPath extractor feature of Firebug (or similar tools in other browsers) for this.

Example input looks like this:

<!-- snip -->
<table id="example">
  <tr>
    <th>Example Cell</th>
    <th>Another one</th>
  </tr>
  <tr>
    <td>foobar</td>
    <td>42</td>
  </tr>
</table>
<!-- snip -->

I want to extract the first data cell ("foobar"). Firebug proposes the XPath expression

//table[@id="example"]/tbody/tr[2]/td[1]

which works fine in any XPath tester plugins, but not my own application (no results found). If I cut down the query to //table[@id], it works again.

What's going wrong?

2条回答
无色无味的生活
2楼-- · 2018-12-31 10:21

The Problem: DOM Requires <tbody/> Tags

Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.

The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says

The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element.

There is an in-depth explanation in another answer on stackoverflow.

On the other hand, HTML does not necessarily require that tag to be used:

The TBODY start tag is always required except when the table contains only one table body and no table head or foot sections.

Most XPath Processors Work on raw XML

Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/> tags. Also HTML parser libraries like and only output XHTML, not "DOM-HTML".

This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!

Reproducing the Issue

Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org on the command line. Latter will probably not contain any <tbody/> elements (they're rarely used), Firebug will always show them.


Solution 1: Remove /tbody Axis Step

Check if the table you're stuck at really does not contain a <tbody/> element (see last paragraph). If it does, you've probably got another kind of problem.

Now remove the /tbody axis step, so your query will look like

//table[@id="example"]/tr[2]/td[1]

Solution 2: Skip <tbody/> Tags

This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.

Replace the /tbody axis step by a descendant-or-self step:

//table[@id="example"]//tr[2]/td[1]

Solution 3: Allow Both Input With and Without <tbody/> Tags

If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).

  • XPath 1.0:
    //table[@id="example"]/tr[2]/td[1] | //table[@id="example"]/tbody/tr[2]/td[1]
  • XPath 2.0: //table[@id="example"]/(tbody, .)/tr[2]/td[1]
查看更多
梦寄多情
3楼-- · 2018-12-31 10:29

Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)

Before parsing, get the html as a string. Insert missing <tbody> and </tbody> tags with regex, then load it back into your DOMDocument object.

Jens Erat gives a good explanation, but here is

Solution 4: Make sure the HTML source always has the <tbody> tags with regex

JavaScript
    var html = '<html><table><tr><td>foo</td><td>bar</td></tr></table></html>';
    html.replace(/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/g,"$1<tbody>").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1</tbody>$4");

PHP
    $html = $dom->saveHTML();
    $html = preg_replace(array('/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1<tbody>','$1</tbody>$4'),$html);
    $dom->loadHTML($html);

Just the regex:

matches `<table>` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `<tbody>` also with stuff inside the tag

    /(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/

replace with

    $1<tbody>

the $1 referencing the captured `<table>` tag with contents.
Do the same for the closing tag like this:

    /(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/

replace with

    $1</tbody>$4

This way the dom will ALWAYS have the <tbody> tags where necessary.

查看更多
登录 后发表回答