This is meant to provide a canonical Q&A to all that similar (but much too specific questions to be a close target candidate) popping up once or twice a week.
I'm developing an application that needs to parse a website with tables in it. As deriving XPath expression for scraping web pages is boring and error-prone work, I'd like to use the XPath extractor feature of Firebug (or similar tools in other browsers) for this.
Example input looks like this:
<!-- snip -->
<table id="example">
<tr>
<th>Example Cell</th>
<th>Another one</th>
</tr>
<tr>
<td>foobar</td>
<td>42</td>
</tr>
</table>
<!-- snip -->
I want to extract the first data cell ("foobar"). Firebug proposes the XPath expression
//table[@id="example"]/tbody/tr[2]/td[1]
which works fine in any XPath tester plugins, but not my own application (no results found). If I cut down the query to //table[@id]
, it works again.
What's going wrong?
The Problem: DOM Requires
<tbody/>
TagsFirebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.
The DOM for HTML requires that all table rows not contained in a table header of footer (
<thead/>
,<tfoot/>
) are included in table body tags<tbody/>
. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation saysThere is an in-depth explanation in another answer on stackoverflow.
On the other hand, HTML does not necessarily require that tag to be used:
Most XPath Processors Work on raw XML
Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add
<tbody/>
tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!
Reproducing the Issue
Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using
curl http://your.example.org
on the command line. Latter will probably not contain any<tbody/>
elements (they're rarely used), Firebug will always show them.Solution 1: Remove
/tbody
Axis StepCheck if the table you're stuck at really does not contain a
<tbody/>
element (see last paragraph). If it does, you've probably got another kind of problem.Now remove the
/tbody
axis step, so your query will look likeSolution 2: Skip
<tbody/>
TagsThis is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.
Replace the
/tbody
axis step by a descendant-or-self step:Solution 3: Allow Both Input With and Without
<tbody/>
TagsIf you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).
//table[@id="example"]/tr[2]/td[1] | //table[@id="example"]/tbody/tr[2]/td[1]
//table[@id="example"]/(tbody, .)/tr[2]/td[1]
Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)
Before parsing, get the html as a string. Insert missing
<tbody>
and</tbody>
tags with regex, then load it back into your DOMDocument object.Jens Erat gives a good explanation, but here is
Solution 4: Make sure the HTML source always has the
<tbody>
tags with regexJust the regex:
This way the dom will ALWAYS have the
<tbody>
tags where necessary.