I am having a bit of a problem of scraping a table-heavy page with DOMXpath.
The layout is really ugly, meaning I am trying to get content out of a table within a table within a table.
Using Firebug FirePath I am getting for the table element the following path:
html/body/table/tbody/tr[3]/td/table[1]/tbody/tr[2]/td[1]/table[1]/tbody/tr[3]/td[4]
Now, after endless experimenting I found out, that with a stand alone table, I need to remove the "tbody" tag to make it work. But this doesn't seem to be enough for tables within tables.
So my question is how do I best get content out of tables within tables within tables?
I uploaded the file which I am trying to scrape here:1
i have gone through with the same problem as yours scrapping a source of complicated and not well formatted html where i want to get the values in a table inside another tables..
i came with the approach of eyeing the part that i want to get with some series of function like this:
function parse_html() {//gets a specific part of the table i chose to extract the contents
$query = $xpath->query('//tr[@data-eventid]/@data-eventid'); //gets the table i want
$this->parse_table();
}
function parse_table() {//
$query = $xpath->query('//tr[@data-eventid="405412"]/td[@class="impact"]/span[@title]/@title');...etc//extracts the content of the table
$this->parseEvaluate();
}
function parseEvaluate(){
...verifying values if correct
}
just giving the idea..
How about:
//*[contains(text(),"GRABME")]
I know that's probably not what you want, but you get the idea. Identify a pattern and use that pattern to construct the xpath.