I'm using Symfony, Goutte, and DOMCrawler to scrape a page. Unfortunately, this page has many old fashioned tables of data, and no IDs or classes or identifying factors. So I'm trying to find a table by parsing through the source code I get back from the request, but I can't seem to access any information
I think when I try to filter it, it only filters the first node, and that's not where my desired data is, so it returns nothing.
so I have a $crawler
object. And I've tried to loop through the following to get what I want:
$title = $crawler->filterXPath('//td[. = "Title"]/following-sibling::td[1]')->each(funtion (Crawler $node, $i) {
return $node->text();
});
I'm not sure what Crawler $node
, I just got it from the example on the web page. Perhaps if I can get this working, then it will loop through each node in the $crawler
object and find what I'm actually looking for.
Here's an example of the page:
<table>
<tr>
<td>Title</td>
<td>The Harsh Face of Mother Nature</td>
<td>The Harsh Face of Mother Nature</td>
</tr>
.
.
.
</table>
And this is just one table, there are many tables and a huge sloppy mess outside of this one. Any ideas?
(Note: earlier I was able to apply a filter to the $crawler
object for some information I needed, then I serialize()
the information, and has a string finally, which made sense. But I cannot get a string at all anymore, idk why.)
The DomCrawler html() function doesnt dump the whole html as per the function description :
http://api.symfony.com/2.6/Symfony/Component/DomCrawler/Crawler.html#method_html
it returns only the first node which it did in your case.
You may be able to use http://php.net/manual/en/domdocument.savehtml.php as the DomCrawler is a set of SplObjectStorage .
If you view the source for the Crawler::html() you will see that it is performing the following: