DOMCrawler not dumping data properly for parsing

2019-06-06 13:13发布

问题:

I'm using Symfony, Goutte, and DOMCrawler to scrape a page. Unfortunately, this page has many old fashioned tables of data, and no IDs or classes or identifying factors. So I'm trying to find a table by parsing through the source code I get back from the request, but I can't seem to access any information

I think when I try to filter it, it only filters the first node, and that's not where my desired data is, so it returns nothing.

so I have a $crawler object. And I've tried to loop through the following to get what I want:

$title = $crawler->filterXPath('//td[. = "Title"]/following-sibling::td[1]')->each(funtion (Crawler $node, $i) {
        return $node->text();
});

I'm not sure what Crawler $node, I just got it from the example on the web page. Perhaps if I can get this working, then it will loop through each node in the $crawler object and find what I'm actually looking for.

Here's an example of the page:

<table> 
<tr>
    <td>Title</td>
    <td>The Harsh Face of Mother Nature</td>
   <td>The Harsh Face of Mother Nature</td>
</tr>
.
.
.
</table>

And this is just one table, there are many tables and a huge sloppy mess outside of this one. Any ideas?

(Note: earlier I was able to apply a filter to the $crawler object for some information I needed, then I serialize() the information, and has a string finally, which made sense. But I cannot get a string at all anymore, idk why.)

回答1:

The DomCrawler html() function doesnt dump the whole html as per the function description :

http://api.symfony.com/2.6/Symfony/Component/DomCrawler/Crawler.html#method_html

it returns only the first node which it did in your case.

You may be able to use http://php.net/manual/en/domdocument.savehtml.php as the DomCrawler is a set of SplObjectStorage .

$html = $crawler->getNode(0)->ownerDocument->saveHTML();


回答2:

If you view the source for the Crawler::html() you will see that it is performing the following:

$html = '';
foreach ($this->getNode(0)->childNodes as $child) {
    $html .= $child->ownerDocument->saveHTML($child);
}
return $html;