How to crawl with php Goutte and Guzzle if data is

Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery)

标签： php web-crawler guzzle scraper goutte

4条回答

Explosion°爆炸

2楼-- · 2019-04-09 06:57

You want to have a look at phantomjs. There is this php implementation:

http://jonnnnyw.github.io/php-phantomjs/

if you need to have it working with php of course.

You could read the page and then feed the contents to Guzzle, in order to use the nice functions that Guzzle gives you (like search for contents, etc...). That would depend on your needs, maybe you can simply use the dom, like this:

How to get element by class name?

Here is some working code.

  $content = $this->getHeadlessReponse($url);
  $this->crawler->addContent($this->getHeadlessReponse($url));

  /**
   * Get response using a headless browser (phantom in this case).
   *
   * @param $url
   *   URL to fetch headless
   *
   * @return string
   *   Response.
   */
public function getHeadlessReponse($url) {
    // Fetch with phamtomjs
    $phantomClient = PhantomClient::getInstance();
    // and feed into the crawler.
    $request = $phantomClient->getMessageFactory()->createRequest($url, 'GET');

    /**
     * @see JonnyW\PhantomJs\Http\Response
     **/
    $response = $phantomClient->getMessageFactory()->createResponse();

    // Send the request
    $phantomClient->send($request, $response);

    if($response->getStatus() === 200) {
        // Dump the requested page content
        return $response->getContent();
    }

}

Only disadvantage of using phantom, it will be slower than guzzle, but of course, you have to wait for all those pesky js to be loaded.

0人赞添加讨论(0) 举报

Juvenile、少年°

3楼-- · 2019-04-09 06:59

Guzzle (which Goutte uses internally) is an HTTP client. As a result, javascript content will not be parsed or executed. Javascript files which reside outside of the requested endpoint will not be downloaded.

Depending upon your environment, I suppose it would be possible to utilize PHPv8 (a PHP extension that embeds the Google V8 javascript engine) and a custom handler / middleware to perform what you want.

Then again, depending on your environment, it might be easier to simply perform the scraping with a javascript client.

0人赞添加讨论(0) 举报

\"骚年 ilove

4楼-- · 2019-04-09 07:07

I would recommend to try getting response content. Parse it (if you have to) to new html and use it as $html when initialing new Crawler object, after that you can use all data in response like any other Crawler object.

$crawler = $client->submit($form);
$html = $client->getResponse()->getContent();
$newCrawler = new Crawler($html);

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

5楼-- · 2019-04-09 07:09

Since it is impossible to work with javascript, I can suggest another solution:

GOOGLE CHROME > Right button > Inspect Element > Right button > edit as html > copy > work with copied html

        $html = $the_copied_html;
        $crawler = new Crawler($html);

        $data = $crawler->filter('.your-selector')->each(function (Crawler $node, $i) { 
                return [
                    'text' => $node->text()
                ];
        });

        //Do whatever you want with the $data
        return $data; //type Array

This will only work for single jobs and not automated processes. In my case this will do it.

0人赞添加讨论(0) 举报

How to crawl with php Goutte and Guzzle if data is

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间