Scrape web page data generated by javascript

2019-01-03 18:43发布

My question is: How to scrape data from this website http://vtis.vn/index.aspx But the data is not shown until you click on for example "Danh sách chậm". I have tried very hard and carefully, when you click on "Danh sách chậm" this is onclick event which triggers some javascript functions one of the js functions is to get the data from the server and insert it to a tag/place holder and at this point you can use something like firefox to examine the data and yes, the data is display to users/viewers on the webpage. So again, how can we scrap this data programmatically?

i wrote a scrapping function but ofcourse it does not get the data i want because the data is not available until i click on the button "Danh sách chậm"

                <?php
                      $Page = file_get_contents('http://vtis.vn/index.aspx');
                $dom_document = new DOMDocument();
                  $dom_document->loadHTML($Page);
                              $dom_xpath_admin = new DOMXpath($dom_document_admin);
                   $elements = $dom_xpath->query("*//td[@class='IconMenuColumn']");
                              //
                          foreach ($elements as $element) {
                            $nodes = $element->childNodes;
                            foreach ($nodes as $node) {
                                         echo (mb_convert_encoding($node->c14n(), 'iso-8859-1', mb_detect_encoding($content, 'UTF-8', true)));
                               }
                          }
                         }

Thank you kindly, StackOverflow is a great place. D.

2条回答
冷血范
2楼-- · 2019-01-03 19:09

First, you need PhantomJS:

Second, you need PHP phantomjs:

  1. install composer (if it is not exist on your server)
  2. install package (PHP phantomjs), you might have a look on this guide:

https://github.com/jonnnnyw/php-phantomjs http://jonnnnyw.github.io/php-phantomjs/4.0/2-installation/

Third, Load the package to your script: require ('vendor/autoload.php');

Finally, instead of file_get_content, you will load the page via phantomjs

$client = Client::getInstance();
    $client->getEngine()->setPath('/usr/local/bin/phantomjs');


    $client = Client::getInstance();

    $request  = $client->getMessageFactory()->createRequest();
    $response = $client->getMessageFactory()->createResponse();

    $request->setMethod('GET');
    $request->setUrl('https://www.your_page_embeded_ajax_request');

    $client->send($request, $response);

    if($response->getStatus() === 200) {
        echo "Do something here";
    }
查看更多
来,给爷笑一个
3楼-- · 2019-01-03 19:14

You need to look at PhantomJS.

From their site:

PhantomJS is a headless WebKit with JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG.

Using the API you can script the "browser" to interact with that page and scrape the data you need. You can then do whatever you need with it; including passing it to a PHP script if necessary.


That being said, if at all possible try not to "scrape" the data. If there is an ajax call the page is making, maybe there is an API you can use instead? If not, maybe you can convince them to make one. That would of course be much easier and more maintainable than screen scraping.

查看更多
登录 后发表回答