Screen scraping JS page

2019-02-18 17:35发布

I'm trying to scrape this page http://www.buddytv.com/trivia/game-of-thrones-trivia.aspx and it's not working.

I tried

$html = new simple_html_dom();
  $html->load_file($url);

But for the question I'm looking to grab (.trivia-question) can't be found. Can anybody tell me what I'm doing wrong ?

Thanks a lot!

And I tried

  <?php
  $Page = file_get_contents('http://www.buddytv.com/trivia/game-of-thrones-trivia.aspx');
  $dom_document = new DOMDocument();
  //errors suppress because it is throwing errors due to mismatched html tags
  @$dom_document->loadHTML($Page);
  $dom_xpath_admin = new DOMXpath($dom_document_admin);
  $elements = $dom_xpath->query('//*[@id="id60questionText"]');
  var_dump($elements);

标签: php parsing dom
1条回答
我命由我不由天
2楼-- · 2019-02-18 18:00

Ok then here is phantomjs example:

You need to download phantomjs from: http://phantomjs.org/, put somewhere where you can easily access by a script.

Test it by running {installationdir}/bin/phantomjs (phantomjs.exe on windows) --version

Then create JS file somewhere in your project, ex browser.js

var page = require('webpage').create();

page.open('http://www.buddytv.com/trivia/game-of-thrones-trivia.aspx', function() {

page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {

    search = page.evaluate(function() { 
        return  $('#id60questionText').text();
    });

    console.log(search);

    phantom.exit()
  });
})

Then in your PHP script read it like:

$pathToPhatomJs = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/bin/phantomjs';

$pathToJsScript = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/browser.js';

$stdOut = exec(sprintf('%s %s', $pathToPhatomJs,  $pathToJsScript), $out);

echo $stdOut;

Change $pathToPhatomJs and $pathToJsScript according to your configuration.

If you are on windows this may not work. You can then change PHP script to:

$pathToPhatomJs = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/bin/phantomjs';

$pathToJsScript = '/home/aurimas/Downloads/phantomjs/phantomjs-1.9.1-linux-x86_64/browser.js';

exec(sprintf('%s %s > phatom.txt', $pathToPhatomJs,  $pathToJsScript), $out);

$fileContents = file_get_contents('phatom.txt');

echo $fileContents;
查看更多
登录 后发表回答