php crawl - javascript enabled

2020-06-29 09:22发布

问题:

Bonjour, does anyone know of a way of creating a spider that acts as if it has javascript enabled?

PHP Code:

file_get_contents("http://www.google.co.uk/search?hl=en&q=".$keyword."&start=".($x*10)."&sa=N") 

it would retrieve the output of that page. If you used, PHP Code:

file_get_contents("http://www.facebook.com/something/something.something.php") 
(im not sure i just know face book is a good example)

it would return trhe output, which im guessing would include something along the lines of "you must have javascript enabled to continue" because it is a javascript operated site (not accessible).

EDIT: PHP Code: Just checked

$link = "http://www.facebook.com/index.php";
$contents = file_get_contents($link);
echo $contents;

returns: You are using an incompatible web browser.

Sorry, were not cool enough to support your browser. Please keep it real with one of the following browsers:

* Mozilla Firefox
* Safari
* Microsoft Internet Explorer

which i tested through all the above browsers ?

回答1:

Apparently, in this specific case, Facebook is only testing for the HTTP Header "User-Agent".

If I'm using this portion of code, based on curl, which allows me to set a lot of optons, using curl_setopt :

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.facebook.com/index.php");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);
curl_close($ch);
echo $html;

I get the same message as you do.


But, if I try sending a User-Agent that correspond to Firefox (I just copy-pasted the one my real Firefox is actually sending) :

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.facebook.com/index.php");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.3) Gecko/20090910 Ubuntu/9.04 (jaunty) Shiretoko/3.5.3");
$html = curl_exec($ch);
curl_close($ch);
echo $html;

I get the real Facebook homepage, and not that error message about incompatible browser.


Of course, this will not solve the problem of Javascript not being executed...

... But executing Javascript without a browser is quite a difficult thing (not even google solved it ^^ )

There are engines that allow to run Javascript code without a browser (rhino, for instance ; or the Spidermonkey PECL extension, for PHP) ; but even if they allow you to run Javascript code, you will not have all the environment and methods that are provided by the browser, on which websites rely...


An idea, if you need to crawl a Javascript-dependant website, might be to use Selenium, which opens a real browser (ie, firefox, or other), controling it from your PHP code via Selenium RC.

But that means you must have a graphical environment, and a browser, on you PHP machine ; this is also quite heavy and slow -- a lot slower than just loading a webpage ^^