I'm scraping a Facebook page with the PhantomJS node module (https://github.com/sgentle/phantomjs-node), but when I try evaluating the page, it does not evaluate the function I pass to it. Executing it in a standalone script and running it with the Node interpreter works.. The same code in an Express.js app does not work.
This is my code
facebookScraper.prototype.scrapeFeed = function (url, cb) {
f = ':scrapeFeed:';
var evaluator = function (s) {
var posts = [];
for (var i = 0; i < FEED_ITEMS; i++) {
log.info(__filename+f+' iterating step ' + i);
log.info(__filename+f+util.inspect(document, false, null));
}
return {
news: posts
};
}
phantom.create(function (ph) {
ph.createPage(function (page) {
log.fine(__filename+f+' opening url ' + url);
page.open(url, function (status) {
log.fine(__filename+f+' opened site? ' + status);
setTimeout(function() {
page.evaluate(evaluator, function (result) {
log.info(__filename+f+'Scraped feed: ' + util.inspect(result, false, null));
cb(result, ph);
});
}, 5000);
});
});
});
};
The output I get:
{"level":"fine","message":"PATH/fb_regular.js:scrapeFeed: opening url <URL> ","timestamp":"2012-09-23T18:35:10.151Z"}
{"level":"fine","message":"PATH/fb_regular.js:scrapeFeed: opened site? success","timestamp":"2012-09-23T18:35:12.682Z"}
{"level":"info","message":"PATH/fb_regular.js:scrapeFeed: Scraped feed: null","timestamp":"2012-09-23T18:35:12.687Z"}
So, as you see, it calls the phantom callback function (second parameter in the evaluate function) with a null argument, but it doesn't execute the first parameter (my evaluator function, which prints iterating step X).
Anyone knows what the problem is?
I'm unsure as to what version of PhantomJS you are using, but as for the documentation of versions 1.6+ logging inside evaluated script will log the result in the contained page. It will not log into your console. To get that you would have to bind logging to the pages onConsoleMessage event:
page.onConsoleMessage = function (msg) { console.log(msg); };
As for the result not being available: The page.evaluate function takes arguments like so - first one is a function to be executed and the rest are passed as input to that function. The result is returned directly:
var title = page.evaluate(function (s) {
return document.querySelector(s).innerText;
}, 'title');
console.log(title);
evaluate
is run in sandbox mode, which means that none of the variables defined in the containing environment are available, including cb
or even the phantom
object or any functions that you may have defined.
You can explicitly tunnel information into the sandbox as additional arguments to evaluate
.
page.evaluate(function(cb){...}, cb);
PhantomJS' page.evaluate()
function is the door to the DOM context (page context). It is only possible to access the DOM through this function. Since the function is sandboxed, you cannot use variables defined outside of it and they have to be passed in explicitly. There are limitations what can be passed in and out though (docs):
Note: The arguments and the return value to the evaluate
function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.
Closures, functions, DOM nodes, etc. will not work!
phantomjs-node is a bridge between PhantomJS and node.js and as such has a slightly different API than PhantomJS itself. Functions that are synchronous in PhantomJS don't return anything in phantomjs-node, but take a callback where the result is passed in. The callback executes in the outer context and is not sandboxed.
The arguments can be passed in this way:
page.evaluate(function(arg1, arg2){
// use arg1 and arg2 in the page
// return `result`
}, function(result){
// use `result` in the node context
}, "some arg1", "another arg");
The following worked for me to evaluate a page:
page.evaluate(function(s) {
return document.querySelector(s)
}, 'body').then(res => {
console.log(res)
})
There is someone that have a evaluation block with only a console.log line inside and it never execute, its not always a sandbox problem.
see link: On PhantomJS I can't include jQuery and without jQuery I can't post form data