I'm scraping a Facebook page with the PhantomJS node module (https://github.com/sgentle/phantomjs-node), but when I try evaluating the page, it does not evaluate the function I pass to it. Executing it in a standalone script and running it with the Node interpreter works.. The same code in an Express.js app does not work.
This is my code
facebookScraper.prototype.scrapeFeed = function (url, cb) {
f = ':scrapeFeed:';
var evaluator = function (s) {
var posts = [];
for (var i = 0; i < FEED_ITEMS; i++) {
log.info(__filename+f+' iterating step ' + i);
log.info(__filename+f+util.inspect(document, false, null));
}
return {
news: posts
};
}
phantom.create(function (ph) {
ph.createPage(function (page) {
log.fine(__filename+f+' opening url ' + url);
page.open(url, function (status) {
log.fine(__filename+f+' opened site? ' + status);
setTimeout(function() {
page.evaluate(evaluator, function (result) {
log.info(__filename+f+'Scraped feed: ' + util.inspect(result, false, null));
cb(result, ph);
});
}, 5000);
});
});
});
};
The output I get:
{"level":"fine","message":"PATH/fb_regular.js:scrapeFeed: opening url <URL> ","timestamp":"2012-09-23T18:35:10.151Z"}
{"level":"fine","message":"PATH/fb_regular.js:scrapeFeed: opened site? success","timestamp":"2012-09-23T18:35:12.682Z"}
{"level":"info","message":"PATH/fb_regular.js:scrapeFeed: Scraped feed: null","timestamp":"2012-09-23T18:35:12.687Z"}
So, as you see, it calls the phantom callback function (second parameter in the evaluate function) with a null argument, but it doesn't execute the first parameter (my evaluator function, which prints iterating step X).
Anyone knows what the problem is?
PhantomJS'
page.evaluate()
function is the door to the DOM context (page context). It is only possible to access the DOM through this function. Since the function is sandboxed, you cannot use variables defined outside of it and they have to be passed in explicitly. There are limitations what can be passed in and out though (docs):phantomjs-node is a bridge between PhantomJS and node.js and as such has a slightly different API than PhantomJS itself. Functions that are synchronous in PhantomJS don't return anything in phantomjs-node, but take a callback where the result is passed in. The callback executes in the outer context and is not sandboxed.
The arguments can be passed in this way:
There is someone that have a evaluation block with only a console.log line inside and it never execute, its not always a sandbox problem.
see link: On PhantomJS I can't include jQuery and without jQuery I can't post form data
I'm unsure as to what version of PhantomJS you are using, but as for the documentation of versions 1.6+ logging inside evaluated script will log the result in the contained page. It will not log into your console. To get that you would have to bind logging to the pages onConsoleMessage event:
As for the result not being available: The page.evaluate function takes arguments like so - first one is a function to be executed and the rest are passed as input to that function. The result is returned directly:
The following worked for me to evaluate a page:
evaluate
is run in sandbox mode, which means that none of the variables defined in the containing environment are available, includingcb
or even thephantom
object or any functions that you may have defined.You can explicitly tunnel information into the sandbox as additional arguments to
evaluate
.