phantomjs has config loadImage,
but I want more,
how can I control phantomjs to skip download some kind of resource,
such as css etc...
=====
good news:
this feature is added.
https://code.google.com/p/phantomjs/issues/detail?id=230
The gist:
page.onResourceRequested = function(requestData, request) {
if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
console.log('The url of the request is matching. Aborting: ' + requestData['url']);
request.abort();
}
};
UPDATED, Working!
Since PhantomJS 1.9, the existing answer didn't work. You must use this code:
var webPage = require('webpage');
var page = webPage.create();
page.onResourceRequested = function(requestData, networkRequest) {
var match = requestData.url.match(/wordfamily.js/g);
if (match != null) {
console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));
networkRequest.cancel(); // or .abort()
}
};
If you use abort() instead of cancel(), it will trigger onResourceError.
You can look at the PhantomJS docs
So finally you can try this http://github.com/eugenehp/node-crawler
otherwise you can still try the below approach with PhantomJS
The easy way, is to load page -> parse page -> exclude unwanted resource -> load it into PhatomJS.
Another way is just simply block the hosts in the firewall.
Optionally you can use a proxy to block certain URL addresses and queries to them.
And additional one, load the page, and then remove the unwanted resources, but I think its not the right approach here.
Use page.onResourceRequested
, as in example loadurlwithoutcss.js:
page.onResourceRequested = function(requestData, request) {
if ((/http:\/\/.+?\.css/gi).test(requestData['url']) ||
requestData.headers['Content-Type'] == 'text/css') {
console.log('The url of the request is matching. Aborting: ' + requestData['url']);
request.abort();
}
};
No way for now (phantomjs 1.7), it does NOT support that.
But a nasty solution is using a http proxy, so you can screen out some request that you don't need