How can I control PhantomJS to skip download some

2019-01-21 13:05发布

问题:

phantomjs has config loadImage,

but I want more,

how can I control phantomjs to skip download some kind of resource,

such as css etc...

=====

good news: this feature is added.

https://code.google.com/p/phantomjs/issues/detail?id=230

The gist:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

回答1:

UPDATED, Working!

Since PhantomJS 1.9, the existing answer didn't work. You must use this code:

var webPage = require('webpage');
var page = webPage.create();

page.onResourceRequested = function(requestData, networkRequest) {
  var match = requestData.url.match(/wordfamily.js/g);
  if (match != null) {
    console.log('Request (#' + requestData.id + '): ' + JSON.stringify(requestData));
    networkRequest.cancel(); // or .abort() 
  }
};

If you use abort() instead of cancel(), it will trigger onResourceError.

You can look at the PhantomJS docs



回答2:

So finally you can try this http://github.com/eugenehp/node-crawler

otherwise you can still try the below approach with PhantomJS

The easy way, is to load page -> parse page -> exclude unwanted resource -> load it into PhatomJS.

Another way is just simply block the hosts in the firewall.

Optionally you can use a proxy to block certain URL addresses and queries to them.

And additional one, load the page, and then remove the unwanted resources, but I think its not the right approach here.



回答3:

Use page.onResourceRequested, as in example loadurlwithoutcss.js:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || 
            requestData.headers['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};


回答4:

No way for now (phantomjs 1.7), it does NOT support that.

But a nasty solution is using a http proxy, so you can screen out some request that you don't need



标签: phantomjs