Grab the resource contents in CasperJS or PhantomJ

2019-01-09 03:02发布

问题:

I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.

I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?

回答1:

This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.

Long story short, here is my gist. You need the cache.js and mimetype.js files: https://gist.github.com/bshamric/4717583

//for this to work, you have to call phantomjs with the cache enabled:
//usage:  phantomjs --disk-cache=true test.js

var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');

//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';

var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };

//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
  //I only cache images, but you can change this
    if(response.contentType.indexOf('image') >= 0)
    {
        cache.includeResource(response);
    }
};

//when the page is done loading, go through each cachedResource and do something with it, 
//I'm just saving them to a file
page.onLoadFinished = function(status) {
    for(index in cache.cachedResources) {
        var file = cache.cachedResources[index].cacheFileNoPath;
        var ext = mimetype.ext[cache.cachedResources[index].mimetype];
        var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
        fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
    }
};

page.open(url, function () {
    page.render('saved/google.pdf');
    phantom.exit();
});

Then when you call phantomjs, just make sure the cache is enabled:

phantomjs --disk-cache=true test.js

Some notes: I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.



回答2:

I've found that until the phantomjs matures a bit, according to the issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 this is a bit of a headache for them.

So you want to do it anyways? I've opted to go a bit higher to accomplish this and have grabbed PyMiProxy over at https://github.com/allfro/pymiproxy, downloaded, installed, set it up, took their example code and made this in proxy.py

from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO

class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):

        def do_request(self, data):
            data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
            return data

        def do_response(self, data):
            #print '<< %s' % repr(data[:100])
            request_line, headers_alone = data.split('\r\n', 1)
            headers = Message(StringIO(headers_alone))
            print "Content type: %s" %(headers['content-type'])
            if headers['content-type'] == 'text/x-comma-separated-values':
                f = open('data.csv', 'w')
                f.write(data)
            print ''
            return data

if __name__ == '__main__':
    proxy = AsyncMitmProxy()
    proxy.register_interceptor(DebugInterceptor)
    try:
        proxy.serve_forever()
    except KeyboardInterrupt:
        proxy.server_close()

Then I fire it up

python proxy.py

Next I execute phantomjs with the proxy specified...

phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js

You may want to turn your security on or such, it was needless for me currently as I'm scraping just one source. You should now see a bunch of text flowing through your proxy console and if it lands on something with the mime type of "text/x-comma-separated-values" it'll save it as data.csv. This will also save all the headers and everything, but if you've come this far I'm sure you can figure out how to pop those off.

One other detail, I've found that I've had to disable gzip encoding, I could use zlib and decompress data in gzip from my own apache webserver, but if it comes out of IIS or such the decompression will get errors and I'm not sure about that part of it.

So my power company won't offer me an API? Fine! We do it the hard way!



回答3:

Did not realize I could grab the source from the document object like this:

casper.start(url, function() {
    var js = this.evaluate(function() {
        return document; 
    }); 
    this.echo(js.all[0].outerHTML); 
});

More info here.



回答4:

You can use Casper.debugHTML() to print out contents of a HTML resource:

var casper = require('casper').create();

casper.start('http://google.com/', function() {
    this.debugHTML();
});

casper.run();

You can also store the HTML contents in a var using casper.getPageContent(): http://casperjs.org/api.html#casper.getPageContent (available in lastest master)