Taking reliable screenshots of websites? Phantomjs

2019-01-31 03:29发布

问题:

Open a web page and take a screenshot.

Using ONLY phantomjs: (this is a simple script, in fact it is the example script used in their docs. http://phantomjs.org/screen-capture.html

var page = require('webpage').create();
page.open('http://github.com/', function() {
  page.render('github.png');
  phantom.exit();
});

Problem is that for some websites (like github) funny enough are somehow detecting and not serving phantomjs and nothing is being rendered. Result is github.png is a blank white png file.

Replace github with say: "google.com" and you get a nice (proper) screenshot as is intended.

At first I thought this was a Phantomjs issue so I tried running it through Casperjs with:

casper.start('http://www.github.com/', function() {
    this.captureSelector('github.png', 'body');
});

casper.run();

But I get same behavior as with Phantomjs.

So I figured ok this is most likely a user agent issue. As in: Github sniffs out Phantomjs and decides not to show the page. So I set the user agent like below but that still didn't work.

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com/', function() {
  page.render('github.png');
  phantom.exit();
});

So then I tried to parse the page and apparently some sites (again like github) don't appear to be sending anything down the wire.

Using casperjs I tried to print the title. And for google.com I got back Google but for github.com I got back bupkis. Example code:

var casper = require('casper').create();

casper.start('http://github.com/', function() {
    this.echo(this.getTitle());
});

casper.run();  

The same as above also produces the same result in purely phantomjs.

Update:

Could this be a timing issue? Is github just super slow? I doubt it but lets test anyway..

var page = require('webpage').create();
page.open('http://github.com', function (status) {
    /* irrelevant */
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 3000);
});

And the result is still bupkis. So no it's not a timing issue.

  1. How are some sites like github blocking phantomjs?
  2. How can we reliably take screenshots of ALL webpages? Required to be fast, and headless.

回答1:

After bouncing this around for some time I was able to narrow down the problem. Apparently PhantomJS uses a default ssl of sslv3 which causes github to refuse the connection due to a bad ssl handshake

phantomjs --debug=true github.js

Shows output of:

. . .
2014-10-22T19:48:31 [DEBUG] WebPage - updateLoadingProgress: 10 
2014-10-22T19:48:32 [DEBUG] Network - Resource request error: 6 ( "SSL handshake failed" ) URL: "https://github.com/" 
2014-10-22T19:48:32 [DEBUG] WebPage - updateLoadingProgress: 100 

So from this we can conclude that no screen was taken because github was refusing the connection. Great that makes perfect sense. So let's set SSL flag to --ssl-protocol=any and lets also ignore ssl-errors with --ignore-ssl-errors=true

phantomjs --ignore-ssl-errors=true --ssl-protocol=any --debug=true github.js

Great success! A screenshot is now being rendered and saved properly but debugger is showing us a TypeError:

TypeError: 'undefined' is not a function (evaluating 'Array.prototype.forEach.call.bind(Array.prototype.forEach)')

  https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
  https://assets-cdn.github.com/assets/frameworks-dabc650f8a51dffd1d4376a3522cbda5536e4807e01d2a86ff7e60d8d6ee3029.js:29
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 72 
2014-10-22T19:52:32 [DEBUG] WebPage - updateLoadingProgress: 88 
ReferenceError: Can't find variable: $

  https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1
  https://assets-cdn.github.com/assets/github-fa2f009761e3bc4750ed00845b9717b09646361cbbc3fa473ad64de9ca6ccf5b.js:1

I checked the github homepage manually just to see if a TypeError existed and it does NOT.

My next guess is that the assets aren't loading quick enough.. Phantomjs is faster than a speeding bullet!

So lets try to slow it down artificially and see if we can get rid of that TypeError...

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 3000);
});

That didn't work... After a closer inspection of the image - it is clear that some elements are missing. Mainly some icons and the logo.

Success? Partially because we are now at least getting a screen shot where earlier, we weren't getting a thing.

Job done? Not exactly. Need to determine what is causing that TypeError because it preventing some assets from loading and distorting the image.

Additional

Attempted to recreate with CasperJS --debug is very ugly and hard to follow compared to PhantomJS:

casper.start();
casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X)');
casper.thenOpen('https://www.github.com/', function() {
    this.captureSelector('github.png', 'body');
});

casper.run();

console:

casperjs test --ssl-protocol=any --debug=true github.js

Further the image is missing the same icons but is also visually distorted. Being that CasperJs relies on Phantomjs, I do not see the value in using it for this specific task.

If you would like to add to my answer, please share your findings. Very interested in a flawless PhantomJS solution

Update #1 : Removing the TypeError

@ArtjomB points out that Phantomjs does not support js bind in it's current version as of this update (1.9.7). For this reason he explains: ArtjomB: PhantomJs Bind Issue Answer

The TypeError: 'undefined' is not a function refers to bind, because PhantomJS 1.x doesn't support it. PhantomJS 1.x uses an old fork of QtWebkit which is comparable to Chrome 13 or Safari 5. The newer PhantomJS 2 will use a newer engine which will support bind. For now you need to add a shim inside of the page.onInitialized event handler:

Ok great, so the following code will take care of our TypeError from above. (But not fully functional, see below for details)

var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36';
page.open('http://github.com', function (status) {
   window.setTimeout(function () {
            page.render('github.png');
            phantom.exit();
        }, 5000);
});
page.onInitialized = function(){
    page.evaluate(function(){
        var isFunction = function(o) {
          return typeof o == 'function';
        };

        var bind,
          slice = [].slice,
          proto = Function.prototype,
          featureMap;

        featureMap = {
          'function-bind': 'bind'
        };

        function has(feature) {
          var prop = featureMap[feature];
          return isFunction(proto[prop]);
        }

        // check for missing features
        if (!has('function-bind')) {
          // adapted from Mozilla Developer Network example at
          // https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/bind
          bind = function bind(obj) {
            var args = slice.call(arguments, 1),
              self = this,
              nop = function() {
              },
              bound = function() {
                return self.apply(this instanceof nop ? this : (obj || {}), args.concat(slice.call(arguments)));
              };
            nop.prototype = this.prototype || {}; // Firefox cries sometimes if prototype is undefined
            bound.prototype = new nop();
            return bound;
          };
          proto.bind = bind;
        }
    });
}

Now the above code will get us a screenshot same as we were getting before AND debug will not show a TypeError so from the surface, everything appears to work. Progress has been made.

Unfortunately, all of the image icons [logo, etc] are still not loading correctly. We see some sort of 3W icon not sure where thats from.

Thanks for the help @ArtjomB