Casperjs iterating over a list of links using casp

2020-03-13 04:54发布

问题:

I am trying to use Casperjs to get a list of links from a page, then open each of those links, and add to an array object a particular type of data from those pages.

The problem I am having is with the loop that executes over each of the list items.

First I get a listOfLinks from the original page. This part works and using length I can check that this list is populated.

However, using the loop statement this.each as below, none of the console statements ever show up and casperjs appears to skip over this block.

Replacing this.each with a standard for loop, the execution only gets part way through the first link, as the statement "Creating new array in object for x.html" appears once and then the code stops executing. Using an IIFE doesn't change this.

Edit: in verbose debugging mode the following happens:

Creating new array object for https://example.com 
[debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true

So for some reason the URL that is passed into the thenOpen function gets changed to blank...

I feel like there is something about Casperjs's asynchronous nature that I am not grasping here, and would be grateful to be pointed towards a working example.

casper.then(function () {

  var date = Date.now();
  console.log(date);

  var object = {};
  object[date] = {}; // new object for date

  var listOfLinks = this.evaluate(function(){
    console.log("getting links");
    return document.getElementsByClassName('importantLink');
  });

  console.log(listOfLinks.length);

  this.each(listOfLinks, function(self, link) {

    var eachPageHref = link.href;

    console.log("Creating new array in object for " + eachPageHref);

    object[date][eachPageHref] = []; // array for page to store names

    self.thenOpen(eachPageHref, function () {

      var listOfItems = this.evaluate(function() {
        var items = [];
        // Perform DOM manipulation to get items
        return items;
      });
    });

    object[date][eachPageHref] = items;

  });
  console.log(JSON.stringify(object));

});

回答1:

I decided to use our own Stackoverflow.com as a demo site to run your script against. There were a few minor things I've corrected in your code and the result is this exercise in getting comments from PhantomJS bounty questions.

var casper = require('casper').create();

casper
.start()
.open('http://stackoverflow.com/questions/tagged/phantomjs?sort=featured&pageSize=30')
.then(function () {

    var date = Date.now(), object = {};
    object[date] = {};

    var listOfLinks = this.evaluate(function(){

        // Getting links to other pages to scrape, this will be 
        // a primitive array that will be easily returned from page.evaluate
        var links = [].map.call(document.querySelectorAll("#questions .question-hyperlink"), function(link) {
          return link.href;
        });    
        return links;
    });

    // Now to iterate over that array of links
    this.each(listOfLinks, function(self, eachPageHref) {

        object[date][eachPageHref] = []; // array for page to store names

        self.thenOpen(eachPageHref, function () {

            // Getting comments from each page, also as an array
            var listOfItems = this.evaluate(function() {
                var items = [].map.call(document.getElementsByClassName("comment-text"), function(comment) {
                    return comment.innerText;
                });    
                return items;
            });
            object[date][eachPageHref] = listOfItems;
        });
    });

    // After each links has been scraped, output the resulting object
    this.then(function(){
        console.log(JSON.stringify(object));
    });
})

casper.run();

What is changed: page.evaluate now returns simple arrays, which are needed for casper.each() to correctly iterate. href attributes are extracted right away in page.evaluate. Also this correction:

 object[date][eachPageHref] = listOfItems; // previously assigned items which were undefined in this scope

The result of the script run is

{"1478596579898":{"http://stackoverflow.com/questions/40410927/phantomjs-from-node-on-windows":["en.wikipedia.org/wiki/File_URI_scheme – Igor 2 days ago\n","@Igor is there something in particular you see wrong, or are you suggesting the phantom module has an incorrect URI? – Danny Buonocore 2 days ago\n","Probably windows security issue not allowing to run an unsigned program. – Vaviloff yesterday\n"],"http://stackoverflow.com/questions/40412726/casperjs-iterating-over-a-list-of-links-using-casper-each":["Thanks, this looked really promising. I made the changes but it didn't solve the problem. And I just realised that in debug mode the following happens: Creating new array object for https://example.com [debug] [phantom] Navigation requested: url=about:blank, type=Other, willNavigate=true, isMainFrame=true and then Casperjs silently fails. It seems that the correct link that gets passed into thenOpen gets changed to about:blank... – cyc665 yesterday\n"]}}


回答2:

You are returning DOM nodes in the evaluate() function, which is not allowed. You can return the actual URLs instead.

Note: The arguments and the return value to the evaluate function must be a simple primitive object. The rule of thumb: if it can be serialized via JSON, then it is fine.

Closures, functions, DOM nodes, etc. will not work!

Reference: PhantomJS#evaluate



回答3:

If I understand your problem correctly, to solve, give items[] a global scope. In your code, I would have done the following:

var items = [];
this.each(listOfLinks, function(self, link) {

    var eachPageHref = link.href;

    console.log("Creating new array in object for " + eachPageHref);

    object[date][eachPageHref] = []; // array for page to store names

    self.thenOpen(eachPageHref, function () {

        this.evaluate(function() {
        // Perform DOM manipulation to get items
        items.push(whateverThisItemIs);
      });
    });

Hope this helps.