Casperjs scraping dynamic content

2019-01-29 00:56发布

问题:

I'm trying to scrape this page using Casperjs. The main function to my code works just fine, but the content is loaded dynamically and I can't figure out how to trigger that.

This is what I'm doing right now:

casper.waitFor(function() {

    this.scrollToBottom();

    var count = this.evaluate(function() {
        var match = document.querySelectorAll('.loading-msg');
        return match.length;
    });

    if (count <= 1) {
        return true;
    }
    else {
        return false
    };

}, function() { // do stuff });

The wait timeout just expires, even though I've increased it to 20s, and the new content never gets loaded. I've tried adapting this function to my case:

function tryAndScroll(casper) {
  casper.waitFor(function() {
    this.page.scrollPosition = { top: this.page.scrollPosition["top"] + 4000, left: 0 };
    return true;
  }, function() {
    var info = this.getElementInfo('p[loading-spinner="!loading"]');
    if (info["visible"] == true) {
      this.waitWhileVisible('p[loading-spinner="!loading"]', function () {
        this.emit('results.loaded');
      }, function () {
        this.echo('next results not loaded');
      }, 5000);
    }
  }, function() {
    this.echo("Scrolling failed. Sorry.").exit();
  }, 500);
}

But I couldn't figure it out and I'm not even sure it's relevant here. Any ideas?

回答1:

I've looked to the page. It has such a behvior that it doesn't load the middle images when you jump to the end.

When the page is loaded the first couple of rows are completely loaded and some more are not completely loaded (image missing denoted by '.loading-msg' element). When you jump to the end with this.scrollToBottom(); there is no continous scroll. It jumps to the end and the page JavaScript doesn't detect that the middle images were in the viewport, however briefly. The page goes on to load the next rows, but not the missing images of the jumped over rows.

You have to reduce the distance of the jump in both of your snippets.

The first one can be changed like this:

var pos = 0, 
    height = casper.page.viewportSize.height;
casper.waitFor(function() {
    this.scrollTo(0, pos * height);
    return !this.exists('.loading-msg');
}, function() { // do stuff }, 20000);

The second one might work by changing

this.page.scrollPosition = { top: this.page.scrollPosition["top"] + 4000, left: 0 };

to

var height = casper.page.viewportSize.height;
this.page.scrollPosition = { top: this.page.scrollPosition.top + height, left: 0 };