I am trying to scrape links from a page that generates content dynamically as the user scroll down to the bottom (infinite scrolling). I have tried doing different things with Phantomjs but not able to gather links beyond first page. Let say the element at the bottom which loads content has class .has-more-items
. It is available until final content is loaded while scrolling and then becomes unavailable in DOM (display:none). Here are the things I have tried-
- Setting viewportSize to a large height right after
var page = require('webpage').create();
page.viewportSize = { width: 1600, height: 10000, };
- Using
page.scrollPosition = { top: 10000, left: 0 }
but have no effect like-
page.open('http://example.com/?q=houston', function(status) { if (status == "success") { page.scrollPosition = { top: 10000, left: 0 }; } });
- Also tried putting it inside
function but that gives
Reference error: Can't find variable page
- Tried using jQuery and JS code inside
but to no avail-
$("html, body").animate({ scrollTop: $(document).height() }, 10, function() { //console.log('check for execution'); });
as it is and also inside document.ready
. Similarly for JS code-
as it is and also inside window.onload
I am really struck on it for 2 days now and not able to find a way. Any help or hint would be appreciated.
I have found a helpful piece of code at https://groups.google.com/forum/?fromgroups=#!topic/phantomjs/8LrWRW8ZrA0
var hitRockBottom = false; while (!hitRockBottom) {
// Scroll the page (not sure if this is the best way to do so...)
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
// Check if we've hit the bottom
hitRockBottom = page.evaluate(function() {
return document.querySelector(".has-more-items") === null;
}); }
Where .has-more-items
is the element class I want to access which is available at the bottom of the page initially and as we scroll down, it moves further down until all data is loaded and then becomes unavailable.
However, when I tested it is clear that it is running into infinite loops without scrolling down (I render pictures to check). I have tried to replace page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
with codes from below as well (one at a time)
window.document.body.scrollTop = '1000';
location.href = ".has-more-items";
page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };
But nothing seems to work.
The code snippet below work just fine for pinterest. I researched a lot to scrape pinterest without phantomjs but it is impossible to find the infinite scroll trigger link. I think the code below will help other infinite scroll web page to scrape.
I know that it has been answered a long time ago, but I also found a solution to my specific scenario. The result is a piece of javascript that scrolls to the bottom of the page. It is optimized to reduce waiting time.
It is not written for PhantomJS by default, so that will have to be modified. However, for a beginner or someone who doesn't have root access, an Iframe with injected javascript (run Google Chrome with --disable-javascript parameter) is a good alternative method for scraping a smaller set of ajax pages. The main benefit is that it's easily debuggable, because you have a visual overview of what's going on with your scraper.
scrollmaxtime is a timeout variable. Hope this is useful to someone :)
Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check it out. The problem is that you have to wait a little for the page to load out and javascript works asynchronously so you have to use
(see).The "correct" solution didn't work for me. And, from what I've read CasperJS doesn't use
(but I may be wrong on that), which makes me doubt thatwindow
works.The following works for me in the Firefox/Chrome console; but, doesn't work in CasperJS (within
function).What did work for me in CasperJS was:
Which, also worked when moving
into Casper'sthen
function.However, the above solution won't work on some sites like Twitter; jQuery seems to break the
function, and I had to remove theclientScripts
reference to jQuery when working within Twitter.Some websites (e.g. BoingBoing.net) seem to work fine with jQuery and CasperJS
. Not sure why some sites work and others don't.