Use phantomJS in R to scrape page with dynamically

2019-06-28 03:27发布

问题:

Background I'm currently scraping product information from some websites in R using rvest. This works on all but one website where the content seems to be loaded dynamically via angularJS (?), so cannot be loaded iteratively e.g. via URL parameters (as I did for other websites). The specific url is as follows:

http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html

Please keep in mind I don't have admin rights on my machine and can only implement solutions that require either no or only single-time granting of admin rights

Desired Output In the end a table in R with product information (e.g. label, price, rating) => In this question, though, I purely need help to dynamically load and store the website; I can handle the postprocessing in R on my own. It'd be absolutely great if you could push me in the right direction; maybe one of my approaches listed below are on the right track, but I just seem unable to transfer those to the stated website.

Current approach I found phantomJS as a headless browser that afaik should be able to handle this issue. I have close to none knowledge of Java Script at all and syntax differs (at least for me) heavily from languages I'm more used to (R, Matlab, SQL) that I really struggle to implement approaches suggested somewhere else that might work in my code. Based on this example (thanks a lot) I managed to retrieve at least information from the first shown page with this code:

R:

require(rvest)

## change Phantom.js scrape file
url <- 'http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html'

lines <- lines <- readLines("scrape_final.js")
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape_final.js")

## Download website
system("phantomjs scrape_final.js")

### use Rvest to scrape the downloaded website.
web <- read_html("1.html")
content <- html_nodes(web, 'div.paging-indicator')# %>% html_attr('href')
content <- html_text(content) %>% as.data.frame()

and the corresponding PhantomJS script:

var url ='http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html';
var page = new WebPage() 
var fs = require('fs'); 

page.open(url, function (status) { 
         just_wait(); 
}); 


function just_wait() { 
    setTimeout(function() { 
              fs.write('1.html', page.content, 'w'); 
           phantom.exit(); 
    }, 2500); 
}  

what does not work // research While this code retrieves information from the first page, I do need to iterate through all x product pages. I tried to extend the code above with the following examples:

Unable to scrape multiple pages using phantomjs in r

[Scrape dynamic loading pages with phantomjs][3]

[Web scraping dynamically loading data in R][4]

The examples led me to the idea

  • either clicking the "next page" button
  • or somehow inject the correct pagination value

    1. Either click on the "next page" button

This looks like the following

    var url ='http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html';
    var page = require('webpage').create(); 
    var fs = require('fs'); 

    page.open(url, function (status) { 
    age.open(url, function() {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
          $("paging-btn right").click();
            just_wait(); 
        });
        phantom.exit()
      });

    }); 


    function just_wait() { 
        setTimeout(function() { 
                  fs.write('1.html', page.content, 'w'); 
               phantom.exit(); 
        }, 2500); 
    } 

But that doesn't get me anywhere due to poor syntax and maybe other things. Calling this script from R doesn't produce an error unfortunately, it just runs for ages so I have to quit it (while the working script only takes few secs).

I used the gadget inspector from firefox to retrieve the button name, but also that might be wrong:

<a class="paging-btn right rel="next" ng-click="goToNextPage()" 
ng-hide="articleData.pageNumber == articleData.pageCount" 
href="javascript:void(0);">right</a> 
  1. change the pagination parameter on load

I tried to workon the given example here Passing variable into page.evaluate - PhantomJS

but also just got a script that never ended in R

Additional notes It looks like I'm only allowed to post two links, so unfortunately I couldn't link all sources I've researched and tested.

I'm well aware this is huge and messy info at once and if you can help me to improve/better structure my question, please let me know. I'll do my best to be responsive and get you anything you need to assist.

回答1:

I've split the PhantomJS code in two parts which avoids the error messages. I'm quite confident it is possible to first read and store the website and afterwards lick on the "next page" button and output the new url, but unfortunately this didn't work out without an error message.

The following R code is the most inner scraping loop (retrieves info from pages of one sub-sub-category, calls / changes the PhantomJS scripts accordingly).

   for (i3 in 1:num_prod_pages) {

      system('phantomjs readhtml.js') # return html of current page via PhantomJS

      ### Use Rvest to scrape the downloaded website.

      html_filename <- paste0(as.character(i3), '.html') # file generated in PhantomJS readhtml.js
      web <- read_html(html_filnamee)
      content <- html_nodes(web, 'div.article-pricing') # %>% html_attr('href')
      content <- html_text(content) %>% as.data.frame()

      ### generate URL of next page

      url_i3 <- capture.output(system("phantomjs nextpage.js", intern = TRUE)) %>%
         .[length(.)] %>% # last line of output contains 
         str_sub(str_locate(., 'http')[1], -2) # cut '[1] \' at start and ' \" ' at end

      # Adapt PhantomJS scripts to new url

      lines <- readLines("readhtml.js")
      lines[2] <- paste0("var url ='", url_i3 ,"';")
      lines[11] <- paste0("              fs.write('", as.character(i3), ".html', page.content, 'w');")
      writeLines(lines, "readhtml.js")

      lines <- readLines("nextpage.js")
      lines[2] <- paste0("var url ='", url_i3 ,"';")
      writeLines(lines, "nextpage.js")
   } 

The following PhantomJS code "readhtml.js" code stores website with current URL locally

var webPage = require('webpage');
var url ='http://www.hornbach.de/shop/Badarmaturen/S476/artikelliste.html';
var fs = require('fs'); 
var page = webPage.create();
var system = require('system');

//page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0'
page.settings.userAgent = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

page.open(url, function(status) {
    if (status === 'success') {
        fs.write('1.html', page.content, 'w'); 
        console.log('htmlfile ready');
        phantom.exit(); 
    }
})

The following PhantomJS code "nextpage.js" code clicks the "next page" button and returns the new URL

var webPage = require('webpage');
var url ='http://www.hornbach.de/shop/Badarmaturen/S476/artikelliste.html';
var fs = require('fs'); 
var page = webPage.create();
var system = require('system');

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0';

page.open(url, function(status) {
    if (status === 'success') {
        page.evaluate(function() {
            document.querySelector('a.right:nth-child(3)').click();
        });
        setTimeout(function() {
            var new_url = page.url;
            console.log('URL: ' + new_url);
            phantom.exit();
    }, 2000);
    };
});

All in all not really elegant, but lacking other input I close this one as it works without any error messages