Background I'm currently scraping product information from some websites in R using rvest. This works on all but one website where the content seems to be loaded dynamically via angularJS (?), so cannot be loaded iteratively e.g. via URL parameters (as I did for other websites). The specific url is as follows:
http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html
Please keep in mind I don't have admin rights on my machine and can only implement solutions that require either no or only single-time granting of admin rights
Desired Output In the end a table in R with product information (e.g. label, price, rating) => In this question, though, I purely need help to dynamically load and store the website; I can handle the postprocessing in R on my own. It'd be absolutely great if you could push me in the right direction; maybe one of my approaches listed below are on the right track, but I just seem unable to transfer those to the stated website.
Current approach I found phantomJS as a headless browser that afaik should be able to handle this issue. I have close to none knowledge of Java Script at all and syntax differs (at least for me) heavily from languages I'm more used to (R, Matlab, SQL) that I really struggle to implement approaches suggested somewhere else that might work in my code. Based on this example (thanks a lot) I managed to retrieve at least information from the first shown page with this code:
R:
require(rvest)
## change Phantom.js scrape file
url <- 'http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html'
lines <- lines <- readLines("scrape_final.js")
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape_final.js")
## Download website
system("phantomjs scrape_final.js")
### use Rvest to scrape the downloaded website.
web <- read_html("1.html")
content <- html_nodes(web, 'div.paging-indicator')# %>% html_attr('href')
content <- html_text(content) %>% as.data.frame()
and the corresponding PhantomJS script:
var url ='http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html';
var page = new WebPage()
var fs = require('fs');
page.open(url, function (status) {
just_wait();
});
function just_wait() {
setTimeout(function() {
fs.write('1.html', page.content, 'w');
phantom.exit();
}, 2500);
}
what does not work // research While this code retrieves information from the first page, I do need to iterate through all x product pages. I tried to extend the code above with the following examples:
Unable to scrape multiple pages using phantomjs in r
[Scrape dynamic loading pages with phantomjs][3]
[Web scraping dynamically loading data in R][4]
The examples led me to the idea
- either clicking the "next page" button
or somehow inject the correct pagination value
- Either click on the "next page" button
This looks like the following
var url ='http://www.hornbach.de/shop/Badarmaturen/Waschtischarmaturen/S3584/artikelliste.html';
var page = require('webpage').create();
var fs = require('fs');
page.open(url, function (status) {
age.open(url, function() {
page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
page.evaluate(function() {
$("paging-btn right").click();
just_wait();
});
phantom.exit()
});
});
function just_wait() {
setTimeout(function() {
fs.write('1.html', page.content, 'w');
phantom.exit();
}, 2500);
}
But that doesn't get me anywhere due to poor syntax and maybe other things. Calling this script from R doesn't produce an error unfortunately, it just runs for ages so I have to quit it (while the working script only takes few secs).
I used the gadget inspector from firefox to retrieve the button name, but also that might be wrong:
<a class="paging-btn right rel="next" ng-click="goToNextPage()"
ng-hide="articleData.pageNumber == articleData.pageCount"
href="javascript:void(0);">right</a>
- change the pagination parameter on load
I tried to workon the given example here Passing variable into page.evaluate - PhantomJS
but also just got a script that never ended in R
Additional notes It looks like I'm only allowed to post two links, so unfortunately I couldn't link all sources I've researched and tested.
I'm well aware this is huge and messy info at once and if you can help me to improve/better structure my question, please let me know. I'll do my best to be responsive and get you anything you need to assist.