Using R to scrape tables when URL does not change

2019-08-27 21:10发布

问题:

I'm relatively new to scraping in R and have had great luck using "rvest", but I've run into an issue I cannot solve.

The website I am trying to scrape has the same URL no matter what page of the table you are on. For example, the main webpage is www.blah.com with one main table on it that has 10 other "next" pages of the same table, but just the next in order (I apologize for not linking to the actual page as I cannot due to work issues).

So, if I'm on page 1 of the table, the URL is www.blah.com. If I'm on page 2 of the table the URL is www.blah.com and so on... The URL never changes.

Here is my code so far. I'm using a combination of rvest and phantomjs. The code works perfectly, but only for getting page 1 of the table, not the corresponding "next" 10 pages of the table:

url <- "http://www.blah.com"

writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
   console.log(page.content); //page source
   phantom.exit();
});", url), con="scrape.js")

system(phantomjs scrape.js > scrape.html") 

page <- html("scrape.html")
page %>% html_nodes("td:nth-child(4)") %>% html_text()

And, this is the HTML code for page 2 of the table from the website (all other pages of the table are identical except for replacing the 2 with 3 and so on up the list):