Scraping data from TripAdvisor using R

2019-03-15 21:15发布

问题:

I want to create a crawler that will scrape some data from Trip Advisor. Ideally, it will (a) identify the links to all locations to crawl, (b) collect links to all attractions in each location and (c) will collect the destination names, dates and ratings for all reviews. I'd like to focus on part (a) for now.

Here is the website I'm starting off with: http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html

There is problem here: the link gives top 10 destinations to begin with, and if you then click on "See more popular destinations" it will expand the list. It appears as though it uses a javascript function to achieve this. Unfortunately, I'm not familiar with javascript, but I think the following chunk may give clues about how it works:

<div class="morePopularCities" onclick="ta.call('ta.servlet.Tourism.showNextChildPage', event, this)">
<img id='lazyload_2067453571_25' height='27' width='27' src='http://e2.tacdn.com/img2/x.gif'/>
See more popular destinations in New Zealand </div>

I've found a few useful webscraping packages for R, such as rvest, RSelenium, XML, RCurl, but of these, only RSelenium appears to be able to resolve this, having said that, I still haven't been able to work it out.

Here is some relevant code:

tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
RSelenium::startServer()
remDr = RSelenium::remoteDriver(browserName = "internet explorer")
remDr$open()
remDr$navigate(tu)
# remDr$executeScript("JS_FUNCTION")

The last line should do the trick here, but I'm not sure what function I need to call here.

Once I manage to expand this list, I will be able to obtain the links for each destination the same way I would solve part (b) and I think I've already solved this (for those interested):

library(rvest)
tu = "http://www.tripadvisor.co.nz/Tourism-g255104-New_Zealand-Vacations.html"
tu = html_session(tu)
tu %>% html_nodes(xpath='//div[@class="popularCities"]/a') %>% html_attr("href")
 [1] "/Tourism-g255122-Queenstown_Otago_Region_South_Island-Vacations.html"                      
 [2] "/Tourism-g255106-Auckland_North_Island-Vacations.html"                                     
 [3] "/Tourism-g255117-Blenheim_Marlborough_Region_South_Island-Vacations.html"                  
 [4] "/Tourism-g255111-Rotorua_Rotorua_District_Bay_of_Plenty_Region_North_Island-Vacations.html"
 [5] "/Tourism-g255678-Nelson_Nelson_Tasman_Region_South_Island-Vacations.html"                  
 [6] "/Tourism-g255113-Taupo_Taupo_District_Waikato_Region_North_Island-Vacations.html"          
 [7] "/Tourism-g255109-Napier_Hawke_s_Bay_Region_North_Island-Vacations.html"                    
 [8] "/Tourism-g612500-Wanaka_Otago_Region_South_Island-Vacations.html"                          
 [9] "/Tourism-g255679-Russell_Bay_of_Islands_Northland_Region_North_Island-Vacations.html"      
[10] "/Tourism-g255114-Tauranga_Bay_of_Plenty_Region_North_Island-Vacations.html"  

As for step (c), I've found some useful links that might be helpful for that: https://github.com/hadley/rvest/blob/master/demo/tripadvisor.R http://notesofdabbler.github.io/201408_hotelReview/scrapeTripAdvisor.html

If you have any tips on how to expand the list of top destinations or how to go through the other steps in a smarter way, please let me know, I'd be really keen to hear from you.

Many thanks in advance!

回答1:

Basically, you can try to send a click event to the <div class="morePopularCities">. Something like this :

remDr$navigate(tu)
div <- remDr$findElement("class", "morePopularCities")
div$clickElement()

To expand all locations, you can possibly repeat the above logic in a while loop. Keep clicking on the <div> until no more items available (until the div no longer in the page) :

divs <- remDr$findElements("class", "morePopularCities")
while(length(divs )>0) {
  for(div in divs ){
    div$clickElement()
  }
  divs <- remDr$findElements("class", "morePopularCities")
}

I'm not fluent in R, you may find my code example not pretty, feel free to suggest.