R Disparity between browser and GET / getURL

2020-03-04 02:14发布

问题:

I'm trying to download the content from a page and I'm finding that the response data is either malformed or incomplete, as if GET or getURL are pulling before those data are loaded.

library(httr)
library(RCurl)
url <- "https://www.vanguardcanada.ca/individual/etfs/etfs.htm"
d1 <- GET(url) # This shows a lot of {{ moustache style }} code that's not filled
d2 <- getURL(url) # This shows "" as if it didn't get anything

I'm not sure how to proceed. My goal is to get the numbers associated with the links that show in the browser:

https://www.vanguardcanada.ca/individual/etfs/etfs-detail-overview.htm?portId=9548

So in this case, I want to download and scrape '9548'.

Not sure why getURL and GET seem to get wildly different results than what's presented in the browser. It seems like data is loaded slowly and almost as if GET and getURL pull before it's fully loaded.

For example, look at:

x <- "https://www.vanguardcanada.ca/individual/etfs/etfs-detail-prices.htm?portId=9548"
readHTMLTable(htmlParse(GET(x)))

回答1:

It's important to understand that when you scrape a webpage, you are getting the raw HTML source code for that page; this isn't necessarily exactly what you will be interacting with in a web browser. When you call GET(url) you are getting the actual html/text that is the source of that page. This is what is being sent directly from the server. Nowadays most web pages also assume the browser will not only display the HMTL, but will also execute the javascript code on that page. This is especially true when a lot of in-page content is generated later by javascript. That's exactly what's going on in this page. The "content" on the page isn't found in the html source of that page; it is downloaded later via javascript.

Neither httr nor RCurl will execute the javascript required to "fill" the page with the table you are actually viewing. There is a package called RSelenium which is capable of interacting with a browser to execute javascript, but in this case we actually can get around that.

First, just a side note on why getURL didn't work. It seems this web server sniffs the user-agent sent by the requesting program to send different content back. Whatever the default user-agent used by RCurl is isn't deemed "good" enough to get the html from the server. You can get around this by specifying a different user agent. For example

d2 <- getURL(url, .opts=list(useragent="Mozila 5.0"))

seems to work.

But getting back to the main problem. When working on problems like this, i strongly recommend you use the Chrome Developer tools (or whatever the equivalent is in your favorite browser). In the Chrome developer tools, specifically on the Network tab, you can see all requests made by Chrome to get the data

If you click on the first one ("etfs.html") you can see the headers and response for that request. On the response sub-tab, you should see exactly the same content that is found by GET or getURL. Then we download a bunch of CSS and javascript files. The file that looked most interesting was "GetETFJson.js". This actually seems to hold most of the data in an almost JSON like format. It actually has some true javascript in front the JSON block that kind of gets in the way. But we can download that file with

d3 <- GET("https://www.vanguardcanada.ca/individual/mvc/GetETFJson.js")

and extract the content as text with

p3 <- content(d3, as="text")

and then turn it into an R object with

library(jsonlite)
r3 <- fromJSON(substr(p3,13,nchar(p3)))

again, we are using substr above to strip off the non-JSON stuff at the beginning to make it easier to parse.

Now, you can explore the object returned. But it looks like the data you want is stored in the following vectors

cbind(r3$fundData$Fund$profile$portId, r3$fundData$Fund$profile$benchMark)

      [,1]   [,2]                                                                            
 [1,] "9548" "FTSE All World ex Canada Index in CAD"                                         
 [2,] "9561" "FTSE Canada All Cap Index in CAD"                                              
 [3,] "9554" "Spliced Canada Index"                                                          
 [4,] "9559" "FTSE Canada All Cap Real Estate Capped 25% Index"                              
 [5,] "9560" "FTSE Canada High Dividend Yield Index"                                         
 [6,] "9550" "FTSE Developed Asia Pacific Index in CAD"                                      
 [7,] "9549" "FTSE Developed Europe Index in CAD"                                            
 [8,] "9558" "FTSE Developed ex North America Index in CAD"                                  
 [9,] "9555" "Spliced FTSE Developed ex North America Index Hedged in CAD"                   
[10,] "9556" "Spliced Emerging Markets Index in CAD"                                         
[11,] "9563" "S&P 500 Index in CAD"                                                          
[12,] "9562" "S&P 500 Index in CAD Hedged"                                                   
[13,] "9566" "NASDAQ US Dividend Achievers Select Index in CAD"                              
[14,] "9564" "NASDAQ US Dividend Achievers Select Index Hedged in CAD"                       
[15,] "9557" "CRSP US Total Market Index in CAD"                                             
[16,] "9551" "Spliced US Total Market Index Hedged in CAD"                                   
[17,] "9552" "Barclays Global Aggregate CAD Float Adjusted Index in CAD"                     
[18,] "9553" "Barclays Global Aggregate CAD 1-5 Year Govt/Credit Float Adj Ix in CAD"        
[19,] "9565" "Barclays Global Aggregate Canadian 1-5 Year Credit Float Adjusted Index in CAD"
[20,] "9568" "Barclays Global Aggregate ex-USD Float Adjusted RIC Capped Index Hedged in CAD"
[21,] "9567" "Barclays U.S. Aggregate Float Adjusted Index Hedged in CAD"  

So hopefully that should be sufficient to extract the data you need to identify the path to the URL with more data.



标签: r curl rcurl httr