fast url query with R

2019-04-09 01:07发布

Hi have to query a website 10000 times I am looking for a real fast way to do it with R

as a template url:

url <- "http://mutationassessor.org/?cm=var&var=7,55178574,G,A"

my code is:

url  <- mydata$mutationassessorurl[1]
rawurl  <- readHTMLTable(url)
Mutator  <- data.frame(rawurl[[10]])

for(i in 2:27566) {
  url  <- mydata$mutationassessorurl[i]
  rawurl  <- readHTMLTable(url)
  Mutator  <- smartbind(Mutator, data.frame(rawurl[[10]]))  
  print(i)
}

using microbenchmark I have 680 milliseconds for query. I was wondering if there is a faster way to do it!

Thanks

1条回答
我想做一个坏孩纸
2楼-- · 2019-04-09 01:13

One way to speed up http connections is to leave the connection open between requests. The following example shows the difference it makes for httr. The first option is most similar to the default behaviour in RCurl.

library(httr)
test_server <- "http://had.co.nz"

# Return times in ms for easier comparison
timed_GET <- function(...) {
  req <- GET(...)
  round(req$times * 1000)
}

# Create a new handle for every request - no connection sharing
rowMeans(replicate(20, 
  timed_GET(handle = handle(test_server), path = "index.html")
))

##      redirect    namelookup       connect   pretransfer starttransfer 
##          0.00         20.65         75.30         75.40        133.20 
##         total 
##        135.05

test_handle <- handle(test_server)
# Re use the same handle for multiple requests
rowMeans(replicate(20, 
  timed_GET(handle = test_handle, path = "index.html")
))

##      redirect    namelookup       connect   pretransfer starttransfer 
##          0.00          0.00          2.55          2.55         59.35 
##         total 
##         60.80

# With httr, handles are automatically pooled
rowMeans(replicate(20,
  timed_GET(test_server, path = "index.html")
))

##      redirect    namelookup       connect   pretransfer starttransfer 
##          0.00          0.00          2.55          2.55         57.75 
##         total 
##         59.40

Note the difference in the namelookup and connect - if you're sharing a handle you need to do each of these operations only once, which saves quite a bit of time.

There's quite a lot of intra-request variation - on average the last two methods should be very similar.

查看更多
登录 后发表回答