How to optimise scraping with getURL() in R

2020-06-04 07:02发布

问题:

I am trying to scrape all bills from two pages on the website of the French lower chamber of parliament. The pages cover 2002-2012 and represent less than 1,000 bills each.

For this, I scrape with getURL through this loop:

b <- "http://www.assemblee-nationale.fr" # base
l <- c("12","13") # legislature id

lapply(l, FUN = function(x) {
  print(data <- paste(b, x, "documents/index-dossier.asp", sep = "/"))

  # scrape
  data <- getURL(data); data <- readLines(tc <- textConnection(data)); close(tc)
  data <- unlist(str_extract_all(data, "dossiers/[[:alnum:]_-]+.asp"))
  data <- paste(b, x, data, sep = "/")
  data <- getURL(data)
  write.table(data,file=n <- paste("raw_an",x,".txt",sep="")); str(n)
})

Is there any way to optimise the getURL() function here? I cannot seem to use concurrent downloading by passing the async=TRUE option, which gives me the same error every time:

Error in function (type, msg, asError = TRUE)  : 
Failed to connect to 0.0.0.12: No route to host

Any ideas? Thanks!

回答1:

Try mclapply {multicore} instead of lapply.

"mclapply is a parallelized version of lapply, it returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X." (http://www.rforge.net/doc/packages/multicore/mclapply.html)

If that doesn't work, you may get better performance using the XML package. Functions like xmlTreeParse use asynchronous calling.

"Note that xmlTreeParse does allow a hybrid style of processing that allows us to apply handlers to nodes in the tree as they are being converted to R objects. This is a style of event-driven or asynchronous calling." (http://www.inside-r.org/packages/cran/XML/docs/xmlEventParse)



回答2:

Why use R? For big scraping jobs you are better off using something already developed for the task. I've had good results with Down Them All, a browser add on. Just tell it where to start, how deep to go, what patterns to follow, and where to dump the HTML.

Then use R to read the data from the HTML files.

Advantages are massive - these add-ons are developed especially for the task so they will do multiple downloads (controllable by you), they will send the right headers so your next question won't be 'how do I set the user agent string with RCurl?', and they can cope with retrying when some of the downloads fail, which they inevitably do.

Of course the disadvantage is that you can't easily start this process automatically, in which case maybe you'd be better off with 'curl' on the command line, or some other command-line mirroring utility.

Honestly, you've got better things to do with your time than write website code in R...