Scraping non html-websites with R?

2019-04-17 17:16发布

Scraping data from html tables from html websites is cool and easy. However, how can I do this task if the website is not written in html and requires a browser to show the relevant information, e.g. if it's an asp website or the data is not in the code but comes in through java code?

Like it is here: http://www.bwea.com/ukwed/construction.asp.

With VBA for excel one can write a function that opens and IE session calling the website and then basically copy and pasting the content of the website. Any chance to do something similar with R?

标签: r scrape
2条回答
Rolldiameter
2楼-- · 2019-04-17 17:30

This is normal HTML, with the associated normal trouble of having to clean up after scraping the data.

The following does the trick:

  • Read the page with readHTMLTable in package XML
  • It's the fifth table on the page, so extract the fifth element
  • Take the first row and assign it to the names of the table
  • Delete the first row

The code:

x <- readHTMLTable("http://www.bwea.com/ukwed/construction.asp", 
                   as.data.frame=TRUE, stringsAsFactors=FALSE)
dat <- x[[5]]
names(dat) <- unname(unlist(dat[1, ]))

The resulting data:

dat <- dat[-1, ]

'data.frame':   39 obs. of  10 variables:
 $ Date                : chr  "September 2011" "August 2011" "August 2011" "August 2011" ...
 $ Wind farm           : chr  "Baillie Wind farm - Bardnaheigh Farm" "Mains of Hatton" "Coultas Farm" "White Mill (Coldham ext)" ...
 $ Location            : chr  "Highland" "Aberdeenshire" "Nottinghamshire" "Cambridgeshire" ...
 $ Power(MW)           : chr  "2.5" "0.8" "0.33" "2" ...
 $ Turbines            : chr  "21" "3" "1" "7" ...
 $ MW Capacity         : chr  "52.5" "2.4" "0.33" "14" ...
 $ Annual homes equiv*.: chr  "29355" "1342" "185" "7828" ...
 $ Developer           : chr  "Baillie" "Eco2" "" "COOP" ...
 $ Latitude            : chr  "58 02 52N" "57 28 11N" "53 04 33N" "52 35 47N" ...
 $ Longitude           : chr  "04 07 40W" "02 30 32W" "01 18 16W" "00 07 41E" ...
查看更多
家丑人穷心不美
3楼-- · 2019-04-17 17:32

That site just delivers HTML, as Thomas comments. Some sites use JavaScript to get values via an AJAX call and insert them into the document dynamically - those won't work via a simple scraping. The trick with those is to use a JavaScript debugger to see what the AJAX calls are and reverse engineer them from the Request and Response.

The hardest thing will be sites driven by Java Applets, but thankfully these are rare. These could be getting their data via just about any network mechanism, and you'd have to reverse engineer all that by inspecting network traffic.

Even IE/VBA will fail if its a Java applet, I reckon.

Also, don't confuse java and JavaScript.

查看更多
登录 后发表回答