Scraping non html-websites with R?

Scraping data from html tables from html websites is cool and easy. However, how can I do this task if the website is not written in html and requires a browser to show the relevant information, e.g. if it's an asp website or the data is not in the code but comes in through java code?

Like it is here: http://www.bwea.com/ukwed/construction.asp.

With VBA for excel one can write a function that opens and IE session calling the website and then basically copy and pasting the content of the website. Any chance to do something similar with R?

标签： r scrape

2条回答

Rolldiameter

2楼-- · 2019-04-17 17:30

This is normal HTML, with the associated normal trouble of having to clean up after scraping the data.

The following does the trick:

Read the page with readHTMLTable in package XML
It's the fifth table on the page, so extract the fifth element
Take the first row and assign it to the names of the table
Delete the first row

The code:

x <- readHTMLTable("http://www.bwea.com/ukwed/construction.asp", 
                   as.data.frame=TRUE, stringsAsFactors=FALSE)
dat <- x[[5]]
names(dat) <- unname(unlist(dat[1, ]))

The resulting data:

dat <- dat[-1, ]

'data.frame':   39 obs. of  10 variables:
 $ Date                : chr  "September 2011" "August 2011" "August 2011" "August 2011" ...
 $ Wind farm           : chr  "Baillie Wind farm - Bardnaheigh Farm" "Mains of Hatton" "Coultas Farm" "White Mill (Coldham ext)" ...
 $ Location            : chr  "Highland" "Aberdeenshire" "Nottinghamshire" "Cambridgeshire" ...
 $ Power(MW)           : chr  "2.5" "0.8" "0.33" "2" ...
 $ Turbines            : chr  "21" "3" "1" "7" ...
 $ MW Capacity         : chr  "52.5" "2.4" "0.33" "14" ...
 $ Annual homes equiv*.: chr  "29355" "1342" "185" "7828" ...
 $ Developer           : chr  "Baillie" "Eco2" "" "COOP" ...
 $ Latitude            : chr  "58 02 52N" "57 28 11N" "53 04 33N" "52 35 47N" ...
 $ Longitude           : chr  "04 07 40W" "02 30 32W" "01 18 16W" "00 07 41E" ...

0人赞添加讨论(0) 举报

家丑人穷心不美

3楼-- · 2019-04-17 17:32

That site just delivers HTML, as Thomas comments. Some sites use JavaScript to get values via an AJAX call and insert them into the document dynamically - those won't work via a simple scraping. The trick with those is to use a JavaScript debugger to see what the AJAX calls are and reverse engineer them from the Request and Response.

The hardest thing will be sites driven by Java Applets, but thankfully these are rare. These could be getting their data via just about any network mechanism, and you'd have to reverse engineer all that by inspecting network traffic.

Even IE/VBA will fail if its a Java applet, I reckon.

Also, don't confuse java and JavaScript.

0人赞添加讨论(0) 举报

Scraping non html-websites with R?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间