web-scrape asp.net web-site with r

I'd like to web-scrape the html as seen in the source code of the web-browser, for this url "https://portal.tirol.gv.at/wisPvpSrv/wisSrv/wis/wbo_wis_auszug.aspx?ATTR=Y&TREE=N&ANL_ID=T20889658R3&TYPE=0".

what I get with..

library(RCurl)
library(XML)
myurl = "https://portal.tirol.gv.at/wisPvpSrv/wisSrv/wis/wbo_wis_auszug.aspx?ATTR=Y&TREE=N&ANL_ID=T20889658R3&TYPE=0"
x = getURL(myurl, followlocation = TRUE, ssl.verifypeer = FALSE)
htmlParse(x, asText = TRUE)

..is not what I see in the browser's source code - how to circumvent this??

标签： asp.net r web-scraping

2条回答

我只想做你的唯一

2楼-- · 2020-03-08 06:21

If that website uses a lot of Javascript (and it seems it does) to generate content then you are pretty much stuck for starters.

If you use Firefox and get the developer toolbar then you can disable Javascript to see what the site looks like without it, and what content might be scrapable. You may hope that the site has a usable non-javascript version (this is called 'graceful degradation', where JS is only used to fancy stuff).

Otherwise use Firebug or some other JS debugger to see how the site pulls content if it's using AJAX. Then replicate those calls in R and scrape from the response.

Not that I can test any of this because if I go to that URL I get a Benutzername and Passwort prompt, and I don't have a Benutzername. If the content is behind authentication then you'll have to handle that in the RCurl process too - which might mean mucking with cookies and so on.

Good luck with that.

0人赞添加讨论(0) 举报

Explosion°爆炸

3楼-- · 2020-03-08 06:37

Here ya go:

 library(RCurl) 
 library(XML)

 cookie = 'cookiefile.txt'
 curl  =  getCurlHandle ( cookiefile = cookie ,
     useragent =  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6",
     header = FALSE,
     verbose = TRUE,
     netrc = TRUE,
     maxredirs = as.integer(20),
     followlocation = TRUE,
    # userpwd = "bob:duncantl", ## enter here your username:password
     ssl.verifypeer = TRUE)

 myurl = "https://portal.tirol.gv.at/wisSrvPublic/wis/wbo_wis_auszug.aspx?ANL_ID=T20889658R3&TYPE=O" 

 x = getURL(myurl, curl = curl, cainfo = "path to R/library/RCurl/CurlSSL/ca-bundle.crt")

 x2 <- gsub('\r','', gsub('\t','', gsub('\n','', x))) # remove white spaces

 htmlParse(x2, asText = TRUE)

If you can not pass the ssl verification have a look at this post : using Rcurl with HTTPs

0人赞添加讨论(0) 举报

web-scrape asp.net web-site with r

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间