I'd like to web-scrape the html as seen in the source code of the web-browser, for this url "https://portal.tirol.gv.at/wisPvpSrv/wisSrv/wis/wbo_wis_auszug.aspx?ATTR=Y&TREE=N&ANL_ID=T20889658R3&TYPE=0".
what I get with..
library(RCurl)
library(XML)
myurl = "https://portal.tirol.gv.at/wisPvpSrv/wisSrv/wis/wbo_wis_auszug.aspx?ATTR=Y&TREE=N&ANL_ID=T20889658R3&TYPE=0"
x = getURL(myurl, followlocation = TRUE, ssl.verifypeer = FALSE)
htmlParse(x, asText = TRUE)
..is not what I see in the browser's source code - how to circumvent this??
If that website uses a lot of Javascript (and it seems it does) to generate content then you are pretty much stuck for starters.
If you use Firefox and get the developer toolbar then you can disable Javascript to see what the site looks like without it, and what content might be scrapable. You may hope that the site has a usable non-javascript version (this is called 'graceful degradation', where JS is only used to fancy stuff).
Otherwise use Firebug or some other JS debugger to see how the site pulls content if it's using AJAX. Then replicate those calls in R and scrape from the response.
Not that I can test any of this because if I go to that URL I get a Benutzername and Passwort prompt, and I don't have a Benutzername. If the content is behind authentication then you'll have to handle that in the RCurl process too - which might mean mucking with cookies and so on.
Good luck with that.
Here ya go:
If you can not pass the ssl verification have a look at this post : using Rcurl with HTTPs