Scraping data from html tables from html websites is cool and easy. However, how can I do this task if the website is not written in html and requires a browser to show the relevant information, e.g. if it's an asp website or the data is not in the code but comes in through java code?
Like it is here: http://www.bwea.com/ukwed/construction.asp.
With VBA for excel one can write a function that opens and IE session calling the website and then basically copy and pasting the content of the website. Any chance to do something similar with R?
This is normal HTML, with the associated normal trouble of having to clean up after scraping the data.
The following does the trick:
readHTMLTable
in packageXML
The code:
The resulting data:
That site just delivers HTML, as Thomas comments. Some sites use JavaScript to get values via an AJAX call and insert them into the document dynamically - those won't work via a simple scraping. The trick with those is to use a JavaScript debugger to see what the AJAX calls are and reverse engineer them from the Request and Response.
The hardest thing will be sites driven by Java Applets, but thankfully these are rare. These could be getting their data via just about any network mechanism, and you'd have to reverse engineer all that by inspecting network traffic.
Even IE/VBA will fail if its a Java applet, I reckon.
Also, don't confuse java and JavaScript.