I have a webBrowser, and a label in Visual Studio, and basically what I'm trying to do is grab a section from another webpage.
I tried using WebClient.DownloadString and WebClient.DownloadFile, and both of them give me the source code of the webpage before the javascript loads the content. My next idea was to use a WebBrowser tool and just call webBrowser.DocumentText after the page loaded and that did not work, it still gives me the original source of the page.
Is there a way I can grab the page post-javascriptload?
Here is the page I'm trying to scrape.
http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083
I need to get the comment off of that page, which is generated.
ok i will show you how to enable javascript using phantomjs and selenuim with c#
in your main function type this code
have great time of coding and thanks wbennett
Thanks to wbennet, discovered https://phantomjscloud.com. Enough free service to scrap pages through web api calls.
Yeah.
The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.
At a high-level, these are the steps:
Install-Package Selenium.WebDriver
Here is an example usage of the phantomjs webdriver:
More info on selenium, phantomjs and webdriver can be found at the following links:
http://docs.seleniumhq.org/
http://docs.seleniumhq.org/projects/webdriver/
http://phantomjs.org/
EDIT: Easier Method
It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):
Install web driver:
Install embedded exe:
Updated code: