Scraping webpage generated by javascript with C#

I have a webBrowser, and a label in Visual Studio, and basically what I'm trying to do is grab a section from another webpage.

I tried using WebClient.DownloadString and WebClient.DownloadFile, and both of them give me the source code of the webpage before the javascript loads the content. My next idea was to use a WebBrowser tool and just call webBrowser.DocumentText after the page loaded and that did not work, it still gives me the original source of the page.

Is there a way I can grab the page post-javascriptload?

Here is the page I'm trying to scrape.

http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083

I need to get the comment off of that page, which is generated.

标签： c# javascript html visual-studio web-scraping

3条回答

男人必须洒脱

2楼-- · 2019-01-06 14:03

ok i will show you how to enable javascript using phantomjs and selenuim with c#

create a new console project name it as you want
go to solution explorer in your right hand
a right click on References click on Manage NuGet packages
a windows will shows click on browse than install Selenium.WebDriver
downold phantomjs from here Phantomjs

in your main function type this code

    var options = new PhantomJSOptions();
    options.AddAdditionalCapability("IsJavaScriptEnabled", true);
    IWebDriver driver = new PhantomJSDriver("phantomjs Folder Path", options);
    driver.Navigate().GoToUrl("https://www.yourwebsite.com/");

    try
    {
        string pagesource = driver.PageSource;
        driver.FindElement(By.Id("yourelement"));
        Console.Write("yourelement founded");

    }
    catch (Exception e)
    {
        Console.WriteLine(e.Message);

    }

    Console.Read();

don't forget to put yourwebsite and the element that you loooking for and the phantomjs.exe path in you machine in this code below

have great time of coding and thanks wbennett

0人赞添加讨论(0) 举报

SAY GOODBYE

3楼-- · 2019-01-06 14:09

Thanks to wbennet, discovered https://phantomjscloud.com. Enough free service to scrap pages through web api calls.

    public static string GetPagePhantomJs(string url)
    {
        using (var client = new System.Net.Http.HttpClient())
        {
            client.DefaultRequestHeaders.ExpectContinue = false;
            var pageRequestJson = new System.Net.Http.StringContent(@"{'url':'" + url + "','renderType':'html','outputAsJson':false }");
            var response = client.PostAsync("https://PhantomJsCloud.com/api/browser/v2/{YOUT_API_KEY}/", pageRequestJson).Result;
            return response.Content.ReadAsStringAsync().Result;
        }
    }

Yeah.

0人赞添加讨论(0) 举报

乱世女痞

4楼-- · 2019-01-06 14:26

The problem is the browser usually executes the javascript and it results with an updated DOM. Unless you can analyze the javascript or intercept the data it uses, you will need to execute the code as a browser would. In the past I ran into the same issue, I utilized selenium and PhantomJS to render the page. After it renders the page, I would use the WebDriver client to navigate the DOM and retrieve the content I needed, post AJAX.

At a high-level, these are the steps:

Installed selenium: http://docs.seleniumhq.org/
Started the selenium hub as a service
Downloaded phantomjs (a headless browser, that can execute the javascript): http://phantomjs.org/
Started phantomjs in webdriver mode pointing to the selenium hub
In my scraping application installed the webdriver client nuget package: Install-Package Selenium.WebDriver

Here is an example usage of the phantomjs webdriver:

var options = new PhantomJSOptions();
options.AddAdditionalCapability("IsJavaScriptEnabled",true);

var driver = new RemoteWebDriver( new URI(Configuration.SeleniumServerHub),
                    options.ToCapabilities(),
                    TimeSpan.FromSeconds(3)
                  );
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

More info on selenium, phantomjs and webdriver can be found at the following links:

http://docs.seleniumhq.org/

http://docs.seleniumhq.org/projects/webdriver/

http://phantomjs.org/

EDIT: Easier Method

It appears there is a nuget package for the phantomjs, such that you don't need the hub (I used a cluster to do massive scrapping in this manner):

Install web driver:

Install-Package Selenium.WebDriver

Install embedded exe:

Install-Package phantomjs.exe

Updated code:

var driver = new PhantomJSDriver();
driver.Url = "http://www.regulations.gov/#!documentDetail;D=APHIS-2013-0013-0083";
driver.Navigate();
//the driver can now provide you with what you need (it will execute the script)
//get the source of the page
var source = driver.PageSource;
//fully navigate the dom
var pathElement = driver.FindElementById("some-id");

0人赞添加讨论(0) 举报

Scraping webpage generated by javascript with C#

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间