Scraping dynamic web content in C#

2020-03-30 15:57发布

Is it possible to scrape data generated by dynamic web page? I mean for example This website generates the tag <font> with some java script which is

document.write("<font class=spy2>:<\/font>"+(v2j0j0^o5r8)+(r8d4x4^y5i9)+(b2r8e5^u1p6)+(r8d4x4^y5i9))

The values change on each page refresh. Each generated code represents a number from 0 to 9, for example (code1)+(code2)+(code3)+(code4) and at the back end some type of parser is written which understands it and generates the numbers accordingly.

Once the page is rendered and for example code1 was set some where for digit 4 the where ever the digit 4 is generated it comes from this code after getting parsed.

If we use HtmlAgilityPack we see that java script code but not its generated output. Is there any way we can read the tag it creates when the page is rendered?

2条回答
看我几分像从前
2楼-- · 2020-03-30 16:35

Thanks for pointing out.I saw that by implementing .same results but then looking at one more comment who says use IE engine i turned and made a small application that does the job.I added IE and navigated it to the website and read the content.Here is the code

 private void webBrowser1_DocumentCompleted(object sender, System.Windows.Forms.WebBrowserDocumentCompletedEventArgs e)
        {
  System.Windows.Forms.HtmlElementCollection elementsforViewPost =
                                webBrowser1.Document.GetElementsByTagName("font");
  foreach (System.Windows.Forms.HtmlElement current2 in elementsforViewPost)
  {
  if (current2.InnerText != null && CheckForValidProxyAddress(current2.InnerText) &&
                    ObtainedProxies.Where(index=>index.ProxyAddress == current2.InnerText.Trim()).ToList().Count == 0)
 {
   Proxy data = new Proxy();
   data.IsRetired = false;
   data.IsActive = true;
   int result = 1;                   

   data.DomainsVisited = 0;
   data.ProxyAddress = current2.InnerText.Trim();

   ObtainedProxies.Add(data);
}

and for checking that received text is valid proxy here is what i did got it from some page long ago by googling

  private bool CheckForValidProxyAddress(string address)
        {

        //create our match pattern
        //string pattern = @"^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])){3}$:([0-9][0-9][0-9][0-9])";
        string pattern = @"\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b\:[0-9]{0,4}";
        //create our Regular Expression object
        Regex check = new Regex(pattern);
        //boolean variable to hold the status
        bool valid = false;
        //check to make sure an ip address was provided
        if (address == "")
        {
            //no address provided so return false
            valid = false;
        }
        else
        {
            //address provided so use the IsMatch Method
            //of the Regular Expression object
            valid = check.IsMatch(address, 0);
        }
        //return the results
        return valid;
    }
查看更多
够拽才男人
3楼-- · 2020-03-30 16:47

I think you're obliged to use somehow IE engine.

查看更多
登录 后发表回答