Retrieve a string containing html Document source

2019-08-07 12:12发布

问题:

I really hope there's someone experienced enough both with TPL & System.Net Classes and methods

What started as a simple thought of use TPL on current sequential set of actions led me to a halt in my project.

As I am still fresh With .NET, jumping straight to deep water using TPL ...

I was trying to extract an Aspx page's source/content(html) using WebClient

Having multiple requests per day (around 20-30 pages to go through) and extract specific values out of the source code... being only one of few daily tasks the server has on its list,

Led me to try implement it by using TPL, thus gain some speed.

Although I tried using Task.Factory.StartNew() trying to iterate on few WC instances , on first try execution of WC the application just does not get any result from the WebClient

This is my last try on it

    static void Main(string[] args)
    {
        EnumForEach<Act>(Execute);
        Task.WaitAll();
    }

    public static void EnumForEach<Mode>(Action<Mode> Exec)
    {
            foreach (Mode mode in Enum.GetValues(typeof(Mode)))
            {
                Mode Curr = mode;

                Task.Factory.StartNew(() => Exec(Curr) );
            }
    }

    string ResultsDirectory = Environment.CurrentDirectory,
        URL = "",
        TempSourceDocExcracted ="",
        ResultFile="";

        enum Act
        {
            dolar, ValidateTimeOut
        }

    void Execute(Act Exc)
    {
        switch (Exc)
        {
            case Act.dolar:
                URL = "http://www.AnyDomainHere.Com";
                ResultFile =ResultsDirectory + "\\TempHtm.htm";
                TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
                File.WriteAllText(ResultFile, TempSourceDocExcracted);
                break;
            case Act.ValidateTimeOut:
                URL = "http://www.AnotherDomainHere.Com";
                ResultFile += "\\TempHtm.htm";
                TempSourceDocExcracted = IeNgn.AgilityPacDocExtraction(URL).GetElementbyId("Dv_Main").InnerHtml;
                File.WriteAllText(ResultFile, TempSourceDocExcracted);
                break;
        }

        //usage of HtmlAgilityPack to extract Values of elements by their attributes/properties
        public HtmlAgilityPack.HtmlDocument AgilityPacDocExtraction(string URL)
        {
            using (WC = new WebClient())
            {
                WC.Proxy = null;
                WC.Encoding = Encoding.GetEncoding("UTF-8");
                tmpExtractedPageValue = WC.DownloadString(URL);
                retAglPacHtmDoc.LoadHtml(tmpExtractedPageValue);
                return retAglPacHtmDoc;
            }
        }

What am I doing wrong? Is it possible to use a WebClient using TPL at all or should I use another tool (not being able to use IIS 7 / .net4.5)?

回答1:

I see at least several issues:

  1. naming - FlNm is not a name - VisualStudio is modern IDE with smart code completion, there's no need to save keystrokes (you may start here, there are alternatives too, main thing is too keep it consistent: C# Coding Conventions.

  2. If you're using multithreading, you need to care about resource sharing. For example FlNm is a static string and it is assigned inside each thread, so it's value is not deterministic (also even if it was running sequentially, code would work faulty - you would adding file name in path in each iteration, so it would be like c:\TempHtm.htm\TempHtm.htm\TempHtm.htm)

  3. You're writing to the same file from different threads (well, at least that was your intent I think) - usually that's a recipe for disaster in multithreading. Question is, if you need at all write anything to disk, or it can be downloaded as string and parsed without touching disk - there's a good example what does it mean to touch a disk.

  4. Overall I think you should parallelize only downloading, so do not involve HtmlAgilityPack in multithreading, as I think you don't know it is thread safe. On the other hand, downloading will have good performance/thread count ratio, html parsing - not so much, may be if thread count will be equal to cores count, but not more. Even more - I would separate downloading and parsing, as it would be easier to test, understand and maintain.

Update: I don't understand your full intent, but this may help you started (it's not production code, you should add retry/error catching, etc.). Also at the end is extended WebClient class allowing you to get more threads spinning, because by default webclient allows only two connections.

class Program
{
    static void Main(string[] args)
    {
        var urlList = new List<string>
                          {
                              "http://google.com",
                              "http://yahoo.com",
                              "http://bing.com",
                              "http://ask.com"
                          };

        var htmlDictionary = new ConcurrentDictionary<string, string>();
        Parallel.ForEach(urlList, new ParallelOptions { MaxDegreeOfParallelism = 20 }, url => Download(url, htmlDictionary));
        foreach (var pair in htmlDictionary)
        {
            Process(pair);
        }
    }

    private static void Process(KeyValuePair<string, string> pair)
    {
        // do the html processing
    }

    private static void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
    {
        using (var webClient = new SmartWebClient())
        {
            htmlDictionary.TryAdd(url, webClient.DownloadString(url));
        }
    }
}

public class SmartWebClient : WebClient
{
    private readonly int maxConcurentConnectionCount;

    public SmartWebClient(int maxConcurentConnectionCount = 20)
    {
        this.maxConcurentConnectionCount = maxConcurentConnectionCount;
    }

    protected override WebRequest GetWebRequest(Uri address)
    {
        var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
        if (httpWebRequest == null)
        {
            return null;
        }

        if (maxConcurentConnectionCount != 0)
        {
            httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
        }

        return httpWebRequest;
    }
}