HttpClient crawling results in memory leak

2019-02-01 01:34发布

问题:

I am working on a WebCrawler implementation but am facing a strange memory leak in ASP.NET Web API's HttpClient.

So the cut down version is here:


[UPDATE 2]

I found the problem and it is not HttpClient that is leaking. See my answer.


[UPDATE 1]

I have added dispose with no effect:

    static void Main(string[] args)
    {
        int waiting = 0;
        const int MaxWaiting = 100;
        var httpClient = new HttpClient();
        foreach (var link in File.ReadAllLines("links.txt"))
        {

            while (waiting>=MaxWaiting)
            {
                Thread.Sleep(1000);
                Console.WriteLine("Waiting ...");
            }
            httpClient.GetAsync(link)
                .ContinueWith(t =>
                                  {
                                      try
                                      {
                                          var httpResponseMessage = t.Result;
                                          if (httpResponseMessage.IsSuccessStatusCode)
                                              httpResponseMessage.Content.LoadIntoBufferAsync()
                                                  .ContinueWith(t2=>
                                                                    {
                                                                        if(t2.IsFaulted)
                                                                        {
                                                                            httpResponseMessage.Dispose();
                                                                            Console.ForegroundColor = ConsoleColor.Magenta;
                                                                            Console.WriteLine(t2.Exception);
                                                                        }
                                                                        else
                                                                        {
                                                                            httpResponseMessage.Content.
                                                                                ReadAsStringAsync()
                                                                                .ContinueWith(t3 =>
                                                                                {
                                                                                    Interlocked.Decrement(ref waiting);

                                                                                    try
                                                                                    {
                                                                                        Console.ForegroundColor = ConsoleColor.White;

                                                                                        Console.WriteLine(httpResponseMessage.RequestMessage.RequestUri);
                                                                                        string s =
                                                                                            t3.Result;

                                                                                    }
                                                                                    catch (Exception ex3)
                                                                                    {
                                                                                        Console.ForegroundColor = ConsoleColor.Yellow;

                                                                                        Console.WriteLine(ex3);
                                                                                    }
                                                                                    httpResponseMessage.Dispose();
                                                                                });                                                                                
                                                                        }
                                                                    }
                                                  );
                                      }
                                      catch(Exception e)
                                      {
                                          Interlocked.Decrement(ref waiting);
                                          Console.ForegroundColor = ConsoleColor.Red;                                             
                                          Console.WriteLine(e);
                                      }
                                  }
                );

            Interlocked.Increment(ref waiting);

        }

        Console.Read();
    }

The file containing links is available here.

This results in constant rising of the memory. Memory analysis shows many bytes held possibly by the AsyncCallback. I have done many memory leak analysis before but this one seems to be at the HttpClient level.

I am using C# 4.0 so no async/await here so only TPL 4.0 is used.

The code above works but is not optimised and sometimes throws tantrum yet is enough to reproduce the effect. Point is I cannot find any point that could cause memory to be leaked.

回答1:

OK, I got to the bottom of this. Thanks to @Tugberk, @Darrel and @youssef for spending time on this.

Basically the initial problem was I was spawning too many tasks. This started to take its toll so I had to cut back on this and have some state for making sure the number of concurrent tasks are limited. This is basically a big challenge for writing processes that have to use TPL to schedule the tasks. We can control threads in the thread pool but we also need to control the tasks we are creating so no level of async/await will help this.

I managed to reproduce the leak only a couple of times with this code - other times after growing it would just suddenly drop. I know that there was a revamp of GC in 4.5 so perhaps the issue here is that GC did not kick in enough although I have been looking at perf counters on GC generation 0, 1 and 2 collections.

So the take-away here is that re-using HttpClient does NOT cause memory leak.



回答2:

I'm no good at defining memory issues but I gave it a try with the following code. It's in .NET 4.5 and uses async/await feature of C#, too. It seems to keep memory usage around 10 - 15 MB for the entire process (not sure if you see this a better memory usage though). But if you watch # Gen 0 Collections, # Gen 1 Collections and # Gen 2 Collections perf counters, they are pretty high with the below code.

If you remove the GC.Collect calls below, it goes back and forth between 30MB - 50MB for entire process. The interesting part is that when I run your code on my 4 core machine, I don't see abnormal memory usage by the process either. I have .NET 4.5 installed on my machine and if you don't, the problem might be related to CLR internals of .NET 4.0 and I am sure that TPL has improved a lot on .NET 4.5 based on resource usage.

class Program {

    static void Main(string[] args) {

        ServicePointManager.DefaultConnectionLimit = 500;
        CrawlAsync().ContinueWith(task => Console.WriteLine("***DONE!"));
        Console.ReadLine();
    }

    private static async Task CrawlAsync() {

        int numberOfCores = Environment.ProcessorCount;
        List<string> requestUris = File.ReadAllLines(@"C:\Users\Tugberk\Downloads\links.txt").ToList();
        ConcurrentDictionary<int, Tuple<Task, HttpRequestMessage>> tasks = new ConcurrentDictionary<int, Tuple<Task, HttpRequestMessage>>();
        List<HttpRequestMessage> requestsToDispose = new List<HttpRequestMessage>();

        var httpClient = new HttpClient();

        for (int i = 0; i < numberOfCores; i++) {

            string requestUri = requestUris.First();
            var requestMessage = new HttpRequestMessage(HttpMethod.Get, requestUri);
            Task task = MakeCall(httpClient, requestMessage);
            tasks.AddOrUpdate(task.Id, Tuple.Create(task, requestMessage), (index, t) => t);
            requestUris.RemoveAt(0);
        }

        while (tasks.Values.Count > 0) {

            Task task = await Task.WhenAny(tasks.Values.Select(x => x.Item1));

            Tuple<Task, HttpRequestMessage> removedTask;
            tasks.TryRemove(task.Id, out removedTask);
            removedTask.Item1.Dispose();
            removedTask.Item2.Dispose();

            if (requestUris.Count > 0) {

                var requestUri = requestUris.First();
                var requestMessage = new HttpRequestMessage(HttpMethod.Get, requestUri);
                Task newTask = MakeCall(httpClient, requestMessage);
                tasks.AddOrUpdate(newTask.Id, Tuple.Create(newTask, requestMessage), (index, t) => t);
                requestUris.RemoveAt(0);
            }

            GC.Collect(0);
            GC.Collect(1);
            GC.Collect(2);
        }

        httpClient.Dispose();
    }

    private static async Task MakeCall(HttpClient httpClient, HttpRequestMessage requestMessage) {

        Console.WriteLine("**Starting new request for {0}!", requestMessage.RequestUri);
        var response = await httpClient.SendAsync(requestMessage).ConfigureAwait(false);
        Console.WriteLine("**Request is completed for {0}! Status Code: {1}", requestMessage.RequestUri, response.StatusCode);

        using (response) {
            if (response.IsSuccessStatusCode){
                using (response.Content) {

                    Console.WriteLine("**Getting the HTML for {0}!", requestMessage.RequestUri);
                    string html = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
                    Console.WriteLine("**Got the HTML for {0}! Legth: {1}", requestMessage.RequestUri, html.Length);
                }
            }
            else if (response.Content != null) {

                response.Content.Dispose();
            }
        }
    }
}


回答3:

A recent reported "Memory Leak" in our QA environment taught us this:

Consider the TCP Stack

Don't assume the TCP Stack can do what is asked in the time "thought appropriate for the application". Sure we can spin off Tasks at will and we just love asych, but....

Watch the TCP Stack

Run NETSTAT when you think you have a memory leak. If you see residual sessions or half-baked states, you may want to rethink your design along the lines of HTTPClient reuse and limiting the amount of concurrent work being spun up. You also may need to consider using Load Balancing across multiple machines.

Half-baked sessions show up in NETSTAT with Fin-Waits 1 or 2 and Time-Waits or even RST-WAIT 1 and 2. Even "Established" sessions can be virtually dead just waiting for time-outs to fire.

The Stack and .NET are most likely not broken

Overloading the stack puts the machine to sleep. Recovery takes time and 99% of the time the stack will recover. Remember also that .NET will not release resources before their time and that no user has full control of GC.

If you kill the app and it takes 5 minutes for NETSTAT to settle down, that's a pretty good sign the system is overwhelmed. It's also a good show of how the stack is independent of the application.



回答4:

The default HttpClient leaks when you use it as a short-lived object and create new HttpClients per request.

Here is a reproduction of this behavior.

As a workaround, I was able to keep using HttpClient as a short-lived object by using the following Nuget package instead of the built-in System.Net.Http assembly: https://www.nuget.org/packages/HttpClient

Not sure what the origin of this package is, however, as soon as I referenced it the memory leak disappeared. Make sure that you remove the reference to the built-in .NET System.Net.Http library and use the Nuget package instead.