I am working on a WebCrawler implementation but am facing a strange memory leak in ASP.NET Web API's HttpClient.
So the cut down version is here:
[UPDATE 2]
I found the problem and it is not HttpClient that is leaking. See my answer.
[UPDATE 1]
I have added dispose with no effect:
static void Main(string[] args)
{
int waiting = 0;
const int MaxWaiting = 100;
var httpClient = new HttpClient();
foreach (var link in File.ReadAllLines("links.txt"))
{
while (waiting>=MaxWaiting)
{
Thread.Sleep(1000);
Console.WriteLine("Waiting ...");
}
httpClient.GetAsync(link)
.ContinueWith(t =>
{
try
{
var httpResponseMessage = t.Result;
if (httpResponseMessage.IsSuccessStatusCode)
httpResponseMessage.Content.LoadIntoBufferAsync()
.ContinueWith(t2=>
{
if(t2.IsFaulted)
{
httpResponseMessage.Dispose();
Console.ForegroundColor = ConsoleColor.Magenta;
Console.WriteLine(t2.Exception);
}
else
{
httpResponseMessage.Content.
ReadAsStringAsync()
.ContinueWith(t3 =>
{
Interlocked.Decrement(ref waiting);
try
{
Console.ForegroundColor = ConsoleColor.White;
Console.WriteLine(httpResponseMessage.RequestMessage.RequestUri);
string s =
t3.Result;
}
catch (Exception ex3)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine(ex3);
}
httpResponseMessage.Dispose();
});
}
}
);
}
catch(Exception e)
{
Interlocked.Decrement(ref waiting);
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine(e);
}
}
);
Interlocked.Increment(ref waiting);
}
Console.Read();
}
The file containing links is available here.
This results in constant rising of the memory. Memory analysis shows many bytes held possibly by the AsyncCallback. I have done many memory leak analysis before but this one seems to be at the HttpClient level.
I am using C# 4.0 so no async/await here so only TPL 4.0 is used.
The code above works but is not optimised and sometimes throws tantrum yet is enough to reproduce the effect. Point is I cannot find any point that could cause memory to be leaked.
OK, I got to the bottom of this. Thanks to @Tugberk, @Darrel and @youssef for spending time on this.
Basically the initial problem was I was spawning too many tasks. This started to take its toll so I had to cut back on this and have some state for making sure the number of concurrent tasks are limited. This is basically a big challenge for writing processes that have to use TPL to schedule the tasks. We can control threads in the thread pool but we also need to control the tasks we are creating so no level of
async/await
will help this.I managed to reproduce the leak only a couple of times with this code - other times after growing it would just suddenly drop. I know that there was a revamp of GC in 4.5 so perhaps the issue here is that GC did not kick in enough although I have been looking at perf counters on GC generation 0, 1 and 2 collections.
So the take-away here is that re-using
HttpClient
does NOT cause memory leak.The default
HttpClient
leaks when you use it as a short-lived object and create new HttpClients per request.Here is a reproduction of this behavior.
As a workaround, I was able to keep using HttpClient as a short-lived object by using the following Nuget package instead of the built-in
System.Net.Http
assembly: https://www.nuget.org/packages/HttpClientNot sure what the origin of this package is, however, as soon as I referenced it the memory leak disappeared. Make sure that you remove the reference to the built-in .NET
System.Net.Http
library and use the Nuget package instead.I'm no good at defining memory issues but I gave it a try with the following code. It's in .NET 4.5 and uses async/await feature of C#, too. It seems to keep memory usage around 10 - 15 MB for the entire process (not sure if you see this a better memory usage though). But if you watch # Gen 0 Collections, # Gen 1 Collections and # Gen 2 Collections perf counters, they are pretty high with the below code.
If you remove the
GC.Collect
calls below, it goes back and forth between 30MB - 50MB for entire process. The interesting part is that when I run your code on my 4 core machine, I don't see abnormal memory usage by the process either. I have .NET 4.5 installed on my machine and if you don't, the problem might be related to CLR internals of .NET 4.0 and I am sure that TPL has improved a lot on .NET 4.5 based on resource usage.A recent reported "Memory Leak" in our QA environment taught us this:
Consider the TCP Stack
Don't assume the TCP Stack can do what is asked in the time "thought appropriate for the application". Sure we can spin off Tasks at will and we just love asych, but....
Watch the TCP Stack
Run NETSTAT when you think you have a memory leak. If you see residual sessions or half-baked states, you may want to rethink your design along the lines of HTTPClient reuse and limiting the amount of concurrent work being spun up. You also may need to consider using Load Balancing across multiple machines.
Half-baked sessions show up in NETSTAT with Fin-Waits 1 or 2 and Time-Waits or even RST-WAIT 1 and 2. Even "Established" sessions can be virtually dead just waiting for time-outs to fire.
The Stack and .NET are most likely not broken
Overloading the stack puts the machine to sleep. Recovery takes time and 99% of the time the stack will recover. Remember also that .NET will not release resources before their time and that no user has full control of GC.
If you kill the app and it takes 5 minutes for NETSTAT to settle down, that's a pretty good sign the system is overwhelmed. It's also a good show of how the stack is independent of the application.