可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
We have an application that is running on 5 (server) nodes (16 cores, 128 GB Memory each) that loads almost 70 GB data on each machine. This application is distributed and serves concurrent clients, therefore, there is a lot of sockets usage. Similarly, for synchronization between multiple threads, there are a few synchronization techniques being used, mostly using System.Threading.Monitor
.
Now the problem is that while application is running and the data is traveling between these server nodes and between clients and servers, one or two server machines start receiving OutOfMemoryException
even though there is 40+% memory still available. We have a feeling that this exception is coming from unmanaged code. Although, we are not directly making any unmanaged calls, we have seen that the last call in the OOM exception stack trace is always a framework call that internally calls unmanaged code.
Following are a couple of examples.
Exception of type 'System.OutOfMemoryException' was thrown.
at System.Threading.Monitor.ObjPulseAll(Object obj)
....
Exception of type 'System.OutOfMemoryException' was thrown.
at System.Threading.Monitor.ObjWait(Boolean exitContext, Int32 millisecondsTimeout, Object obj)
at System.Threading.Monitor.Wait(Object obj, TimeSpan timeout)
....
We are clueless here as to what is causing this issue. We have induced GC on these machines multiple times but that also doesn't seem to help.
Any help would be appreciated..
EDIT:
Following are some more details;
- Application is running in x64 process.
- Windows Server 2012 R2
- .NET Framework 4.5
- Server GC enabled
AllowLargeObject
flag is set.
EDIT2: Please note that this is not a memory leak. 70 GB process size is valid here.
回答1:
Some of the preliminary questions that other users have suggested are cool, but have you considered being lazy and profiling your app?
I can think of Ants profiler from Redgate or dotmemory from JetBrains, links below.
http://www.red-gate.com/products/dotnet-development/ants-memory-profiler/
https://www.jetbrains.com/dotmemory/
回答2:
Even if there is a memory leak from unmanaged code, if you have 40% memory available you should be able to allocate objects. What I am thinking of is that this is a fragmentation problem not a memory leak.
1- Is the data you are trying to allocate in big or small chunks?
2- Did you try to force the garbage collector (By Calling GC.Collect()) ? garbage collection not only frees memory but compacts it removing fragmentation.
回答3:
GC.Collect()
will only free memory where an object is not referenced by anything else.
A common scenario where a leak can occur is by not disconnecting an event handler from an object before setting it's reference to null.
As an exercise in avoiding leaks, it's a good idea to implement IDisposable
on objects (even tho' it's meant for releasing unmanaged objects), simply from the point of view of ensuring that all handlers are disconnected, collections are cleared correctly and any other object references are set to null.
回答4:
I suggest that using ADPlus or other tools to get dump of your process when this exception occurs.Using this dump, you can debug your dump file using WinDbg. All of the below commands are taken from blog post Investigating ASP.Net Memory Dumps for Idiots (like Me).
Investigating memory leaks
In order to get a view on memory, we need to use the following command
!dumpheap
"dumpheap" command will give you object counts and memory usage of objects.
Then you can investigate which object types uses most of your memory.
!dumpheap -type System.IO.MemoryStream
"dumpheap -type" command will list all of the objects on the heap that are of type MemoryStream.
Good thing about WinDbg is you can investigate Unmanaged Memory Leaks: Example 1 and Example2.
回答5:
If it is a fragmentation problem then you cannot solve it without some sort of profiling. Search for a memory profiler that supports fragmentation detection to know exactly the cause of this fragmentation.
回答6:
Garbage collection with LargeObjectHeapCompactionMode = CompactOnce may help to fix fragmentation.
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect();
回答7:
Note that while an event handler is subscribed, the publisher of the event holds a reference to the subscriber. This is a common cause of memory leaks in .NET, and in your case it would not be a serious leak but if a managed object is keeping pointer or handle to an unmanaged object then it is not deleting this unmanaged object and so causing memory fragmentation.
If you are sure that the reason for fragmentation is the unmanaged component and that you are not missing something, and if you have access to the code of the umnanaged component you can recompile it and link it using a decent memory allocator like hoard. But this should be done when there is nothing else to do and after serious profiling.
回答8:
In .NET 4.5, the CLR team enhanced large object heap (LOH) allocation. Even then, they still recommend object pooling to help large object performance. It sounds like LOH fragmentation happens less often in 4.5, but it could still happen. But from the stack trace, it looks unrelated to the LOH.
Daniel Lane suggested GC deadlocks. We have seen those happen on production systems, too, and they definitely cause issues with process size and out of memory conditions.
One thing you could do is run Debug Diagnostics Tool, capture a full dump when the OutOfMemoryException occurs, and then have the tool analyze the dump for crash and memory information. I've seen some interesting things happen with both native and managed heaps from this report. For example, we found a printer driver had allocated 1 GB of unmanaged heap on a 32-bit system. Updating the driver fixed the issue. Granted, that was a client system, but something similar could be happening to your server.
I agree that this sounds like a native mode error. Looking at the implementation of System.Threading.Monitor.Wait
, ObjWait
, PulseAll
, and ObjPulseAll
from the .NET 4.5 Reference Code reveals these classes are calling native methods:
/*========================================================================
** Sends a notification to all waiting objects.
========================================================================*/
[System.Security.SecurityCritical] // auto-generated
[ResourceExposure(ResourceScope.None)]
[MethodImplAttribute(MethodImplOptions.InternalCall)]
private static extern void ObjPulseAll(Object obj);
[System.Security.SecuritySafeCritical] // auto-generated
public static void PulseAll(Object obj)
{
if (obj == null)
{
throw new ArgumentNullException("obj");
}
Contract.EndContractBlock();
ObjPulseAll(obj);
}
A comment on Raymond Chen's article about "PulseEvent is fundamentally flawed" by "Mike Dimmick" says:
Monitor.PulseAll is a wrapper around Monitor.ObjPulseAll, which is an
internal call to the CLR internal function ObjectNative::PulseAll.
This in turn wraps ObjHeader::PulseAll, which wraps
SyncBlock::PulseAll. This simply sits in a loop calling SetEvent until
no more threads are waiting on the object.
If anyone has access to the source code for the CLI, maybe they could post more about this function and what the memory error could be coming from.
回答9:
An educated guess without seeing your code is that you have an issue with STA deadlocking on finalisation, especially seeing as though it's a high concurrency system judging by your hefty hardware requirements. Anyway seeing as though you've tried forcing GC a deadlock makes sense, if the finalisation is deadlocked then the GC isn't going to be able to do its job. Hope this helps you.
Advanced Techniques to Prevent and Detect Deadlocks in .Net Applications
Specifically the section that is of interest is as I've quoted below
When your code is executing on a single-threaded apartment (STA) thread, the equivalent of an exclusive lock occurs. Only one thread can update a GUI window or run code inside an Apartment-threaded COM component inside an STA at once. Such threads own a message queue into which to-be-processed information is placed by the system and other parts of the application. GUIs use this queue for information such as repaint requests, device input to be processed, and window close requests. COM proxies use the message queue to transitioning cross-Apartment method calls into the apartment for which a component has affinity. All code running on an STA is responsible for pumping the message queue—looking for and processing new messages using the message loop—otherwise the queue can become clogged, leading to lost responsiveness. In Win32 terms, this means using the MsgWaitForSingleObject, MsgWaitForMultipleObjects (and their Ex counterparts), or CoWaitForMultipleHandles APIs. A non-pumping wait such as WaitForSingleObject or WaitForMultipleObjects (and their Ex counterparts) won't pump incoming messages.
In other words, the STA "lock" can only be released by pumping the message queue. Applications that perform operations whose performance characteristics vary greatly on the GUI thread without pumping for messages, like those noted earlier, can easily deadlock. Well-written programs either schedule such long-running work to occur elsewhere, or pump for messages each time they block to avoid this problem. Thankfully, the CLR pumps for you whenever you block in managed code (via a call to a contentious Monitor.Enter, WaitHandle.WaitOne, FileStream.EndRead, Thread.Join, and so forth), helping to mitigate this problem. But plenty of code—and even some fraction of the .NET Framework itself—ends up blocking in unmanaged code, in which case a pumping wait may or may not have been added by the author of the blocking code.
Here's a classic example of an STA-induced deadlock. A thread running in an STA generates a large quantity of Apartment threaded COM component instances and, implicitly, their corresponding Runtime Callable Wrappers (RCWs). Of course, these RCWs must be finalized by the CLR when they become unreachable, or they will leak. But the CLR's finalizer thread always joins the process's Multithreaded Apartment (MTA), meaning it must use a proxy that transitions to the STA in order to call Release on the RCWs. If the STA isn't pumping to receive the finalizer's attempt to invoke the Finalize method on a given RCW—perhaps because it has chosen to block using a non-pumping wait—the finalizer thread will be stuck. It is blocked until the STA unblocks and pumps. If the STA never pumps, the finalizer thread will never make any progress, and a slow, silent build-up of all finalizable resources will occur over time. This can, in turn, lead to a subsequent out-of-memory crash or a process-recycle in ASP.NET. Clearly, both outcomes are unsatisfactory.
High-level frameworks like Windows Forms, Windows Presentation Foundation, and COM hide much of the complexity of STAs, but they can still fail in unpredictable ways, including deadlocking. COM synchronization contexts introduce similar, but subtly different, challenges. And furthermore, many of these failures will only occur in a small fraction of test runs and often only under high stress.
回答10:
The GC doesn't take into account the unmanaged heap. If you are creating lots of objects that are merely wrappers in C# to larger unmanaged memory then your memory is being devoured but the GC can't make rational decisions based on this as it only see the managed heap.
You end up in a situation where the GC collector doesn't think you are short of memory because most of the things on your gen 1 heap are 8 byte references where in actual fact they are like icebergs at sea. Most of the memory is below!
You can make use of these GC calls:
System::GC::AddMemoryPressure(sizeOfField);
System::GC::RemoveMemoryPressure(sizeOfField);
These methods allow the garbage collector to see the unmanaged memory (if you provide it the right figures)