On one production site our application(*) crashes repeatedly, but non-reproducibly. Analyzing the crash dumps clearly shows that it's a heap corruption: The crashes are at different location, but always access violations inside kernel32!HeapFree
/ntdll!RtlpLowFragHeapFree
. Win Dbg !analyze -v
also reports a heap corruption.
What we have tried so far is to run the application with the GFlags option Page Heap. The problem is that the memory overhead of Page Heap is such that the application won't operate anymore (hitting virtual memory limit for the 32 bit process).
So, we cannot use Page Heap. Which other flags would be useful to add so that we either
- get a crash at the corruption site
- or at least can get more info out of a crash dump that will eventually be generated when we crash in
HeapFree
?
We are currently trying out the flags:
- Enable heap tagging
- Enable heap tail checking
in the hopes that the next crash dump will contain some more information of what went wrong.
I considered these flags, but left them out for now:
- Enable heap parameter checking ... I would expect quite some overhead when the system checks every time a heap function is called
- Enable heap free checking ... not sure whether this would actually buy me anything
- Enable heap validation on call ... here even the docs warn of the high overhead
One problem I (also) have is that I'm unsure how these flags help when a memory corruption occurs. Page Heap obviously will generate an access violation when something writes into the guard pages, but how do the other flags operate?
Do I have to run the app with Application Verifier for these other flags to help? Or will an exception be raised when the checking code detects something?
Which combination of these flags makes most sense so that the application can still run with OK performance and memory consumption in production?
(*) : It's a 32bit Windows desktop application in industrial automation. Running on Win7 64bit in this case (which it does just fine at a whole lot of other sites).