GFlags setting to catch heap corruption (other tha

2019-03-19 12:08发布

问题:

On one production site our application(*) crashes repeatedly, but non-reproducibly. Analyzing the crash dumps clearly shows that it's a heap corruption: The crashes are at different location, but always access violations inside kernel32!HeapFree/ntdll!RtlpLowFragHeapFree. Win Dbg !analyze -v also reports a heap corruption.

What we have tried so far is to run the application with the GFlags option Page Heap. The problem is that the memory overhead of Page Heap is such that the application won't operate anymore (hitting virtual memory limit for the 32 bit process).

So, we cannot use Page Heap. Which other flags would be useful to add so that we either

  • get a crash at the corruption site
  • or at least can get more info out of a crash dump that will eventually be generated when we crash in HeapFree?

We are currently trying out the flags:

  • Enable heap tagging
  • Enable heap tail checking

in the hopes that the next crash dump will contain some more information of what went wrong.

I considered these flags, but left them out for now:

  • Enable heap parameter checking ... I would expect quite some overhead when the system checks every time a heap function is called
  • Enable heap free checking ... not sure whether this would actually buy me anything
  • Enable heap validation on call ... here even the docs warn of the high overhead

One problem I (also) have is that I'm unsure how these flags help when a memory corruption occurs. Page Heap obviously will generate an access violation when something writes into the guard pages, but how do the other flags operate?

Do I have to run the app with Application Verifier for these other flags to help? Or will an exception be raised when the checking code detects something?

Which combination of these flags makes most sense so that the application can still run with OK performance and memory consumption in production?


(*) : It's a 32bit Windows desktop application in industrial automation. Running on Win7 64bit in this case (which it does just fine at a whole lot of other sites).

回答1:

  1. IMHO the easiest way to control all this checking is using the ApplicationVerifier. You have a perfect UI and you can play around with all flags.
  2. Heap Free checking is a good flag without too much overhead. So if a heap block is badly modified and the block is freed you get a break into the debugger. If the corruption occurs near the allocation and freeing of the block, this might help.
  3. AFAIK "Heap parameter chechking" is just a lightweight "heap validation on call". I never had any success with this.
  4. Heap tail checking and tagging is easy and fast. Works sometimes for me.

You know that you can control this on a per application base also with gflags.

gflags.exe /i Testapp.exe e0

But: The best way to find such problems is completely using the Debug-CRT... if it is possible for you. So if there is a chance to use you Debug-Version in the production environment, do it. Inside the Debug-CRT you again a lot of flags you can use and set....



回答2:

"Enable Page Heap" from the gflags GUI enables full page heap verification which can cause the problem you describe. The gflags command line gives you more control and allows you to enable standard page heap verification which uses less memory but is less powerful. The command line also offers you the ability to to use a mix of standard and full using the /size, /dlls, and /address options.

Here are the options listed in the debugger.chm help file:

*To enable and configure page heap verification:

    gflags /p /enable ImageFile  [ /full [/backwards] | /random Probability | /size SizeStart SizeEnd | /address AddressStart AddressEnd | /dlls DLL [DLL...] ]  [/debug ["DebuggerCommand"] | /kdebug] [/unaligned] [/notraces] [/fault Rate [TimeOut]] [/leaks] [/protect] [/no_sync] [/no_lock_checks]*