Debugging heap corruptions

If I would have to choose family of bugs I hate the most (I would have to be under the gun obviously, as it’s a little bit like choosing your favourite illness) I would be inclined to go with heap corruptions. They’re silent, random, hard to reproduce and track down. Today I’ll describe my small collection of simple tips that helped me to stay sane during debugging sessions.

First of all, let’s list the most obvious heap problems:

  • overrun: most common. Writing more data than block can hold. If we’re lucky, we’re ovewrite unused data, in most cases – we’ll wipe some other block.
  • using memory after releasing. It may or may not work in some cases. It will crash sooner or later. If it happens just after releasing, chances are that memory wont change, but of course you cant rely it’ll always be this way.
  • uninitialized variables, especially pointers.

There’s more of course (buffer underruns, trying to free the same pointer multiple times etc), but those are the most popular, in my experience. If you only use standard system allocation functions – it’s much easier, because the win32 C runtime library provides functions, that in debug mode can help a lot with debugging memory related problems. Thing is, usually at least part of your allocations will go through your own functions. If something’s screwed there - we’re on our own. Let’s do what CRT does - allocate more memory and place guards at the beginning/end of the memory block. In the most basic implementation we’d only need 2 extra bytes: 1 byte per header, 1 per footer. In some cases you’ll need more information stored in the header (block size for example).

Figure 1: Layout of complete memory block.

When the user allocates block of memory we:

  • fill the leading/trailing guard with ‘magic’ values (make them different if possible to distinguish between them),
  • fill the allocation with in-use ‘magic’ value (different than guards of course)
  • fill header information if needed.

When freeing memory:

  • check guards for overwrites. It’s much quicker to find the culprit if only footer’s destroyed, because it means that it was probably overwritten by code related to block being freed.
  • fill the block and guards with new values

This approach will let you catch most of the basic mistakes easily. Biggest problems are overruns, because they’re caught when block is freed, not when the overrun happens. Windows has the possibility to enable “full pageheap” mode (via gflags) for its own heaps, which catches overruns immediately. Main problem is (apart of the obvious one - it only works for system allocations) - I dont think it’s possible to have it enabled for any medium sized application, the overhead is just too big (it adds non-writable pages for each block). In case of typical game - consider yourseful lucky if you can even load a level. What are other options?

  • record all footer pointers and diagnose them every ‘n’ memory operations. Obviously, it’ll affect both the performance and memory requirements, but by tweaking the ‘n’ factor it can be usable. It still wont catch the bug immediately in the point of execution, but it should narrow down the scope.

  • my favourite “trick” is to provide method ValidateUsedBlock (you’ll have it most probably and call before freeing the block, just make it public). Now you can call it in suspicious places (typical examples: after constructing the object, after most common operations, inside Tick() method and so on), running in smaller and smaller circles, shortening the list of possible culprits. In most cases memory blocks are not touched that often, so it’s possible to catch bugs quicker than you may expect this way… Of course, it’s not ideal and still rather boring, but it’s still better than the previous method.

Old comments

therealremi 2008-03-12 08:24:21

I think that errors caused by incorrect use of 3d libraries are more annoying. For me it’s easier to debug a memory problem than for example some small random dots in various places of the level ;) especially when the only OpenGL debugger I’ve heard of is very expensive.