Instrumenting crash dumps

I’ve been planning to write a post about debugging multiplayer games (post-mortem) for a while now, but it keeps getting bigger and bigger, so instead of waiting until I can get enough time to finish it, I thought it’d be easier to share some quick’n’easy tricks first.

I’d like to show a simple way of “instrumenting” a crash dump so that it gives us more info about the crime scene. Let’s assume we’ve received a crash from the following piece of code (it’s actually very close to a real-life scenario I encountered). Short side note first: I’m talking about crash dumps coming from public here and usually extremely rare cases, too. If this was something you could repro in-house, we wouldn’t have this conversation.

 1struct SType
 2{
 3    SType() : parentType(NULL) {}
 4    SType* parentType;
 5};
 6struct SObj
 7{
 8    SObj(SType* t) : type(t) {}
 9    SType* type;
10
11    bool IsA(const SType* t) const;
12};
13bool SObj::IsA(const SType* t) const
14{
15    const SType* iter = type;
16    while(iter)
17    {
18        if(iter == t)
19        {
20            return true;
21        }
22        // Crashes in the line below. Access Violation when reading 'iter->parentType'
23        iter = iter->parentType;
24    }
25    return(false);
26}

Our crash dumps points to the line 23, obviously something’s wrong with the ‘iter’ variable. Corresponding assembly code (just the while loop):

1003A2172 3B 44 24 04 cmp eax,dword ptr [esp+4]
2003A2176 74 0B je SObj::IsA+15h (3A2183h)
3003A2178 8B 00 mov eax,dword ptr [eax]
4003A217A 85 C0 test eax,eax
5003A217C 75 F4 jne SObj::IsA+4 (3A2172h)

Crash happens in the mov eax, dword ptr [eax] line. Sadly, given only this dump, it’s hard to form any solid theories. We don’t even know what iteration is that, so can’t tell if it’s the object itself that’s corrupted (so accessing obj->type) or the type chain. We suspect something wrote over either object instance or type definition, but most of the time, we can’t see what’s in memory corresponding to either of these objects (I’ve tried inspecting memory associated with ECX, but nothing interesting there, not included in the dump). Yes, you can try grabbing full memory dumps, but they’re usually too huge to be practical (good luck having people send you gigs of data). Normal crash dumps only contain very limited information, registers, stack, call stacks and so on. Wait a moment, did I say “stack”? What if we instrument our function, so that it tries to store crucial information where we can find it? For the sake of this example let’s assume our object is corrupted by the following code:

1SType st;
2SObj obj(&st);
3
4static void Corrupt()
5{
6    const char* str = "Hello world!";
7    memcpy(&obj, str, 10);
8}

Here’s a temporarily modified version of the IsA method:

1bool SObj::IsA(const SType* t) const
2{
3    volatile unsigned int stackInfo[4];
4    const volatile unsigned int* fmem = reinterpret_cast<const volatile unsigned int*>(this);
5    for(size_t i = 0; i < 4; ++i)
6    {
7        stackInfo[i] = fmem[i];
8    }
9// Everything else stays the same

Remember, this is temporary, diagnostic code, it doesn’t need to be pretty. The idea is to store data associated with “this” pointer on the stack, so that we can take a look later. Yes, it comes with a slight performance hit, but it’s less severe than most alternatives (e.g. logging). We deploy a new build and wait for fresh dumps. Finally, we can answer some of our questions:

(Immediate Window)
> stackInfo
0x012afcc4
[0x0]: 0x6c6c6548
[0x1]: 0x6f77206f
[0x2]: 0x00ee6c72
[0x3]: 0x00000000
> eax
0x6c6c6548

At this point we know it’s actually memory associated with the object itself that’s written over, we crashed during the first iteration (EAX == first 4 bytes of the object memory == obj.type). Let’s see if we can get more info about the data that’s in memory right now:

> (const char*)stackInfo
0x012afcc4 "Hello worl??"

A-ha! We’re being written over by a familiar looking string. Obviously, it’s a contrived example, in real-life it rarely is that easy, but in my experience looking at “corrupted” memory can often give you valuable hints (is it a string, maybe some common floating-point bit pattern like 0x3f800000 etc).

Tricks like this are most valuable in a scenario, where we have a luxury of regular, frequent deployments (so that we can push instrumented build & the fix) and, sadly, are less helpful for more traditional, boxed products. Even then, it’s good to remember that you can stuff some of your crucial info (global state) in the stack space of your main loop as well. This can be especially helpful on consoles, where dumps are often all you get (no log files). In most cases you can get 90% of what you need from a raw dump, but having a way to get the extra few % when needed can be priceless. It did save my sanity many times in the past.

More Reading
Newer// C++ 11 final
Older// NaN memories