Instrumenting crash dumps

I’ve been planning to write a post about debugging multiplayer games (post-mortem) for a while now, but it keeps getting bigger and bigger, so instead of waiting until I can get enough time to finish it, I thought it’d be easier to share some quick’n’easy tricks first.

I’d like to show a simple way of “instrumenting” a crash dump so that it gives us more info about the crime scene. Let’s assume we’ve received a crash from the following piece of code (it’s actually very close to a real-life scenario I encountered). Short side note first: I’m talking about crash dumps coming from public here and usually extremely rare cases, too. If this was something you could repro in-house, we wouldn’t have this conversation.

 1 struct SType
 2 {
 3     SType() : parentType(NULL) {}
 4     SType* parentType;
 5 };
 6 struct SObj
 7 {
 8     SObj(SType* t) : type(t) {}
 9     SType* type;
11     bool IsA(const SType* t) const;
12 };
13 bool SObj::IsA(const SType* t) const
14 {
15     const SType* iter = type;
16     while(iter)
17     {
18         if(iter == t)
19         {
20             return true;
21         }
22         // Crashes in the line below. Access Violation when reading 'iter->parentType'
23         iter = iter->parentType;
24     }
25     return(false);
26 }

Our crash dumps points to the line 23, obviously something’s wrong with the ‘iter’ variable. Corresponding assembly code (just the while loop):

1 003A2172 3B 44 24 04 cmp eax,dword ptr [esp+4]
2 003A2176 74 0B je SObj::IsA+15h (3A2183h)
3 003A2178 8B 00 mov eax,dword ptr [eax]
4 003A217A 85 C0 test eax,eax
5 003A217C 75 F4 jne SObj::IsA+4 (3A2172h)

Crash happens in the mov eax, dword ptr [eax] line. Sadly, given only this dump, it’s hard to form any solid theories. We don’t even know what iteration is that, so can’t tell if it’s the object itself that’s corrupted (so accessing obj->type) or the type chain. We suspect something wrote over either object instance or type definition, but most of the time, we can’t see what’s in memory corresponding to either of these objects (I’ve tried inspecting memory associated with ECX, but nothing interesting there, not included in the dump). Yes, you can try grabbing full memory dumps, but they’re usually too huge to be practical (good luck having people send you gigs of data). Normal crash dumps only contain very limited information, registers, stack, call stacks and so on. Wait a moment, did I say “stack”? What if we instrument our function, so that it tries to store crucial information where we can find it? For the sake of this example let’s assume our object is corrupted by the following code:

1 SType st;
2 SObj obj(&st);
4 static void Corrupt()
5 {
6     const char* str = "Hello world!";
7     memcpy(&obj, str, 10);
8 }

Here’s a temporarily modified version of the IsA method:

1 bool SObj::IsA(const SType* t) const
2 {
3     volatile unsigned int stackInfo[4];
4     const volatile unsigned int* fmem = reinterpret_cast<const volatile unsigned int*>(this);
5     for(size_t i = 0; i < 4; ++i)
6     {
7         stackInfo[i] = fmem[i];
8     }
9 // Everything else stays the same

Remember, this is temporary, diagnostic code, it doesn’t need to be pretty. The idea is to store data associated with “this” pointer on the stack, so that we can take a look later. Yes, it comes with a slight performance hit, but it’s less severe than most alternatives (e.g. logging). We deploy a new build and wait for fresh dumps. Finally, we can answer some of our questions:

(Immediate Window)
> stackInfo
[0x0]: 0x6c6c6548
[0x1]: 0x6f77206f
[0x2]: 0x00ee6c72
[0x3]: 0x00000000
> eax

At this point we know it’s actually memory associated with the object itself that’s written over, we crashed during the first iteration (EAX == first 4 bytes of the object memory == obj.type). Let’s see if we can get more info about the data that’s in memory right now:

> (const char*)stackInfo
0x012afcc4 "Hello worl??"

A-ha! We’re being written over by a familiar looking string. Obviously, it’s a contrived example, in real-life it rarely is that easy, but in my experience looking at “corrupted” memory can often give you valuable hints (is it a string, maybe some common floating-point bit pattern like 0x3f800000 etc).

Tricks like this are most valuable in a scenario, where we have a luxury of regular, frequent deployments (so that we can push instrumented build & the fix) and, sadly, are less helpful for more traditional, boxed products. Even then, it’s good to remember that you can stuff some of your crucial info (global state) in the stack space of your main loop as well. This can be especially helpful on consoles, where dumps are often all you get (no log files). In most cases you can get 90% of what you need from a raw dump, but having a way to get the extra few % when needed can be priceless. It did save my sanity many times in the past.

More Reading
Newer// C++ 11 final
Older// NaN memories
comments powered by Disqus