Whac-A-Mole

I’ve been debugging a rare memory corruption bug recently and - as usual - it turned out to be an exercise in frustration. This time only part of that was because of the bug itself, the other was because methods I chose were not very helpful (in my defense, it’s been a while, so I was a little bit rusty).

The bug itself was fairly boring - sometimes, when leaving a certain UI screen game would crash. It was usually one of few places - either our UI drawing/updating code or some memory manager function. All traces lead to memory corruption. Fortunately for me, it was quite consistent and the UI crash was almost always happening because the exact same element was getting corrupted. To be more precise, UI code keeps an array of pointers to elements to tick/draw. One of the pointers (usually index 2 or 3) was getting modified. The object itself was still there, it’s just the pointer that was now pointing to some other address (not far from the object, too). I’ll describe the process briefly, hopefully it’ll save someone from going down the dead end route and wasting time.

I knew the array was OK shortly after initialization, so it was getting corrupted at some point before shutting down the movie. My first thought was to use VirtualAlloc+VirtualProtect to protect the region (we were not writing to the array again), but I dismissed it quickly. This would give us a completely different address range then when using our mem manager, so whoever was corrupting the memory would probably now just stomp over something else. I still tried that, but as expected - came short (it was not an easy repro, but I could usually get it within 5-10 minutes of constant open/close loop).

My next idea was to use debug registers. They are my go-to tool when trying to track small, localized corruptions like this. I quickly hacked something and to my surprise – the game eventually crashed, but my single-step exception has never been thrown! I started digging and to my dismay quickly confirmed that debug registers are pretty much useless in a multi-threaded environment. You could kinda expect that, seeing as we use thread context to control them, but making them work only from single thread seemed so bizarre I went with it anyway. It seems like I’m not that the only one that got that impression. Maybe they used to work differently in the past, but as of VS2012, if you enable them in thread A and thread B comes along writing to your protected address – it’ll go through without a single peep. I tried a half-hearted approach of sprinkling the same code over our other threads, but I had a feeling it was some thread that was being spawned later/not by us that was triggering the crash. Further tests seemed to confirm that theory.

This basically means we’re down to a good, old brute force method - debugger data breakpoints. They do use debug registers, but debugger will actually take a snapshot of all running threads and set debug registers for all of them (plus also any threads that are spawned afterwards). This is something you can maybe do yourself as well, but I was running it on a platform that didn’t have toolhelp32 functionality exposed (..and that still doesn’t solve a problem for threads that are forked later). Once again, this turned out to be a little bit more cumbersome than it should. Back in the old good days of VS2008 you could run macros from a breakpoint (ie. ‘run when hit’). I actually had a bunch of macros for cases like this (enable/disable next bp etc). At some point Microsoft had decided to remove it and only left an option to print a message (shame). I guess you can still do it if you write your own add-in, but that seemed like an overkill. Added some code detecting my case, put an ordinary breakpoint and then kept enabling my data breakpoint by hand. 20 minutes of keyboard mashing later I finally found my culprit (as expected, it wasn’t coming from a thread that’s been spawned by us (callback was ours, though)).

The whole thing took me way longer than it should have and in the end it turned out to be something fairly straightforward (recent modification to an async handler that wasn’t flushed before destroying a parent object, so it could have been released/reallocated in the meantime. We were only writing to one byte, but that was enough, obviously), but at least I learned a few things on the way (mostly not to rely on debug registers for stuff like that).

Old comments

WheretIB 2015-08-04 07:05:05

> My first thought was to use VirtualAlloc+VirtualProtect to protect the region
My first though here was ‘why don’t use a debug breakpoint?’
> My next idea was to use debug registers
And again, first though - ‘why don’t use a debug breakpoint?’
> Added some code detecting my case, put an ordinary breakpoint and then kept enabling my data breakpoint by hand.
Why didn’t you use a debug breakpoint condition?
Back on track, for a very specific errors there is an even crazier solution that doesn’t change memory layout:
VirtualProtect on an existing memory region coupled with Vectored Exception Handler that tracks the address, enables access to the page, single steps the process and restores protection.
This might be a solution to track some hard to find corruptions :)

WheretIB 2015-08-04 07:11:42

In my previous post, when I mentioned ‘Debug breakpoint’, I indended to say ‘Debug data breakpoint’.

Lachlan Stuart 2015-08-04 09:57:04

Microsoft’s “Application Verifier” and “Page Heap” utilities could probably have helped a lot here.
It’s a shame you’re on Windows though. Address Sanitizer on Linux would have found this bug pretty quickly.

admin 2015-08-05 01:53:15

@WheretIB: well, you’d be right in this case. The thing is, it was a very unreliable repro, requiring hundreds of tries. I hacked some of our scripts to give me an automated test, didn’t want to undo it all by having to enable/disable data breakpoints all the time :)
@Lachlan: sadly IME AppVerifier is pretty much unusable on any bigger game, it usually runs out of memory before loading anything.