Who ate my stack
19/Feb 2018
The Old New Thing is one of my favorite blogs. It’s a collection of Windows development anecdotes, but every now and then Raymond will post a gnarly debugging/crash story. I’ve recently found some of my old notes related to a crash I was chasing in a third-party library, it reminded me a little bit of The Old New Thing and I decided to try something similar.
The whole thing took place a few years ago, we were using some super early versions of a certain vendor library (no sources). If it crashed, it was just between you and the assembly. At some point I slightly modified the way we used their API (to better re-use certain primitives and not recreate them so frequently). Suddenly, we started crashing, few levels deep in the library. We triple checked it was not our fault and started digging (aforementioned vendor was pretty busy at this point so they were not super thrilled about bug reports without a proper repro/details).
As mentioned before, crash would happen (sometimes) 3 call levels later (ie. we’d call Cat, Cat would call Lol, Lol would call Bar, Bar would crash here, deep in the guts of an external library):
mov r13,qword ptr [rbp-90h]
mov r12d,0
jle 000000080029BA21h
mov esi,dword ptr [r13+0] ; ***
Crash was happening in the last line, marked with stars, R13 would be a value around 270-290, definitely not a valid address. Fortunately for us, even in this short snippet it’s easy to see where does it come from, it’s loaded just a few instructions before, from RBP-0x90. Sadly, as mentioned, this is 2 function calls later, in order to see where it is actually first set, we have to go all the way to the top.
Crash was actually fairly rare, I could only reproduce it few times, so it was mostly trying to decipher the code and understand what could have gone wrong. The top-level API function was one of these ‘get event’ functions that takes a table to fill with results and a maximum number of results the table can hold. In our case we only expected 1 events, so call looked roughly like:
Result results[1];
Cat(&results[0], 1);
Here’s how Cat prologue looked like:
mov eax,edx
lea rax,[rax+rax*2]
lea rax,[rax*8+0Fh]
and rax,0FFFFFFFFFFFFFFF0h
mov rcx,rsp
sub rcx,rax
mov qword ptr [rbp-88h],rcx
neg rax
lea rax,[rsp+rax+10h]
mov qword ptr [rbp-90h],rax
EDX = second argument = number of events we can handle. We multiply it by 24 (presumably the size of the result structure) and align to 16 bytes, then ‘allocate’ this much space on the stack. Notice also we actually store RAX at RBP-0x90. RAX was a valid address at this point. So far so good, let’s keep going. I followed the code and eventually reached the location that was corrupting our stack, was not sure why yet:
mov rax,qword ptr [rdi]
mov qword ptr [r13+rdx*8+8],rax (rax=274 etc)
add rdx, 3
inc r12d
cmp r9d,r15d
jne 0000000800165F00h
Clearly, a loop of some kind, RDX=0 before entering a loop, R13 = stack address we have calculated before (buffer for number of events*3). Now, in most cases R15 was 1, so we’d only loop once, write to a ‘good’ memory and bail. Where did the R15 come from? Well, we have to rewind one more time, but this was basically just an unknown function call (presumably returning the number of all events in the queue?):
call 00000008003AC270h
mov r15d,eax
In some, rare cases, this function would actually return more than 1 in EAX and it seemed like the caller did not include any precautions. We’d loop 2 (or more, but 2 was enough) times, second iteration our RDX was 3, which means we’d try to write to R13+(3*8)+8 = R13 + 32, which happened to be RBP-0x90. Classical out-of-bounds access, we only had 32 bytes allocated, but tried to use more than that. Now that I knew what the problem was (well, roughly) I could just use a bigger array and give the library more breathing space. Obviously, this was just a short term bandaid, but it stopped our game from crashing and now we had a proper repro, so the bug was properly fixed a few days later.