Whac-A-Mole

August 4, 2015 – 4:45 am

I’ve been debugging a rare memory corruption bug recently and – as usual – it turned out to be an exercise in frustration. This time only part of that was because of the bug itself, the other was because methods I chose were not very helpful (in my defense, it’s been a while, so I was a little bit rusty).

The bug itself was fairly boring – sometimes, when leaving a certain UI screen game would crash. It was usually one of few places – either our UI drawing/updating code or some memory manager function. All traces lead to memory corruption. Fortunately for me, it was quite consistent and the UI crash was almost always happening because the exact same element was getting corrupted. To be more precise, UI code keeps an array of pointers to elements to tick/draw. One of the pointers (usually index 2 or 3) was getting modified. The object itself was still there, it’s just the pointer that was now pointing to some other address (not far from the object, too). I’ll describe the process briefly, hopefully it’ll save someone from going down the dead end route and wasting time.

I knew the array was OK shortly after initialization, so it was getting corrupted at some point before shutting down the movie. My first thought was to use VirtualAlloc+VirtualProtect to protect the region (we were not writing to the array again), but I dismissed it quickly. This would give us a completely different address range then when using our mem manager, so whoever was corrupting the memory would probably now just stomp over something else. I still tried that, but as expected – came short (it was not an easy repro, but I could usually get it within 5-10 minutes of constant open/close loop).

My next idea was to use debug registers. They are my go-to tool when trying to track small, localized corruptions like this. I quickly hacked something and to my surprise — the game eventually crashed, but my single-step exception has never been thrown! I started digging and to my dismay quickly confirmed that debug registers are pretty much useless in a multi-threaded environment. You could kinda expect that, seeing as we use thread context to control them, but making them work only from single thread seemed so bizarre I went with it anyway. It seems like I’m not that the only one that got that impression. Maybe they used to work differently in the past, but as of VS2012, if you enable them in thread A and thread B comes along writing to your protected address — it’ll go through without a single peep. I tried a half-hearted approach of sprinkling the same code over our other threads, but I had a feeling it was some thread that was being spawned later/not by us that was triggering the crash. Further tests seemed to confirm that theory.

This basically means we’re down to a good, old brute force method – debugger data breakpoints. They do use debug registers, but debugger will actually take a snapshot of all running threads and set debug registers for all of them (plus also any threads that are spawned afterwards). This is something you can maybe do yourself as well, but I was running it on a platform that didn’t have toolhelp32 functionality exposed (..and that still doesn’t solve a problem for threads that are forked later). Once again, this turned out to be a little bit more cumbersome than it should. Back in the old good days of VS2008 you could run macros from a breakpoint (ie. ‘run when hit’). I actually had a bunch of macros for cases like this (enable/disable next bp etc). At some point Microsoft had decided to remove it and only left an option to print a message (shame). I guess you can still do it if you write your own add-in, but that seemed like an overkill. Added some code detecting my case, put an ordinary breakpoint and then kept enabling my data breakpoint by hand. 20 minutes of keyboard mashing later I finally found my culprit (as expected, it wasn’t coming from a thread that’s been spawned by us (callback was ours, though)).

The whole thing took me way longer than it should have and in the end it turned out to be something fairly straightforward (recent modification to an async handler that wasn’t flushed before destroying a parent object, so it could have been released/reallocated in the meantime. We were only writing to one byte, but that was enough, obviously), but at least I learned a few things on the way (mostly not to rely on debug registers for stuff like that).

Know your assembly (part N)

June 3, 2015 – 5:48 am

An entertaining head-scratcher from today. Application has suddenly started crashing at launch, seemingly inside steam_api.dll. It’d be easy to blame external code, but as mentioned – it’s a new thing, so most likely related to our changes. To make things more interesting, it’s only 32-bit build that crashes, 64-bit seems fine. High-level code looks like (simplified):

struct tester
{
    virtual bool foo();
    virtual void bar()
    {
    }
    virtual void lol()
    {
        if(!foo())
        {
            printf("Failed\n");
            return;
        }
        bar();
    }
};

Crash occurs when trying to call bar() [the original code was actually crashing ‘inside’ bar, which debugger claimed to be inside aforementioned DLL]:

00199FB3 8B 06            mov         eax,dword ptr [esi]
00199FB5 8B 10            mov         edx,dword ptr [eax]
00199FB7 FF D2            call        edx
00199FB9 84 C0            test        al,al
00199FBB 75 10            jne         tester::lol+1Dh (199FCDh)
[...]
00199FCD 8B 06            mov         eax,dword ptr [esi]
00199FCF 8B 50 04         mov         edx,dword ptr [eax+4] ***
00199FD2 8B CE            mov         ecx,esi
00199FD4 5E               pop         esi
00199FD5 FF E2            jmp         edx

Crash line marked with stars – Access violation reading location 0x00000018, EAX=0x14 at this point. That’d suggest something’s wrong with our vtable. How can this be, though, we have just called another virtual method (foo) and it’s been fine! As you might have guessed, it’s foo itself that’s wreaking havoc. It’s been modified recently and now contains the following code:

static BOOL MyCallback(LPVOID, PENUM_PAGE_FILE_INFORMATION, LPCTSTR);
virtual bool foo()
{
    EnumPageFiles(MyCallback, NULL);
    return true;
}

EnumPageFiles is a system function, so we’ll ignore it for now. Perhaps there’s something in the MyCallback that causes problems. Let’s remove everything and try running with empty callback function. Still crashes. Remove call to EnumPageFiles altogether – works fine…

Did some more poking and discovered that the vtable itself is actually OK and never modified by our code (data breakpoint at EAX+4 before calling foo). It’s the value of ESI that changes! See the 2 places where we move [ESI] to EAX? ESI differs between these 2 points. It’s like foo doesn’t restore it properly! Let’s keep digging… Deep inside the EnumPageFiles there is a push/pop esi pair, so it should be fine, right? The problem is, ESP doesn’t match. When trying to pop, it’s 12 bytes less than it should be, so we’re popping a completely unrelated value (in the original code it happened to point to the middle of some function in Steam DLL). You can probably guess where this is going. Stack mismatches like that are usually as sign of calling convention issues. Quick inspection of the code that executes our callback confirms it’s the problem:

75C08D83 50               push        eax
75C08D84 8D 45 E0         lea         eax,[ebp-20h]
75C08D87 50               push        eax
75C08D88 FF 75 0C         push        dword ptr [ebp+0Ch]
75C08D8B FF 55 08         call        dword ptr [ebp+8]
...
// Callback body (return true):
003C9F90 B8 01 00 00 00   mov         eax,1
003C9F95 C3               ret

As you can see, caller pushes 3 arguments to the stack (12 bytes!), but there’s no code that pops them (we expect callee to clean-up). Consulting the documentation confirms our findings, callback function is supposed to follow the stdcall convention. We’re not done yet, adding

static BOOL CALLBACK MyCallback(LPVOID, PENUM_PAGE_FILE_INFORMATION, LPCTSTR)

doesn’t seem to cut it, compiler complains:
EnumPageFilesW’ : cannot convert parameter 1 from ‘BOOL (__stdcall *)(LPVOID,PENUM_PAGE_FILE_INFORMATION,LPCTSTR)’ to ‘PENUM_PAGE_FILE_CALLBACKW’

As it turns out, the PENUM_PAGE_FILE_CALLBACKW type doesn’t actually include __stdcall… Let’s try to force it nonetheless:

PENUM_PAGE_FILE_CALLBACKW cb = (PENUM_PAGE_FILE_CALLBACKW)MyCallback;
EnumPageFiles(cb, NULL);
...
// Callback code:
00D99F90 B8 01 00 00 00   mov         eax,1
00D99F95 C2 0C 00         ret         0Ch

As you can see, MyCallback now cleans everything up properly and – as expected – code runs without crashing. Not sure where does the discrepancy between system headers and documentation come from.

Now, you might wonder – why did the x64 build worked fine? Well, we’ve been lucky. Callback function has only 3 arguments and x64 calling convention will pass them all in registers (r8, rdx, rcx) so that stack stays untouched. If it had just 1 more — we’d run into trouble. (Correction: as pointed by Ofek, this would still be fine, there’s 1 calling convention on x64 anyway). Interestingly enough, debug version worked “fine” as well (or at least was hiding the problem more effectively, as it was using more registers so one of the top functions was saving/restoring ESI too).

C++ 11 final

April 30, 2015 – 5:20 am

I’ve been doing some micro-optimizations recently. One of the things I’ve been trying is eliminating virtual method calls in ‘leaf’ types. Consider the following snippet (simplified):

struct Bar
{
  virtual bool Func1() { return false; }
};
struct Foo : public Bar
{
  virtual bool Func1()
  {
    return true;
  }
  virtual void Func2()
  {
    if (Func1())
      printf("Hello");
  }
};

void DoSomething(Foo& f)
{
  f.Func2();
}

Calling DoSomething will result in 2 virtual method calls – and rightly so, there’s no way to tell if Func1/Func2 were not modified in some class that’s derived from Foo.

It can be a little bit wasteful, especially if we – a programmer – know for a fact that nothing derives from Foo. Func2 calling Func1 will always, in every single case call just that – Func1. I used a not-so-sophisticated method to work around that:

if(this->Foo::Func1())
    printf("Hello");

This works, but can be dangerous. Imagine one day some other programmer decides to derive from Foo and provide new implementation of Func1. He’s in for a nasty surprise (and I’m in for some public shaming). Traditionally, C++ offers a few ways of preventing inheritance, but they’re all fairly ugly (private constructors etc). Fortunately, C++ 11 introduced a new keyword – final – which does exactly what we want.

It also got me thinking – does it mean that compiler has additional knowledge it was lacking before. Couldn’t it use it to employ the same optimizations we’ve just tried to force? Are my changes even necessary? Sadly, as it often happens, the answer is — it depends.

AFAICT, Visual Studio doesn’t care much. Yes, it’ll prevent inheritance, but doesn’t seem like it affects code generation at all. Here’s assembly for our code fragment (with final keyword added):

// struct Foo final : public Bar

// DoSomething
000000013FE15460 48 8B 01             mov         rax,qword ptr [rcx]
000000013FE15463 48 FF 60 08          jmp         qword ptr [rax+8] ***

virtual void Func2()
00000001407426D0 48 83 EC 28          sub         rsp,28h
    if(Func1())
00000001407426D4 48 8B 01             mov         rax,qword ptr [rcx]
00000001407426D7 FF 10                call        qword ptr [rax] ***
00000001407426D9 84 C0                test        al,al
00000001407426DB 74 10                je          Foo::Func2+1Dh (01407426EDh)
      printf("Hello");
00000001407426DD 48 8D 0D 04 3B EF 00 lea         rcx,[string "Hello" (01416361E8h)]
00000001407426E4 48 83 C4 28          add         rsp,28h
00000001407426E8 E9 B3 FB 85 00       jmp         printf (0140FA22A0h)
00000001407426ED 48 83 C4 28          add         rsp,28h
00000001407426F1 C3                   ret

As you can see – still 2 vtable accesses (lines marked with stars). Let’s see if GCC/Clang does any better (Compiler Explorer to the rescue). Just look at this beauty (that’s a body of DoSomething):

    movl    $.L.str, %edi
    xorl    %eax, %eax
    jmp    printf                  # TAILCALL
.L.str:
    .asciz    "Hello"

Not only did the compiler de-virtualize both calls, it also inlined them and sprinkled with a tailcall. Impressive.

Sadly, as mentioned – we can’t rely on these optimizations being employed consistently, but at the very least – final will prevent another programmer from making a mistake he’d later regret.

Instrumenting crash dumps

April 4, 2015 – 2:16 am

I’ve been planning to write a post about debugging multiplayer games (post-mortem) for a while now, but it keeps getting bigger and bigger, so instead of waiting until I can get enough time to finish it, I thought it’d be easier to share some quick’n’easy tricks first.

I’d like to show  a simple way of “instrumenting” a crash dump so that it gives us more info about the crime scene. Let’s assume we’ve received a crash from the following piece of code (it’s actually very close to a real-life scenario I encountered). Short side note first: I’m talking about crash dumps coming from public here and usually extremely rare cases, too. If this was something you could repro in-house, we wouldn’t have this conversation.

struct SType
{
    SType() : parentType(NULL) {}
    SType* parentType;
};
struct SObj
{
    SObj(SType* t) : type(t) {}
    SType* type;

    bool IsA(const SType* t) const;
};
bool SObj::IsA(const SType* t) const
{
    const SType* iter = type;
    while(iter)
    {
        if(iter == t)
        {
            return true;
        }
        // Crashes in the line below. Access Violation when reading 'iter->parentType'
        iter = iter->parentType;
    }
    return(false);
}

Our crash dumps points to the line 23, obviously something’s wrong with the ‘iter‘ variable. Corresponding assembly code (just the while loop):

003A2172 3B 44 24 04      cmp         eax,dword ptr [esp+4]
003A2176 74 0B            je          SObj::IsA+15h (3A2183h)
003A2178 8B 00            mov         eax,dword ptr [eax]
003A217A 85 C0            test        eax,eax
003A217C 75 F4            jne         SObj::IsA+4 (3A2172h)

Crash happens in the mov eax, dword ptr [eax] line. Sadly, given only this dump, it’s hard to form any solid theories. We don’t even know what iteration is that, so can’t tell if it’s the object itself that’s corrupted (so accessing obj->type) or the type chain. We suspect something wrote over either object instance or type definition, but most of the time, we can’t see what’s in memory corresponding to either of these objects (I’ve tried inspecting memory associated with ECX, but nothing interesting there, not included in the dump). Yes, you can try grabbing full memory dumps, but they’re usually too huge to be practical (good luck having people send you gigs of data). Normal crash dumps only contain very limited information, registers, stack, call stacks and so on. Wait a moment, did I say “stack”? What if we instrument our function, so that it tries to store crucial information where we can find it? For the sake of this example let’s assume our object is corrupted  by the following code:

SType st;
SObj obj(&st);

static void Corrupt()
{
    const char* str = "Hello world!";
    memcpy(&obj, str, 10);
}

Here’s a temporarily modified version of the IsA method:

bool SObj::IsA(const SType* t) const
{
    volatile unsigned int stackInfo[4];
    const volatile unsigned int* fmem = reinterpret_cast<const volatile unsigned int*>(this);
    for(size_t i = 0; i < 4; ++i)
    {
        stackInfo[i] = fmem[i];
    }
// Everything else stays the same

Remember, this is temporary, diagnostic code, it doesn’t need to be pretty. The idea is to store data associated with “this” pointer on the stack, so that we can take a look later. Yes, it comes with a slight performance hit, but it’s less severe than most alternatives (e.g. logging). We deploy a new build and wait for fresh dumps. Finally, we can answer some of our questions:

(Immediate Window)
> stackInfo
0x012afcc4
[0x0]: 0x6c6c6548
[0x1]: 0x6f77206f
[0x2]: 0x00ee6c72
[0x3]: 0x00000000
> eax
0x6c6c6548

At this point we know it’s actually memory associated with the object itself that’s written over, we crashed during the first iteration (EAX == first 4 bytes of the object memory == obj.type). Let’s see if we can get more info about the data that’s in memory right now:

> (const char*)stackInfo
0x012afcc4 "Hello worlî"

A-ha! We’re being written over by a familiar looking string. Obviously, it’s a contrived example, in real-life it rarely is that easy, but in my experience looking at “corrupted” memory can often give you valuable hints (is it a string, maybe some common floating-point bit pattern like 0x3f800000 etc).

Tricks like this are most valuable in a scenario, where we have a luxury of regular, frequent deployments (so that we can push instrumented build & the fix) and, sadly, are less helpful for more traditional, boxed products. Even then, it’s good to remember that you can stuff some of your crucial info (global state) in the stack space of your main loop as well. This can be especially helpful on consoles, where dumps are often all you get (no log files). In most cases you can get 90% of what you need from a raw dump, but having a way to get the extra few % when needed can be priceless. It did save my sanity many times in the past.

NaN memories

March 12, 2015 – 6:02 am

A nice thing about Twitter is that single tweet can bring so many memories. It all started when Fiora posted this. It reminded me of an old bug I encountered few years ago when working on a multi-platform (PC/X360/PS3) title. One day we started getting strange bug reports. Apparently, if you jumped down the roof at a very specific position, player would start to slide across the map. To make things more interesting, this was only happening on consoles, PC version was fine. After few minutes of investigation I narrowed it down to a fragment of code that looked roughly like this (don’t have access to the original code anymore, trying to recreate it from memory, that was player movement module):

float x = y / sfactor;
float vx = fmax(0.0001f, fmin(x, MAX_RESPONSE));

Looks innocent, but it’d break completely for y==0.0. The whole block has been added recently and the ‘sfactor‘ property was exposed to the editor and controlled by designers. As it turned out, it’s been set to 0.0 as well. If y == 0.0 and sfactor == 0.0, then x is NaN. fmax and fmin are, as you probably have guessed, floating-point versions of the min/max functions. For PC we were using a naive/reference implementation, ie:

float fmax(float a, b) { return a > b ? a : b; }
float fmin(float a, b) { return a < b ? a : b; }

Let’s see what happens in our case. fmin(NaN, MAX_RESPONSE) returns MAX_RESPONSE, as any comparison against NaN returns false. It’ll be fed to fmax and since it’s greater than 0.0001f, vx = MAX_RESPONSE.

Things are a little bit more interesting on PPC, though. If you coded for PowerPC, you’re aware of two facts:

  • it hates branches,
  • it has a ‘fsel’ instruction that allows for branchless floating-point code. Basically it ‘selects’ one of two given values based on another value (depending if it’s greater or less than zero).

The usual way of implementing fmin & fmax using fsel would be:

float fmax(float a, float b) { return fsel(a - b, a, b); }
float fmin(float a, float b) { return fsel(a - b, b, a); }

You can probably spot a problem already. If not, refer to this bit of fsel description (+my notes): “If the value in FRA [a – b in our case] is less than zero or is a NaN, floating point register FRT is set to the contents of floating-point register FRB [3rd argument]”. Think about what happens as our NaN flows through the fmin & fmax functions in this case:

  • fmin(x, MAX_RESPONSE): a – b is still NaN, so ‘a’ (x) is returned,
  • fmax(0.0001f, x): a – b is NaN, b (x) is returned.

In this case, NaN would just go through both functions and we ended up with an invalid vx. In this particular case it was an unfortunate coincidence that the original author used fmin(x, MAX_RESPONSE) — fmin(MAX_RESPONSE, x) would have been “fine”, but at least it helped us find the actual problem which was invalid value of the ‘sfactor‘ property (..and incompatibilities between fmin/fmax on different platforms).

MESIng with cache

December 22, 2014 – 7:12 am

(Please excuse the terrible pun, couldn’t help myself).

As we all know, computer cache is a touchy beast, seemingly little modifications to the code can result in major performance changes. I’ve been playing with performance monitoring counters recently (using Agner Fog’s library I mentioned before). I was mostly interested in testing how cmpxchg instruction behaves under the hood, but wanted to share some other tidbits as well.

Let’s assume we’re working with a simple spinlock code. There are a few ways of implementing one, each with slightly different characteristics. Let’s start with the most basic one:

void Lock()
{
    while(_InterlockedExchange(&mLock, 1) == 1)
    {
        [..]// spin
    }
}

Notes:

  • “spin” code could be a whole post in itself and it’ll affect the performance. As mentioned, I was mostly interested in mechanics of cmpxchg, so I went with simple exponential backoff (if I wanted to amplify the effects of chosen ‘lock’ mechanism I could have gone with just spinning on single ‘pause’ instruction, but that felt too contrived).
  • InterlockedExchange compiles to xchg reg,dword ptr [mem]
  • Not posting code to Unlock, it’s the same in all experiments and boils down to setting mLock to 0 (atomically)

Let’s now think what happens if we have a high contention scenario with many threads (more importantly — many cores) trying to access the same variable and obtain the same lock. I’ll assume you’re familiar with the MESI protocol, so I’ll spare you the details (if not, Wikipedia has a decent write-up actually). The important part here is that as long as cache line containing mLock is in Modified or Exclusive state, we can read from it/write to it without having to communicate with other caches (we’ll need to write it back to main mem if it changes obviously, but that’s another issue). Sadly, with many threads banging on it, it’s quite unlikely, as different caches keep “stealing” the ownership from each other. As mentioned, InterlockedExchange compiles to xchg reg, [mem]. You might be surprised there’s no “lock” prefix, but it doesn’t matter in this particular case — Intel processors will automatically lock the bus “when executing an XCHG instruction that references memory” (see Intel Architecture Software Developer’s Manual Vol 3 for details). Not only we lock the bus, we also issue the infamous RFO message (Read For Ownership) in most cases (when we don’t own the line exclusively). This will cause all other processors to drop this line (set it to Invalid), so next time they try to access it, they’ll miss. Modern CPUs try to be smart about it and hide some of the associated overhead with store buffers and invalidate queues, but it still hurts.
Consider the following modification to our lock code:


while(mLock == 1 || _InterlockedExchange(&mLock, 1) == 1)

Before analyzing this change, let’s run a quick benchmark – quad core CPU, 4 threads, all fighting to access the same variable and increase it (500000 times each).

  • v1: ~64ms on average,
  • v2:  ~59ms on average,

Not huge, but significant difference and it actually increases with contention. That’s hardly surprising and actually well known, we’ve just implemented test-and-set (v1) and test and test-and-set locks (v2) [and if we want, we can complicate things further with tickets or array locks]. The idea here is we spin mostly on reading from local cache, so no need to communicate with other CPUs, we only do it when we think we have a chance of succeeding. Things get a little bit more interesting as the contention decreases. With 2 threads fighting for access, the results are as following:

  • v1: ~19ms
  • v2: ~23 ms

Uh, oh… The lesson here I guess is not to apply “one size fits all” solutions to everything. Lots of benchmarks out there tend to focus on super high contention scenarios. They are important sure, but sometimes they feel a little bit counter-intuitive as well. After all, if we have 4+ threads banging on the same lock, perhaps it’s a good idea to reduce the contention first? Treat the cause, not the symptom. It’s hard to come up with solutions that are clearly superior for all scenarios. There’s actually an interesting discussion at the Linux Kernel discussion list on this very subject (cmpxchg, not xchg, but similar principles apply, in the end they decided to reject TTAS). In case of ‘light’ contention, our xchg will succeed in majority of cases, so extra read actually hurts us more than it helps.

Let’s dig a little bit deeper now and run our test PMC snippets. I added a bunch of performance counters, mostly related to cache activity and ran the tests again. 4 threads (click to enlarge):

TAS

TAS

TTAS

TTAS

(Results for CPU 1 & 0 were very similar). As you can see there’s clearly more cache traffic in the TAS case, even though the instruction count is very similar. I added the following counters:

{161, S_ID3,  INTEL_IVY,    0,   3,     0,   0x24,       0x0C, "L2 RFOs"    },
{162, S_ID3,  INTEL_IVY,    0,   3,     0,   0xF2,       0x0F, "L2Evict" },
{163, S_ID3,  INTEL_IVY,    0,   3,     0,   0x27,       0x02, "RFO.S"},
{164, S_ID3,  INTEL_IVY,    0,   3,     0,   0x26,       0x1, "L2Miss"},
  • L2 RFOs = number of store RFO requests (=L1D misses & prefetches),
  • L2Evict = L2 lines evicted for any reason
  • L2Miss = take a guess

Let’s try with 2 threads next:

TAS

TAS

TTAS

TTAS

As you can see, there’s still less RFOs, but interestingly — the number of misses is almost the same and TTAS generates more instructions, obviously.

There’s one more way of implementing our spinlock and that’s a cmpxchg instruction:


while(_InterlockedCompareExchange(&mLock, 1, 0) == 1)

How do you think, is it closer to TAS or TTAS? First thought could be TTAS, after all it’s a very similar idea, we compare against expected value first, then exchange. There are few differences, though. For one, _InterlockedCompareExchange compiles to lock cmpxchg, so we lock the bus before reading. Also, it’s 1 fairly complicated instruction, not 2 or more. According to Agner Fog’s tables, lock cmpxchg is 9 uops (as compared to 7 for xchg). There are some more interesting (and perhaps surprising) properties, but first some benchmarks (v3). 4 threads:

cmpxchg (4 threads)

cmpxchg (4 threads)

It seems like it’s very close to the xchg instruction. This is what you could expect based on this paper on scalable locks from Intel, but to be honest, I was a little bit surprised at first, especially by the fact it seems to generate similar cache traffic. As it turns out — cmpxchg instruction itself is actually quite close to xchg as well (they work differently, but trigger similar mechanisms):

  •  cmpxchg implies an RFO, in all cases, even if comparison fails. Some confirmation here (LKML again) and it’s also what shows in PMC tests above,
  • another interesting question is — does lock cmpxchg always result in a write? Again, the answer seems to be “yes”. That’s based on Agner’s tables (1 p4 uop. p4 = memory write) and the fact that ops that lock the bus are expected to write to memory. There’s some more information here for example, if comparison fails, the destination operand is simply written back as if nothing happened.

The beauty of cmpxchg is that it does the comparison & swap atomically, so it’s perfect for more complicated scenarios (like MPMC containers, where we need to swap list head for example), but our case here is very simple, we just ping-pong between 0 & 1. When trying to obtain lock by using xchg, if it’s already taken, we’ll simply write 1 to it again, it doesn’t break anything, cmpxchg doesn’t really buy us much. I actually found a patent application for a FASTCMPXCHG instruction (from Intel engineers). The idea is that in some cases CPU replaces the whole load-compare-store chain with simple final store (AFAIK it’s not implemented in any hardware).

For some more benchmarks of various memory operations/different CPUs see also this Gist from Ryg.

Rust pathtracer

November 14, 2014 – 4:25 am

Last year I briefly described my adventure with writing a pathtracer in the Go language. This year, I decided to give Rust a try. It’s almost exact 1:1 port of my Go version, so I’ll spare you the details, without further ado – here’s a short list of observations and comparisons. As previously, please remember this is written from position of someone who didn’t know anything about the language 2 weeks ago and still is a complete newbie (feel free to point out my mistakes!):

  • Rust is much more “different” than most mainstream languages. It was the first time in years that I had to spend much time Googling and scratching my head to get a program even to compile. Go’s learning curve seemed much more gentle. One reason is that it’s still a very young, evolving language. In many cases information you find is outdated and relates to some older version (it changes a lot, too), so there’s lots of conflicting data out there. The other is unorthodox memory management system that takes a while to wrap your head around.
  • As mentioned – it still feels a little bit immature, with API and language mechanisms changing all the time. I’m running Windows version which probably makes me lagging behind even more. Enough to say, it’s quite easy to get a program that crashes when trying to start (and I’m quite sure it’s not the Rust code that causes the issues, it’s the generated binary.. It crashes before printing even the first message). It’s always the same code, too, accessing some spinlock guarded variable, it seems:
    00000000774EE4B4 8B 43 08         mov         eax,dword ptr [rbx+8]
    00000000774EE4B7 A8 01            test        al,1
    00000000774EE4B9 0F 85 B5 47 FD FF jne         00000000774C2C74
    00000000774EE4BF 8B C8            mov         ecx,eax
    00000000774EE4C1 2B CD            sub         ecx,ebp
    00000000774EE4C3 F0 0F B1 4B 08   lock cmpxchg dword ptr [rbx+8],ecx
    00000000774EE4C8 0F 85 9B 47 FD FF jne         00000000774C2C69
    00000000774EE4CE 48 8B 03         mov         rax,qword ptr [rbx]
    00000000774EE4D1 4C 89 AC 24 C0 00 00 00 mov         qword ptr [rsp+0C0h],r13
    00000000774EE4D9 33 ED            xor         ebp,ebp
    00000000774EE4DB 45 33 ED         xor         r13d,r13d
    00000000774EE4DE 48 83 F8 FF      cmp         rax,0FFFFFFFFFFFFFFFFh
    00000000774EE4E2 74 03            je          00000000774EE4E7
    00000000774EE4E4 FF 40 24         inc         dword ptr [rax+24h]  
  • Memory management, which is one of its distinctive features is both interesting & confusing at first. They seem to go back & forth on optional GC, but the canonical way is to use one of few types of smart pointers. Memory leaks and other issues are detected at compile time. Rust compiler in general is pretty good at detecting potential problems, once it builds, it’ll probably run fine. The only runtime error I encountered in my app was out of bounds array access. (It’s a good thing too as I’ve no idea how to debug my application… Don’t think there’s a decent debugger for Windows)
  • Unlike Go, it has operator overloading, but syntax is a little bit confusing to be honest. You don’t use the operator itself when overloading, you have to know what function name it corresponds to. E.g. operator-(Vector, Vector) is:
    impl Sub<Vector, Vector> for Vector
    {
        fn sub(&self, other: &Vector) -> Vector
        {
            Vector { x : self.x - other.x, y : self.y - other.y, z : self.z - other. z}
        }
    }
    

    Not sure what’s the rationale behind this, but it’s one more thing to remember.

  • Changing pointer types feels cumbersome at times. For example, let’s imagine we want to change a boxed pointer:
    struct Scene
    {
        camera  : Box,
    }
    

    to a reference. We now have to provide a lifetime specifier, so it changes to:

    struct Scene<'r>
    {
        camera  : &'r mut Camera,
    }
    

    …and you also have to modify your impl block (2 specifiers):

    impl<'r> Scene<'r>
    

    It seems a little bit redundant to me, even in C++ it’d be a matter of changing one typedef.

  • [EDIT, forgot about this one initially] Comparing references will actually try to compare referenced objects by value. If you want to compare the actual memory addresses you need to do this:
        if object as *const Sphere == light as *const Sphere // compares ptrs
        if object == light // compares objects
    
  • Rust code is almost exactly same length as Go, around 700  lines
  • Running tracer in multiple threads took more effort than with Go. By default, Rust’s tasks spawn native threads (was a little surprised when I opened my app in the Process Hacker and noticed I had 100+ threads running), you need to explicitly request “green” tasks. It also involved way more memory hacks. By default, Rust doesn’t allow for sharing mutable data (for safety reasons), the recommended way is to use channels for communication. I didn’t really feel like copying parts of framebuffer was a good idea, so had to resort to some “unsafe” hacks (I still clone immutable data). I quite like this approach, it’s now obvious what data can be modified by background threads. I’m not convinced I chose the most effective way, though, there’s no thread profiler yet (AFAIK).
  • Performance. Surprisingly, in my tests Rust was quite a bit slower than Go. Even disregarding the task pool/data sharing code, when running from a single thread, it takes 1m25s to render 128×128 image using Go and almost 3 minutes with Rust. With multiple threads, the difference is smaller, but still noticeable (43s vs 70s).
    [EDIT] Embarrassingly enough, turned out I was testing version with no optimizations. After compiling with -O Rust now renders the same picture in 34s instead of 70 (and 1m09s with 1 thread). I’ll leave the old figures, just to show that debug build still runs with decent speed.
  • Random stuff I liked:
    • data immutable by default. It makes it immediately obvious when it’s modified. E.g.
      let r = trace(&mut ray, &context.scene, &context.samples, u1, u2, &mut rng);
      Can you tell what’s modified inside trace function?
    • as mentioned – compiler is very diligent, once the app builds, there’s a good chance it’ll run fine. Error messages are clear & descriptive
    • pattern matching
  • I realize it might seem like I’m mostly praising Go and bashing Rust a little bit here, but that’s not the case. I’ll admit I couldn’t help but think it seemed immature compared to Go, but I realized it’s an unfair comparison. Go has been around for 5 years and I only started using it 12 months ago, so they had lots of time to iron out most wrinkles. Rust is a very ambitious project and I definitely hope it gains more popularity, but for the time being Go’s minimalism resonates with me better. I mostly code in C++ at work, this is a language that offers you 100 ways of shooting yourself in the foot. Working with Go, where it’s usually only one true way (and it involves blunt tools) is very refreshing. Rust sits somewhere in-between for the time being, it’ll be fascinating to see where it ends up.

Obligatory screenshot:

rust_trace

512×512, 256spp, 2x2AA

Source code can be downloaded here.

Hidden danger of the BSF instruction

October 26, 2014 – 5:49 am

Had a very interesting debugging session recently. I’ve been investigating some of external crash reports, all I had was a crash dump related to a fairly innocent-looking piece of code. Here’s a minimal snippet demonstrating the problem:

struct Tree
{
    void* items[32];
};

#pragma intrinsic(_BitScanForward)
__declspec(noinline) void* Grab(Tree* t, unsigned int n)
{
    unsigned int j = 0;
    _BitScanForward((unsigned long *)&j, n);
    return t->items[j];
}

Easy enough, right? Seemingly, nothing can go wrong here. We find first set bit and use to index our table. ‘n’ is 32-bit, so our index is guaranteed to be in the 0-31 range, so it’s all good. Well, why is it crashing in extreme rare cases then? Let’s take a look at generated assembly code (x64), maybe it’ll help decipher this madness:

and    DWORD PTR j$[rsp], 0
bsf    eax, edx
; 667  :     return t->items[j];
mov    rax, QWORD PTR [rcx+rax*8]
ret    0

Three measly instructions, nothing extraordinary here, right? Don’t click “read the rest” if you want to give it a try yourself.

Read the rest of this entry »

Z-Machine interpreter in Go

September 23, 2014 – 3:32 am

zork1Recently, I had an inspiring discussion with fellow programmers, we were talking about interesting side projects/programs to quickly “try out” new programming language/job interview tasks. One that’s been mentioned was coding a Z-machine interpreter that’s capable of playing Zork I. The Z-machine is a virtual machine developed by Joel Berez and Marc Blank, used for numerous Infocom text adventure games, most notably the Zork series. In all honesty, I’m probably a few years too young so didn’t get to play Zork when it was big (I did play old Sierra adventures back when you actually had to type commands, though, one of the the reasons I started to learn English was Police Quest I. Took me more than 3 months to finish this game). Few weeks later I had a whole weekend to myself and decided to give it a try. As it turned out — it really was a lot of fun. I also gained lots of respect for the Infocom guys, there are some really creative ideas there, especially given space/memory limitations (zork1.dat file I found was ~90k). At first I wanted to do it in Rust (language I wanted to experiment with), but in the end decided to play it safe, limit the number of unknowns and went with Go (my second time). It actually turned out to be a good choice, basic implementation is ~1500 lines of Go and comes with some nice features for free (like cycling through past commands with the up arrow). Went fairly smooth too, stumbled few times, mostly because of me missing some little detail (like call 0 == return false or some off-by-1 mistake when indexing properties). One that took me probably most time was subtle bug in the ‘change parent’ routine that’d cause the game to break apart after I had picked up something. Luckily, I found an easy repro case, if I didn’t pick up a water bottle, I could move the rug fine, otherwise, it’d complain about rug not being there. I didn’t want to spend time writing a fullblown debugger, it was a weekend project after all), so spent some time comparing instruction traces for “good” and “bad” runs, trying to see where they drift apart. Eventually coded a quick diff application (comparing it in notepad was too slow) and found what was going on, it was fairly smooth sailing after that.

debuggingThe good thing, it’s very easy to start with a basic framework that does nothing but advances IP accordingly and then keep on filling the gaps, adding implementation for required opcode types, opcodes themselves etc. I simply started with NOPs everywhere (with basic implementation calling panic(“NOP”)) and then kept on implementing until finally seeing the “you are standing in an open field of a white house” message. The good thing is, getting to this point requires implementing most of the basic functionality, it’s mostly adding opcodes after that (aka the easy stuff).

Big pieces that are still missing are save/restore, other than that it should be fairly complete (it’s version 3 only).

Useful links, if someone would like to give it a try:

A Byte Too Far

July 17, 2014 – 5:19 am

A short cautionary tale from today. I’ve been modifying some code and one of the changes I made was to use a type of Lol as a key in a map-like structure (key-pair container, uses < operator for comparisons). Structure itself looked like:

struct Lol
{
    byte tab[16];
    short cat;
    bool foo;
};

…and here’s the operator<

bool Lol::operator<(const Lol& other) const
{
    return(memcmp(this, &other, sizeof(other)) < 0);
}

The problem was – it seemed like sometimes, in seemingly random cases, we’d try to insert an instance of Lol to a container even though exactly the same element was already there. In other words (pseudo code):

container.insert(std::make_pair(lol, cat));
lol2 = lol;
auto it = container.find(lol2);
// In some cases it == container.end()!

As you’ve already guessed (or perhaps spotted it immediately) – the problem was caused by operator<. The “original” version of the Lol type didn’t have ‘foo’ member, it’s been added quite recently (*cough*by me*cough*). Can you guess why did it break?

Read the rest of this entry »