MESIng with cache

(Please excuse the terrible pun, couldn’t help myself). As we all know, computer cache is a touchy beast, seemingly little modifications to the code can result in major performance changes. I’ve been playing with performance monitoring counters recently (using Agner Fog’s library I mentioned before). I was mostly interested in testing how cmpxchg instruction behaves under the hood, but wanted to share some other tidbits as well. Let’s assume we’re working with a simple spinlock code.

Rust pathtracer

Last year I briefly described my adventure with writing apathtracer in the Go language. This year, I decided to give Rust a try. It’s almost exact 1:1 port of my Go version, so I’ll spare you the details, without further ado - here’s a short list of observations and comparisons. As previously, please remember this is written from position of someone who didn’t know anything about the language 2 weeks ago and still is a complete newbie (feel free to point out my mistakes!

Hidden danger of the BSF instruction

Had a very interesting debugging session recently. I’ve been investigating some of external crash reports, all I had was a crash dump related to a fairly innocent-looking piece of code. Here’s a minimal snippet demonstrating the problem: struct Tree { void* items[32]; }; #pragma intrinsic(_BitScanForward) __declspec(noinline) void* Grab(Tree* t, unsigned int n) { unsigned int j = 0; _BitScanForward((unsigned long *)&j, n); return t->items[j]; } Easy enough, right? Seemingly, nothing can go wrong here.

Z-Machine interpreter in Go

Recently, I had an inspiring discussion with fellow programmers, we were talking about interesting side projects/programs to quickly “try out” new programming language/job interview tasks. One that’s been mentioned was coding a Z-machine interpreter that’s capable of playing Zork I. The Z-machine is a virtual machine developed by Joel Berez and Marc Blank, used for numerous Infocom text adventure games, most notably the Zork series. In all honesty, I’m probably a few years too young so didn’t get to play Zork when it was big (I did play old Sierra adventures back when you actually had to type commands, though, one of the the reasons I started to learn English was Police Quest I.

A Byte Too Far

A short cautionary tale from today. I’ve been modifying some code and one of the changes I made was to use a type of Lol as a key in a map-like structure (key-pair container, uses < operator for comparisons). Structure itself looked like: 1struct Lol 2{ 3 byte tab[16]; 4 short cat; 5 bool foo; 6}; …and here’s the operator< 1bool Lol::operator<(const Lol& other) const 2{ 3 return(memcmp(this, &other, sizeof(other)) < 0); 4} The problem was - it seemed like sometimes, in seemingly random cases, we’d try to insert an instance of Lol to a container even though exactly the same element was already there.

Going deeper - addendum

There’s been some comments to my previous post wondering about C++ compilers and their capabilities. Normally, I’m all for compiler bashing, in this case I’d probably cut them some slack. It’s easy to optimize when you’re focused on a single piece of code, way more difficult when you have to handle plethora of cases. On top of that, uops handled differently on different CPUs, e.g. in my limited tests Haswell seems to care less.

Going deeper

Few weeks ago I encountered a discussion on a Polish gamedev forum – participants were wondering whether it’s faster to access stack or heap memory. I didn’t pay much attention (there should be no significant difference) until someone had posted a test case. Out of curiosity, I ran it and to my surprise discovered, it was consistently faster for buffers allocated on the stack. We’re not talking about few cycles here and there, too, the difference was almost 20% (Ivy Bridge laptop).

Patching binaries

There may come a time in game programmer’s life when he has to fix a bug in a library he doesn’t have the source code for. It doesn’t happen often, it might never happen, but it’s good to be prepared. If I remember correctly, I had to do it only two times, one was fairly recently. We were getting quite a few crash reports and were assured that fix in the third-party library was coming, but I decided to see if it’s possible to do anything about it in the meantime.

Smartness overload #2

Today’s article is brought to you by a friend of mine. He’s been doing some home experiments and noticed a particular piece of code was running faster in Debug than Release (using Visual C++ compiler). He mentioned this to me, I found this intriguing enough to ask him for this snippet and investigated it yesterday. Turned out it was a classical case of compiler trying to be too smart and shooting itself in the foot.

Delete current instruction macro

I must admit I am not as die hard fan of ProDG debugger as some other coders out there, perhaps I’ve not been using it long enough. One tiny thing I miss though was the possibility of replacing an assembly instruction under the cursor with NOP with a single keystroke. Sure, with Visual Studio you can achieve same result with memory/immediate window, but it’s much more cumbersome. Today I decided to finally bite the bullet and recreate this little feature with VB macro:

Go pathtracer

Recently I’ve been experimenting with the Go programming language a little bit. It’s my second approach actually - I gave it a half-hearted try last year, but gave up pretty quickly (I think it was some petty reason, too, probably K&R braces). This time around I actually managed to stick to it a little bit longer and learn a thing or two. I decided my test application would be a simple pathtracer.

Performance-Monitoring Events

When experimenting with some of the more esoteric features of modern CPUs it’s sometimes not immediately obvious if we’re actually taking advantage of them. Sure, you can compare cycles, but the differences are not always big enough to justify conclusions. Luckily for us, in the Pentium processor Intel introduced a set of performance-monitoring counters. They are model specific (not compatible among different processor families) and allow you to monitor just about every aspect of CPU pipeline.

Second Reality - source code

Just a quick follow-up to my previous note. As mentioned by Michal, Future Crew guys decided to celebrate the 20th anniversary of Second Reality in the best way possible - they released a full source code. Obviously, it’s more of a tidbit than anything else, but it’s still interesting to finally see how certain effects were done. Apparently Fabian is already working on a code analysis article, but in the meantime I’ll only mention two things that caught my eye so far:

Demoscene tribute - Second Reality

I know I claimed I would not write about Second Reality here, mostly because everyone knows it, but it’s a special day today… It’s been exactly 20 years since Second Reality has been shown for the first time, at Assembly 1993. I’ve seen it few months later and still remember that day. I was living in Torun at that time and have just started high school. I would often visit local computer store, just to see what’s new.

Forward that store

I know I’ve been bitching about load-hit-store (too) many times before, but it’s been one of the most annoying things we had to deal with at the previous generation consoles. LHS happens when we try to load data from the address that has been recently written to. X360/PPC CPUs were fairly simple, they could neither try to execute some other instruction (no OOE) nor retrieve the data without waiting for it to reach cache.

cmov fun

If you’ve been coding for current (prev?) gen consoles, you know your optimization guidelines - be nice to your cache, avoid LHS and branches, in general - do not stress the pipeline too much. With next (current?) generation moving back to x86, things get a little bit more blurry. With out-of-order/speculative execution, register renaming and advanced branch predictors, it’s sometimes easy to shoot yourself in the foot when trying to be smarter than compiler/CPU.

Parallel 101

With the unveiling of next gen console specifications it’s clear that multi-threaded code is here to stay (I don’t think anyone expected otherwise). If anything, it’ll be even more common, new CPUs run at relatively low frequencies (when compared to modern PCs), so we’ll definitely have to go wide to use their potential. Here’s a quick cheat sheet I’m usually following when trying to move code to a background thread. Please note: move.

MemTracer 64

It’s been a long time coming, but I finally found time to update MemTracer C# to support 64-bit applications (so 64-bit memory pointers + callstack addresses).

Choices & consequences

Spent a little bit time tweaking RDE vector class again. As I already mentioned, the container itself is not terribly fascinating, there are not too many choices here. There’s another battle going on the lower level though, it’s interesting to see how an innocent instruction like_ size()_ can be a cause of a slowdown. Typically, a vector class has 3 properties it needs to keep track of: buffer properties (pointer and size),

A world without Lucas Arts

Working in a game industry for more than few years tends to desensitize one to all the news about mass layoffs & companies going bust. Sadly, it happens so often, we’re slowly becoming used to it. I first found out about Disney shutting down Lucas Arts at work. Sure, I was surprised, it’s a big news after all, but I was busy, so didn’t think twice about it… I finished work, came home, read all the updates and then it suddenly hit me.