A Byte Too Far

July 17, 2014 – 5:19 am

A short cautionary tale from today. I’ve been modifying some code and one of the changes I made was to use a type of Lol as a key in a map-like structure (key-pair container, uses < operator for comparisons). Structure itself looked like:

struct Lol
{
    byte tab[16];
    short cat;
    bool foo;
};

…and here’s the operator<

bool Lol::operator<(const Lol& other) const
{
    return(memcmp(this, &other, sizeof(other)) < 0);
}

The problem was – it seemed like sometimes, in seemingly random cases, we’d try to insert an instance of Lol to a container even though exactly the same element was already there. In other words (pseudo code):

container.insert(std::make_pair(lol, cat));
lol2 = lol;
auto it = container.find(lol2);
// In some cases it == container.end()!

As you’ve already guessed (or perhaps spotted it immediately) – the problem was caused by operator<. The “original” version of the Lol type didn’t have ‘foo’ member, it’s been added quite recently (*cough*by me*cough*). Can you guess why did it break?

Read the rest of this entry »

Going deeper – addendum

May 4, 2014 – 5:56 am

There’s been some comments to my previous post wondering about C++ compilers and their capabilities. Normally, I’m all for compiler bashing, in this case I’d probably cut them some slack. It’s easy to optimize when you’re focused on a single piece of code, way more difficult when you have to handle plethora of cases. On top of that, uops handled differently on different CPUs, e.g. in my limited tests Haswell seems to care less. Anyhow, I’d rather expect compilers to replace INC with ADD x,1 in most cases, I’d be much less optimistic with SIB byte elimination. MSVC seems a little inconsistent about it, it sometimes uses INC, sometimes ADD, not sure what determines that. Out of curiosity, I decided to use Matt Godbolt’s excellent Compiler Explorer to see how different compilers from the GCC family behave. Results:

  • GCC 4.9.0 eliminates both SIB byte & uses ADD instead of inc (it pretty much generates identical code as my final, hand optimized version)
  • Clang 3.2 eliminates SIB byte, but uses INC [mem]
  • g++ 4.4 – same as Clang

I didn’t really test the more exotic versions, follow the link if you’re interested.

Going deeper

April 27, 2014 – 5:57 am

Few weeks ago I encountered a discussion on a Polish gamedev forum — participants were wondering whether it’s faster to access stack or heap memory. I didn’t pay much attention (there should be no significant difference) until someone had posted a test case. Out of curiosity, I ran it and to my surprise discovered, it was consistently faster for buffers allocated on the stack. We’re not talking about few cycles here and there, too, the difference was almost 20% (Ivy Bridge laptop). Test itself is a little bit contrived, but turned out to be an interesting study. Code is exactly the same in both cases, we simply increment all elements of an array:

for(int i = 0; i < iterations; i++)
{
    for(int j = 0; j < arraySize; j++)
    {
        stackArray[j]++;
    }
}
[...]
char stackArray[arraySize];
char* heapArray = new char[arraySize];

I refused to believe it was actually a memory access making a difference so decided it’s time to dig deeper. Let’s take a look at the very inner loop for these two cases, as it turns out it’s slightly different:

// Stack:
00D1100D BB 01 00 00 00   mov         ebx,1
[...]
00F11014 00 98 40 30 F1 00 add         byte ptr stackArray (0F13040h)[eax],bl
00F1101A 03 C3            add         eax,ebx
00F1101C 3B 05 3C 21 F1 00 cmp         eax,dword ptr [arraySize (0F1213Ch)]

// Heap:
00D11036 8B 0D 48 30 E1 00 mov         ecx,dword ptr [heapArray (0E13048h)]
[...]
00D1103E FE 04 01         inc         byte ptr [ecx+eax]
00D11041 40               inc         eax
00D11042 3B 05 3C 21 D1 00 cmp         eax,dword ptr [arraySize (0D1213Ch)]

As you can see — it’s pretty close, main difference seems to be using INC instead of ADD. Back in the old days, comparing performance was easy, we’d just see how many cycles INC/ADD costed. Pentium complicated things a little bit with U/V pairing and with modern out-of-order processors it’s even more tricky. Without going in gory details, CPUs can split instructions into smaller, more RISCy operations named micro-ops. add [mem], reg is a good example, it’ll be split into 4 uops (load, add, calc write address, store). To make things more interesting, some of these micro-ops can be then fused again into a single operation (some of pipeline stages can only process a limited number of uops, so this reduces bottlenecks). Different CPUs have different fusion capabilities, but my Ivy Bridge should be able to fuse our 4 uops back to 2. Let’s verify our assumptions using Agner’s MSR driver (I mentioned it before). Results from test run for the “stack version”, array size = 1024, 1 iteration:

add_0

Uops F.D = uops fused domain, fused uops count as 1
Fused uops = # of fused uops

As we can see, inner loop costs 4 uops per iteration (after fusion). Let’s see the results for inc byte ptr [ecx+eax]:

inc_0

Ouch. As you can see, number of instructions is almost exactly the same, but we generate way more uops (6k vs 4k, 2 uops more per iteration) and micro-op fusion doesn’t seem to do a very good job. Where does the difference come from? As usually, in such case, I start with Agner Fog’s site. As it turns out, there’s some subtlety to memory store instructions – if they don’t use SIB byte, they only cost 1 uop, otherwise – it’s 2. If you look at our “heap code”, it’s clear we need a SIB byte (it’s necessary if there’s more than 1 pointer register), that means 2 uops. Let’s try to modify the assembly code so that we don’t use 2 registers:

mov eax, dword ptr heapArray
mov ecx, eax
add ecx, arraySize
l1:
inc byte ptr [eax]
inc eax
cmp eax, ecx
jb l1

Results:

inc_1

Progress! We eliminated 1 uop per iteration. Still not exactly on par with the first version, but we’re getting there. The remaining difference is INC vs ADD. I actually asked Agner Fog about it, he rightfully pointed out it’s probably related to the fact INC needs to preserve the carry flag (for legacy reasons). (Sidenote: every time I ask Agner about something, he replies almost immediately… After that I imagine the volume of mail he must be getting, compare to the time it takes me to answer an email and start feeling bad). It’s also clearly noted in his “Instruction tables” doc (INC = 3 uops fused domain, ADD = 2) and Intel discourage using INC in their docs as well. Let’s replace INC with ADD and see the results:

add_1

That’s more like it! As you can see, we now generate 4 uops per iteration (same as the first version). The only thing that was still bothering me as the difference in the fused uop number (~2k here, ~3k in the first one). I figured it had something to do with cmp instruction using register (vs memory in the first one). Using “Instructions table” again we can notice that CMP m, r/i is actually not fused, while CMP r, r/i is. I also discovered another tool that’s helpful in such situations – Intel Architecture Code Analyzer. Compare output for cmp reg, mem & cmp reg, reg:

|   2^   | 0.3       |     | 0.5   0.5 | 0.5   0.5 |     |     | 0.8 |     |    | cmp eax, dword ptr [0x0]
|   1     | 0.3       |     |               |               |     |     | 0.8 |     |    | cmp eax, 0×400

^ means that micro-op fusion happened here (2 uops fused into 1). IACA doesn’t seem 100% reliable (e.g. it doesn’t seem to catch the INC vs ADD difference), but it’s another helpful tool in our collection.

Obviously, going to the micro-op level is not something you’ll be doing often (or ever, to be honest), it only makes sense if you’re actually writing assembly, otherwise it’s in the hands of the compiler. I still think it’s fun and an interesting way to understand how the modern CPUs work under the hood.

Patching binaries

November 11, 2013 – 6:24 am

There may come a time in game programmer’s life when he has to fix a bug in a library he doesn’t have the source code for. It doesn’t happen often, it might never happen, but it’s good to be prepared. If I remember correctly, I had to do it only two times, one was fairly recently. We were getting quite a few crash reports and were assured that fix in the third-party library was coming, but I decided to see if it’s possible to do anything about it in the meantime. Things were further complicated by the fact we’ve never seen this crash internally, it was all based on user reports (and it was quite rare in the wild, too). Started with investigating crash dumps in WinDbg. The crash itself was division by zero, it seemed like the code was not handling all the edge cases correctly. It’d load a value from table, do some transformation and divide by result, it worked fine in most cases, but would break if the value read from the table was zero, too (it’d pass all the transformations and come out as zero on the other end). We had no sources and no symbols, so I wasn’t even sure what was this function supposed to do, but it seemed like the array should not contain zeros in the first place. Now, I didn’t really care about 100% correct solution, as it was obvious I was treating symptoms, I just wanted something that’d eliminate crashes (and wouldn’t break rendering completely, I was fine with temporary artifacts). What I had to do was to squeeze in a test against zero, handle it and also set the original array element to something else than 0 (to cut the long story short, I found out about the last requirement in the process, it’d crash in another function without it). Easy, right? That’s like ~12 bytes worth of opcodes in x64. The block I was comfortable with modifying (didn’t want to mess with the whole function) was roughly 40-45  bytes, maybe a little bit more, so I had to find a way to shrink it down by ~25-30%. I will not focus on the actual modifications too much, as they’re not applicable for anyone else and – to stress this one more time – you do not need stuff like this often, if ever. Instead, I’ll try to present some of the tricks & tools of the trade that can come useful in other situations, too.

Let’s start with writing the code that does the same as the original fragment, but is smaller. Luckily for me, code was using additional registers (outside the EAX-EDI range), even though it was not operating on 64-bit numbers (so only using lower 32-bits). When using extended registers, we have to output an additional REX prefix, so most of the time opcodes are at least 1 byte longer than their 32-bit counterparts. Example:

mov ecx, eax            ; 8b c8 = 2 bytes
mov ecx, r8d            ; 41 8b c8 = 3 bytes 
                        ; (0x41 encodes default operand size 
                        ; (32-bit for mov) & extends the MODRM)

By changing parts of the code to operate on EAX-EDI I was able to get within 1 byte to my goal, but for the last stretch I had to resort to more risky modification involving using CDQ (1 byte opcode) instead of XOR EDX, EDX (2 bytes). They are equivalent, assuming we’re operating on positive numbers, which luckily was the case here. Surprisingly, x86 version was somewhat easier, I could not use “smaller” registers, but generated code was a little bit redundant, so I modified the algorithm slightly to do the same thing, but with less instructions.

Getting the final opcodes was trivial for x86, I simply used inline assembly and copy-pasted from the disassembly window. Could not do the same for x64 as MSVC does not support inline assembly in this mode (intrinsics only). Looking back, I should have just downloaded some x64 assembler, but if I did – I would not have discovered ODA. It’s great online disassembler supporting every platform you’ve ever coded for and a bunch you’ve never heard about. My only complaint is it sometimes takes a while to realize that opcodes have changed and still shows you the old code, but other than that – it’s simply awesome. x64 encoding is not terribly user friendly, especially when you need to generate instructions like INC BYTE PTR [R12+0x4], but I kept plowing through. Intel’s manuals are a good starting point, but I found OSDev Wiki to be a more concise reference.

For the actual editing I’ve used HTE for x86. It probably pushes the definition of oldschool a little bit too far (no mouse support…), but has a built-in disassembler, so that I could immediately verify my changes made sense. Could not find any hex editor/disassembler for 64-bits, so used my trusty xvi32 and the debugger for verification.

This brings us to the last point — how to set a breakpoint in an unknown piece of code, no function name, no symbols. Well, immediate window to the rescue! (Side note: I feel this is probably one of the most underappreciated features of Visual Studio. IME many of programmers complaining about MSVC debugger either do not use it often enough or do not use it to full potential). We know the opcodes, we can search for them in memory. Start with getting address range of the module you’re interested in (open Modules window, copy-paste). Now, in the immediate window we can use the memory search command. For example, let’s assume you’re looking for the mov ecx, eax instruction (in real life scenario, you’d probably want to choose something less common obviously) and you’re module address range is 0x003D0000-0x0044B000:
.S -W 0x003D0000 0x0044B000 0xc88b (-W = 16-bit number, -D = 32-bit).

All that’s left to do is opening the disassembly window and copy-pasting addresses returned by .S command (hopefully not too many of them) into the ‘Address’ field one-by-one. Shouldn’t take long to find a function we’re looking for. That was the last stage of my experiment, I could now verify the code was indeed running, I was able to modify data on the fly, prove that it did crash upon encountering zero in the array (remember, I have never actually experienced this bug myself, it’s all based on user error reports, hoping to just run into it was a fool’s errand). More importantly, I could verify it was no longer crashing after my changes and introduced no noticeable side effects.

Road trip 2013

October 30, 2013 – 5:24 am

Last summer we visited Alberta and BC, so this year it was time to go east. We’ve been driving and only had 9 days, so didn’t go all the way to the coast (something for 2014 perhaps?), but had a great trip anyhow. Itinerary was London-Kingston (1000 Islands cruise)-Ottawa-Montreal-Quebec City. A little bit over 2100kms, I’ve not been driving every day though, so it wasn’t too demanding. Ottawa & Montreal were nice, but it’s QC that was the highlight of our trip. It’s a charming little city with a very European feel (old town at least), it was nice to see buildings older than 150 years for a change. We happened to be in Ottawa just in time for The Canada Day, which was a nice experience as well. Some photos (more here):

Smartness overload #2

October 23, 2013 – 3:33 am

Today’s article is brought to you by a friend of mine. He’s been doing some home experiments and noticed a particular piece of code was running *faster* in Debug than Release (using Visual C++ compiler). He mentioned this to me, I found this intriguing enough to ask him for this snippet and investigated it yesterday. Turned out it was a classical case of compiler trying to be too smart and shooting itself in the foot. Consider the following piece of code (it’s not exactly the same as the one I’ve been analyzing, I stripped it to a bare minimum), built for x86:

// numToTest & lc are  __int64s
for(__int64 i = 2; i < numToTest; ++i)
{
    if(numToTest % i == 0)
    {
        lc = numToTest / i;
        sd += i;
        //.. some other stuff, not too much, though

As you can see, it’s fairly innocent, inner loop is tight (compiles to maybe 15-20 instructions) and doesn’t touch memory too much, so I could see Debug running at roughly the same speed as Release, but being faster? That’s something you do not see very often. It was consistently faster too and by a fair margin.

Let’s see generated assembly code (loop body). Debug:

013C167C 8B 45 D8         mov         eax,dword ptr [ebp-28h]
013C167F 50               push        eax
013C1680 8B 4D D4         mov         ecx,dword ptr [i]
013C1683 51               push        ecx
013C1684 8B 55 0C         mov         edx,dword ptr [ebp+0Ch]
013C1687 52               push        edx
013C1688 8B 45 08         mov         eax,dword ptr [numToTest]
013C168B 50               push        eax
013C168C E8 D8 F9 FF FF   call        @ILT+100(__allrem) (13C1069h)
013C1691 89 85 08 FF FF FF mov         dword ptr [ebp-0F8h],eax
013C1697 89 95 0C FF FF FF mov         dword ptr [ebp-0F4h],edx
013C169D 8B 8D 08 FF FF FF mov         ecx,dword ptr [ebp-0F8h]
013C16A3 0B 8D 0C FF FF FF or          ecx,dword ptr [ebp-0F4h]
013C16A9 75 2D            jne         sd_Local+0C8h (13C16D8h)
{
lc = numToTest / i;
013C16AB 8B 45 D8         mov         eax,dword ptr [ebp-28h]
013C16AE 50               push        eax
013C16AF 8B 4D D4         mov         ecx,dword ptr [i]
013C16B2 51               push        ecx
013C16B3 8B 55 0C         mov         edx,dword ptr [ebp+0Ch]
013C16B6 52               push        edx
013C16B7 8B 45 08         mov         eax,dword ptr [numToTest]
013C16BA 50               push        eax
013C16BB E8 BD F9 FF FF   call        @ILT+120(__alldiv) (13C107Dh)
013C16C0 89 45 E4         mov         dword ptr [lc],eax
013C16C3 89 55 E8         mov         dword ptr [ebp-18h],edx

Release:

00291038 56               push        esi
00291039 57               push        edi
0029103A 55               push        ebp
0029103B 50               push        eax
0029103C E8 CF 08 00 00   call        _alldvrm (291910h)
00291041 0B CB            or          ecx,ebx
00291043 75 10            jne         sd_Local+55h (291055h)
{
lc = numToTest / i;
sd += i;
00291045 01 7C 24 10      add         dword ptr [esp+10h],edi
00291049 89 44 24 18      mov         dword ptr [esp+18h],eax
0029104D 89 54 24 1C      mov         dword ptr [esp+1Ch],edx
00291051 11 74 24 14      adc         dword ptr [esp+14h],esi

At first glance, it doesn’t explain much, if anything, Release is tighter and keeps data in registers (as expected). It only calls 1 function vs 2 in Debug, too… The interesting part is — it calls a different function. As you probably know, 64-bit operations are a little bit tricky in 32-bit applications. Numbers take 2 registers and we need to deal with overflows properly. That’s why MSVC encapsulates 64-bit divisions in functions (as opposed to just spitting out 40+ assembly instructions every time). As it turns out, it has at least 3 of these (+ variants for unsigned types):

  • lldiv – 64-bit divide (x/y),
  • llrem – 64-bit remainder (x % y),
  • lldvrm – both (x/y AND x % y)

It does make sense to calculate both divide & remainder at once sometimes as it’s not that much additional work. That’s what the compiler tried to do here. It was smart enough to notice we use both numToTest % i and numToTest / i, so generated a call to lldvrm to get them both in one shot. The problem is, lldvrm is also a little bit more expensive than llrem (not much, but it adds up), see reference implementations here & here. If you think about it, our if(numToTest % i == 0) test will evaluate to false in the vast majority of cases. This means that Debug build will get away with cheaper llrem call and only has to execute lldiv in few situations. Release buid OTOH will pay for calculating both x/y & x%y, even though it only uses the latter. That’s enough to nullify the effects of better register management and not having to call lldiv in a handful of cases.

Other observations:

  • obviously, problem doesn’t manifest itself in x64 builds, no need for function calls there (both builds run much faster than 32-bit version, Release faster than Debug),
  • I couldn’t find an easy way to persuade the compiler to stop using lldvrm. MSVC doesn’t have branching hints, I tried PGO, hoping it’ll realize that if evaluates to false in 95% of cases, but it didn’t help either. Doing:
    __declspec(noinline) __int64 Rem64(__int64 x, __int64 y)
    {
        return(x % y);
    }
    

    ..and using Rem64(numToTest, i) instead of numToTest % i did help, but it’s not ideal, as you pay the price of additional function call (has to be there, though, so that compiler doesn’t notice the optimization “opportunity”).

Delete current instruction macro

October 6, 2013 – 4:48 am

I must admit I am not as die hard fan of ProDG debugger as some other coders out there, perhaps I’ve not been using it long enough. One tiny thing I miss though was the possibility of replacing an assembly instruction under the cursor with NOP with a single keystroke. Sure, with Visual Studio you can achieve same result with memory/immediate window, but it’s much more cumbersome. Today I decided to finally bite the bullet and recreate this little feature with VB macro:

Imports System
Imports EnvDTE
Imports EnvDTE80
Imports EnvDTE90
Imports System.Diagnostics
Imports System.Windows.Forms

Public Module Module1

    Sub NopInstruction()
        Dim debugger As EnvDTE.Debugger
        debugger = DTE.Debugger

        Dim projs As System.Array
        Dim proj As Project
        projs = DTE.ActiveSolutionProjects()
        Dim is64 As Boolean
        is64 = False
        If projs.Length > 0 Then
            proj = CType(projs.GetValue(0), EnvDTE.Project)
            is64 = proj.ConfigurationManager.ActiveConfiguration.PlatformName.IndexOf("64") >= 0
        End If

        Dim expr As Expression
        If is64 Then
            expr = debugger.GetExpression("@rip")
        Else
            expr = debugger.GetExpression("@eip")
        End If


        If expr.IsValidValue Then
            Dim deleteExpression As String
            deleteExpression = "*(unsigned char*)" + expr.Value + " = 0x90"
            debugger.ExecuteStatement(deleteExpression)
            debugger.StepOver()
        End If

    End Sub

End Module

All you need to do is to add this macro and bind it to a key of your choice. Unfortunately, life is never easy and my solution doesn’t come without a few drawbacks:

  • you probably don’t want to bind it to Delete. Visual Studio doesn’t support different binding for different modes… Well, it kinda does, but I don’t think you can distinguish between an editor and debugger. I have it under Ctrl+Del.
  • it’s rather dumb. IIRC ProDG was smart enough to know the instruction size and insert a proper number of NOPs. My macro will always insert one per keystroke. I think it’s possible to fix this one by writing a “proper” add-in and using information provided by IDebugDisassemblyStream2, but this was supposed to be a quick evening coding fueled by Ardbeg & Genesis. Took me long enough to figure out how to get current code location (CurrentThread.Location returns function name, not address, if symbols are provided :/)
  • posted version will only work with 64-bit code. 32-bit would require changing rip->eip. This sucks, as now we have to have 2 macros. It’s not a deal breaker for me, I work with 64-bit code 90% of the time, but still, a little bit annoying.
    [EDIT] It was a problem with the first version I posted. After some hacking I’ve been able to add 64/32-bit detection (it’s based on the platform name for the first active solution project).

Go pathtracer

September 23, 2013 – 2:47 am

Recently I’ve been experimenting with the Go programming language a little bit. It’s my second approach actually – I gave it a half-hearted try last year, but gave up pretty quickly (I think it was some petty reason, too, probably K&R braces). This time around I actually managed to stick to it a little bit longer and learn a thing or two. I decided my test application would be a simple pathtracer. Probably not exactly area that language creators had in mind, but I wasn’t really interested in coding web server stuff (Go did surprisingly well in the end). As far as features go, pathtracer itself is not really interesting, my main goal was to get to know the language a little bit. It’s pretty much a poor man’s version of smallpt (I borrowed scene description and “use huge sphere to fake box trick” from there). It doesn’t support dielectric objects (I did implement glossy materials, though, I like that look) and I didn’t really care about squeezing it in as little lines as possible (so it’s ~700 lines, not 99). Some of my observations (please take them for what they are – random ramblings of someone who’s been coding in Go for ~10h total and used it for quite an unorthodox purpose):

  • if you know C/C++ – you know Go. Syntax is very similar and you can learn about the most important differences in 5 minutes. There’s a good summary here.
  • interestingly, forced K&R braces were not a problem this time, it somehow felt natural to me (even though I don’t use this convention in C/C++). If you’re struggling – there is gofmt (tool that converts to the ‘official’ format)
  • my biggest problem was “weird” declaration syntax that’s different from most languages I tried – in Go it’s name followed by type, ie.:
    var normal Vector    // That's Vector normal in C/C++/Java/C# etc
    

    type-name order is so ingrained in my mind/fingers it took me a few days to switch and I still hesitate for a split second every time (I actually had to think twice when typing the example above [EDIT: and I still got it wrong!]).

  • no operator overloading. Yes, you don’t need it often and in the wrong hands it can be downright dangerous, but math code is one area where it’s actually quite useful. Take ‘reflect’ function for example, with operator overloading it looks very similar to the math formula, without it might take a while to understand what’s going on exactly. It also results in longer code. Not a big deal, but worth mentioning.
  • no built-in assertion/compile time assertions. Authors try to justify their decision in the FAQ, but to be I don’t really buy it (yes, you can code one yourself or use some public library, but it didn’t make too much sense for a program of this scale)
  • profiling. Go has a built-in profiler (well, almost, you need to add some code to your program), described here. Sadly, out of the box, it doesn’t seem to work properly on Windows. I used a custom version of Perl script found at Ex Nihilo blog. It worked OK for big picture views, still couldn’t get list command to work (admittedly didn’t try too hard, it seemed like it was missing some GCC tools like c++filt & nm)
  • debugging. Apparently you can use gdb, didn’t try it, so can’t comment. Used fmt.Printf and visual debugging – mostly for tracking bugs as crashes are usually very informative and it’s obvious what went wrong. Go is fairly “safe” language, too, can’t easily stomp over memory or address arrays with invalid indices.
  • goroutines. The most important thing to remember — these are not hardware threads, don’t be too stingy. Switching between goroutines is cheap, so it’s OK to create hundreds of them (in my case it was a single goroutine per 16×16 pixel chunk). In gamedev terms they’re closer to jobs in worker pool scheme (rather than worker threads). In case anyone is interested in gory details, here’s their scheduler implementation (check it out, it’s actually a very nice code). This is one of areas where Go shine. Modifying my tracer to take advantage of multiple threads took me 2 minutes or so. Truth be told, the first version had a data race bug, but here’s where Go surprised me again, it actually has a built-in race detector. All you need to do is to add the -race flag to the command line. One of the authors of the detector is Dmitry Vjukov, guy behind Relacy and someone who knows more about low-level multi-threading than most of us combined, so I expect it to work OK (to play it safe). I know it found my problem in no time.
  • memory management. I must admit I still don’t really “feel” it. Go uses an automatic memory management (mark-and-sweep GC), so in theory you don’t have to worry much, but I like to understand what’s going on under the hood (especially if you care about performance). The tricky part is, Go tries to be smart and avoid dangling pointers, so heap allocations are not always immediately obvious. Consider:
    func f2() *S {
        var s S
        return &s
    }
    

    In this example S will be allocated on the heap, so that caller is not left with dangling pointer (even though there’s no explicit new).

  • one that actually caused me some head scratching. Consider the following code where I tried to precalculate radius^2 for each sphere:
    func (scene *Scene) Initialize() {
        for _, s := range scene.objects {
            s.radiusSqr = s.radius * s.radius
        }
    }
    

    If you’re familiar with Go, you probably see the mistake already… After I introduced this optimization all I was getting was a black screen. As it turned out index,value iteration actually operates on the copy of the array element! Seems like a weird decision to me, but I might be missing something. It also has performance implications. My original “intersect” loop was this:

    for i, s := range scene.objects {
        t := SphereIntersect(&s, ray)
        [...]
    

    Same problem here – we copy sphere object every iteration. Rendering 256×256 image with 144 samples per pixel was taking around 2m20s. One little change:

    for i := range scene.objects {
        s := &scene.objects[i]
        t := SphereIntersect(s, ray)
    [...]
    

    …and same scene renders in 1m40s!

  • most importantly – is it fun to program in Go? It sure is. I actually did most of that stuff during few late nights and I had to drag myself to bed (mostly because I knew my daughter would wake me up at 8am next day). Even though it’s fairly young language, it feels very mature. It takes few seconds to compile, so it might not feel as snappy as some of the scripting languages, but the actual generated code is really fast. Has few quirks that I don’t love (most of them mentioned above), but at the end of the day I really enjoyedit.

Some pretty and not-so-pretty pictures and the source code:

Raytrace

Raytrace

First version of path tracing, extremely noisy

First version of path tracing, extremely noisy

Comparison between stratified (left) & uniform sampling. Nothing spectacular, but noticeable (click to enlarge).

Comparison between stratified (left) & uniform sampling. Nothing spectacular, but noticeable (click to enlarge).

512x512, 256spp, 2x2 AA. Takes 11 minutes to render, fairly decent quality, you're only getting minor improvement afterwards.

512×512, 256spp, 2×2 AA. Takes 11 minutes to render, fairly decent quality, you’re only getting minor improvement afterwards.

Final render. 1024 samples, 2x2 AA, light sampling

Final render. 1024 samples, 2×2 AA, light sampling

Performance-Monitoring Events

August 9, 2013 – 4:36 am

When experimenting with some of the more esoteric features of modern CPUs it’s sometimes not immediately obvious if we’re actually taking advantage of them. Sure, you can compare cycles, but the differences are not always big enough to justify conclusions. Luckily for us, in the Pentium processor Intel introduced a set of performance-monitoring counters. They are model specific (not compatible among different processor families) and allow you to monitor just about every aspect of CPU pipeline. Before explaining the details, let’s take a look at the high level usage steps:

  • choose events to monitor (from big selection ranging from non-forwarded stores to the number of cache misses. Modern CPUs support up to 4 general purpose counters).
  • read performance monitor counter (using RDPMC instruction),
  • run your code,
  • read performance monitor counter again. Compare with the original one

Doesn’t sound too complicated, does it? The only tricky part is the first step. In order to select monitored events we have to write to Model Specific Register (WRMSR mnemonic). This instruction must be executed at privilege level 0 (kernel), we can’t simply execute it from our application, it needs to be a driver. Sure, we could use a full blown profiler like VTune, that has it built-in, but that’s probably too fancy for our simple needs. Personally, I use MSRDriver by Agner Fog (name should be familiar to any programmer interested in optimization), it’s perfect for small experiments.

Most recently, I’ve been experimenting a little bit with loop stream detector, but wasn’t sure if it was really kicking in (performance differences were inconclusive). The loop stream detector does pretty much what it says on the tin, if it detects loop that fits in the micro-op queue it starts streaming directly from the queue, no fetching/decoding. It has some limitations, loops can’t be too long (was up to 18 instructions, bumped to 28 in Nehalem), have to have more than certain number of iterations (64), can’t contain too many branches (absolutely no CALLs/RETs) etc. MSRDriver experiments shown below. First run didn’t have enough iterations to qualify for LSD (63), second one had 65. Please notice the third column – the number of micro-ops delivered by loop stream detector (to be honest I’m not exactly sure why it’s non-zero for the first two tests):

loop stream detector

Here’s another test, I went back to my store-forwarding experiment (last column is number of loads that could not be forwarded):

PMC

Agner’s test app doesn’t include counter definition for Ld.Block, so I added it, here’s corresponding entry:


{312, S_ID3,  INTEL_7,  0,   3,     0,   0x3,      0x2, "Ld.Block" },

Refer to Intel 64 and IA-32 Architectures Software Developer’s Manual (volume 3B) for a complete list of counters.

Second Reality – source code

August 3, 2013 – 3:39 am

Just a quick follow-up to my previous note. As mentioned by Michal, Future Crew guys decided to celebrate the 20th anniversary of Second Reality in the best way possible – they released a full source code. Obviously, it’s more of a tidbit than anything else, but it’s still interesting to finally see how certain effects were done. Apparently Fabian is already working on a code analysis article, but in the meantime I’ll only mention two things that caught my eye so far:

  • lots of auto generated code. They have small C programs generating assembly inner loops. Neat idea
  • loop unrolling galore. There’s a neat trick we’ve been using back in the 90s (no longer applicable with modern CPUs). In C/C++ terms it’s an extreme case of Duff’s device – instead of processing 4/8/16/xx, we unroll the whole loop. This means no need for loop counter manipulation and no need to update the data pointer. Example (Gouraud shader, AVIDFILL.ASM, line 458):
    mov	ax,cx
	shl	 cx,3
	sub	ax,cx
	;ax=-cx*7
	add	ax,OFFSET @@vdfg_fcode
	mov	bx,cs:vdfg_color2
	sub	bx,dx ;one extra
	jmp	ax

zzz=MAXCOLS
REPT	MAXCOLS+1
	add	bx,dx			;2 bytes
	mov	cs:vdfg_buf[zzz],bh	;5 bytes
	zzz=zzz-1
ENDM
	mov ax,bx			;2 bytes, 1 clock (filler)
@@vdfg_fcode:

(REPT is a macro that’ll repeat code between REPT/ENDM specified number of times)
As you can see, the only thing that’s modified in every “iteration” is the BX register, 2 instructions per loop (instead of 4+conditional jump). In order to find jump address we multiply number of pixels to fill (originally in CX) by 7 (bytes per “iteration”) then subtract it from the end address (so we go back “x” iterations).