Know your assembly (part N+1)

Recently we have updated some of our compilers (Clang) and started running into mysterious problems, mostly in the physics code. We were able to narrow it down to raycasts failing to hit… sometimes. What made it a bit more annoying - it was only happening on 1 platform I had no easy access too. It was ARM based, but another ARM platform was fine. It seemed like the problem was deep in the guts of PhysX, code is publicly available, so I can link it here (rayAABBIntersect2).

Fantasy football and statistics

I’m playing in a fantasy football league with some coworkers this year. I have always liked all kinds of sports, but it’s a little bit challenging to follow NFL from Europe. There’s not too many stations that actually show it (except for the Super Bowl) and even if they do, the time difference make it difficult to watch. I still followed major news and managed to catch a game every now and then, but I was very far from expert.

Know your assembly (part N)

An entertaining head-scratcher from today. Application has suddenly started crashing at launch, seemingly inside steam_api.dll. It’d be easy to blame external code, but as mentioned - it’s a new thing, so most likely related to our changes. To make things more interesting, it’s only 32-bit build that crashes, 64-bit seems fine. High-level code looks like (simplified): 1struct tester 2{ 3 virtual bool foo(); 4 virtual void bar() 5 { 6 } 7 virtual void lol() 8 { 9 if(!

Know your assembly (part N)

An entertaining head-scratcher from today. Application has suddenly started crashing at launch, seemingly inside steam_api.dll. It’d be easy to blame external code, but as mentioned - it’s a new thing, so most likely related to our changes. To make things more interesting, it’s only 32-bit build that crashes, 64-bit seems fine. High-level code looks like (simplified): 1struct tester 2{ 3 virtual bool foo(); 4 virtual void bar() 5 { 6 } 7 virtual void lol() 8 { 9 if(!

NaN memories

A nice thing about Twitter is that single tweet can bring so many memories. It all started when Fiora posted this: This reminds me of how HLSL very carefully defines "saturate" in a way that makes NaNs turn into 0: https://t.co/0tWzaa6H2K @rygorous — Fiora (@FioraAeterna) March 11, 2015 It reminded me of an old bug I encountered few years ago when working on a multi-platform (PC/X360/PS3) title. One day we started getting strange bug reports.

MESIng with cache

(Please excuse the terrible pun, couldn’t help myself). As we all know, computer cache is a touchy beast, seemingly little modifications to the code can result in major performance changes. I’ve been playing with performance monitoring counters recently (using Agner Fog’s library I mentioned before). I was mostly interested in testing how cmpxchg instruction behaves under the hood, but wanted to share some other tidbits as well. Let’s assume we’re working with a simple spinlock code.

Z-Machine interpreter in Go

Recently, I had an inspiring discussion with fellow programmers, we were talking about interesting side projects/programs to quickly “try out” new programming language/job interview tasks. One that’s been mentioned was coding a Z-machine interpreter that’s capable of playing Zork I. The Z-machine is a virtual machine developed by Joel Berez and Marc Blank, used for numerous Infocom text adventure games, most notably the Zork series. In all honesty, I’m probably a few years too young so didn’t get to play Zork when it was big (I did play old Sierra adventures back when you actually had to type commands, though, one of the the reasons I started to learn English was Police Quest I.

Going deeper

Few weeks ago I encountered a discussion on a Polish gamedev forum – participants were wondering whether it’s faster to access stack or heap memory. I didn’t pay much attention (there should be no significant difference) until someone had posted a test case. Out of curiosity, I ran it and to my surprise discovered, it was consistently faster for buffers allocated on the stack. We’re not talking about few cycles here and there, too, the difference was almost 20% (Ivy Bridge laptop).

Smartness overload #2

Today’s article is brought to you by a friend of mine. He’s been doing some home experiments and noticed a particular piece of code was running faster in Debug than Release (using Visual C++ compiler). He mentioned this to me, I found this intriguing enough to ask him for this snippet and investigated it yesterday. Turned out it was a classical case of compiler trying to be too smart and shooting itself in the foot.

Delete current instruction macro

I must admit I am not as die hard fan of ProDG debugger as some other coders out there, perhaps I’ve not been using it long enough. One tiny thing I miss though was the possibility of replacing an assembly instruction under the cursor with NOP with a single keystroke. Sure, with Visual Studio you can achieve same result with memory/immediate window, but it’s much more cumbersome. Today I decided to finally bite the bullet and recreate this little feature with VB macro:

cmov fun

If you’ve been coding for current (prev?) gen consoles, you know your optimization guidelines - be nice to your cache, avoid LHS and branches, in general - do not stress the pipeline too much. With next (current?) generation moving back to x86, things get a little bit more blurry. With out-of-order/speculative execution, register renaming and advanced branch predictors, it’s sometimes easy to shoot yourself in the foot when trying to be smarter than compiler/CPU.

The undefined flag

This week I had one of the most interesting debugging sessions in a while. Here’s the minimal code snippet exhibiting the problem. It doesn’t really make much sense in itself, but I tried to remove everything not related to the bug itself (the actual case was much more convoluted): 1struct SVar 2{ 3 SVar() : m_v(23000) {} 4 void operator=(unsigned short t) 5 { 6 if(_rotr16(m_v, 1) != t) 7 { 8 m_v += t * 100; 9 } 10 } 11 unsigned short m_v; 12}; 13static const int MAX_T = 1000; 14struct Lol 15{ 16 __declspec(noinline) void Cat(int t) 17 { 18 unsigned short newT = static_cast<unsigned short>(t < MAX_T ?

Conditional breakpoints on steroids

Imagine a situation where you’re debugging some problem and need to set a breakpoint in function that’s called very often from many points in your application (like operator[] in your array class for example). Thing is, you’re only interested in one particular codepath, as that’s where things go wrong, calls from other places are fine. If you simply set a breakpoint in line that interests you, it’ll trigger hundreds of times every frame, making it tricky to find the moment that’s interesting for us.

TMplayer to MicroDVD converter

One of my Christmas presents (that I bought myself) was WD TV Live Plus media streamer. I left my projector back in Warsaw (sniff) and I was tired of watching stuff on my 15” laptop. WD TV is a pretty neat little machine, it effortlessly plays every video I throw at it. The only problem I had was that it seemed to struggle with TMPlayer subtitle format which is very popular in Poland for some reason.

Playing with bits

Little challenge posted by a workmate. Good for those 5 minute breaks at work when you need to force your brain to think about something else than your current task. Recode the following piece of code so that it doesn’t use branches/multiplies: unsigned char* oldStart = 0; if (m_start + size < m_bottom) { oldStart = m_start; m_start += size; } return oldStart; Don’t read below if you want to try it yourself.

Anatomy of Duff's Device

Duff’s Device is one of the most brilliant exploits of C syntax. It’s used to unroll loops and save some cycles spent on loop ‘maintenance’. Let’s take a look at typical fill_n function: template RDE_FORCEINLINE void fill_n(T* first, size_t n, const T& val) { for (size_t i = 0; i < n; ++i) first[i] = val; } Now, assuming that element is not too expensive to copy, execution time of this function can be heavily influenced just by loop ‘maintenance’ I mentioned (counter modifications + jumps).

Vector swizzling in C++

Everyone who’s done at least some vertex/pixel shader/HLSL programming has probably encountered mechanism called “swizzling”. It’s an operation where we create new vector using arbitrarily selected components of another vector (also a little bit similiar to SSE shuffling). Code snippet is worth 100 words, so some examples: a = b.zyzx; // a.x = b.z, a.y = b.y, a.z = b.z, a.w = b.x a = b.wy; // a.x = b.

Aligning arrays

When developing stack-based containers for RDESTL I’ve encountered the following problem - how to get block of uninitialized memory that’s aligned properly for type T. Consider fixed_vector class: template class fixed_array { ... char m_data[N * sizeof(T)]; Size is OK (we need N elements of type T), sadly alignment is invalid here. It may not be a problem for majority of cases, but try storing _m128s… Even when using 32-bit variables, they should be aligned on natural boundary (4 bytess) in order to rely on writes/reads being atomic.

Number of array elements - addendum

OK, so how does this snippet work? There’s a function that takes array of N elements as an argument. It returns an array of N chars. Assuming sizeof(char) == 1, size of the return type for this function is N. There’s no function body, because it’s not needed, function is never called. On the other note, we’ve been hit by the Reddit Effect. Of course, in this case it wasnt a problem of breaking the server, but it screwed all my stats.

Number of array elements

(or one more reason to love C++). Problem: how to find out number of elements in a C++ array? The most popular form is probably: #define RDE_COUNT_OF(arr) (sizeof(arr)/sizeof(arr[0])) ... Foo myTab[10]; size_t numElems = RDE_COUNT_OF(myTab); assert(numElems == 10); It’s being widely used, however most people do not realize the potential danger with this small piece of code. Imagine that one day we decide to change our static array to dynamically allocated block of memory: Foo* myTab = new Foo[numNeededFoos]; size_t numElems = RDE_COUNT_OF(myTab); // Huh?

More new/delete overriding fun.

Consider the following code snippet: struct Foo { void* operator new(size_t bytes); void operator delete(void* ptr); }; struct Bar : public Foo { void* operator new(size_t bytes); void operator delete(void* ptr); }; [...] Foo* b = new Bar(); delete b; Can you see a problem here? Yes, there’s no virtual destructor. It will not manifest itself (at least under Visual Studio) in an obvious way if you do not do anything in derived class’ destructor.

Spot a bug

I’ve found an interesting bug in a very old code today. Consider the following code snippet: 1struct foo 2{ 3 void* operator new(size_t t); 4 void operator delete(void* p); 5 int i; 6}; 7[...] 8{ 9 foo* f = new foo(); 10 ::delete(f); 11} Can you spot a bug here? Read rest of the article for an answer. It all boils down to those two ‘:’ chars. It forces compiler to call global operator delete instead of this defined for class foo.

Compile-time iterator type

Something we’ve been wondering at work one day (just for kicks, but being resident template freak I couldnt resist to give it a try). Challenge: given STL collection, find a way to determine ‘compatible’ iterator type for it. Example of intended usage: std::vector v; for (ITER_TYPE(int, v)::iterator it = v.begin(); it != v.end(); ++it) ; // do something here. Do not read further if you want to try it by yourself, solution follows.