Hidden enemy

Consider the following piece of code:

void FindFrames(const S* seq, float d, UINT& outFrameA)
    const UINT n= static_cast<UINT>(seq->fd.size());
    outFrameA = 0;
    while(outFrameA + 1 < n && seq->fd[outFrameA + 1] < d)

Looks innocent, right? Well, it was slow enough to show up quite high in the profiler… It might seem surprising at first, but if you’ve done some console programming in the past, you can probably recognized our old foe - load-hit-store here. Let’s see generated assembly code:

823690D4  cmplwi           cr6,r10,1
823690D8  ble              cr6,FindFrames + 00d0h (82369110h)
823690DC  lwz              r11,0(r31)
823690E0  lwz              r8,0A4h(r30)
823690E4  addi             r7,r11,1
823690E8  slwi             r6,r7,2
823690EC  lfsx             fr0,r6,r8
823690F0  fcmpu            cr6,fr0,fr31
823690F4  bge              cr6,FindFrames + 00d0h (82369110h)
823690F8  addi             r11,r11,1
823690FC  stw              r11,0(r31)
82369100  rlwinm           r11,r11,0,0,31
82369104  addi             r8,r11,1
82369108  cmplw            cr6,r8,r10
8236910C  blt              cr6,FindFrames + 009ch (823690dch)

Whooah, some serious spills there. As you can see compiler doesn’t store outFrameA in register, instead it loads (823690DC) & stores (823690FC) it in memory on every iteration. There’s not much work done between those calls, so obviously – that’s where all those LHSes come from. To be honest - I am not exactly sure why it behaves like this. There’s no aliasing, perhaps it’s so that outside code sees the most up-to-date value of outFrameA (even though it’s not volatile?). Either way, doesn’t make too much sense to me (see also Charles’ post + comments). Solution? Simple, copy to local variable, operate on it then store back. Modified code:

UINT frameA = 0;
while(frameA + 1 < n && seq->fd[frameA + 1] < d)
outFrameA = frameA;

Compare generated assembly:

823690D8  cmplwi           cr6,r7,1
823690DC  ble              cr6,FindFrames + 00d0h (82369110h)
        UINT frameA = 0;
823690E0  mr               r11,r5
823690E4  li               r10,1
823690E8  add              r6,r5,r9
823690EC  lfs              fr0,4(r6)
823690F0  fcmpu            cr6,fr0,fr31
823690F4  bge              cr6,FindFrames + 00d0h (82369110h)
823690F8  addi             r11,r11,4
823690FC  addi             r10,r10,1
82369100  addi             r8,r8,1
82369104  cmplw            cr6,r10,r7
82369108  add              r6,r11,r9
8236910C  blt              cr6,FindFrames + 00ach (823690ech)
Ahh, much better now. frameA is stored in r8, more stores/loads on every iteration, there’ll be only one when we’re done. That’s a very trivial modification, but version #2 will run much much faster than version #1… After while it’s easier to notice those patterns, but from time to time I’m still surprised when looking at the profiler data. Also, LHS is usually more sneaky than cache miss, because it depends more heavily on compiler. You can have perfectly innocent code that will run like dog just because compiler fails to put stuff into registers. Another common source of LHSes are bitfields, they might look like independent variables, but obviously they’re not. Typical case of speed/memory tradeoff here, sometimes it might still be a win (we’re essentially trading cache misses for LHSes…).

Old comments

NickP 2011-06-20 12:54:01

>> To be honest â?? I am not exactly sure why it behaves like this. Thereâ??s no aliasing <<
It looks like fd is an STL vector or some other templated container? My guess is that the compiler is storing + loading the value because of the call to operator[]. It's odd that it does that even though the function gets inlined. Perhaps if operator[] were declared __forceinline it would not do so?
It's also interesting that the problem cascades. Notice that it's reloading the base pointer of the array from memory every iteration, presumably because it cannot be sure outFrameA and the base pointer of the array do not alias.

admin 2011-06-20 17:09:57

Nick - yeah, fd is my version of STL vector. It doesn’t matter here, though, even if I change function to look like this:
[code lang=“cpp”]
void FindFrames(const float* fd, int n, float d, UINT& outFrameA)
Generated assembly will be exactly the same.

Arseny Kapoulkine 2011-06-20 20:01:18

> Thereâ??s no aliasing
The assembly style suggests that the platform is X360 - I was under the impression that MSVC does not perform type-based alias analysis (i.e. ‘strict aliasing’ problems don’t apply), so outFrameA could alias anything, including the container struct & container memory.

admin 2011-06-20 22:06:04

Yeah, I thought so, too, but adding __restrict to both outFrame (changing it to pointer, obviously) & fd (function signature showed above) will not help here.

More Reading