Virtual functions - an experiment

Some months ago I’ve read this article by Elan Ruskin, when he measures the overhead of virtual functions. There’s also a follow-up with test code, make sure to check it out as well. I’ve decided to extend the test application a little bit and add one more solution - direct function pointers.

Instead of having global per-class vtable, we can store function pointers in each object, thus removing one level of indirection. With virtual functions we first have to jump to class vtable, load function address, then jump to function. With per-object function pointers, the jump address is right there. Code for this version:

class TestVector4_FuncPtr
static float CALL_DECL S_GetX(const TestVector4_FuncPtr* v) { return v->x; }
static float CALL_DECL S_GetY(const TestVector4_FuncPtr* v) { return v->y; }
static float CALL_DECL S_GetZ(const TestVector4_FuncPtr* v) { return v->z; }
static float CALL_DECL S_GetW(const TestVector4_FuncPtr* v) { return v->w; }
static float CALL_DECL S_SetX(TestVector4_FuncPtr* v, float in) { return v->x = in; }
static float CALL_DECL S_SetY(TestVector4_FuncPtr* v, float in) { return v->y = in; }
static float CALL_DECL S_SetZ(TestVector4_FuncPtr* v, float in) { return v->z = in; }
static float CALL_DECL S_SetW(TestVector4_FuncPtr* v, float in) { return v->w = in; }

float GetX() const    { return    m_pfnGetX(this); }
float GetY() const    { return    m_pfnGetY(this); }
float GetZ() const    { return    m_pfnGetZ(this); }
float GetW() const    { return    m_pfnGetW(this); }

float SetX(float in)    { return m_pfnSetX(this, in); }
float SetY(float in)    { return m_pfnSetY(this, in); }
float SetZ(float in)    { return m_pfnSetZ(this, in); }
float SetW(float in)    { return m_pfnSetW(this, in); }

m_pfnGetX = &S;_GetX;
m_pfnGetY = &S;_GetY;
m_pfnGetZ = &S;_GetZ;
m_pfnGetW = &S;_GetW;

m_pfnSetX = &S;_SetX;
m_pfnSetY = &S;_SetY;
m_pfnSetZ = &S;_SetZ;
m_pfnSetW = &S;_SetW;

typedef float (CALL_DECL *Getter)(const TestVector4_FuncPtr*);
typedef float (CALL_DECL *Setter)(TestVector4_FuncPtr*, float);

float x, y, z, w;

Getter    m_pfnGetX;
Getter    m_pfnGetY;
Getter    m_pfnGetZ;
Getter    m_pfnGetW;

Setter    m_pfnSetX;
Setter    m_pfnSetY;
Setter    m_pfnSetZ;
Setter    m_pfnSetW;
(I’ll explain CALL_DECL later).

Let’s see the results now (array size: 1024 elements, 200000 iterations, 2.2GHz laptop):

  • virtual functions: 5008477914 ticks (2282.140339ms),
  • inline functions: 701710284 ticks (319.738127 ms),
  • normal (direct) functions: 4791365040 ticks (2183.227339 ms),
  • per-object function pointers: 5290894092 ticks (2410.842116 ms)

That’s a little discouraging. Not only our version is not faster than vtable, it’s actually quite a bit slower. I didn’t have too high expectations to be honest (I agree with Elan that cost of looking-up function pointer is not a major problem here), but still – it was surprising. I decided to dig a little deeper. Let’s compare generated code (one component, function pointer version):

//; 166  :         out[i].SetY( in1[i].GetY() + in2[i].GetY() );
push    ebx
call    DWORD PTR [ebx+20]
fstp    DWORD PTR $T39106[ebp]
lea    eax, DWORD PTR [esi-16]
push    eax
call    DWORD PTR [esi+4]
fadd    DWORD PTR $T39106[ebp]
add    esp, 12                    ; 0000000cH
fstp    DWORD PTR [esp]
push    edi
call    DWORD PTR [edi+36] 

A little too many pushes for my taste, but doesn’t look too bad, let’s investigate further. Let’s see GetX function:

//; 55   :     float GetX() const    { return    m_pfnGetX(this); }
push    ecx
call    DWORD PTR [ecx+16]
pop    ecx
ret    0

//; 46   :     static float CALL_DECL S_GetX(const TestVector4_FuncPtr* v) { return v->x; }
mov    eax, DWORD PTR _v$[esp-4]
fld    DWORD PTR [eax]

// For comparison -- vtable version
; 119  : V_GETTER(Virtual, x, X);
fld    DWORD PTR [ecx+4]
ret    0

Yikes. First problem, when we re-direct from GetX to S_GetX, compiler needs to save ECX (this pointer). Actually, in this case, we could expect it to notice “this” is not touched afterwards, so it’s not really necessary, but oh, well. Another difference is that in vtable version we don’t have to load object pointer from the stack, as calling convention (thiscall) guarantees it’s passed in the register. It seems like our modifications did more harm than good in this case. Let’s not surrender yet, though. What would happen if we tried to change calling convention for our functions? Maybe there’s a way to force compiler to deal with it in more efficient manner. Here are results with static functions compiled in thiscall convention (we know that ECX is OK in GetX, so it should be good also in S_GetX):

  • per-object function pointers: 4936676352 ticks (2249.440999 ms)

Generated code looks more reasonable as well:

//; 46   :     static float CALL_DECL S_GetX(const TestVector4_FuncPtr* v) { return v->x; }
fld    DWORD PTR [ecx]
ret    0

//; 55   :     float GetX() const    { return    m_pfnGetX(this); }
jmp    DWORD PTR [ecx+16]

//; 166  :         out[i].SetY( in1[i].GetY() + in2[i].GetY() );
mov    ecx, ebx
call    DWORD PTR [ebx+20]
fstp    DWORD PTR $T39110[ebp]
lea    ecx, DWORD PTR [esi-16]
call    DWORD PTR [esi+4]
fadd    DWORD PTR $T39110[ebp]
push    ecx
mov    ecx, edi
fstp    DWORD PTR [esp]
call    DWORD PTR [edi+36]

Obviously, it’s not a solution of all vtable related problems. Gains are not really significant, but what’s more problematic - it increases size of every object (the more functions - the bigger overhead). In real-life scenario it’d only be usable for big objects, definitelly not for arrays of vectors as in this example. It’s just another tool in our box.

Another problem is – results seem to be very platform specific. While on PC the function-pointer version seems to be a little bit quicker, on consoles it doesn’t really change much. Let’s think how does virtual function call work on PPC. It first loads CTR using “mtctr” and later performs a branch using CTR. Without advanced branch predictors, CPU cannot start fetching new instructions, because jump address is not known yet. To cut the long story short – our “trick” won’t help us much on consoles :/. Code generated on PPC (for function pointer version) is roughly the same as Elan posted in the second article. There’s one lwz instruction less (no need to load the vtable), but it doesn’t seem to be enough to pay for increased object size.

[Edit] As suggested in comments, added milliseconds timings (I like ticks, because they’re more “portable”, but I agree that in this case it wasn’t very clear]

Old comments

J 2009-06-15 04:28:15

I suggest that you convert the tics into something more human readable. In this incarnation, it’s difficult to even see how their orders of magnitude compare.

ed 2009-06-16 12:05:18

You missed one factor in the testing. You had 1024 elements of the same object so vtable was cached in L1/L2. The question is of course how you’ll use objects and how many vtables will be involved. In real life app difference between vtable and direct should be much bigger. Storing pointer in object may have advantage as it will be read along with object but still, it will not be as fast as direct.
The biggest performance problem is that virtual functions are not inlined and pushing input parameters through stack (at least on PC) will have much bigger impact than calling function from vtable.

Anonymous 2009-06-16 16:53:51

Also be aware that most modern compilers will inline indirect function calls (and virtual function calls) if they can determine the target statically.

admin 2009-06-16 17:50:08

ed: good points.
On PC virtual functions usually will not be a problem anyway :)

getw 2010-03-06 22:37:49

[…] getz geta getq getw gete getd ge6s ge5s gers gefs gegs geys ggets geets getts getss g ets ge …Virtual functions an experiment | .mischief.mayhem.soap.Some months ago I've read this article by Elan Ruskin, when he measures the overhead of virtual […]