template<typename T, std::size_t Dim> class Vec { enum { MaxIndex = Dim - 1 }; private: T m_v[Dim]; };
As you can see we also define private MaxIndex constant, which contains maximum valid index for this vector. For example, 3D vector may be indexed with values from [0, 2] range. Even at this stage we can implement some of simple functions, which do not require any tweaks. These are:
Vec(const Vec& rhs) { for (std::size_t i = 0; i < Dim; ++i) { m_v[i] = rhs[i]; } }
It would be good if we were able just to trust that our compiler would unroll this loop during compilation stage. Theoretically this should be possible (Dim is constant), but as we all know practice has this nasty habit not to conform the theory. Let's see what GCC generates from this piece of code:
xorl %edx, %edx movl 8(%esp), %ecx movl 12(%esp), %ebx L140: movl (%ebx,%edx,4), %eax movl %eax, (%ecx,%edx,4) incl %edx cmpl $2, %edx jbe L140
Yes, runtime loop. Too bad. This means we need to find a way to unroll the loop by hand, during the compile time. First let's create assignment operator:
template<typename T> struct AssignOp { static void Apply(T& lhs, const T& rhs) { lhs = rhs; } };
Alternatively you can make Apply() template member of AssignOp, but I tend to choose the "weakest" solution that works. Member templates may cause problems, so we try to avoid them (even if they'll appear later anyway :). This operator should be applied to each vector element.
template<class LHS, class RHS, class Op, std::size_t i, std::size_t MaxIndex> struct Unroller { static void Go(LHS lhs, RHS rhs) { Op::Apply(lhs[i], rhs[i]); Unroller<LHS, RHS, Op, i + 1, MaxIndex>::Go(lhs, rhs); } };
This is the unroller, which applies given operator (Op) to i-th element of LHS/RHS. The interesting part is second line, here we move to the next index. Now the only thing which is left is terminating when MaxIndex has been reached. I use template specialization for this:
template<class LHS, class RHS, class Op, std::size_t MaxIndex> struct Unroller<LHS, RHS, Op, MaxIndex, MaxIndex> { static void Go(LHS lhs, RHS rhs) { Op::Apply(lhs[MaxIndex], rhs[MaxIndex]); } };
Code is essentially the same, but this time we just terminate. You've most probably seen implementations that traverse from Dim - 1 to 0, with specialization for i=0, but I prefer my solution as it may be more cache friendly. We can implement Vec::operator= using Unroller now:
template<typename T, std::size_t Dim> RT_FORCEINLINE Vec<T, Dim>& Vec<T, Dim>::operator=(const Vec& rhs) { Et::Unroller<T*, const T*, Et::AssignOp<T>, 0, MaxIndex>::Go( Begin(), rhs.Begin()); return *this; }
(RT_FORCEINLINE is my wrapper over 'inline', Et is namespace where all the unrolling/expression templates code resides). Now you know what we needed Begin() for, BTW. Other solution would be to make operator[] return const T& instead of just 'T' and doing &rhs[0] instead of rhs.Begin(), but it seems a little "dirty" to me. In this implementation you can modify Begin() should there arise a need for this. Let's now see whether our hard work has given us something:
movl 8(%esp), %ecx movl 4(%esp), %eax movl (%ecx), %edx movl %edx, (%eax) movl 4(%ecx), %edx movl %edx, 4(%eax) movl 8(%ecx), %edx movl %edx, 8(%eax)
Nice, isn't it? It's exactly the same code as generated for: a[0] = b[0]; a[1] = b[1]; a[2] = b[2], just what we wanted! Now we can extend our system by implementing copy constructor in the same fashion and introducing +=/-= operators for in-place vector addition/subtraction. The only difference is that now we use AddAssignOp/SubAssignOp instead of AssignOp. Code for AddAssignOp follows:
template<typename T> struct AddAssignOp { static void Apply(T& lhs, const T& rhs) { lhs += rhs; } };
The nice thing is that our Unroller works perfectly with this operators, we only need to provide valid 'Op' template argument.
Another common vector operation is multiply/divide by scalar. In this case we need special unroller, that's because one of the arguments (scalar) shouldn't be indexed:
template<class LHS, class RHS, class Op, std::size_t i, std::size_t MaxIndex> struct ScalarUnroller { static void Go(LHS lhs, RHS rhs) { Op::Apply(lhs[i], rhs); ScalarUnroller<LHS, RHS, Op, i + 1, MaxIndex>::Go(lhs, rhs); } };
I don't present specialization for i=MaxIndex here, it should be obvious by now. Example of operator*= :
template<typename T, std::size_t Dim> RT_FORCEINLINE Vec<T, Dim>& Vec<T, Dim>::operator*=(T s) { Et::ScalarUnroller<T*, const T, Et::InPlaceMultOp<T>, 0, MaxIndex>::Go( Begin(), s); return *this; }
Dividing is implemented by multiplying by reciprocal, this way not only we reuse the code, but also optimize (for Intel-family CPUs at least):
template<typename T, std::size_t Dim> RT_FORCEINLINE Vec<T, Dim>& Vec<T, Dim>::operator/=(T s) { return this->operator*=(T(1) / s); }
Another solution would be to create wrapper over scalar types, which would return the same value for each call to operator[].
What's left? Vector length, normalization... But first, let's create dot product of two vectors, it will help us with other operations.
template<typename T, std::size_t i, std::size_t MaxIndex> struct DotProductUnroller { static T Go(const T* lhs, const T* rhs) { return (lhs[i] * rhs[i]) + DotProductUnroller<T, i + 1, MaxIndex>::Go(lhs, rhs); } }; template<typename T, std::size_t Dim> inline T Dot(const Vec<T, Dim>& a, const Vec<T, Dim>& b) { return Et::DotProductUnroller<T, 0, Dim - 1>::Go(a.Begin(), b.Begin()); }
DotProductUnroller is very simple, it cannot have custom LHS/RHS. I don't provide result assembly code, but believe me, it's just as good as it gets. GetLength()/GetLengthSquared() are pretty trivial to implement now. v.GetLengthSquared() is just Dot(v, v) and v.Length() is sqrt(v.GetLengthSquared()). Simple as that. I've two normalization routines. First one assumes that vector is always valid (that is, has non-zero length, it verifies this with assertion), second one (named SafeNormalize) throws exception in such case. Below I present SafeNormalize:
template<typename T, std::size_t Dim> void Vec<T, Dim>::SafeNormalize() { const T len = GetLength(); if (IsAlmostEqual(len, T(0))) throw ZeroLengthVector(); (*this) *= T(1) / len; }
IsAlmostEqual performs "comparison with tolerance" using epsilon value from std::numeric_limits. Now, if someone would *really* like to play with templates, he could implement this in terms of "on-zero-vector" policies a'la Alexandrescu, but I leave this as an exercise for the reader :-).
uff, I think that by now you've the general idea of how this system works.
template<typename T, std::size_t Dim> RT_FORCEINLINE Vec<T, Dim> operator+(const Vec<T, Dim>& lhs, const Vec<T, Dim>& rhs) { Vec<T, Dim> nrv(lhs); nrv += rhs; return nrv; }
This method bases on NRVO (Named Return Value Optimization) feature of the compiler. It's present in GCC, so we can expect good code being generated. Please see the thread about ZUTO at the Boost discussion list for details and comparisons of different forms of (N)RVO. We do the same for subtraction and multiplication/division by scalar. Following Michael Abrash advice we test these operators, just to make sure if there's a need to optimize it. Test consists of two functions:
Vec3f VecTest0(const Vec3f& a, const Vec3f& b, const Vec3f& c, const Vec3f& d) { return (a + b + c) * 5.0f - d; } Vec4i VecTest1() { Vec4i a; Vec4i b; Vec4i c; a[0] = 1; a[1] = 2; a[2] = 3; a[3] = 4; b[0] = 4; b[1] = 5; b[2] = 6; b[3] = 7; return a + b; }
In the first case we have all basic kinds of vector operations, executed on objects passed as arguments. Both these functions will be compared to their hand-optimized equivalents:
StupidVec3f VecTest0HandOptimized(const StupidVec3f& a, const StupidVec3f& b, const StupidVec3f& c, const StupidVec3f& d) { return StupidVec3f((a.x + b.x + c.x) * 5 - d.x, (a.y + b.y + c.y) * 5 - d.y, (a.z + b.z + c.z) * 5 - d.z); } Vec4i VecTest4HandOptimized() { Vec4i a; Vec4i b; Vec4i c; a[0] = 1; a[1] = 2; a[2] = 3; a[3] = 4; b[0] = 4; b[1] = 5; b[2] = 6; b[3] = 7; c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; return c; }
StupidVec3f is just flat structure of 3 floats. For the first example hand-optimized source gives us this code:
LFB3: pushl %esi LCFI48: pushl %ebx LCFI49: movl 20(%esp), %edx movl 16(%esp), %ebx movl 24(%esp), %ecx movl 28(%esp), %esi flds (%edx) flds 4(%edx) flds 8(%edx) fxch %st(2) fadds (%ebx) fxch %st(1) fadds 4(%ebx) fxch %st(2) fadds 8(%ebx) fxch %st(1) fadds (%ecx) fxch %st(2) fadds 4(%ecx) fxch %st(1) fadds 8(%ecx) flds LC5 movl 12(%esp), %eax fmul %st, %st(3) fmul %st, %st(2) fxch %st(3) fsubs (%esi) fxch %st(1) fmulp %st, %st(3) fxch %st(1) fsubs 4(%esi) fxch %st(2) fsubs 8(%esi) fxch %st(1) fstps (%eax) fxch %st(1) fstps 4(%eax) fstps 8(%eax) popl %ebx LCFI50: popl %esi LCFI51: ret $4
Let's compare it to assembly output for version with "standard" operators:
LFB4: pushl %esi LCFI52: pushl %ebx LCFI53: subl $52, %esp LCFI54: flds LC7 movl 72(%esp), %edx movl 68(%esp), %ecx movl 76(%esp), %ebx movl 64(%esp), %eax flds (%edx) flds 4(%edx) flds 8(%edx) fxch %st(2) fadds (%ecx) fxch %st(1) fadds 4(%ecx) fxch %st(2) fadds 8(%ecx) fxch %st(1) fadds (%ebx) fxch %st(2) fadds 4(%ebx) fxch %st(1) fadds 8(%ebx) fxch %st(2) fmul %st(3), %st fxch %st(1) movl 80(%esp), %esi fmul %st(3), %st fxch %st(2) fmulp %st, %st(3) fsts (%eax) fxch %st(1) fsts 4(%eax) fxch %st(2) fsts 8(%eax) fxch %st(1) fsubs (%esi) fstps (%eax) fxch %st(1) fsubs 4(%esi) fstps 4(%eax) fsubs 8(%esi) fstps 8(%eax) addl $52, %esp LCFI55: popl %ebx LCFI56: popl %esi LCFI57: ret $4
As you can see, GCC did pretty good job with this one, the snippets are almost the same! Also for the second example compiler was smart enough to calculate the value of expression during compilation:
LFB8: subl $60, %esp LCFI60: movl 64(%esp), %eax movl $5, (%eax) movl $7, 4(%eax) movl $9, 8(%eax) movl $11, 12(%eax) addl $60, %esp LCFI61: ret $4
It seems there's not much of us to fix. However, I decided to go further and see if I can push the performance even more. I was glad that GCC did so good job, but on the other hand I wasn't sure about other compilers, possibly w/o NRVO support. Finally, I wanted to experiment a little bit with expression templates magic. Entry points for my tests were: "Disambiguated Glommable Expression Templates Reintroduced" by Geoffrey Furnish, published in C++ Report (Oct 2002) and small code snippet by Gottfried Chen posted at his page. Both these examples had some flaws. The former was too complicated for my simple purposes. It would work great for full-blown math library with many kinds of arbitraly sized matrices and vectors, but for 3D engine purposes it's just overkill. Plus, it doesn't take advantage of compile-time loop unrolling. The latter used virtual functions, this may affect the performance. I wanted to combine somehow the bese of these two solutions. I've also found some nice ideas in Stroustrup's "TC++L". Our goal is to eliminate all the temporaries and produce as tight code as it's possible for constructions like:
a = b + c + d;
Without any optimization we may have 2 temporary values here (assuming that no conversions are needed). One to hold the result of (b + c) expression and another one to hold the result (b + c) + d before copying it to 'a'. Now, imagine that operator+ doesn't return vector, but only structure holding it's two arguments, so that we can accumulate operations and evaluate the expression just before assigning the result to 'a'. Binary operator structure can look like:
template<typename T, class LHS, class RHS, class Op> struct BinOp { public: BinOp(LHS lhs, RHS rhs) : m_lhs(lhs), m_rhs(rhs) { } T operator[](std::size_t i) const { return Op::Apply(m_lhs[i], m_rhs[i]); } private: LHS m_lhs; RHS m_rhs; };
Binary operator takes two arguments in the constructor -- they represent left & right side of the operator. Evaluation is delayed, and can be executed via operator[]. Now let's define some functors that can be passed as 'Op' template argument. Such functor should have static Apply function that performs operation on two arguments and return a result (of type 'T'). For example:
template<typename T> struct AddOp { static T Apply(const T& lhs, const T& rhs) { return lhs + rhs; } };
This operator returns sum of its two arguments (provided that 'T' type has '+' operator defined). Now there's only one piece missing -- the expression structure. Or rather a kind of wrapper over the expression. In fact, this structure can hold sub-expressions of arbitrary complexity, communicating with them only via subscription operator. Our implementation follows.
template<typename T, class Xpr> class Expr { public: Expr(const Xpr& e) : m_expr(e) { } T operator[](std::size_t i) const { return m_expr[i]; } private: const Xpr& m_expr; };
This looks very simple and innocent, but as you'll see it allows for creating truly powerful constructions. Uff, OK, first task: addition of two vectors. Return type is Expr<T, BinOp<T, const T*, const T*, AddOp<T> >. LHS and RHS are just pointers to the beginning of vector data. Full operator looks like:
template<typename T, std::size_t Dim> RT_FORCEINLINE Et::Expr<T, Et::BinOp<T, const T*, const T*, Et::AddOp<T> > > operator+(const Vec<T, Dim>& lhs, const Vec<T, Dim>& rhs) { typedef Et::BinOp<T, const T*, const T*, Et::AddOp<T> > OpT; return Et::Expr<T, OpT>(OpT(lhs.Begin(), rhs.Begin())); }
Careful reader should notice that something is wrong. How are we supposed to assign the result of such expression to vector? We need assignment operator/copy constructor taking expressions. Chen solved this problem by deriving all expressions from Evaluatable base class with virtual 'evaluateTo(Vector3<T>&)' method. I'd like to avoid virtual functions and prefer a simple kind of compile-time polymorphism, which may be achieved using member templates.
template<class Xpr> Vec(const Et::Expr<T, Xpr>& rhs) { Et::Unroller<T*, const E&, Et::AssignOp<T>, 0, MaxIndex>::Go( Begin(), rhs); }
Unroller comes to the rescue one more time. In the same fashion we can implement operator=, operator+=, operator-= etc. However, we're not done with the whole system yet, not by the long shot. We need also the following operators:
Why? Imagine for example:
a = b + c + d;
Here, the result of (b + c) is expression, so we need to have operator+ taking expression (b + c) and vector (d). One of our test cases:
return (a + b + c) * 5.0f - d;
is even more complicated. Here we need:
You can find the implementation of all these combinations in source code attached to this article. Uff, finally, we can see the result of all this template magic. Here's assembler output for both tests:
LFB4:
pushl %edi
LCFI52:
pushl %esi
LCFI53:
pushl %ebx
LCFI54:
subl $32, %esp
LCFI55:
flds LC7
leal 24(%esp), %edx
movl 56(%esp), %ecx
movl %edx, 16(%esp)
movl 52(%esp), %esi
leal 16(%esp), %edx
movl 60(%esp), %ebx
movl %edx, 8(%esp)
movl 64(%esp), %edi
leal 8(%esp), %edx
movl 48(%esp), %eax
movl %edx, (%esp)
flds (%ecx)
fadds (%esi)
fadds (%ebx)
fmul %st(1), %st
fsubs (%edi)
fstps (%eax)
flds 4(%ecx)
fadds 4(%esi)
fadds 4(%ebx)
fmul %st(1), %st
fsubs 4(%edi)
fstps 4(%eax)
flds 8(%ecx)
fadds 8(%esi)
fadds 8(%ebx)
fmulp %st, %st(1)
fsubs 8(%edi)
fstps 8(%eax)
addl $32, %esp
LCFI56:
popl %ebx
LCFI57:
popl %esi
LCFI58:
popl %edi
LCFI59:
ret $4
// -- And test #2
LCFI62:
movl 80(%esp), %eax
movl $5, (%eax)
movl $7, 4(%eax)
movl $9, 8(%eax)
movl $11, 12(%eax)
addl $76, %esp
LCFI63:
ret $4
Ain't this interesting? Code generated for test#1 is shorter than the one for hand-optimized function :-). Is it also faster? Not necessarily, as 'fxch' operations are virtually free, but this is something to find out with profiler, guessing will lead us nowhere. Please notice how final value of each component is calculated and compare this to previous snippets.
// count = 3200000 in this case float TestFunc2(int count, float a, float b, float c) { vec3f v(a,b,c); while ( --count ) { v = (v + v*3) / 2.000001f - v * c; } return DotProduct(v,v); }
In the package there are the following implementations:
I couldn't compile test #5 (D3DX) as I had some problems with DX8 under MinGW, however Jukka reported it's quite on par with (1)/(2) implementation. Each function has been executed 16 times and min/max/avg performance has been recorded (using standard clock() function). The results are:
It's veeeery interesting. First, I thought it's the problem with timing and replaced clock() with timeGetTime(), but the results were basically the same. Version (3) (hand-optimized) is more then two times slower than vector class. Function-based is clearly the fastest one. I tried to plug in my library, but first I had to add one operator I had forgotten (division). The results were not so bad, but still a little disappointing:
8: min: 290 ms max: 311 ms average: 302.313 ms 0.424437 x
Definitelly slower than Jukka's library, not even mentioning 1/2/6 versions. Still, I was ready to sacrifice some speed for more generic system, but it was also ~5 times slower than the fastest version! A little too much for me to accept, especially if this library was to be used also in time-critical place (there's not much gain of using ETs if at the end of a day you have to hand-optimize your tight loops anyway). Now let's see how fast would it work without ET (RTMATH_USES_EXPRESSION_TEMPLATES set to zero):
8: min: 100 ms max: 111 ms average: 103.25 ms 1.22458 x
Much better, this is second fastest version! At this point I could finish tests. However, I felt there's one more thing to do -- speed up my ET version to reach at least 90% of Jukka's code (4). I chose this implementation as a reference point because, as it was said, it based on similiar ideas. I didn't have much hope to create something faster than ~200ms/test. After all my implementation was really very generic and required quite a lot from a compiler. Let's see what we can do, anyway... All the operators from rExpr.h take const T& as argument. Type of argument s usually float or double, anyway, something that may be better to pass by value. Let's modify this.
8: min: 290 ms max: 311 ms average: 302.313 ms
Erm, OK. Another thing, our "expression/scalar" operator was quickly implemented using Et::DivOp. Now let's optimize it by hand, to force multiplication by reciprocal:
template<typename T, class Xpr> RT_FORCEINLINE Et::Expr<T, Et::ScalarBinOp<T, Et::Expr<T, Xpr>, Et::MultOp<T> > > operator/(const Et::Expr<T, Xpr>& e, T s) { typedef Et::ScalarBinOp<T, Et::Expr<T, Xpr>, Et::MultOp<T> > OpT; const T r = T(1) / s; return Et::Expr<T, OpT>(OpT(e, r)); }
8: min: 110 ms max: 130 ms average: 118.313 ms
Uff, already much, much better. I was very curious how far can I push the performance. I also wanted to make sure that there aren't any unwanted temporary objects. I made all the constructors, copy constructors and assignment operators for BinOp/ScalarBinOp private. Copy constructor was only needed for Expr. However, this time instead of trusting the compiler I generated my own, inlined copy ctor (according to Lippman, this may help in RVO optimizations). Another thing, a little more subtle, was to make it easier for the compiler to unroll the loop. I created basic function:
template<typename T, typename Container, class Op, std::size_t Dim> RT_FORCEINLINE Vec<T, Dim>& VecAssign(Vec<T, Dim>& v, const Container& c, Op) { Et::Unroller<T*, const Container&, Op, 0, Dim - 1>::Go(v.Begin(), c); return v; }
and provided overloads for the most common kinds of vectors:
template<typename T, typename Container, class Op> RT_FORCEINLINE Vec<T, 2>& VecAssign(Vec<T, 2>& v, const Container& c, Op) { Op::Apply(v[0], c[0]); Op::Apply(v[1], c[1]); return v; }
Results were better than I had expected:
8: min: 70 ms max: 81 ms average: 75.75 ms!
This is quite on par with the fastest possible implementation! However, there are still problems that I find with extensive use of templates. It's hard (unless your surname starts with 'A' and ends with 'u' :-) to say what impact on your code will have certain change. It's quite possible that the same alterations with one of them missing wouldn't give us much, and they work so well only in this combination. Experiments with ET are a little bit like unarming the bomb, you touch the wrong wire and everything explodes :-). Nevertheless, I expect that these modifications should speed up operations on all compilers. If they compile, that's, because Jukka already informed me it doesnt work with MSVC :-). I really have no time right now to fix it, maybe someone out there may help? One more small refactor I'm considering is to finally add this policy-based normalization routine, but this has not much to do with the speed of course. Some cosmetic modifications were also to change a little bit passing/storing data to BinOp. In first version it was done by value, but since I mainly passed pointers (using Vec::Begin()) it was not a big problem. However, in some situation it's possible that argument of BinOp ctor is expression. In such case it would be better to send it via reference. The solutions I can think of are:
In the end I decided to choose the simplest way the can possibly work -- I modified the code so that all parameters are passed by reference and removed calls to Vec::Begin(). Maybe in the future I'll switch to the second solution, which looks the most generic to me.
Included source code is part of my RDE library. In the future I'm going to extend it of course, by adding matrices, quaternions, splines. Make sure to visit RDE's page often, as I hope to be able to update this part of library more frequently then others (I work on them, too, but only in my limited spare time, we're going into alpha stage, you know...).
-- Maciej Sinilo
mail: maciejsinilo@o2.pl (contact me in case of bugs, ideas, questions...)