Fast vector math library using ET & loop unrolling

Author:: Maciej "Yarpen" Sinilo (maciejsinilo@o2.pl)

Introduction

Everyone, who has written even one linear algebra package knows that there are many sections of code common for different types. After all, if we had all the time we wanted for computations we wouldn't bother with writing separate, optimized classes for 2D/3D/4D vectors, we would just create general N-D vector class and use it everywhere. Unfortunately in most cases the price you pay for such decision is loss of execution speed, especially if the dimension is dynamic (not known at compile time). Another common problem is that constructs like: a = b + c; suffer from unnecessary overhead connected with temporary objects. In this article I'll show how to avoid both of these problems using expression templates and template metaprogramming. All the code was compiled under MinGW 2.0.0 (GCC 3.2)

Basics

First, let's create some basic skeleton of N-dimensional vector class. Template arguments are: scalar type (type of vector components) and vector dimension:

template<typename T, std::size_t Dim>
class Vec
{
    enum { MaxIndex = Dim - 1 };

private:
    T   m_v[Dim];
};

As you can see we also define private MaxIndex constant, which contains maximum valid index for this vector. For example, 3D vector may be indexed with values from [0, 2] range. Even at this stage we can implement some of simple functions, which do not require any tweaks. These are:

default constructor. It does nothing, not even initialize components to zero. The rationale behind this is to make construction of vector arrays as fast as possible.
subscription operator, two versions, for const and non-const vectors: T operator[](std::size_t i) const; T& operator[](std::size_t i);
Begin() function, which returns the beginning of raw vector data. This is not so common vector class member, taken rather from std::vector. It's needed for our ET code (later).

Compile-time loop unrolling

Our first task: copying vectors. The straightforward and slow method would be:

    Vec(const Vec& rhs)
    {
        for (std::size_t i = 0; i < Dim; ++i)
        {
            m_v[i] = rhs[i];
        }
    }

It would be good if we were able just to trust that our compiler would unroll this loop during compilation stage. Theoretically this should be possible (Dim is constant), but as we all know practice has this nasty habit not to conform the theory. Let's see what GCC generates from this piece of code:

    xorl    %edx, %edx
    movl    8(%esp), %ecx
    movl    12(%esp), %ebx
L140:
    movl    (%ebx,%edx,4), %eax
    movl    %eax, (%ecx,%edx,4)
    incl    %edx
    cmpl    $2, %edx
    jbe L140

Yes, runtime loop. Too bad. This means we need to find a way to unroll the loop by hand, during the compile time. First let's create assignment operator:

template<typename T>
struct AssignOp
{
    static void Apply(T& lhs, const T& rhs)
    {
        lhs = rhs;
    }
};

Alternatively you can make Apply() template member of AssignOp, but I tend to choose the "weakest" solution that works. Member templates may cause problems, so we try to avoid them (even if they'll appear later anyway :). This operator should be applied to each vector element.

template<class LHS, class RHS, class Op, std::size_t i, std::size_t MaxIndex>
struct Unroller
{
    static void Go(LHS lhs, RHS rhs)
    {
        Op::Apply(lhs[i], rhs[i]);
        Unroller<LHS, RHS, Op, i + 1, MaxIndex>::Go(lhs, rhs);
    }
};

This is the unroller, which applies given operator (Op) to i-th element of LHS/RHS. The interesting part is second line, here we move to the next index. Now the only thing which is left is terminating when MaxIndex has been reached. I use template specialization for this:

template<class LHS, class RHS, class Op, std::size_t MaxIndex>
struct Unroller<LHS, RHS, Op, MaxIndex, MaxIndex>
{
    static void Go(LHS lhs, RHS rhs)
    {
        Op::Apply(lhs[MaxIndex], rhs[MaxIndex]);
    }
};

Code is essentially the same, but this time we just terminate. You've most probably seen implementations that traverse from Dim - 1 to 0, with specialization for i=0, but I prefer my solution as it may be more cache friendly. We can implement Vec::operator= using Unroller now:

template<typename T, std::size_t Dim> RT_FORCEINLINE
Vec<T, Dim>& Vec<T, Dim>::operator=(const Vec& rhs)
{
    Et::Unroller<T*, const T*, Et::AssignOp<T>, 0, MaxIndex>::Go(
        Begin(), rhs.Begin());
    return *this;
}

(RT_FORCEINLINE is my wrapper over 'inline', Et is namespace where all the unrolling/expression templates code resides). Now you know what we needed Begin() for, BTW. Other solution would be to make operator[] return const T& instead of just 'T' and doing &rhs[0] instead of rhs.Begin(), but it seems a little "dirty" to me. In this implementation you can modify Begin() should there arise a need for this. Let's now see whether our hard work has given us something:

    movl    8(%esp), %ecx
    movl    4(%esp), %eax
    movl    (%ecx), %edx
    movl    %edx, (%eax)
    movl    4(%ecx), %edx
    movl    %edx, 4(%eax)
    movl    8(%ecx), %edx
    movl    %edx, 8(%eax)

Nice, isn't it? It's exactly the same code as generated for: a[0] = b[0]; a[1] = b[1]; a[2] = b[2], just what we wanted! Now we can extend our system by implementing copy constructor in the same fashion and introducing +=/-= operators for in-place vector addition/subtraction. The only difference is that now we use AddAssignOp/SubAssignOp instead of AssignOp. Code for AddAssignOp follows:

template<typename T>
struct AddAssignOp
{
    static void Apply(T& lhs, const T& rhs)
    {
        lhs += rhs;
    }
};

The nice thing is that our Unroller works perfectly with this operators, we only need to provide valid 'Op' template argument.

Another common vector operation is multiply/divide by scalar. In this case we need special unroller, that's because one of the arguments (scalar) shouldn't be indexed:

template<class LHS, class RHS, class Op, std::size_t i, std::size_t MaxIndex>
struct ScalarUnroller
{
    static void Go(LHS lhs, RHS rhs)
    {
        Op::Apply(lhs[i], rhs);
        ScalarUnroller<LHS, RHS, Op, i + 1, MaxIndex>::Go(lhs, rhs);
    }
};

I don't present specialization for i=MaxIndex here, it should be obvious by now. Example of operator*= :

template<typename T, std::size_t Dim> RT_FORCEINLINE
Vec<T, Dim>& Vec<T, Dim>::operator*=(T s)
{
    Et::ScalarUnroller<T*, const T, Et::InPlaceMultOp<T>, 0, MaxIndex>::Go(
        Begin(), s);
    return *this;
}

Dividing is implemented by multiplying by reciprocal, this way not only we reuse the code, but also optimize (for Intel-family CPUs at least):

template<typename T, std::size_t Dim> RT_FORCEINLINE
Vec<T, Dim>& Vec<T, Dim>::operator/=(T s)
{
    return this->operator*=(T(1) / s);
}

Another solution would be to create wrapper over scalar types, which would return the same value for each call to operator[].

What's left? Vector length, normalization... But first, let's create dot product of two vectors, it will help us with other operations.

template<typename T, std::size_t i, std::size_t MaxIndex>
struct DotProductUnroller
{
    static T Go(const T* lhs, const T* rhs)
    {
        return (lhs[i] * rhs[i]) + 
            DotProductUnroller<T, i + 1, MaxIndex>::Go(lhs, rhs);
    }
};


template<typename T, std::size_t Dim>
inline 
T Dot(const Vec<T, Dim>& a, const Vec<T, Dim>& b)
{
    return Et::DotProductUnroller<T, 0, Dim - 1>::Go(a.Begin(), b.Begin());
}

DotProductUnroller is very simple, it cannot have custom LHS/RHS. I don't provide result assembly code, but believe me, it's just as good as it gets. GetLength()/GetLengthSquared() are pretty trivial to implement now. v.GetLengthSquared() is just Dot(v, v) and v.Length() is sqrt(v.GetLengthSquared()). Simple as that. I've two normalization routines. First one assumes that vector is always valid (that is, has non-zero length, it verifies this with assertion), second one (named SafeNormalize) throws exception in such case. Below I present SafeNormalize:

template<typename T, std::size_t Dim>
void Vec<T, Dim>::SafeNormalize()
{
    const T len = GetLength();
    if (IsAlmostEqual(len, T(0)))
    throw ZeroLengthVector();

    (*this) *= T(1) / len;
}

IsAlmostEqual performs "comparison with tolerance" using epsilon value from std::numeric_limits. Now, if someone would *really* like to play with templates, he could implement this in terms of "on-zero-vector" policies a'la Alexandrescu, but I leave this as an exercise for the reader :-).

uff, I think that by now you've the general idea of how this system works.

Eliminating temporaries

Let's now try to implement some basic arithmetic operations for vectors, like addition and subtraction. Recommended path in such cases is to implement operator+= first, then create binary operator+ using the former. The first stage is done, so it should be easy:

template<typename T, std::size_t Dim> RT_FORCEINLINE
Vec<T, Dim> operator+(const Vec<T, Dim>& lhs, const Vec<T, Dim>& rhs)
{
    Vec<T, Dim> nrv(lhs);
    nrv += rhs;
    return nrv;
}

This method bases on NRVO (Named Return Value Optimization) feature of the compiler. It's present in GCC, so we can expect good code being generated. Please see the thread about ZUTO at the Boost discussion list for details and comparisons of different forms of (N)RVO. We do the same for subtraction and multiplication/division by scalar. Following Michael Abrash advice we test these operators, just to make sure if there's a need to optimize it. Test consists of two functions:

Vec3f VecTest0(const Vec3f& a, const Vec3f& b, const Vec3f& c, const Vec3f& d)
{
    return (a + b + c) * 5.0f - d;
}

Vec4i VecTest1()
{
    Vec4i   a;
    Vec4i   b;
    Vec4i   c;
    a[0] = 1; a[1] = 2; a[2] = 3; a[3] = 4;
    b[0] = 4; b[1] = 5; b[2] = 6; b[3] = 7;

    return a + b;
}

In the first case we have all basic kinds of vector operations, executed on objects passed as arguments. Both these functions will be compared to their hand-optimized equivalents:

StupidVec3f VecTest0HandOptimized(const StupidVec3f& a, 
    const StupidVec3f& b, const StupidVec3f& c, const StupidVec3f& d)
{
    return StupidVec3f((a.x + b.x + c.x) * 5 - d.x, 
        (a.y + b.y + c.y) * 5 - d.y, 
        (a.z + b.z + c.z) * 5 - d.z);
}

Vec4i VecTest4HandOptimized()
{
    Vec4i   a;
    Vec4i   b;
    Vec4i   c;
    a[0] = 1; a[1] = 2; a[2] = 3; a[3] = 4;
    b[0] = 4; b[1] = 5; b[2] = 6; b[3] = 7;

    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
    return c;
}

StupidVec3f is just flat structure of 3 floats. For the first example hand-optimized source gives us this code:

LFB3:
    pushl   %esi
LCFI48:
    pushl   %ebx
LCFI49:
    movl    20(%esp), %edx
    movl    16(%esp), %ebx
    movl    24(%esp), %ecx
    movl    28(%esp), %esi
    flds    (%edx)
    flds    4(%edx)
    flds    8(%edx)
    fxch    %st(2)
    fadds   (%ebx)
    fxch    %st(1)
    fadds   4(%ebx)
    fxch    %st(2)
    fadds   8(%ebx)
    fxch    %st(1)
    fadds   (%ecx)
    fxch    %st(2)
    fadds   4(%ecx)
    fxch    %st(1)
    fadds   8(%ecx)
    flds    LC5
    movl    12(%esp), %eax
    fmul    %st, %st(3)
    fmul    %st, %st(2)
    fxch    %st(3)
    fsubs   (%esi)
    fxch    %st(1)
    fmulp   %st, %st(3)
    fxch    %st(1)
    fsubs   4(%esi)
    fxch    %st(2)
    fsubs   8(%esi)
    fxch    %st(1)
    fstps   (%eax)
    fxch    %st(1)
    fstps   4(%eax)
    fstps   8(%eax)
    popl    %ebx
LCFI50:
    popl    %esi
LCFI51:
    ret $4

Let's compare it to assembly output for version with "standard" operators:

LFB4:
    pushl   %esi
LCFI52:
    pushl   %ebx
LCFI53:
    subl    $52, %esp
LCFI54:
    flds    LC7
    movl    72(%esp), %edx
    movl    68(%esp), %ecx
    movl    76(%esp), %ebx
    movl    64(%esp), %eax
    flds    (%edx)
    flds    4(%edx)
    flds    8(%edx)
    fxch    %st(2)
    fadds   (%ecx)
    fxch    %st(1)
    fadds   4(%ecx)
    fxch    %st(2)
    fadds   8(%ecx)
    fxch    %st(1)
    fadds   (%ebx)
    fxch    %st(2)
    fadds   4(%ebx)
    fxch    %st(1)
    fadds   8(%ebx)
    fxch    %st(2)
    fmul    %st(3), %st
    fxch    %st(1)
    movl    80(%esp), %esi
    fmul    %st(3), %st
    fxch    %st(2)
    fmulp   %st, %st(3)
    fsts    (%eax)
    fxch    %st(1)
    fsts    4(%eax)
    fxch    %st(2)
    fsts    8(%eax)
    fxch    %st(1)
    fsubs   (%esi)
    fstps   (%eax)
    fxch    %st(1)
    fsubs   4(%esi)
    fstps   4(%eax)
    fsubs   8(%esi)
    fstps   8(%eax)
    addl    $52, %esp
LCFI55:
    popl    %ebx
LCFI56:
    popl    %esi
LCFI57:
    ret $4

As you can see, GCC did pretty good job with this one, the snippets are almost the same! Also for the second example compiler was smart enough to calculate the value of expression during compilation:

LFB8:
    subl    $60, %esp
LCFI60:
    movl    64(%esp), %eax
    movl    $5, (%eax)
    movl    $7, 4(%eax)
    movl    $9, 8(%eax)
    movl    $11, 12(%eax)
    addl    $60, %esp
LCFI61:
    ret $4

It seems there's not much of us to fix. However, I decided to go further and see if I can push the performance even more. I was glad that GCC did so good job, but on the other hand I wasn't sure about other compilers, possibly w/o NRVO support. Finally, I wanted to experiment a little bit with expression templates magic. Entry points for my tests were: "Disambiguated Glommable Expression Templates Reintroduced" by Geoffrey Furnish, published in C++ Report (Oct 2002) and small code snippet by Gottfried Chen posted at his page. Both these examples had some flaws. The former was too complicated for my simple purposes. It would work great for full-blown math library with many kinds of arbitraly sized matrices and vectors, but for 3D engine purposes it's just overkill. Plus, it doesn't take advantage of compile-time loop unrolling. The latter used virtual functions, this may affect the performance. I wanted to combine somehow the bese of these two solutions. I've also found some nice ideas in Stroustrup's "TC++L". Our goal is to eliminate all the temporaries and produce as tight code as it's possible for constructions like:

    a = b + c + d;

Without any optimization we may have 2 temporary values here (assuming that no conversions are needed). One to hold the result of (b + c) expression and another one to hold the result (b + c) + d before copying it to 'a'. Now, imagine that operator+ doesn't return vector, but only structure holding it's two arguments, so that we can accumulate operations and evaluate the expression just before assigning the result to 'a'. Binary operator structure can look like:

template<typename T, class LHS, class RHS, class Op>
struct BinOp
{
public:
    BinOp(LHS lhs, RHS rhs)
    :   m_lhs(lhs),
        m_rhs(rhs)
    {
        
    }

    T operator[](std::size_t i) const
    {
        return Op::Apply(m_lhs[i], m_rhs[i]);
    }

private:
    LHS m_lhs;
    RHS m_rhs;
};

Binary operator takes two arguments in the constructor -- they represent left & right side of the operator. Evaluation is delayed, and can be executed via operator[]. Now let's define some functors that can be passed as 'Op' template argument. Such functor should have static Apply function that performs operation on two arguments and return a result (of type 'T'). For example:

template<typename T>
struct AddOp
{
    static T Apply(const T& lhs, const T& rhs)
    {
        return lhs + rhs;
    }
};

This operator returns sum of its two arguments (provided that 'T' type has '+' operator defined). Now there's only one piece missing -- the expression structure. Or rather a kind of wrapper over the expression. In fact, this structure can hold sub-expressions of arbitrary complexity, communicating with them only via subscription operator. Our implementation follows.

template<typename T, class Xpr>
class Expr
{
public:
    Expr(const Xpr& e)
    :   m_expr(e)
    {
        
    }

    T operator[](std::size_t i) const
    {
        return m_expr[i];
    }

private:
    const Xpr&  m_expr;
};

This looks very simple and innocent, but as you'll see it allows for creating truly powerful constructions. Uff, OK, first task: addition of two vectors. Return type is Expr<T, BinOp<T, const T*, const T*, AddOp<T> >. LHS and RHS are just pointers to the beginning of vector data. Full operator looks like:

template<typename T, std::size_t Dim> RT_FORCEINLINE                    
Et::Expr<T, Et::BinOp<T, const T*, const T*, Et::AddOp<T> > >
operator+(const Vec<T, Dim>& lhs, const Vec<T, Dim>& rhs)     
{                                                                       
    typedef Et::BinOp<T, const T*, const T*, Et::AddOp<T> > OpT;        
    return Et::Expr<T, OpT>(OpT(lhs.Begin(), rhs.Begin()));             
}

Careful reader should notice that something is wrong. How are we supposed to assign the result of such expression to vector? We need assignment operator/copy constructor taking expressions. Chen solved this problem by deriving all expressions from Evaluatable base class with virtual 'evaluateTo(Vector3<T>&)' method. I'd like to avoid virtual functions and prefer a simple kind of compile-time polymorphism, which may be achieved using member templates.

template<class Xpr> Vec(const Et::Expr<T, Xpr>& rhs)
{
    Et::Unroller<T*, const E&, Et::AssignOp<T>, 0, MaxIndex>::Go(
        Begin(), rhs);
}

Unroller comes to the rescue one more time. In the same fashion we can implement operator=, operator+=, operator-= etc. However, we're not done with the whole system yet, not by the long shot. We need also the following operators:

expression/expression operations
scalar/vector & vector/scalar operations
scalar expression/"standard" expression and the opposite
vector/expression & expression/vector
and maybe also some unary operators (like negation)

Why? Imagine for example:

    a = b + c + d;

Here, the result of (b + c) is expression, so we need to have operator+ taking expression (b + c) and vector (d). One of our test cases:

    return (a + b + c) * 5.0f - d;

is even more complicated. Here we need:

operator+(V, V)
operator+(E, V)
operator*(E, S)
operator-(E, V).

You can find the implementation of all these combinations in source code attached to this article. Uff, finally, we can see the result of all this template magic. Here's assembler output for both tests:

LFB4:
    pushl   %edi
LCFI52:
    pushl   %esi
LCFI53:
    pushl   %ebx
LCFI54:
    subl    $32, %esp
LCFI55:
    flds    LC7
    leal    24(%esp), %edx
    movl    56(%esp), %ecx
    movl    %edx, 16(%esp)
    movl    52(%esp), %esi
    leal    16(%esp), %edx
    movl    60(%esp), %ebx
    movl    %edx, 8(%esp)
    movl    64(%esp), %edi
    leal    8(%esp), %edx
    movl    48(%esp), %eax
    movl    %edx, (%esp)
    flds    (%ecx)
    fadds   (%esi)
    fadds   (%ebx)
    fmul    %st(1), %st
    fsubs   (%edi)
    fstps   (%eax)
    flds    4(%ecx)
    fadds   4(%esi)
    fadds   4(%ebx)
    fmul    %st(1), %st
    fsubs   4(%edi)
    fstps   4(%eax)
    flds    8(%ecx)
    fadds   8(%esi)
    fadds   8(%ebx)
    fmulp   %st, %st(1)
    fsubs   8(%edi)
    fstps   8(%eax)
    addl    $32, %esp
LCFI56:
    popl    %ebx
LCFI57:
    popl    %esi
LCFI58:
    popl    %edi
LCFI59:
    ret $4

// -- And test #2

LCFI62:
    movl    80(%esp), %eax
    movl    $5, (%eax)
    movl    $7, 4(%eax)
    movl    $9, 8(%eax)
    movl    $11, 12(%eax)
    addl    $76, %esp
LCFI63:
    ret $4

Ain't this interesting? Code generated for test#1 is shorter than the one for hand-optimized function :-). Is it also faster? Not necessarily, as 'fxch' operations are virtually free, but this is something to find out with profiler, guessing will lead us nowhere. Please notice how final value of each component is calculated and compare this to previous snippets.

Optimizations, modifications

Two days after I had published the article, I found this thread on Flipcode: http://www.flipcode.com/cgi-bin/msg.cgi?showThread=00003401&forum=general&id=-1 . Basically it's about a little different topic, but there's also very interesting sub-thread about vector-math and ETs. One of the participants, Jukka Liimatta, presented his own math library (ET based) + test application. It also seems that we independently developed quite similar code. Main difference is that Jukka focuses on 3D vectors and provides specialization both for vector class (loop-unrolling by hand) and expressions. In his code there's also given another reference that may prove very valuable (sadly I had no occasion to read it) - "Jim Blinn's Corner: Notation, Notation, Notation". Complete test package can be downloaded here. It seemed very interesting to me, so I decided to test my engine against the included ones. All the timing/testing was done after compiling the code with MinGW 2.0.0 (with -O3), executable ran on Celeron 1.2Ghz, WindowsXP. Test code is:

// count = 3200000 in this case
float TestFunc2(int count, float a, float b, float c)
{
    vec3f v(a,b,c);
    while ( --count )
    {
        v = (v + v*3) / 2.000001f - v * c;
    }
    return DotProduct(v,v);
}

In the package there are the following implementations:

"typical" vector class (no templates), binary operators not implemented in terms of +=/-= etc (1)
analogical vector class, but this time template (argument: type) (2)
hand-optimized version, not using vector class, only scalars (3)
Jukka's prmath library (4)
version using D3DX vector (5)
function based implementation. Interesting solution employing macro-like functions (div, madd, etc) (6)
template vector class presented by cgk (in mentioned thread) (7)

I couldn't compile test #5 (D3DX) as I had some problems with DX8 under MinGW, however Jukka reported it's quite on par with (1)/(2) implementation. Each function has been executed 16 times and min/max/avg performance has been recorded (using standard clock() function). The results are:

1: min: 120 ms max: 141 ms average: 126.437 ms 1 x
2: min: 110 ms max: 120 ms average: 115.188 ms 1.09767 x
3: min: 301 ms max: 331 ms average: 313.563 ms 0.403229 x
4: min: 190 ms max: 201 ms average: 196.563 ms 0.643243 x
6: min: 70 ms max: 81 ms average: 76.3125 ms 1.65684 x
7: min: 360 ms max: 380 ms average: 365.563 ms 0.345871 x

It's veeeery interesting. First, I thought it's the problem with timing and replaced clock() with timeGetTime(), but the results were basically the same. Version (3) (hand-optimized) is more then two times slower than vector class. Function-based is clearly the fastest one. I tried to plug in my library, but first I had to add one operator I had forgotten (division). The results were not so bad, but still a little disappointing:

8: min: 290 ms max: 311 ms average: 302.313 ms 0.424437 x

Definitelly slower than Jukka's library, not even mentioning 1/2/6 versions. Still, I was ready to sacrifice some speed for more generic system, but it was also ~5 times slower than the fastest version! A little too much for me to accept, especially if this library was to be used also in time-critical place (there's not much gain of using ETs if at the end of a day you have to hand-optimize your tight loops anyway). Now let's see how fast would it work without ET (RTMATH_USES_EXPRESSION_TEMPLATES set to zero):

8: min: 100 ms max: 111 ms average: 103.25 ms 1.22458 x

Much better, this is second fastest version! At this point I could finish tests. However, I felt there's one more thing to do -- speed up my ET version to reach at least 90% of Jukka's code (4). I chose this implementation as a reference point because, as it was said, it based on similiar ideas. I didn't have much hope to create something faster than ~200ms/test. After all my implementation was really very generic and required quite a lot from a compiler. Let's see what we can do, anyway... All the operators from rExpr.h take const T& as argument. Type of argument s usually float or double, anyway, something that may be better to pass by value. Let's modify this.

8: min: 290 ms max: 311 ms average: 302.313 ms

Erm, OK. Another thing, our "expression/scalar" operator was quickly implemented using Et::DivOp. Now let's optimize it by hand, to force multiplication by reciprocal:

template<typename T, class Xpr> RT_FORCEINLINE
Et::Expr<T, Et::ScalarBinOp<T, Et::Expr<T, Xpr>, Et::MultOp<T> > >
operator/(const Et::Expr<T, Xpr>& e, T s)
{
    typedef Et::ScalarBinOp<T, Et::Expr<T, Xpr>, Et::MultOp<T> > OpT;
    const T r = T(1) / s;
    return Et::Expr<T, OpT>(OpT(e, r));
}

8: min: 110 ms max: 130 ms average: 118.313 ms

Uff, already much, much better. I was very curious how far can I push the performance. I also wanted to make sure that there aren't any unwanted temporary objects. I made all the constructors, copy constructors and assignment operators for BinOp/ScalarBinOp private. Copy constructor was only needed for Expr. However, this time instead of trusting the compiler I generated my own, inlined copy ctor (according to Lippman, this may help in RVO optimizations). Another thing, a little more subtle, was to make it easier for the compiler to unroll the loop. I created basic function:

template<typename T, typename Container, class Op, std::size_t Dim>
RT_FORCEINLINE
Vec<T, Dim>& VecAssign(Vec<T, Dim>& v, const Container& c, Op)
{
    Et::Unroller<T*, const Container&, Op, 0, Dim - 1>::Go(v.Begin(), c);
    return v;
}

and provided overloads for the most common kinds of vectors:

template<typename T, typename Container, class Op> RT_FORCEINLINE
Vec<T, 2>& VecAssign(Vec<T, 2>& v, const Container& c, Op)
{
    Op::Apply(v[0], c[0]);
    Op::Apply(v[1], c[1]);
    return v;
}

Results were better than I had expected:

8: min: 70 ms max: 81 ms average: 75.75 ms!

This is quite on par with the fastest possible implementation! However, there are still problems that I find with extensive use of templates. It's hard (unless your surname starts with 'A' and ends with 'u' :-) to say what impact on your code will have certain change. It's quite possible that the same alterations with one of them missing wouldn't give us much, and they work so well only in this combination. Experiments with ET are a little bit like unarming the bomb, you touch the wrong wire and everything explodes :-). Nevertheless, I expect that these modifications should speed up operations on all compilers. If they compile, that's, because Jukka already informed me it doesnt work with MSVC :-). I really have no time right now to fix it, maybe someone out there may help? One more small refactor I'm considering is to finally add this policy-based normalization routine, but this has not much to do with the speed of course. Some cosmetic modifications were also to change a little bit passing/storing data to BinOp. In first version it was done by value, but since I mainly passed pointers (using Vec::Begin()) it was not a big problem. However, in some situation it's possible that argument of BinOp ctor is expression. In such case it would be better to send it via reference. The solutions I can think of are:

using ConstRef wrapper (as described in [2])
type traits with some kind of ParameterType

In the end I decided to choose the simplest way the can possibly work -- I modified the code so that all parameters are passed by reference and removed calls to Vec::Begin(). Maybe in the future I'll switch to the second solution, which looks the most generic to me.

Conclusions, future work

What did I learn during creating this small library? First of all, that GCC optimizes code pretty well :-). Second, that (N)RVO is just behind the corner. I'd say that if your compiler can handle it -- use it and forget about ET for small types as vectors. It's good to have this code somewhere anyway, so when it's possible to switch to ET when needed (after making sure that compiler cannot eliminate temporaries by itself). ET code is more complex, and harder to understand, with very cryptic error messages... OTOH it's also a good way to learn more about templates and the way your compiler works. Also, there's a limit to which your compiler can inline all the templates you use. For example, I found out that all my unrolling/ET stops to work when I use classes derived from Vec (I tried Vec3 : public Vec<T, 3>). The problem is that behaviour may vary from compiler to compiler, so it's hard to find one universal solution that fullfills all needs. Some compilers may not even accept ET code at all, you've to remember that GCC 3.2 is pretty standard-conforming.

Included source code is part of my RDE library. In the future I'm going to extend it of course, by adding matrices, quaternions, splines. Make sure to visit RDE's page often, as I hope to be able to update this part of library more frequently then others (I work on them, too, but only in my limited spare time, we're going into alpha stage, you know...).

-- Maciej Sinilo
mail: maciejsinilo@o2.pl (contact me in case of bugs, ideas, questions...)

Source code: RtMath

References

[1] Chen G., code snippet for expression templates, published at www.chengine.com
[2] Furnish G. "Disambiguated Glommable Expression Templates Reintroduced", C++ Report, 10/2002
[3] Stroustrup B., "Jezyk C++" (Polish edition of "The C++ Programming Language")
[4] Thread on Flipcode.com
[5] Source code by Jukka Liimatta (mathtest.zip).