Vector swizzling in C++

Everyone who’s done at least some vertex/pixel shader/HLSL programming has probably encountered mechanism called “swizzling”. It’s an operation where we create new vector using arbitrarily selected components of another vector (also a little bit similiar to SSE shuffling). Code snippet is worth 100 words, so some examples:

a = b.zyzx;   // a.x = b.z, a.y = b.y, a.z = b.z, a.w = b.x
a = b.wy;     // a.x = b.w, a.y = b.y, a.z = b.y, a.w = b.y
a = b.z;       // a.x = a.y = a.z = a.w = b.z

Vector swizzling may come handy in C++ as well. Recently, I’ve seen a discussion about it at some programming forum and thought it could be an interesting experiment to implement it. The most straightforward and brute force way would be simply to generate method for every possible component combination, but that doesn’t sound very interesting.I started with another simple approach, where you pass 4 component indices to a function and hope that compiler will be able to figure out they’re all constant and optimize it nicely. Code:

enum EVecCoord
{
    X, Y, Z, W
};
struct Vec4
{
    Vec4(float x, float y, float z, float w)
    {
        m_v[X] = x;
        m_v[Y] = y;
        m_v[Z] = z;
        m_v[W] = w;
    }
    Vec4 Swizzle(EVecCoord c0, EVecCoord c1, EVecCoord c2, EVecCoord c3) const
    {
        return Vec4(m_v[c0], m_v[c1], m_v[c2], m_v[c3]);
    }
    float    m_v[4];
};
// Test function
void SwizzleTest_3(const Vec4& v)
{
    Vec4 v2 = v.Swizzle(X, X, W, Y);
    Foo(v2);
    Vec4 v3 = v.Swizzle(X, Y, Y, Y);
    Foo(v3);
}
_Foo _is a dummy external function to prevent compiler from optimizing Swizzle() away. Let’s take a look at generated assembly (MSVC):
; 481  :     Vec4 v2 = v.Swizzle(X, X, W, Y);
mov       esi, DWORD PTR _v$[ebp]
movss    xmm0, DWORD PTR [esi]
movss    DWORD PTR _v2$[ebp], xmm0
movss    DWORD PTR _v2$[ebp+4], xmm0
movss    xmm0, DWORD PTR [esi+12]
; 482  :     Foo(v2);
lea        eax, DWORD PTR _v2$[ebp]
movss    DWORD PTR _v2$[ebp+8], xmm0
movss    xmm0, DWORD PTR [esi+4]
push      eax
movss    DWORD PTR _v2$[ebp+12], xmm0
call        ?Foo@@YAXABUVec4@@@Z            ; Foo
; 483  :     Vec4 v3 = v.Swizzle(X, Y, Y, Y);
movss    xmm0, DWORD PTR [esi]
; 484  :     Foo(v3);
lea        eax, DWORD PTR _v3$[ebp]
movss    DWORD PTR _v3$[ebp], xmm0
movss    xmm0, DWORD PTR [esi+4]
push      eax
movss    DWORD PTR _v3$[ebp+4], xmm0
movss    DWORD PTR _v3$[ebp+8], xmm0
movss    DWORD PTR _v3$[ebp+12], xmm0
call        ?Foo@@YAXABUVec4@@@Z            ; Foo
Looking good, just as I’d code it by hand, more or less. I could stop here to be honest, but after all, this was supposed to be an experiment, so let’s play along. Two things I’m still not 100% happy about is that I had to repeat ‘Y’ component manually, it’s not replicated automatically like in VS example. The other – for some reason, even after checking the code, relying on compiler to perform all those optimizations makes me a little bit anxious. Let’s try a template version, where everything should be guaranteed to resolve at compile time. This time I provide four different methods:
template<EVecCoord c0>
Vec4 Swizzle() const
{
    return Vec4(m_v[c0], m_v[c0], m_v[c0], m_v[c0]);
}

template<EVecCoord c0, EVecCoord c1>
Vec4 Swizzle() const
{
    return Vec4(m_v[c0], m_v[c1], m_v[c1], m_v[c1]);
}

template<EVecCoord c0, EVecCoord c1, EVecCoord c2>
Vec4 Swizzle() const
{
    return Vec4(m_v[c0], m_v[c1], m_v[c2], m_v[c2]);
}

template<EVecCoord c0, EVecCoord c1, EVecCoord c2, EVecCoord c3>
Vec4 Swizzle() const
{
    return Vec4(m_v[c0], m_v[c1], m_v[c2], m_v[c3]);
}
// Test
Vec4 v2 = v.Swizzle<X, X, W, Y>();
Foo(v2);
Vec4 v3 = v.Swizzle<X, Y>();
Foo(v3);
Generated assembly code is exactly the same as previously, but we stress the compiler a little bit less, we don’t really pass any arguments to the function. This version will also allow us to only specify 2 components in the second Swizzle call. One tiny drawback is that every different component combination is actually a separate function. Sure, in the end, they’re inlined anyway, but .obj file still contains generated routines, they will be stripped eventually, but it’s a little bit of additional work for the linker.

Are we done? Normally, we’d be, but – experiment, remember? I thought it’d be cool to have a version where you don’t have to use commas, though, so you could write a = b.Swizzle(WYZY) for example. First, I needed all the possible combinations of components. It was simple to generate them using Python script (you can get it here, it requires Python 2.6 for itertools module… I think Python may have libraries for just about everything. I fully expect version 3.0 coming with life.findMeaning). [I had a Scala version as well, I wanted to learn it, but somehow it didn’t click for me, must find another language. Scala looks like it may be really efficient, but doesn’t lend that good for home fun, IMHO] The idea was just to pass a single ‘mask’ and then somehow extract components. At first, I thought about having offset table, but it’d press compiler even harder, plus I’d have to obtain index for this table (from mask) anyway. Then again, if I was going to generate an index, I could compute offset directly as well. That’s what I did. Every component combination is 8-bit mask with 2 bits per component, masks are generated by same Python script. Here’s non-template solution:

Vec4 Swizzle(EVecCoord c) const
{
    return Vec4(m_v[c], m_v[c], m_v[c], m_v[c]);
}
Vec4 Swizzle(EVecSwizzle2 swizzle) const
{
    return Vec4(m_v[swizzle & 0x3], m_v[swizzle >> 2], m_v[swizzle >> 2], m_v[swizzle >> 2]);
}
Vec4 Swizzle(EVecSwizzle3 swizzle) const
{
    return Vec4(m_v[swizzle & 0x3], m_v[(swizzle >> 2) & 0x3],
        m_v[(swizzle >> 4) & 0x3], m_v[(swizzle >> 4) & 0x3]);
}
__forceinline Vec4 Swizzle(EVecSwizzle4 swizzle) const
{
    return Vec4(m_v[swizzle & 0x3], m_v[(swizzle >> 2) & 0x3],
        m_v[(swizzle >> 4) & 0x3], m_v[swizzle >> 6]);
}
// Test
Vec4 v2 = v.Swizzle(XXWY);
Foo(v2);
Vec4 v3 = v.Swizzle(XY);
Foo(v3);
Generated assembly is exactly same as before. As you can see, I had to force inlining for 4-component version, compiler wouldn’t do it automatically. Finally, just for kicks, there’s also a template version taking mask argument. Sadly, it only supports full, 4 component mask, as compiler cannot differentiate between functions with same name but template arguments being different enum types. For non-template versions syntax can be made even shorter, by using operator() instead of Swizzle() method, so you’d write (personally, I prefer more explicit ways):
Vec4 v2 = v(XYWW);
If I had to choose between those 4 versions, I’d probably go with the first one anyway, it’s the simplest and writing few commas won’t kill me. If compiler would have problems with resolving all offsets at compile-time, then going with version #2 should be a safer bet. Source code testing all presented approaches can be found here.

Old comments

Jonathan 2009-10-22 00:54:35

Very nice.
I’ve played with something not dissimilar for Cell BE SPU, but due to the way the shufb instruction works, I used the preprocessor - http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/cell/spu/spu_shuffle.h

Arseny Kapoulkine 2009-10-22 05:18:53

I have a macro for that kind of stuff; on Altivec/SPU it expands to shufb (nothing non-obvious here); for PC I had to write, uhm, this:
http://www.everfall.com/paste/id.php?89p3wsk7gft0
and the test suite
http://www.everfall.com/paste/id.php?3qzz76p339ld

blackpawn 2009-10-24 00:31:40

cool! i still like the simple syntax in HLSL though. how have float4’s not made it in as native C/C++ types by now? the CPUs have had these registers for what 15 years now?? :P

C++ Swizzling | Dwight Design 2010-04-03 00:14:54

[…] libraries and forum posts where people had tried to implement it. I was able to find quite a few implementation attempts. But none of these did what I wanted. Only one had true write swizzling, and try as I […]

admin 2010-11-30 01:13:48

We do use ‘proper’ vector instructions (ie. SSE/Altivec/SPU in our case, no Wii). How you implement the shuffling itself is one thing, you still need to expose this functionality somehow and ideally - generate masks automatically.

fries 2010-11-29 13:35:28

I’m pretty sure that if you wrote your vector classes using proper vector instructions, you would get swizzles, splats, masks, permutes, etc. from almost any vector instruction set you ported to… Even the Wii’s paired singles can do some of this kind of thing in only 1 or 2 instructions.

gwiazdorrr 2013-11-12 22:58:41

Hi Maciej,
With C++ there are far more idiomatic ways of implementing swizzling. Hell, you can even replicate whole GLSL/HLSL swizzling syntax.
My take on this is CxxSwizzle (https://github.com/gwiazdorrr/CxxSwizzle). Bottom line, you can take a GLSL fragment shader and run it as C++ code, without any changes. With all the goodies (and baggage) of it. From your favourite IDE, using your favourite compiler.
It can be adopted to simulate HLSL as well.
I’d love to read your opinion on it.

admin 2013-11-18 05:59:43

Sorry, busy week. I didn’t have a chance to actually run it (no C++ 11 compliant compiler at home), but it looks impressive. I have to say, though, while it’s nice for experiments like C++ shaders (and proof of concept), it’s probably little bit of an overkill for a general purpose multi-platform vectory library.

gwiazdorrr 2013-12-04 09:16:15

Thanks for taking a look.
Regarding your concerns, given naive math implementation and poor set of support functions I have to agree with you. This was not the goal of this project. However, these problems are not rocket science, so maybe in the future…
As a side note, swizzling in D can be done in just few lines, putting C++ to shame: https://github.com/Dav1dde/gl3n/blob/master/gl3n/linalg.d#L361