Vector swizzling in C++
October 21, 2009 – 9:50 pmEveryone who’s done at least some vertex/pixel shader/HLSL programming has probably encountered mechanism called “swizzling“. It’s an operation where we create new vector using arbitrarily selected components of another vector (also a little bit similiar to SSE shuffling). Code snippet is worth 100 words, so some examples:
a = b.zyzx; // a.x = b.z, a.y = b.y, a.z = b.z, a.w = b.x a = b.wy; // a.x = b.w, a.y = b.y, a.z = b.y, a.w = b.y a = b.z; // a.x = a.y = a.z = a.w = b.z
Vector swizzling may come handy in C++ as well. Recently, I’ve seen a discussion about it at some programming forum and thought it could be an interesting experiment to implement it. The most straightforward and brute force way would be simply to generate method for every possible component combination, but that doesn’t sound very interesting.I started with another simple approach, where you pass 4 component indices to a function and hope that compiler will be able to figure out they’re all constant and optimize it nicely. Code:
enum EVecCoord
{
X, Y, Z, W
};
struct Vec4
{
Vec4(float x, float y, float z, float w)
{
m_v[X] = x;
m_v[Y] = y;
m_v[Z] = z;
m_v[W] = w;
}
Vec4 Swizzle(EVecCoord c0, EVecCoord c1, EVecCoord c2, EVecCoord c3) const
{
return Vec4(m_v[c0], m_v[c1], m_v[c2], m_v[c3]);
}
float m_v[4];
};
// Test function
void SwizzleTest_3(const Vec4& v)
{
Vec4 v2 = v.Swizzle(X, X, W, Y);
Foo(v2);
Vec4 v3 = v.Swizzle(X, Y, Y, Y);
Foo(v3);
}
Foo is a dummy external function to prevent compiler from optimizing Swizzle() away. Let’s take a look at generated assembly (MSVC):
; 481 : Vec4 v2 = v.Swizzle(X, X, W, Y); mov esi, DWORD PTR _v$[ebp] movss xmm0, DWORD PTR [esi] movss DWORD PTR _v2$[ebp], xmm0 movss DWORD PTR _v2$[ebp+4], xmm0 movss xmm0, DWORD PTR [esi+12] ; 482 : Foo(v2); lea eax, DWORD PTR _v2$[ebp] movss DWORD PTR _v2$[ebp+8], xmm0 movss xmm0, DWORD PTR [esi+4] push eax movss DWORD PTR _v2$[ebp+12], xmm0 call ?Foo@@YAXABUVec4@@@Z ; Foo ; 483 : Vec4 v3 = v.Swizzle(X, Y, Y, Y); movss xmm0, DWORD PTR [esi] ; 484 : Foo(v3); lea eax, DWORD PTR _v3$[ebp] movss DWORD PTR _v3$[ebp], xmm0 movss xmm0, DWORD PTR [esi+4] push eax movss DWORD PTR _v3$[ebp+4], xmm0 movss DWORD PTR _v3$[ebp+8], xmm0 movss DWORD PTR _v3$[ebp+12], xmm0 call ?Foo@@YAXABUVec4@@@Z ; Foo
Looking good, just as I’d code it by hand, more or less. I could stop here to be honest, but after all, this was supposed to be an experiment, so let’s play along.
Two things I’m still not 100% happy about is that I had to repeat ‘Y’ component manually, it’s not replicated automatically like in VS example. The other — for some reason, even after checking the code, relying on compiler to perform all those optimizations makes me a little bit anxious. Let’s try a template version, where everything should be guaranteed to resolve at compile time. This time I provide four different methods:
template<EVecCoord c0>
Vec4 Swizzle() const
{
return Vec4(m_v[c0], m_v[c0], m_v[c0], m_v[c0]);
}
template<EVecCoord c0, EVecCoord c1>
Vec4 Swizzle() const
{
return Vec4(m_v[c0], m_v[c1], m_v[c1], m_v[c1]);
}
template<EVecCoord c0, EVecCoord c1, EVecCoord c2>
Vec4 Swizzle() const
{
return Vec4(m_v[c0], m_v[c1], m_v[c2], m_v[c2]);
}
template<EVecCoord c0, EVecCoord c1, EVecCoord c2, EVecCoord c3>
Vec4 Swizzle() const
{
return Vec4(m_v[c0], m_v[c1], m_v[c2], m_v[c3]);
}
// Test
Vec4 v2 = v.Swizzle<X, X, W, Y>();
Foo(v2);
Vec4 v3 = v.Swizzle<X, Y>();
Foo(v3);
Generated assembly code is exactly the same as previously, but we stress the compiler a little bit less, we don’t really pass any arguments to the function. This version will also allow us to only specify 2 components in the second Swizzle call. One tiny drawback is that every different component combination is actually a separate function. Sure, in the end, they’re inlined anyway, but .obj file still contains generated routines, they will be stripped eventually, but it’s a little bit of additional work for the linker.
Are we done? Normally, we’d be, but — experiment, remember? I thought it’d be cool to have a version where you don’t have to use commas, though, so you could write a = b.Swizzle(WYZY) for example. First, I needed all the possible combinations of components. It was simple to generate them using Python script (you can get it here, it requires Python 2.6 for itertools module… I think Python may have libraries for just about everything. I fully expect version 3.0 coming with life.findMeaning). [I had a Scala version as well, I wanted to learn it, but somehow it didn't click for me, must find another language. Scala looks like it may be really efficient, but doesn't lend that good for home fun, IMHO]
The idea was just to pass a single ‘mask’ and then somehow extract components. At first, I thought about having offset table, but it’d press compiler even harder, plus I’d have to obtain index for this table (from mask) anyway. Then again, if I was going to generate an index, I could compute offset directly as well. That’s what I did. Every component combination is 8-bit mask with 2 bits per component, masks are generated by same Python script. Here’s non-template solution:
Vec4 Swizzle(EVecCoord c) const
{
return Vec4(m_v1, m_v1, m_v1, m_v1);
}
Vec4 Swizzle(EVecSwizzle2 swizzle) const
{
return Vec4(m_v[swizzle & 0x3], m_v[swizzle >> 2], m_v[swizzle >> 2], m_v[swizzle >> 2]);
}
Vec4 Swizzle(EVecSwizzle3 swizzle) const
{
return Vec4(m_v[swizzle & 0x3], m_v[(swizzle >> 2) & 0x3],
m_v[(swizzle >> 4) & 0x3], m_v[(swizzle >> 4) & 0x3]);
}
__forceinline Vec4 Swizzle(EVecSwizzle4 swizzle) const
{
return Vec4(m_v[swizzle & 0x3], m_v[(swizzle >> 2) & 0x3],
m_v[(swizzle >> 4) & 0x3], m_v[swizzle >> 6]);
}
// Test
Vec4 v2 = v.Swizzle(XXWY);
Foo(v2);
Vec4 v3 = v.Swizzle(XY);
Foo(v3);
Generated assembly is exactly same as before. As you can see, I had to force inlining for 4-component version, compiler wouldn’t do it automatically.
Finally, just for kicks, there’s also a template version taking mask argument. Sadly, it only supports full, 4 component mask, as compiler cannot differentiate between functions with same name but template arguments being different enum types.
For non-template versions syntax can be made even shorter, by using operator() instead of Swizzle() method, so you’d write (personally, I prefer more explicit ways):
Vec4 v2 = v(XYWW);
If I had to choose between those 4 versions, I’d probably go with the first one anyway, it’s the simplest and writing few commas won’t kill me. If compiler would have problems with resolving all offsets at compile-time, then going with version #2 should be a safer bet.
Source code testing all presented approaches can be found here.










7 Responses to “Vector swizzling in C++”
Very nice.
I’ve played with something not dissimilar for Cell BE SPU, but due to the way the shufb instruction works, I used the preprocessor – http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/cell/spu/spu_shuffle.h
By Jonathan on Oct 22, 2009
I have a macro for that kind of stuff; on Altivec/SPU it expands to shufb (nothing non-obvious here); for PC I had to write, uhm, this:
http://www.everfall.com/paste/id.php?89p3wsk7gft0
and the test suite
http://www.everfall.com/paste/id.php?3qzz76p339ld
By Arseny Kapoulkine on Oct 22, 2009
cool! i still like the simple syntax in HLSL though. how have float4′s not made it in as native C/C++ types by now? the CPUs have had these registers for what 15 years now?? :P
By blackpawn on Oct 24, 2009
I’m pretty sure that if you wrote your vector classes using proper vector instructions, you would get swizzles, splats, masks, permutes, etc. from almost any vector instruction set you ported to… Even the Wii’s paired singles can do some of this kind of thing in only 1 or 2 instructions.
By fries on Nov 29, 2010
We do use ‘proper’ vector instructions (ie. SSE/Altivec/SPU in our case, no Wii). How you implement the shuffling itself is one thing, you still need to expose this functionality somehow and ideally – generate masks automatically.
By admin on Nov 30, 2010