To vectorize or not to vectorize...
31/Dec 2025
If you ever looked at optimized code generated by Clang, you might have noticed it loves vectorizing. It is enabled by default at higher optimization levels/when not minimizing the size. Clang is a great compiler and it knows what it’s doing, but I was curious about two things:
- is it worth the increased binary size
- what functions are the most ‘affected’
The first question could be answered by profiling, but the results were a bit inconclusive, if there were gains they were mostly “noise” level (both frametimes and CPU counters). Still, I didn’t want to discard it without a better understanding of what code is actually different. This leads us to the second point - how can we easily find and inspect functions that change the most as we turn vectorization off. Fortunately, as we already mentioned, typically, vectorization increases the code size (if it doesn’t, meaningfully, we don’t really care). We can dump all the symbols with their sizes, then build again with -f-no-vectorize, repeat, find functions with biggest differences in size and inspect them individually.
The first step is easy, we just run objdump -t -C <file>. -C prints the symbol table and -t demangles them so we get something human readable. (There are other ways here, you can also use .map file if your build produces one, nm etc. Format might be a tiny bit different, but the general idea stays the same).
After this is done we’re left with data that looks like:
address flags section size name
0000000000c39a10 l F .text 0000000000002c4a stbi__load_main(stbi__context*, <more args>)
The next step is to find N biggest symbols, re-run with different options (vectorization off) and compare. Technically, this could be done in 1 step, but I found it is quite informative to have a separate info about the biggest symbols, regardless of vectorization. It is also not 100% correct as theoretically we could have a big delta in some small function, but it is fairly unlikely (it’d be on the top N list in at least one build). To achieve that I wrote 2 simple Python scripts: sort_objdump_symbols.py and compare_sizes.py. Usage is simple:
python sort_objdump_symbols.py <objdump-output> [num biggest symbols to output] > a.txt (..and b.txt)python compare_sizes.py a.txt b.txt <num diffs>
…and eventually we get something like:
11705 0x0000000000002db9 -> 0x0000000000000000 stbi__convert_format(unsigned char*, ...)
which means stbi__convert_format went from 11705 bytes to zero… which is obviously not true, it’s just in this case the function is not even in the top 50. If we are interested in the real value, we’d need to include more results in the ‘before’ capture (it was 1268 bytes). For the most part, though, even that’s enough, I’m not as much interested in the exact delta, more in the ‘inspection candidates’, that we can now disassembly and dig deeper.
Running these scripts on our codebase made it a bit more obvious why it’s hard to notice major differences. The compiler doesn’t know what functions are “hot” (not withut PGO anyway) so it applies the optimizations ‘equally’. Functions that grew the most were typically way off the critical path - random inventory code, some rare serialization, preparing the data for a net request etc. Not saying they do not matter at all, but it is not something that will improve your framerate. Whether it’s “worth it” is another question, that obviously will vary between codebases and can’t be answered without understanding what’s going on. The process outlined here allows us to get a better insight and maybe even pick some middle ground, like only enable vectorization for some modules.