Parallel 101
18/Jun 2013
With the unveiling of next gen console specifications it’s clear that multi-threaded code is here to stay (I don’t think anyone expected otherwise). If anything, it’ll be even more common, new CPUs run at relatively low frequencies (when compared to modern PCs), so we’ll definitely have to go wide to use their potential. Here’s a quick cheat sheet I’m usually following when trying to move code to a background thread. Please note: move. It is not about writing code from scratch, it’s about taking an existing subsystem and making it run in multiple threads.
choose wisely. First step is to decide what piece of code do we want to parallelize. It’d be tempting to start with the slowest function, but it’s not always the best choice (except in the ideal world with no deadlines). What you really want to choose is the fragment that’ll give you decent gains and is not crazy hard to modify (every codebase has its dark corners). It’s a little bit like EV, but with “easiness” instead of probability (or probability of actually being able to do it). This rule applies at few different levels – also when actually modifying code. Yes, it’s tempting to go completely lock free and have no contention whatsoever, but holding a lock for a short duration is not the end of the world. We should strive for perfection, but not necessarily when it means that now your changes take 10x as much time and require rewriting 90% of the original code.
think about the data. What are your inputs, what are your outputs, what memory do you touch in any way (focus on writes, though). I like to use compiler to do the heavy lifting for me and move the code I’m going to modify outside the class, sometimes even to a separate file. Obviously, it will not compile at first, which is the whole point – now compiler will complain about all the data/functions I’m going to need… Next step is trying to get it to compile… and stop there, I rarely actually use this version for anything, this is solely to estimate the scope of the task. YMMV, but usually the more changes I have to make to get the code to compile – the more tricky the final implementation will be.
there is no “one size fits all” solution. Typically, when operating on objects, single job packet = single object. It’s not always the best/easiest to implement, though, so don’t choose it automatically. For example, let’s think about rendering multiple scenes at once (not a real life example, but close enough hopefully, had to obfuscate it slightly to protect the innocent). The most obvious approach would be to spawn job for each object and let the thread pool deal with the rest. This causes potential contention issues, though. Now, we have multiple threads touching same data (scene). Less orthodox solution would be to process each scene in a separate thread (so single job = all scene objects in this case). We know we’re the exclusive owners, so no need to synchronize access. Yes, load balancing will be slightly worse and on some systems you might need to yield manually, to give other cores a chance, but in many cases it’s much much faster to implement (see point 1).
use your original, single threaded implementation for verification purposes.
actually, it’s nice to have at least 3-4 different “modes” and an option to switch between them while running the game:
“old”, single threaded,
“jobified” version, but running sequentially (i.e. only one job at a time),
final multi-threaded version,
on top of that, we might need even more variations. For example, when using double-buffering of data, it’s handy to have single threaded/single buffer + single threaded/double-buffer versions (to verify the buffering code). For platforms without shared memory, it’s nice to have a way of “faking” memory transfer (e.g. with memcpy instead of DMA move)
having multiple modes makes it easier to debug & profile our changes. When something goes wrong (yes, when, not if), switching between different modes should give us some first clues. Sequential jobs working OK? Job code probably fine, problems most likely caused by data races. “memcpy” version OK, but problems when running at the actual device? Might be a sign of missing DMA wait.
just as important – it’s now easier to monitor our progress. After all, our goal is not just to have something running in the background thread, it is to make the game run faster. When modifying complex subsystems, often originally coded by someone else, it’s easy to miss problems that become obvious after having a look at the profiler results… Word of wisdom: even naive implementation should give you a noticeable speedup. If your brand new multi-threaded code is running with the exact same speed as the old function, do not start with chasing false sharing or other esoteric culprits, it’s probably something stupid (like critical section call buried deep in the call chain… this happened to my *cough*friend recently). My go-to tool for such cases is my trusted thread profiler, your engine probably has something similar. Poor man’s way of of verifying this is to set a breakpoint in critical section code.
code compiles, runs, it’s faster than it was before, no obvious problems, we check it in. Are we done? Not yet sadly, time for more subtle issues… We’re now getting rare crash reports with access violations or weird results (if nothing like this happens after few days - high five yourself). Smells like a race condition… One trick I like to employ to track these down is to use a special kind of locks. When taking a lock, verify it’s not already taken, when releasing – verify it is taken. The easiest version of this pattern is just an integer. Increment it atomically when trying to acquire a “lock”, assert that previous value was zero. When releasing, decrease it back and verify that it really was 1. It’s surprisingly helpful, when I first started using it I didn’t have much hope, but in many cases it tripped after just few runs. These assertions are way more likely to trip than the actual data race (you really have to be (un)lucky with context switches for these). It’s a good idea to hold them for as long as possible (ie. enclose as much of the code as you can, as long as you’re sure it’s illegal for all memory accesses inside to be performed from multiple threads), to improve our chances even more. Once you get it breaking into debugger, it usually boils down to inspecting what other threads are doing…
practice, practice, practice, there’s no way around it. With every case completed, you’ll learn new things and discover where the dragons are.