Optimizing without fear
21/Aug 2011
As you can probably tell from this blog, I spent fair chunk of my time optimizing code/thinking about optimizations. One of the main problems with optimizing code is making sure we didn’t break anything in the process. It might be tricky, especially when modifying algorithms with many edge cases, but even simple changes are dangerous, probably because people don’t test them so thoroughly. I still feel a little bit nervous upon seeing “Little optimization in XXX” in P4 changelist description (heck, I’ll be the first one to admit, I broke code as recently as last week…). What can we do to make optimization process a little bit less scary?
First thing – having some kind of testing system goes a long way here, it’s so much easier to modify code that has extensive test coverage. For some modules (math) it might be easier, for others - more complicated, but even having some tests help. That’s our first line of defense, I’m guessing should be fairly common these days, so I won’t focus on that.
Use existing code/data! Assuming your original code works as expected, it’s the best verification tool you can dream of. Few months ago I’ve been optimizing our animation preprocessing code. It’s a dangerous area, because it’s tricky to verify, lots of special cases, lots of data. Sure, you can try to eyeball it, but that’s not very a very reliable method. Better way would be to use data generated with old code as a regression set for the new algorithm. In my case, I used old code to spit out C++ header containing regression data (not the whole set, every 10th keyframe). Then added animation preprocessing unit test making sure anims processed with the new code matched the regression data. Only thing left was choosing a representative set of animations, so that all code paths were being tested. Whole process took me a day or so, a little bit of tree flattening gave almost 2x speedup and not a single bug was reported afterwards.
Big advantage of unit test is that it’ll stay there forever, so if in the future someone wants to optimize it even further or maybe do some other tweaks – we’ll know immediately if it breaks. When modifying game code that’s not covered by a test, we don’t have this luxury, but we still can analyze our changes extensively before checking them in. Similar idea – verify results generated after modifications with those known the be “good”. What I usually do is to run a game with both algorithms running side-by-side (yeah, it might be slow) and compare outputs every frame. I play few levels like this, trying various things. If I don’t hit an assertion, I usually assume it’s safe to commit. In most cases, it really is…
As I mentioned before, last week I hit an interesting case, though. I’ve been changing some animation code (runtime this time), sped it up, compared outputs, everything worked fine. Soon, we got a bug saying that sometimes animations looked completely wrong… I quickly rolled back my changes, just to see if helps, to my surprise – it did. I modified it to run both versions at the same time again. Hmmm, not only were the results for every frame and every object the same, it also looked correctly now. I removed the old version, ran with the new one – oupps, broken again. OK, try to run a debug build, see what’s going on exactly. Well, what do you know – it worked perfectly fine. At this point I realized I have been only testing PS3 version. I ran X360 (release, optimized) - worked, problem was PS3 specific then. I forced all the animation jobs to wait before continuing with PPU code, still broken, so it wasn’t a case of accessing same regions of memory without synchronization. I forced processing to run completely on PPU and it fixed the whole thing again. At this point it was quite obvious it’s most probably something related to DMA transfers, I reviewed the code around modified fragment and discovered we were missing a DMA wait. The old code was slow enough to cover this in majority of cases, new one made it much more obvious… Bam, there you have it, that’s one case where my tests not only gave a false sense of security, but also hid the underlying problem. Then again, I must admit I didn’t run my old test I’ve written few months before, that would compare results from SPU job to those calculated using PPU. Had I done this, it’d make the problem visible much sooner…
In the past, I’d sometimes ignore proper regression testing for simple cases, mostly because I assumed I couldn’t possible break something so trivial (Dunning-Kruger effect kicking in? Might be…). Took few humbling experiences to change my ways. Spider sense no longer tingles that strong when I check in my stuff….
Old comments
El Marcel 2011-08-22 15:21:59
Very interesting post,as usual.