Be nice to your cache

Short list of tips & guidelines that every game developer should keep in mind. No rocket science, common sense, really, but it’s still relatively rare to find codebases that apply to them. It’s especially aimed at gameplay programmers, who operate a little bit further up from the metal. With todays hardware, cache can be your biggest friend or enemy, CPUs got the the point where a few extra calculations hurt much less than a single cache miss (so LUT is not always the best option, for example). I’m not going to write too much about cache architecture itself, it’s a topic on its own. If you’re interested see Gustavo’s article or this paper by Ulrich Drepper. In a nutshell - cache is a very fast memory organized in lines of 32-128 bytes. When memory is referenced, cache is tested first, if data is there - access is very quick, if not - whole line is loaded. In the latter case, this means cache miss and costs hundreds of cycles. To put things into perspective, consider numbers from famous Tony Albrecht’s paper - Pitfalls of OOP, he speeds bounding spheres calculations by 35% just by changing data layout (not touching code at all). There are 3 main reasons of cache misses:

  • data is read for the first time (and wasn’t lucky enough to be in same cacheline as previously loaded variables),
  • data has been evicted from cache,
  • cache thrashing (todays caches are set associative, so memory location maps to limited number of possible cache lines).

How to avoid them?

  • do not bloat your structures. Think about padding (Cruncher# to the rescue!), use bitfields instead of multiple bools, consider using smaller data types. Of course, this should never be done automatically, think about every case (see below for example, don’t break access patterns to minimize padding). As I mentioned above - it may be worth to trade off CPU speed for memory and compress data.

  • think about your access patterns. Put members that are used together next to each other. Too often I see code that first accesses member at offset 2, then jumps to 130. I’ve seen fragments that’d cause 3 cache misses on 3 accesses to the same object! I’ve been able to remove hundreds of cache misses per frame just by moving variables around.

  • separate hot/cold data. That’s related both to point #1 & #2, it’ll make your structures smaller and your access patterns more cache-friendly. Consider:

    struct Lol
        int a, b;
        Cat cat; // 140 bytes
    Now, imagine we have a loop iterating over Lols, performing calculations on a & b. Even jumping to next object is guaranteed miss. Then we have another loop, operating on Cats mostly. Same story. Now, imagine having only pointer to Cat in Lol. Suddenly, structure is 12 bytes, first loop is much more cache-friendly. Cats can be allocated from pool, or stored in another linear table. Maybe they can be made smaller using similar technique as well (140 bytes object will still kill your cache). Another typical (anti)pattern that’s often observed is:
    for every object
        if (object->isActive)
    Now, if object is not active, we still load whole cacheline, only to touch 1 byte! That’s about the worst thing we can do. It’s like inviting cache to our house, just to kick it out one second later. According to this presentation by Microsoft, typical console title has 30-40% cache usage (my experience is similar). This means, it only uses 1 byte of every 3 loaded. Better solution:
    for every object
        if (isObjectActive[i])
    That’s a little bit better. If object is not active, it won’t even be touched. Array of active object flags is a nice contiguous block of memory, we can even use bits to compress it even more. It is still not perfect, though. One thing that both cache & CPU love is linear access. Ideally, you want to load data that’s stored in linear blocks one after another, same thing with code, we don’t want to jump around (we touch two tables here). Again, there’s no one size fits all solution, but maybe it’d be better to only have list of active objects and traverse them as:
    for every active object
    Of course, we pay the price of maintaining this list, so it all depends on how often does it change, but may be worth considering. Some questions to answer here: how often does this flag change, what is the ratio of processed to ignored objects? One of the most pathological situations I’ve seen:
    while (itObject != 0)
        if (itObject->isDisposed())
        itObject = itObject->next;
    This was called every frame. In 99% of cases, there was nothing to dispose. It’d still iterate over hundreds of objects (‘notice, it’s a linked list!) just to find out there’s nothing to do. My first ‘fix’ was to introduce flag indicating if there’s anything to dispose (3 minutes of work, 1k cache misses/frame less in average situation). Separating hot/cold data pays off especially in situations where we need to iterate over objects often. Real-life scenario - for every NPC we had a knowledge base structure containing informations about other agents. It was often needed to query for info about particular agent, using ID. At first, I used sorted array. It’s a nice, cache-friendly structure, however, agent info data was quite big, so even skipping over elements when binary searching meant cache misses. I changed it so that array contained only ID+pointer pairs, actual data stored elsewhere. Now, when searching we only touch local data, when correct ID is found - pointer is returned and only then agent info is actually referenced.

  • consider using component-based model. It improves data locality & makes objects smaller. Now, we can batch-process components and load only data that we actually need. In traditional approach, base object structures can grow really big and again - we load 400 bytes only to touch one here, then 4 here and then maybe 64 there. With component based approach, all this data would be stored next to each other with no other stuff nearby. We can push it further and store components of same type in arrays, one after another, this way when we’re done with component A1, very next byte (well, depends on aligment) is component A2 (of same type, but belonging to different object), so it’s perfect for batch processing.

  • think about your data structures. Again - cache loves linear access. This means - be careful when using trees & linked lists. They can be made a little bit more friendly by using pool allocators, but maybe switching to another data structure should be considered as well. You may need to forget your CS classes and not focus on big O notation so much. If there are not many elements, simple array may still beat lists, even when inserting/removing objects. Sorted-arrays/hash tables are nice alternatives to map/rb-tree.

  • do not copy. If you look at many game engines, they spent big chung of time copying/moving/transforming objects between different lists. Every time you copy data around - you waste time. Keep asking yourself - do I really need to do it, maybe it’s enough to store reference somewhere. [Edit] As pointed out by Brian in comments - this may sound a little too radical. Let’s rephrase it to “do not copy, unless you have to”. As with other tips – sometimes you need to choose lesser evil. In an ideal world, we’d have everything in perfect, cache-friendly format. In reality, we sometimes may need to rearrange data (as in isActive example). See Brian’s comment for more info.

  • profile. Even if you’re sure you understand the situation - validate your assumptions. Focus on fixing place with poor bytes/cache line usage first. Profile it afterwards again to make sure your changes actually helped.

  • prefetch. Left it for the end as it’s the most brute-force solution, sometimes unavoidable, though. Even with data organized in a linear fashion, you will pay for cache miss from time to time (whenever new object is not in cache). In order to avoid it, you can try to let the CPU know you’ll need it soon. That’s what prefetch does, basically, it loads the data to cache, in the background. Here’s the trick though - it still needs few hundreds of cycles to complete. So, don’t expect miracles if you do this:

    prefetch(0, tab[i]);
    int a = tab[i].a;
    Popular pattern is to prefetch next element when processing current one. Depending on object sizes/processing time (it may be shorter than time required for prefetch), consider loading 2/3/4/etc elements ahead. Prefetching can be very useful, but it shouldn’t be your first option, try to employ other solutions first and resort to prefetching at the very end, as a finishing touch.

  • Publications every programmer should read (apart from those already mentioned):

More Reading
Older// Venice
comments powered by Disqus