Be nice to your cache

April 25, 2010 – 2:25 am

Short list of tips & guidelines that every game developer should keep in mind. No rocket science, common sense, really, but it’s still relatively rare to find codebases that apply to them. It’s especially aimed at gameplay programmers, who operate a little bit further up from the metal. With todays hardware, cache can be your biggest friend or enemy, CPUs got the the point where a few extra calculations hurt much less than a single cache miss (so LUT is not always the best option, for example). I’m not going to write too much about cache architecture itself, it’s a topic on its own. If you’re interested see Gustavo’s article or this paper by Ulrich Drepper. In a nutshell – cache is a very fast memory organized in lines of 32-128 bytes. When memory is referenced, cache is tested first, if data is there – access is very quick, if not – whole line is loaded. In the latter case, this means cache miss and costs hundreds of cycles. To put things into perspective, consider numbers from famous Tony Albrecht’s paper – Pitfalls of OOP, he speeds bounding spheres calculations by 35% just by changing data layout (not touching code at all).
There are 3 main reasons of cache misses:

  • data is read for the first time (and wasn’t lucky enough to be in same cacheline as previously loaded variables),
  • data has been evicted from cache,
  • cache thrashing (todays caches are set associative, so memory location maps to limited number of possible cache lines).

How to avoid them?

  • do not bloat your structures. Think about padding (Cruncher# to the rescue!), use bitfields instead of multiple bools, consider using smaller data types. Of course, this should never be done automatically, think about every case (see below for example, don’t break access patterns to minimize padding). As I mentioned above – it may be worth to trade off CPU speed for memory and compress data.
  • think about your access patterns. Put members that are used together next to each other. Too often I see code that first accesses member at offset 2, then jumps to 130. I’ve seen fragments that’d cause 3 cache misses on 3 accesses to the same object! I’ve been able to remove hundreds of cache misses per frame just by moving variables around.
  • separate hot/cold data. That’s related both to point #1 & #2, it’ll make your structures smaller and your access patterns more cache-friendly. Consider:
    struct Lol
    {
        int a, b;
        Cat cat; // 140 bytes
    };
    

    Now, imagine we have a loop iterating over Lols, performing calculations on a & b. Even jumping to next object is guaranteed miss. Then we have another loop, operating on Cats mostly. Same story. Now, imagine having only pointer to Cat in Lol. Suddenly, structure is 12 bytes, first loop is much more cache-friendly. Cats can be allocated from pool, or stored in another linear table. Maybe they can be made smaller using similar technique as well (140 bytes object will still kill your cache).
    Another typical (anti)pattern that’s often observed is:

    for every object
    {
        if (object->isActive)
            object->Update();
    }
    

    Now, if object is not active, we still load whole cacheline, only to touch 1 byte! That’s about the worst thing we can do. It’s like inviting cache to our house, just to kick it out one second later. According to this presentation by Microsoft, typical console title has 30-40% cache usage (my experience is similar). This means, it only uses 1 byte of every 3 loaded. Better solution:

    for every object
    {
        if (isObjectActive[i])
            object->Update();
    }
    

    That’s a little bit better. If object is not active, it won’t even be touched. Array of active object flags is a nice contiguous block of memory, we can even use bits to compress it even more. It is still not perfect, though. One thing that both cache & CPU love is linear access. Ideally, you want to load data that’s stored in linear blocks one after another, same thing with code, we don’t want to jump around (we touch two tables here). Again, there’s no one size fits all solution, but maybe it’d be better to only have list of active objects and traverse them as:

    for every active object
        object->Update();
    

    Of course, we pay the price of maintaining this list, so it all depends on how often does it change, but may be worth considering. Some questions to answer here: how often does this flag change, what is the ratio of processed to ignored objects? One of the most pathological situations I’ve seen:

    while (itObject != 0)
    {
        if (itObject->isDisposed())
            disposedObjects.push_back(itObject);
        itObject = itObject->next;
    

    This was called every frame. In 99% of cases, there was nothing to dispose. It’d still iterate over hundreds of objects (…notice, it’s a linked list!) just to find out there’s nothing to do. My first “fix” was to introduce flag indicating if there’s anything to dispose (3 minutes of work, 1k cache misses/frame less in average situation).
    Separating hot/cold data pays off especially in situations where we need to iterate over objects often. Real-life scenario – for every NPC we had a knowledge base structure containing informations about other agents. It was often needed to query for info about particular agent, using ID. At first, I used sorted array. It’s a nice, cache-friendly structure, however, agent info data was quite big, so even skipping over elements when binary searching meant cache misses. I changed it so that array contained only ID+pointer pairs, actual data stored elsewhere. Now, when searching we only touch local data, when correct ID is found – pointer is returned and only _then_ agent info is actually referenced.

  • consider using component-based model. It improves data locality & makes objects smaller. Now, we can batch-process components and load only data that we actually need. In traditional approach, base object structures can grow really big and again – we load 400 bytes only to touch one here, then 4 here and then maybe 64 there. With component based approach, all this data would be stored next to each other with no other stuff nearby.
    We can push it further and store components of same type in arrays, one after another, this way when we’re done with component A1, very next byte (well, depends on aligment) is component A2 (of same type, but belonging to different object), so it’s perfect for batch processing.
  • think about your data structures. Again – cache loves linear access. This means – be careful when using trees & linked lists. They can be made a little bit more friendly by using pool allocators, but maybe switching to another data structure should be considered as well. You may need to forget your CS classes and not focus on big O notation so much. If there are not many elements, simple array may still beat lists, even when inserting/removing objects. Sorted-arrays/hash tables are nice alternatives to map/rb-tree.
  • do not copy. If you look at many game engines, they spent big chung of time copying/moving/transforming objects between different lists. Every time you copy data around – you waste time. Keep asking yourself – do I really need to do it, maybe it’s enough to store reference somewhere.
    [Edit] As pointed out by Brian in comments – this may sound a little too radical. Let’s rephrase it to “do not copy, unless you have to”. As with other tips — sometimes you need to choose lesser evil. In an ideal world, we’d have everything in perfect, cache-friendly format. In reality, we sometimes may need to rearrange data (as in isActive example). See Brian’s comment for more info.
  • profile. Even if you’re sure you understand the situation – validate your assumptions. Focus on fixing place with poor bytes/cache line usage first. Profile it afterwards again to make sure your changes actually helped.
  • prefetch. Left it for the end as it’s the most brute-force solution, sometimes unavoidable, though. Even with data organized in a linear fashion, you will pay for cache miss from time to time (whenever new object is not in cache). In order to avoid it, you can try to let the CPU know you’ll need it soon. That’s what prefetch does, basically, it loads the data to cache, in the background. Here’s the trick though – it still needs few hundreds of cycles to complete. So, don’t expect miracles if you do this:
    prefetch(0, tab[i]);
    int a = tab[i].a;
    

    Popular pattern is to prefetch next element when processing current one. Depending on object sizes/processing time (it may be shorter than time required for prefetch), consider loading 2/3/4/etc elements ahead. Prefetching can be very useful, but it shouldn’t be your first option, try to employ other solutions first and resort to prefetching at the very end, as a finishing touch.

  • Publications every programmer should read (apart from those already mentioned):
  1. 21 Responses to “Be nice to your cache”

  2. Thanks for these tips! :)

    By triton on Apr 25, 2010

  3. Pretty amazing stuff.

    That’s good to know that there are still coders out there who know about low level stuff and care about performances !

    By Stephane on Apr 25, 2010

  4. Some of your words could’ve been picked better. It’s not clear whether you’re referring to component-oriented programming (a method of structuring programs and frameworks) or structures-of-arrays (a data layout principle).

    By Anon on Apr 25, 2010

  5. Very informative post. Going to subscribe to your blog.

    By hanDerPeder on Apr 25, 2010

  6. Anon: Both, in this case. I’ll try to rephrase this bullet point a little bit.

    By admin on Apr 25, 2010

  7. Great post.

    I disagree with the “do not copy” part though. It also seemed odd because I’m not sure how that point was related to cache coherence. Copying data can be a useful tool in improving cache performance. Copying can be used to grab parts of a larger data structure while it is being accessed in the first place to be used later in a cache friendly way instead of accessing the same large data structure again or chasing pointers to get there.

    You could argue that the data should already be broken apart from that larger structure and the benefit could be made without the copy. In many cases this is true. In others it isn’t. Copying can be a very simple way to lining up just the data you will need later with no gaps. A very simple example of this is an operation that doesn’t operate on every object. A list of just the objects that need this operation could be pointers, or even pointers to components. Either way if every other object needs to be operated on your cache usage can get bad. If a small amount of data is needed from the objects it’s better to copy that relevant data to the list instead of references to it. The later operation can then be done with 100% cache usage.

    In most cases copying is as bad as it seems but it can also be a nice tool for improving cache usage.

    By Brian Karis on Apr 26, 2010

  8. This is cool but the there’s no coverage of the metrics. How do you measure what’s happening to the cache and how it effects your program?

    By Dan on Apr 26, 2010

  9. *affects

    By Dan on Apr 26, 2010

  10. Thanks for this great post! Very informative.

    By Riddlemaster on Apr 26, 2010

  11. Very good explanation on how to align and operate over data structures in both C and C++. Very insightful article. I also recommend to read processor specific manuals, such as i386 microprocessor handbook, and similar books. Also, in some cases I recommend to use atomic operations if the compiler supports them, for example on GCC we have some Atomic Builtins: http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html

    By Daniel Molina Wegener on Apr 26, 2010

  12. As always, relevant and nicely to the point :)
    Thanks alot!

    By Emil on Apr 26, 2010

  13. @Brian: good catch, added little explanation, thanks!
    @Dan: It depends on a development environment. For X360 there is PIX and performance counters, for PS3 – SN Tuner. Not sure about PC, to be honest, I usually just go with console optimization and it helps with PC as well.

    By admin on Apr 27, 2010

  14. Using bitfields instead of multiple bools may introduce LHS (load hit store) issues if those bools are used together (e.g. one is set and then another is tested).

    By RCL on Apr 27, 2010

  15. The link to ‘Pitfalls of OOP’ seems to be broken. Good read.

    By Cristian on Apr 27, 2010

  16. Pitfalls of OOP – link broken. Could you please provide a good link? Thanks in advance.

    By Ganeshram Iyer on Apr 28, 2010

  17. Link fixed, thanks for reporting.

    By admin on Apr 28, 2010

  18. How do you prefetch data on the PC and what profiler do you use on PC to look for cache misses?

    By Reg on May 2, 2010

  19. @Reg: _mm_prefetch. For profiling – don’t know to be honest, never needed it. I optimize for X360/PS3, it helps with PC as well. (cache misses are less of a problem here, anyway)

    By admin on May 2, 2010

  1. 3 Trackback(s)

  2. Apr 25, 2010: Tweets that mention Be nice to your cache | .mischief.mayhem.soap. -- Topsy.com
  3. Apr 27, 2010: Rant Things » today’s amuse-cerveau – April 27, 2010
  4. Dec 22, 2010: Data Oriented Programming « JustinLiew.Com

Post a Comment