Be nice to your cache

²⁵/Apr 2010 By admin

Short list of tips & guidelines that every game developer should keep in mind. No rocket science, common sense, really, but it’s still relatively rare to find codebases that apply to them. It’s especially aimed at gameplay programmers, who operate a little bit further up from the metal. With todays hardware, cache can be your biggest friend or enemy, CPUs got the the point where a few extra calculations hurt much less than a single cache miss (so LUT is not always the best option, for example). I’m not going to write too much about cache architecture itself, it’s a topic on its own. If you’re interested see Gustavo’s article or this paper by Ulrich Drepper. In a nutshell - cache is a very fast memory organized in lines of 32-128 bytes. When memory is referenced, cache is tested first, if data is there - access is very quick, if not - whole line is loaded. In the latter case, this means cache miss and costs hundreds of cycles. To put things into perspective, consider numbers from famous Tony Albrecht’s paper - Pitfalls of OOP, he speeds bounding spheres calculations by 35% just by changing data layout (not touching code at all). There are 3 main reasons of cache misses:

data is read for the first time (and wasn’t lucky enough to be in same cacheline as previously loaded variables),
data has been evicted from cache,
cache thrashing (todays caches are set associative, so memory location maps to limited number of possible cache lines).

How to avoid them?

do not bloat your structures. Think about padding (Cruncher# to the rescue!), use bitfields instead of multiple bools, consider using smaller data types. Of course, this should never be done automatically, think about every case (see below for example, don’t break access patterns to minimize padding). As I mentioned above - it may be worth to trade off CPU speed for memory and compress data.
think about your access patterns. Put members that are used together next to each other. Too often I see code that first accesses member at offset 2, then jumps to 130. I’ve seen fragments that’d cause 3 cache misses on 3 accesses to the same object! I’ve been able to remove hundreds of cache misses per frame just by moving variables around.
separate hot/cold data. That’s related both to point #1 & #2, it’ll make your structures smaller and your access patterns more cache-friendly. Consider:
```
struct Lol
{
    int a, b;
    Cat cat; // 140 bytes
};
```
Now, imagine we have a loop iterating over Lols, performing calculations on a & b. Even jumping to next object is guaranteed miss. Then we have another loop, operating on Cats mostly. Same story. Now, imagine having only pointer to Cat in Lol. Suddenly, structure is 12 bytes, first loop is much more cache-friendly. Cats can be allocated from pool, or stored in another linear table. Maybe they can be made smaller using similar technique as well (140 bytes object will still kill your cache). Another typical (anti)pattern that’s often observed is:
```
for every object
{
    if (object->isActive)
        object->Update();
}
```
Now, if object is not active, we still load whole cacheline, only to touch 1 byte! That’s about the worst thing we can do. It’s like inviting cache to our house, just to kick it out one second later. According to this presentation by Microsoft, typical console title has 30-40% cache usage (my experience is similar). This means, it only uses 1 byte of every 3 loaded. Better solution:
```
for every object
{
    if (isObjectActive[i])
        object->Update();
}
```
That’s a little bit better. If object is not active, it won’t even be touched. Array of active object flags is a nice contiguous block of memory, we can even use bits to compress it even more. It is still not perfect, though. One thing that both cache & CPU love is linear access. Ideally, you want to load data that’s stored in linear blocks one after another, same thing with code, we don’t want to jump around (we touch two tables here). Again, there’s no one size fits all solution, but maybe it’d be better to only have list of active objects and traverse them as:
```
for every active object
    object->Update();
```
Of course, we pay the price of maintaining this list, so it all depends on how often does it change, but may be worth considering. Some questions to answer here: how often does this flag change, what is the ratio of processed to ignored objects? One of the most pathological situations I’ve seen:
```
while (itObject != 0)
{
    if (itObject->isDisposed())
        disposedObjects.push_back(itObject);
    itObject = itObject->next;
```
This was called every frame. In 99% of cases, there was nothing to dispose. It’d still iterate over hundreds of objects (‘notice, it’s a linked list!) just to find out there’s nothing to do. My first ‘fix’ was to introduce flag indicating if there’s anything to dispose (3 minutes of work, 1k cache misses/frame less in average situation). Separating hot/cold data pays off especially in situations where we need to iterate over objects often. Real-life scenario - for every NPC we had a knowledge base structure containing informations about other agents. It was often needed to query for info about particular agent, using ID. At first, I used sorted array. It’s a nice, cache-friendly structure, however, agent info data was quite big, so even skipping over elements when binary searching meant cache misses. I changed it so that array contained only ID+pointer pairs, actual data stored elsewhere. Now, when searching we only touch local data, when correct ID is found - pointer is returned and only then agent info is actually referenced.
consider using component-based model. It improves data locality & makes objects smaller. Now, we can batch-process components and load only data that we actually need. In traditional approach, base object structures can grow really big and again - we load 400 bytes only to touch one here, then 4 here and then maybe 64 there. With component based approach, all this data would be stored next to each other with no other stuff nearby. We can push it further and store components of same type in arrays, one after another, this way when we’re done with component A1, very next byte (well, depends on aligment) is component A2 (of same type, but belonging to different object), so it’s perfect for batch processing.
think about your data structures. Again - cache loves linear access. This means - be careful when using trees & linked lists. They can be made a little bit more friendly by using pool allocators, but maybe switching to another data structure should be considered as well. You may need to forget your CS classes and not focus on big O notation so much. If there are not many elements, simple array may still beat lists, even when inserting/removing objects. Sorted-arrays/hash tables are nice alternatives to map/rb-tree.
do not copy. If you look at many game engines, they spent big chung of time copying/moving/transforming objects between different lists. Every time you copy data around - you waste time. Keep asking yourself - do I really need to do it, maybe it’s enough to store reference somewhere. [Edit] As pointed out by Brian in comments - this may sound a little too radical. Let’s rephrase it to “do not copy, unless you have to”. As with other tips – sometimes you need to choose lesser evil. In an ideal world, we’d have everything in perfect, cache-friendly format. In reality, we sometimes may need to rearrange data (as in isActive example). See Brian’s comment for more info.
profile. Even if you’re sure you understand the situation - validate your assumptions. Focus on fixing place with poor bytes/cache line usage first. Profile it afterwards again to make sure your changes actually helped.
prefetch. Left it for the end as it’s the most brute-force solution, sometimes unavoidable, though. Even with data organized in a linear fashion, you will pay for cache miss from time to time (whenever new object is not in cache). In order to avoid it, you can try to let the CPU know you’ll need it soon. That’s what prefetch does, basically, it loads the data to cache, in the background. Here’s the trick though - it still needs few hundreds of cycles to complete. So, don’t expect miracles if you do this:
```
prefetch(0, tab[i]);
int a = tab[i].a;
```
Popular pattern is to prefetch next element when processing current one. Depending on object sizes/processing time (it may be shorter than time required for prefetch), consider loading 2/3/4/etc elements ahead. Prefetching can be very useful, but it shouldn’t be your first option, try to employ other solutions first and resort to prefetching at the very end, as a finishing touch.
Publications every programmer should read (apart from those already mentioned):
- Christer Ericson - Memory Optimization (this article is a watered down version of Christer’s paper, basically),
- Bruce Dawson - Multi-Core Memory Coherence - The Hidden Perils of Sharing Data,
- Igor Ostrovsky - Gallery of Processor Cache Effects,

Old comments

triton 2010-04-25 17:18:40

Thanks for these tips! :)

Stephane 2010-04-25 21:04:51

Pretty amazing stuff.
That’s good to know that there are still coders out there who know about low level stuff and care about performances !

Anon 2010-04-25 21:58:47

Some of your words could’ve been picked better. It’s not clear whether you’re referring to component-oriented programming (a method of structuring programs and frameworks) or structures-of-arrays (a data layout principle).

hanDerPeder 2010-04-25 22:04:08

Very informative post. Going to subscribe to your blog.

admin 2010-04-25 22:33:08

Anon: Both, in this case. I’ll try to rephrase this bullet point a little bit.

Brian Karis 2010-04-26 01:43:35

Great post.
I disagree with the “do not copy” part though. It also seemed odd because I’m not sure how that point was related to cache coherence. Copying data can be a useful tool in improving cache performance. Copying can be used to grab parts of a larger data structure while it is being accessed in the first place to be used later in a cache friendly way instead of accessing the same large data structure again or chasing pointers to get there.
You could argue that the data should already be broken apart from that larger structure and the benefit could be made without the copy. In many cases this is true. In others it isn’t. Copying can be a very simple way to lining up just the data you will need later with no gaps. A very simple example of this is an operation that doesn’t operate on every object. A list of just the objects that need this operation could be pointers, or even pointers to components. Either way if every other object needs to be operated on your cache usage can get bad. If a small amount of data is needed from the objects it’s better to copy that relevant data to the list instead of references to it. The later operation can then be done with 100% cache usage.
In most cases copying is as bad as it seems but it can also be a nice tool for improving cache usage.

Dan 2010-04-26 05:48:06

This is cool but the there’s no coverage of the metrics. How do you measure what’s happening to the cache and how it effects your program?

Dan 2010-04-26 05:53:14

*affects

Riddlemaster 2010-04-26 07:15:57

Thanks for this great post! Very informative.

Daniel Molina Wegener 2010-04-26 12:33:21

Very good explanation on how to align and operate over data structures in both C and C++. Very insightful article. I also recommend to read processor specific manuals, such as i386 microprocessor handbook, and similar books. Also, in some cases I recommend to use atomic operations if the compiler supports them, for example on GCC we have some Atomic Builtins: http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html

Emil 2010-04-26 19:20:59

As always, relevant and nicely to the point :)
Thanks alot!

admin 2010-04-26 23:30:31

@Brian: good catch, added little explanation, thanks!
@Dan: It depends on a development environment. For X360 there is PIX and performance counters, for PS3 - SN Tuner. Not sure about PC, to be honest, I usually just go with console optimization and it helps with PC as well.

Rant Things » today’s amuse-cerveau – April 27, 2010 2010-04-27 06:11:26

[…] Be nice to your cache | .mischief.mayhem.soap.April 26, 2010 – Understand your hardware, and you’ll make better software. […]

RCL 2010-04-27 09:42:46

Using bitfields instead of multiple bools may introduce LHS (load hit store) issues if those bools are used together (e.g. one is set and then another is tested).

Cristian 2010-04-27 14:54:37

The link to ‘Pitfalls of OOP’ seems to be broken. Good read.

Ganeshram Iyer 2010-04-27 23:15:30

Pitfalls of OOP - link broken. Could you please provide a good link? Thanks in advance.

admin 2010-04-28 00:40:52

Link fixed, thanks for reporting.

Reg 2010-05-02 09:19:37

How do you prefetch data on the PC and what profiler do you use on PC to look for cache misses?

admin 2010-05-02 16:40:56

@Reg: _mm_prefetch. For profiling - don’t know to be honest, never needed it. I optimize for X360/PS3, it helps with PC as well. (cache misses are less of a problem here, anyway)

Data Oriented Programming « JustinLiew.Com 2010-12-22 18:20:26

[…] http://msinilo.pl/blog/?p=614 – being nice to your cache. […]

.m.m.s.

Maciej Sinilo