Go with the flow

If you’re interested in lock-free/multicore programming, make sure to visit Charles Bloom blog and read his Low Level Threading series. Table of contents is a good start. You’ll find plenty of valuable links there, always good to have them gathered in one place (if you’re not familiar with the topic, reserve two weeks, reading all this stuff may take some time). On top of that, some Relacy-tested code (I tend to agree that code that has not been extensively tested with some automatic suite, cannot be considered safe) and many good insights. Basic one, that cannot be stressed enough, is: fight the cause, not the result. Try to minimize/eliminate contention before introducing some fancy, light-weight mutexes or containers. If you’ve 100 threads fighting for access to single variable – you’re screwed anyway, switching from critical-section guarded to lock-free container will not save you.

Once again, there’s no silver bullet. If the underlying concept is broken, fix it first, before resorting to low-level tweaks. One useful pattern I found is dividing problem to the most simple elements. Usually, more complicated structures can be expressed as collection of more basic ones. Consider MPSC problem (multiple producers, single consumer). This is the case I encountered when coding C++ part of MemTracer.

I have multiple threads producing data (allocation info) and single thread consuming messages and sending them to C# analyzer. Canonical solution would be global queue, accessed by all producers and consumer. This obviously means high contention. However, we can try to simplify the scenario and take advantage of the fact there’s only one consumer. Every producer thread can output messages to its own queue (stored in TLS for example) and consumer grabs data from each queue in turn. It effectively transforms the problem to N SPSC (single producer, single consumer) containers instead of one MPMC. SPSC queue can be implemented in a much more efficient way, in fact, you won’t even need to use interlocked ops, all push/pop operations take few cycles only (depends on stored data, of course).

For example of such implementation see one of my previous notes. This is still not perfect, because it may stall when we’re producing faster than consuming, but this can be solved by using non-bounded container (…which in turn makes us to provide quick allocator, but that’s another story) or at least amortized by using bigger queue. Net result is not only much more effective, it’s also way simpler, easier and less bug-prone (SPSC vs MPMC).

More Reading
Older// Aligning arrays