Streaky locks
4/May 2024
I have been comparing how locks (critical sections) behave on different platforms recently and noticed Windows seemed to be quite different than others. The behavior is consistent across two PC machines I tried (desktop and a laptop), test code is a bit too long to include it in the post, but it is linked here. The gist of it is really simple though - we have multiple threads (even 2 is enough) fighting for access to a shared resource, guarded by a critical section. The actual thread function is as simple as:
1void Run(size_t tid)
2{
3 // Signal our event, wait for others
4 SetEvent(readyEvents[tid]);
5 WaitForMultipleObjects(NUM_THREADS, readyEvents, TRUE, INFINITE);
6 for (size_t i = 0; i != NUM_LOOPS; ++i)
7 {
8 EnterCriticalSection(&cs);
9 ++sharedCounter;
10 lockHistory.push_back(tid);
11 LeaveCriticalSection(&cs);
12 }
13 printf("%zd done\n", tid);
14}
(lockHistory vector is preallocated, so no memory operations while holding a lock).
The important part is that threads should run on different cores, otherwise the flow is very different, most notably 2 threads can’t spin at the same time. Lock history table is basically what it says - we record what thread owned the lock at any given moment.
In our simple test let’s use 2 threads, this means there are 2 potential ‘extreme’ scenarios: both threads holding the lock for entire duration of the lock (1 ownership change) and ‘trading’ the lock for each increment (for.. uh NUM_LOOPS changes). Obviously, it is not likely for any of these extremes to happen, we will see some “in-between” case, interesting question is where does it fall in that interval. As it turns out, most platforms are closer to #2, trading ownership fairly frequently… Windows however is more ‘streaky’. Once a thread grabs a lock, it seems to have a high chance of getting it again. It is also fairly consistent, what could be surprising, because multithreaded scenarios rarely are. On my laptop, 2 threads, 10,000 loops each I get 500-700 ownership changes with an average streak of 30. That means, on average, a thread increments our counter 30 times(!) before the other one grabs a lock. This is with a ‘default’ critical section (created with InitializeCriticalSection). As you might know, Windows critical section do not issue a system call immediately, they first spin for a bit, testing the lock state and issuing a pause instruction (see this post for more details.
Spinning detour
What does it mean ‘for a bit’, though? Well, the answer as usual is - it depends. We can actually inspect the CRITICAL_SECTION object, it has a SpinCount member variable, but its value is somewhat intriguing - 33556432 (AFAICT it is the same across different machines). Does it really mean we spin 33556432 times? Quite unlikely. The pause instruction itself has a reciprocal throughput of ~30-70 cycles typically, so it would take a very long time. Luckily, we can inspect the assembly for RtlpEnterCriticalSectionContended and see what is it doing exactly; Here’s the relevant fragment:
1mov rdi,qword ptr [rsi+20h] // CS at RSI, load _SpinCount_ to RDI
2mov rax,rdi
3and edi,0FFFFFFh
as we can see one of the first things it does is mask with 0FFFFFFh. This means 33556432 becomes 2000. Does it mean we spin up to 2000 times then? Well, again, no… After that we still transform it further – we multiply it by 10 and divide by “some” factor. Dennis puts it at 47, but in my experience it varies between different machines. For example, on my laptop it is 83 (so * 10 / 83), on my desktop it is 75. I can only assume it is calculated based on your system/CPU. It does come from a global variable, presumably calculated at the start (already set before hitting main though). To be even more exact, if this “magical” constant is 12, the formula is different, but I have never seen it happen, so let’s stick with the: (CS_COUNT&0xFFFFFF)*10/C.
To conclude all of this was to say that default spin count is actually 2000 and for that value it’ll spin for up to 240 times on my laptop. Why did I want to know that? Well, my hypothesis was that ‘streaky lock’ were a property of Windows scheduler, but if we spin waiting, we should realize almost immediately and grab the lock. The only problem with this theory was that even the default value should be high enough, after all our ‘lock window’ is just a few instructions (inc + whatever the fast path of vector push_back is doing). I gave it a try anyway, but it didn’t seem to matter.
I tried going the other direction and only using 1 spin, so pretty much always putting the thread to sleep… and it definitely made the difference. It is now pretty much guaranteed that one thread will own the lock for the whole time (ie. 10,000 increments). It might not be that surprising, after all now the other thread needs to wake up and grab the lock. Still, it is a race between one thread leaving the CS, then re-entering it+grabbing again and another waking up in the ‘enter’ function already, you would think it would ‘win’ at least few times. Other platforms I tried do not show that pattern, even when they hit the system/put threads to sleep. Of course, this is a very contrived example, we are not doing anything else, just iterate/lock/increment. The thread that just released the lock, grabs it almost immediately after. There are 2 quite interesting wrinkles, though: - adding some actual work after releasing the lock doesn’t change this pattern, as long as it is short (sub 100 ticks on my machine), - as mentioned, other platforms do not behave this way Just for kicks, I added some busy work and was still observing similar results (up to a certain point, once it gets too long, we are back to ‘trading’ locks).
To be 100% honest, I still can’t exactly explain why Windows seems to behaving this way and why it is different than other platforms. It sounds highly improbable, but sometimes it seemed like thread A (lock owner) is just running faster. It would leave the CS, do the work, then re-grab the CS before the other thread is done with the work.