The C++11 memory consistency model is probably one of the most significant aspects of Modern C++; and yet probably one of the least well-understood. I think the reason is simple: it’s really difficult to understand what the problem actually is.
The memory consistency problem is a concurrency problem. That is, it’s a problem that occurs when we start writing multi-threaded code. More specifically, it’s a parallelism problem – the real subtleties occur when you have two or more processors executing code.
In the first part of this two-part article we’ll have a look at the causes of the memory consistency problem.
In a modern multi-processor environment the program you wrote is almost certainly NOT the program that is executed
In the Good Old Days (whenever that was), computer programs behaved pretty much the way you might instinctively expect them to from looking at the source code
- Things happened in the way specified in the program.
- Things happened in the order specified in the program.
- Things happened the number of times specified in the program (no more, no less).
- Things happened one at a time.
This nostalgic fantasy is sometimes referred to as the Sequential Execution Model.
With sequential execution the only way to speed up execution is to clock the processor faster, but there’s a limit to how fast you can clock; and there are smarter ways of increasing performance. In a modern microprocessor the hardware manufacturers (and the compiler writers) add mechanisms to their architecture, designed to optimize the execution of sequential code.
An optimizing compiler can heavily refactor your code in order to hide pipeline latencies or take advantage of micro-architectural optimizations. Although there are a large number of optimisations a compiler can apply, from our perspective there are two that are significant:
- It can remove what it considers redundant reads or writes.
- It can decide to move a memory access earlier in order to give it more time to complete before the value is required, or later in order to balance out the accesses through the program.
Microprocessor optimisations allow the processor to execute instructions out of order, or speculatively, in order to reduce latency between opcodes and remove stalls while waiting for data.
Memory optimisation features like caches, interconnects and write buffers minimise the impact of high-latency main memory operations by allowing the processor to read / write from low-latency memory structures. Their operation means that there is often asynchronicity between the processor and the main memory.
The above is in no way an exhaustive list, but gives a flavour of the typical optimisations (in reality we could easily spend an entire article – or more! – on each topic). The actual complexity of each optimization, and the interactions between the optimisations, is, quite literally, mind-boggling. In reality, the days of being able to precisely determine the behaviour of our program at the machine level are over.
In order for existing programs and programming models to remain functional, even the most extreme modern processors will still (attempt to) preserve the illusion of Sequential Execution from within the executing program. That is, from an external perspective, the outputs from the program and the system’s measurable states should appear exactly as if the program were running on a machine with no optimisations.
However, even with the Sequential Execution illusion there are some things you can’t guarantee.
- The relative order of main memory reads for separate objects cannot be guaranteed.
- Writes to main memory must be considered asynchronous – that is, there is a ‘significant’ time between the write instruction and the value appearing in main memory
Note the emphasis on main memory reads/writes. As far as your program is concerned reads/writes are happening exactly as specified; it’s just the main memory believes otherwise.
It also turns out each of these optimisations are effectively equivalent – so (for example) code optimisations will yield the same issues as cache coherency management. In other words, if you have any one of those optimization features the net effect is the same as having all of them.
Moving to multiple processors
If you’re running on a single processor system then all is well with the world (even if you’re running with an RTOS; although there are still problems you need to be aware of, so keep reading)
The problem comes when you have two (or more) processors in your system.
(As an aside: If your two processors are running completely independent programs, and never communicate with each other then, once again, you have no problem; but this is a curiously bizarre corner-case system.)
Typically, in a multiprocessor system the programs running on each processor must communicate, either to exchange data or to synchronise operations. The only (software) mechanism they have to do that is via main memory.
Now we have a problem. If you cannot guarantee the relative order of reads between processors or you cannot guarantee when a write will arrive at main memory (the term used is, I believe, “eventually”) then there’s almost no way to reason about the overall behavior of the system.
Breaking the sequential execution illusion
In the example below we’ve got a four processor system. We are using two flags to synchronise the execution of threads C and D. Given this code it is reasonable to speculate that the output from the program will either be “flagA”, “flagB” (or nothing). The condition statements for Thread C and Thread D must never both be true.
Notice, to show the effects of read / write of the two flags I’ve added reads of the flags into local variables. Consider this as the equivalent of reading the values into registers.
On the diagram below time flows down the page so you can see relative timings and sequencing between the operations.
In the simple case where there is no optimisation the sequential execution constraint holds true.
A simple compiler optimisation can cause a failure of our sequential execution. In this case Thread C’s code is optimised so that it reads flagB before flagA. It is now possible to provide an execution sequence where both Thread C and Thread D’s conditions are true; and both messages are output.
In Thread C:
- flagB is read before it is set by Thread B; hence b = false
- flagA is read after it is set by Thread A; hence a = true
In Thread D:
- flagA is read before its value is set by Thread A; hence a = false
- flagB is read after its value is set by Thread B; hence b = true
Please note, the same effect could occur as the result of a processor out-of-order execution, or by a memory optimisation architecture (for example a write-back buffer). Remember the rule: you can’t guarantee read order for main memory reads (which I’ve just exploited), and writes are asynchronous (which would potentially cause the same effect)
(I’ll leave it as a classic ‘exercise for the reader’ to see how many ways you can get this example to produce incorrect output!)
Leave it on a cliffhanger
As it stands, if we write multi-threaded code on a modern multi-processor system, the collective set of compiler and hardware optimisations mean there is the potential for almost-impossible-to-find-and-fix bugs.
In part 2 of this article we’ll have a look at how we can solve this problem. And also how NOT to solve this solution!