It feels that not a day goes by without a new announcement regarding a major development in multicore technology. With so much press surrounding multicore, you have to ask the question “Is it for me?” i.e. can I utilise multicore technology in my embedded application?
However, from a software developer’s perspective, all the code examples seem to demonstrate the (same) massive performance improvements to “rendering fractals” or “ray tracing programs”. The examples always refer to Amdahl’s Law, showing gains when using, say, 16- or 128-cores. This is all very interesting, but not what I would imagine most embedded developers would consider “embedded”. These types of programs are sometimes referred to as “embarrassingly parallel” as it is so obvious they would benefit from parallel processing. In addition the examples use proprietary solutions, such as TBB from Intel, or language extensions with limited platform support, e.g. OpenMP. In addition, this area of parallelisation is being addressed more and more by using multicore General Purpose Graphics Processing Units (GPGPU), such as PowerVR from Imagination Technologies and Mali from ARM, using OpenCL; however this is getting off-topic.
So taking “fractals”, OpenMP and GPGPUs out of the equation, is multicore really useful for embedded systems?
For designs currently based around the traditional Linux pThread model (Processes and Threads) moving to the most common form for multicore solution (Symmetric Multi-Processing; SMP) should be relatively straightforward, as Linux natively supports SMP within the kernel design. SMP is where we there is a single General Purpose Operating Systems (GPOS) running on top of multiple processing cores. The GPOS takes care of the management of allocating and distributing task across the cores, in a transparent manner to the application. Today it is almost impossible to buy a modern desktop PC that is not already running Windows on an SMP platform.
In reality this transition may not be so simple; a well designed application should work on SMP multicore without modification, but many applications aren’t “well designed” with-regard-to threading. The first common problem is that subtle bugs may appear that didn’t exist when executing on a single core. These are generally due to poor design practices (for example using same-priority-FIFO scheduling to enforce mutual exclusion) or the misunderstanding/misuse of certain inter-thread synchronisation primitives (e.g. using thread priorities to guarantee ordering). So first and foremost we need to ensure an application is “SMP Correct”.
Assuming our application is “SMP Correct”, it may not necessarily make major performance gains. There are numerous reasons for this, but the major ones are that either the application or the libraries are not “SMP Optimised”. For example, a library may be written in C++ using the Standard Template Library (STL) algorithms’ such as std::find_if or std::search. Standard library implementations of the STL are unlikely to exploit multicore parallelism. To become “SMP Optimised” the application would need to be reworked to use, for example, the GNU parallel library for the C++ STL (built on top of OpenMP) replacing the standard library calls with their “SMP Optimised” equivalents: __gnu_parallel::find_if and __gnu_parallel::search .
Nevertheless, our biggest stumbling block is likely to be how can an existing non-threaded application, built around a unicore design, utilise a multicore solution? Unfortunately an Operating System cannot automatically parallelise your application; it will be left to you to partition it into threads. This is potentially the most difficult challenge as there are numerous and subtle complexities that come into play. There are a number of approaches to aid this decomposition processes, for example, data decomposition, task decomposition and temporal decomposition. But, still, can you work out where to refactor the code to utilise those extra cores without wasting a huge amount of time on trial and error?
For example, I have personally seen a ported application running slower on a quad-core system than its original unicore system. This was a simple case of multiple threads sharing global data (not my code , I should add), which in turn lead to cache coherency issues. There are numerous games you can play to help here, such as processor-affinity but all require a detailed understanding of multicore technologies. Interestingly there are already commercial products, such as Pareon from Vector Fabrics and Prism from Critical Blue, which are specially designed to aid in and semi-automate this process.
Moving away from GPOS, many of the “High-end” Real-Time Operating Systems (RTOS) also support SMP (e.g. QNX Neutrino, VxWorks from Wind River and Integrity from Green Hills Software). What makes these solutions very attractive to the “Real-Time” designer is the support for Hypervisor technology. In short, Hypervisors allow a GPOS to co-exist with an RTOS (e.g. you can run Android alongside the RTOS – often called Asymmetric Multi-Processing (AMP)), while ensuring the real-time aspects of the design aren’t compromised by the GPOS. Modern multicores are adding native features to support Hypervisor technology (e.g. ARM’s TrustZone and Intel ‘s VT-x). For the real-time designer this gives detailed management of both software (i.e. locking tasks to cores) and hardware (i.e. mapping interrupts to cores), which, when well done can actually lower power consumption compared to a single-core solution.
Finally, for “traditional” RTOS and bare-metal applications on lower-end processors such as the ARM7TDMI, can they utilise multicore? In short, no, not without a major amount of rework. Even though it is possible to create multicore solutions based on, say, ARM’s Cortex-M family, the smaller RTOS is fundamentally not designed to support SMP. But before we discount it completely, there is another model; Hybrid multicore – this is where the cores differ and run as separate programs. As an example, NXP have a dual-core design based around an ARM Cortex-M4 and a Cortex-M0 on the same chip (LPC4300). Finally, hybrid designs are starting to appear that mix multiple high-end cores (e.g. 2 x Cortex-A8) with one or more low-end core (e.g. a Cortex-M4). In this model the Cortex-A’s run in SMP model, with the Cortex-M doing the “real-time” work.
In summary, multicore is making major inroads into embedded computing. Recent developments in operating systems and support tools help that transition. However, for the smaller embedded system, until product evolution demands it, SMP multicore may still be some time off.
- Disassembling a Cortex-M raw binary file with Ghidra - December 20, 2022
- Using final in C++ to improve performance - November 14, 2022
- Understanding Arm Cortex-M Intel-Hex (ihex) files - October 12, 2022
Co-Founder and Director of Feabhas since 1995.
Niall has been designing and programming embedded systems for over 30 years. He has worked in different sectors, including aerospace, telecomms, government and banking.
His current interest lie in IoT Security and Agile for Embedded Systems.
I wonder how truly real-time (by which I mean "having easily determinable timings) an SMP system can ever be.
If the number of tasks (or threads, for those who prefer this term) does not exceed the number of processors, then one-fixed-task-per-core is just as good, and simpler overall, so I will discount this case.
SMP comes into its own when the number of tasks can exceed the number of processors. In this case, a task, on becoming ready is scheduled for execution on some processor. This could be an idle processor or it could be one on which some other task is already running. Clearly, the time taken to get the task running will differ substantially between the two cases, and which of them will arise in any given circumstance is, for practical purposes, unpredictable.
To get the worst case, we have to assume that all tasks in all circumstances contend for the same processor, which is exactly what we do now in our single-processor RTOS designs. If our design is constrained by the worst case - as, indeed, it should be for anything approaching "hard" real time - then it will either work on a single core or it can't be relied on to work at all. The extra cores (when used with an SMP OS) are useless to us! The same goes, of course for all the caches and similar go-faster gadgets which come as standard, these days, even with most single cores. We have to assume they are disabled, therefore their availability is of no use to us.
What you call Hybrid Multicore is, of course, very useful but only the buzz-words are new. We have been doing this kind of thing for decades, but using more chips to do it.
Another problem with multicore chips (SMP or hybrid AMP) is keeping all the cores running at their maximum execution rate when program and/or data are external to the chip. In other words the external memory subsystem becomes a resource bottle neck that can have a severe impact on the chips overall computational performance. E.g. Consider the effect of running two cores at 300MHz from a shared external memory subsystem running burst read DDR2 at 150MHz.
Dave, I think use of shared inter-processor memory is bound to be a bottleneck in any multiprocessor system. Some improvement can be made by segmenting the memory. For instance, a 4-core system could have 4 dual-port memory segments for interprocessor communication rather than one 4-port segment. I assume that I am not the first person to have thought of this!
Shared memory presents other callenges, as well, such as mutual exclusion and cache coherency, which all take their toll on performance and complicate the system.
I believe that something like point-to-point buffered serial (or parallel) interconnects have greater promise, if they can be made to go fast enough.
However it's done, data-passing is the problem area. Multicore is about partitioning the *processing* load and any code a processor might run can, in principle, be stored locally, even at the expense of duplication. Failing that, it can cached quite effectively (but see my previous comment about worst-case timing). Data, though will always have to be passed around. For this reason, I think SMP will continue to be much more useful in computationally intensive systems than it ever will become in systems which run small numbers of small algorithms on constantly changing data. The latter, I believe, describes most small embedded systems perfectly, including DSP applications!
P.S. For full connectivity, I would need 6 memory segments in my 4-core example. Sorry about that, and about the misspelling of "challenges".
I have not worked on a multi-core firmware project (yet) - but have worked on numerous projects with "co-processors". Could the multi-core technology be used to do something like "have a core dedicated to I/O processing and another for user interface" - a typical use of co-processors ? Or are the CPUs and RTOSs setup to "conveniently hide" the multi-core nature of the device, so SMP happens auto-magically ?
There are techniques where you can lock down processes to processors (processor affinity) even within an SMP model. This then gives you a nice halfway-house between SMP and a full Asymmetric model (e.g. running two different OSs possible with Hypervisor support). As you can guess there are so many ways to slice-and-dice the solution.