It feels that not a day goes by without a new announcement regarding a major development in multicore technology. With so much press surrounding multicore, you have to ask the question “Is it for me?” i.e. can I utilise multicore technology in my embedded application?
However, from a software developer’s perspective, all the code examples seem to demonstrate the (same) massive performance improvements to “rendering fractals” or “ray tracing programs”. The examples always refer to Amdahl’s Law, showing gains when using, say, 16- or 128-cores. This is all very interesting, but not what I would imagine most embedded developers would consider “embedded”. These types of programs are sometimes referred to as “embarrassingly parallel” as it is so obvious they would benefit from parallel processing. In addition the examples use proprietary solutions, such as TBB from Intel, or language extensions with limited platform support, e.g. OpenMP. In addition, this area of parallelisation is being addressed more and more by using multicore General Purpose Graphics Processing Units (GPGPU), such as PowerVR from Imagination Technologies and Mali from ARM, using OpenCL; however this is getting off-topic.
So taking “fractals”, OpenMP and GPGPUs out of the equation, is multicore really useful for embedded systems?
For designs currently based around the traditional Linux pThread model (Processes and Threads) moving to the most common form for multicore solution (Symmetric Multi-Processing; SMP) should be relatively straightforward, as Linux natively supports SMP within the kernel design. SMP is where we there is a single General Purpose Operating Systems (GPOS) running on top of multiple processing cores. The GPOS takes care of the management of allocating and distributing task across the cores, in a transparent manner to the application. Today it is almost impossible to buy a modern desktop PC that is not already running Windows on an SMP platform.
In reality this transition may not be so simple; a well designed application should work on SMP multicore without modification, but many applications aren’t “well designed” with-regard-to threading. The first common problem is that subtle bugs may appear that didn’t exist when executing on a single core. These are generally due to poor design practices (for example using same-priority-FIFO scheduling to enforce mutual exclusion) or the misunderstanding/misuse of certain inter-thread synchronisation primitives (e.g. using thread priorities to guarantee ordering). So first and foremost we need to ensure an application is “SMP Correct”.
Assuming our application is “SMP Correct”, it may not necessarily make major performance gains. There are numerous reasons for this, but the major ones are that either the application or the libraries are not “SMP Optimised”. For example, a library may be written in C++ using the Standard Template Library (STL) algorithms’ such as std::find_if or std::search. Standard library implementations of the STL are unlikely to exploit multicore parallelism. To become “SMP Optimised” the application would need to be reworked to use, for example, the GNU parallel library for the C++ STL (built on top of OpenMP) replacing the standard library calls with their “SMP Optimised” equivalents: __gnu_parallel::find_if and __gnu_parallel::search .
Nevertheless, our biggest stumbling block is likely to be how can an existing non-threaded application, built around a unicore design, utilise a multicore solution? Unfortunately an Operating System cannot automatically parallelise your application; it will be left to you to partition it into threads. This is potentially the most difficult challenge as there are numerous and subtle complexities that come into play. There are a number of approaches to aid this decomposition processes, for example, data decomposition, task decomposition and temporal decomposition. But, still, can you work out where to refactor the code to utilise those extra cores without wasting a huge amount of time on trial and error?
For example, I have personally seen a ported application running slower on a quad-core system than its original unicore system. This was a simple case of multiple threads sharing global data (not my code , I should add), which in turn lead to cache coherency issues. There are numerous games you can play to help here, such as processor-affinity but all require a detailed understanding of multicore technologies. Interestingly there are already commercial products, such as Pareon from Vector Fabrics and Prism from Critical Blue, which are specially designed to aid in and semi-automate this process.
Moving away from GPOS, many of the “High-end” Real-Time Operating Systems (RTOS) also support SMP (e.g. QNX Neutrino, VxWorks from Wind River and Integrity from Green Hills Software). What makes these solutions very attractive to the “Real-Time” designer is the support for Hypervisor technology. In short, Hypervisors allow a GPOS to co-exist with an RTOS (e.g. you can run Android alongside the RTOS – often called Asymmetric Multi-Processing (AMP)), while ensuring the real-time aspects of the design aren’t compromised by the GPOS. Modern multicores are adding native features to support Hypervisor technology (e.g. ARM’s TrustZone and Intel ‘s VT-x). For the real-time designer this gives detailed management of both software (i.e. locking tasks to cores) and hardware (i.e. mapping interrupts to cores), which, when well done can actually lower power consumption compared to a single-core solution.
Finally, for “traditional” RTOS and bare-metal applications on lower-end processors such as the ARM7TDMI, can they utilise multicore? In short, no, not without a major amount of rework. Even though it is possible to create multicore solutions based on, say, ARM’s Cortex-M family, the smaller RTOS is fundamentally not designed to support SMP. But before we discount it completely, there is another model; Hybrid multicore – this is where the cores differ and run as separate programs. As an example, NXP have a dual-core design based around an ARM Cortex-M4 and a Cortex-M0 on the same chip (LPC4300). Finally, hybrid designs are starting to appear that mix multiple high-end cores (e.g. 2 x Cortex-A8) with one or more low-end core (e.g. a Cortex-M4). In this model the Cortex-A’s run in SMP model, with the Cortex-M doing the “real-time” work.
In summary, multicore is making major inroads into embedded computing. Recent developments in operating systems and support tools help that transition. However, for the smaller embedded system, until product evolution demands it, SMP multicore may still be some time off.
- Introduction to the ARM® Cortex®-M7 Cache – Part 2 Cache Replacement Policy - October 22, 2020
- Introduction to the ARM® Cortex®-M7 Cache – Part 1 Cache Basics - October 15, 2020
- TDD with Compiler Explorer - August 13, 2020