Introduction to the ARM® Cortex®-M7 Cache – Part 3 Optimising software to use cache

Caches – Why do we miss?

Cold Start

As stated, both data and instruction caches are required to be invalidated on system start. Therefore, the first load of any object (code or data) cannot be in cache (thus the cold start condition).

One available technique to help with cold-start conditions is the ability to pre-load data into the cache. The ARMv7-M instruction set adds the Preload Data (PLD) instruction. The PLD instruction signals to the memory system that data memory accesses from a specified address are likely shortly. If the address is cacheable, then the memory system responds by pre-loading the cache line containing the specified address into the cache. Unfortunately, there is currently no CMSIS intrinsic support for the PLD instruction.

It is worth noting that some processor data caches implement an automatic prefetcher (e.g. Cortex-A15). This monitors cache misses, and when a pattern is detected, the automatic prefetcher starts linefills in the background. Unfortunately, the Cortex-M7 data cache does not support automatic prefetch.

Capacity

The other most obvious reason for misses is that of cache capacity. The larger the cache, the higher the probability of a cache hit and the lower the frequency of eviction. However, all this comes at a cost, not only financial but also power.

A larger cache is, naturally, going to contribute to the overall System-on-Chip (SoC) costs, making the end microprocessor more expensive. In high volume designs, this is always a significant factor in SoC choice.

Among all processor components, the cache and memory subsystem generally consume a large portion of the total microprocessor system power, commonly 30-50% of the total power [Zang13]. Caches, thus, add a further level of complexity to the poor-overworked engineer trying to calculate the design’s power model and has an impact on all battery-based designs.

Conflict

Finally, misses will occur due to natural eviction followed by a reload. So a simple loop such as:

for(uint32_t i = 0; i < N; ++i) {
   dst[i] = src[i];
}

may result in multiple eviction/reload cycles depending on the memory addresses of dst and src. Also, any dst[i] eviction will result in a memory write as the line is marked dirty. The 4-way data cache goes a long way to help reduce the potential of dst[i] eviction, but because of the pseudo-random replacement policy, it may happen more often than we would expect or like.

Code Optimizations

There are a key number of areas where we, as a software developer, can potentially impact the performance of cache:

  • Algorithms
  • Data structures
  • Code structures

Continue reading

Posted in ARM, C/C++ Programming, CMSIS, Cortex, Design Issues | Tagged , | 1 Comment

Introduction to the ARM® Cortex®-M7 Cache – Part 2 Cache Replacement Policy

Instruction Cache Replacement Policy

Starting with the simpler instruction cache case; when we encounter a cache miss the normal policy is to evict the current cache line and replace it with the new cache line. This is known as a read-allocate policy and is the default on all instruction caches.

Cold start (first read)

It should also be noted that on system power-up the initial state of the cache is unknown. On the ARMv7-M all caches are disabled at reset. Before the cache is accessed, it needs invalidating. As well as each line having a tag associated with it, each line also has a valid flag (V-bit) which indicates whether the current cache line has been populated from main memory or not.

The cache must be invalidated before being enabled. The method for this is implementation-defined. In some cases, invalidation is performed by hardware, whereas in other cases it is a requirement of the boot code. CMSIS- has added a specific instruction cache API to support these operations for the Cortex-M7:

void SCB_InvalidateICache(void);
void SCB_EnableICache(void);    
void SCB_DisableICache(void);

Direct Mapped Cache

So far, we have assumed that the whole of the cache is mapped as one contiguous area, we call this a Direct Mapped cache. A lot of work in trying to improve cache performance has been done over the years and the key metric has been found to be that of cache hit/miss rate, i.e. what is our ratio of reads that result is a cache fetch against those requiring a main memory fetch and cache line eviction.

Studies have shown that the Direct Mapped Cache may not always achieve the best cache hit ratios. Take, for example, the following code:

// file1.c
int filter_algo(int p)
{
   ...
   return result;
}

// file2.c
void apply_filter(int* p, int N)
{
   ...
   for(int i = 0; i < N, ++i) {
     x[N] = filter_algo(*(p+i)) + k[N];
     ...
   }
   ...
}

If, unluckily, apply_filter was in the address range of 0x00004000 and filter_algo was around 0x00005000 then each time apply_filter called on filter_algo, this would result in an eviction of the apply_filter code and the filling of the filter_algo instructions. But upon return, we would have to evict filter_algo code and refill with apply_filter instructions. As the algorithm executed it would cause cache thrashing.

Set-associative cache

Due to the principles of locality, research suggested that rather than a single direct-mapped cache, a better approach is to split the cache into an array of buffers, where two addresses with the same line index can reside in different array indices.

In cache terminology, each array index is known as a way, so we talk about N-Way caches. The number of ways can vary, typically ranging from 2 to 8. For the Cortex-M7 the instruction cache is a 2-way system. When we access an address, we now have ‘N’ possible lines to make a tag match against. The number of valid lines involved in the tag comparison is called the set. Continue reading

Posted in ARM, CMSIS, Cortex, Design Issues | Tagged , , , | 2 Comments

Introduction to the ARM® Cortex®-M7 Cache – Part 1 Cache Basics

For many years, the majority of smaller microprocessor-based systems have typically not used caches. With the launch of the ARMv7 architectures, caches were supported in the ARMv7-A family (e.g. Cortex-A8, etc.) but not supported in the core design of the ARMv7-M micro-controllers such as the Cortex-M3 and Cortex-M4. However, when the Cortex-M7 was announced, it broke that mould by offering cache support for the smaller embedded micro-controller.

This series is broken down in three parts:

  1. Basic principles of cache
  2. Cache replacement policies
  3. Optimising software to use cache

Why introduce caches into the architecture?

The purpose of a cache is to increase the average speed of memory access. The most immediate and obvious benefit is one of improved application performance, which in turn can lead to an enhanced power model. Caches have been used for many years (dating back as far as the 1960s) in high-end processor-based systems.

The driver behind the development and use of a cache is based on The Locality Principle.

Caches operate on two principles of locality:

  • Spatial locality
    • Access to one memory location is likely to be followed by accesses to adjacent locations.
  • Temporal locality
    • Access to an area of memory is likely to be repeated within a short period.

Also of note is Sequentiality – Given that a reference has been made to a particular location s it is likely that within the next several references, a reference to the location of s + 1 will be made. Sequentiality is a restricted type of spatial locality and can be regarded as a subset of it.

In high-end modern systems, there can be many forms of cache, including network and disk caches, but here we will focus on main memory caches. In addition, main memory caches can also be hierarchical, i.e. there are multiple caches between the processor and main memory, often referred too as L1, L2, L3, etc., with L1 being nearest to the processor core.

The Cache

The simplest way to think of a cache is as a small, high-speed buffer placed between the central processor unit (CPU) and main memory that stores blocks of recently referred to main memory.

Once we’re using a cache, each memory read will result in one of two outcomes:

  • A cache hit – the memory for the address is already in cache.
  • A cache miss – the memory access was not in cache, and therefore we have to go out to main memory to access it.

Continue reading

Posted in ARM, CMSIS, Cortex, Design Issues | Tagged , , , , | Leave a comment

TDD with Compiler Explorer

Compiler Explorer (CE) has been around for several years now. When it first appeared on the scene, it immediately became an invaluable tool. Its ability to show generated assembler from given source code across many different compilers and ISAs (Instruction Set Architectures) is “mind-blowing”. We use it extensively when teaching as it allows you to clarify the effect your code can have on both performance and memory usage. 

However, rather than limiting itself to only showing generated assembler, recent developments include the ability to execute the code and examine the program output. Having online support for this is nothing especially new (e.g. ColiruWandbox, etc.), but it’s helpful to have it within one tool.

For example, given a simple “hello, world!” program, we see the standard output in a new tab:

Test-Driven Development

One of the significant benefits to come out of the growth of Agile development is the acceptance that unit testing is just part of the development cycle, rather than a separate activity after coding.

Agile unit-testing, better known as Test-Driven Development, or TDD for short, has lead to a growth of unit-test frameworks, all based around the original xUnit model, typified by GoogleTest (gtest). 

As part of the continuing improvements and feature extensions, CE added support for various libraries to be included as part of the build. Included in this set is support for gtest, as well as two other, more modern, test frameworks; Catch2 and doctest.

Using Google Test with Compiler Explorer

Continue reading

Posted in Agile, C/C++ Programming, Testing | Tagged , , , , , , , | Leave a comment

Side effects and sequence points; why volatile matters

Introduction

Most embedded programmers, and indeed anyone who has attended a Feabhas programming course, is familiar with using the volatile directive when accessing registers. But it is not always obvious the ‘whys and wherefores’ of the use of volatile.

In this article, we explore why using volatile works, but more importantly, why it is needed in the first place.

Peripheral register access

If we start with a simple, fictitious, example. Suppose we have a peripheral with the following register layout:

register width offset
control byte 0x00
configuration byte 0x01
data byte 0x02
status byte 0x03

with a base address of 0x40020000.

In a previous posting we covered using structures for register access. Let’s assume the following (incorrect) code has been written:

#include <stdint.h>

typedef struct {
    uint8_t ctrl;
    uint8_t cfg;
    uint8_t data;
    uint8_t status;
} Port_t;

Port_t* const port   = (Port_t*) 0x40020000;

void write(uint8_t data)
{
  port->ctrl = 1;         // Enter configuration mode
  port->cfg  = 3;         // Configure the device
  port->ctrl = 0;         // Enter operational mode.

  while(port->status == 0) 
  {
    // Wait for data...
  }
  port->data = data;
}

If we compile this code using GCC 8.2 for Arm using the flags:

  • -O3 – high optimaisation
  • -mcpu=cortex-m4

we get the following generated Arm Thumb-2 assembler:


 1. write:
 2.   ldr r3, .L5
 3.   ldrb r2, [r3, #3] @ zero_extendqisi2
 4.   mov r1, #768
 5.   strh r1, [r3] @ movhi
 6.   cbnz r2, .L2
 7. .L3:
 8.   b .L3
 9. .L2:
10.   strb r0, [r3, #2]
11.   bx lr
12. .L5:
13.   .word 1073872896

Complete example

Explaining the assembler

Continue reading

Posted in ARM, C/C++ Programming, CMSIS, Cortex | Tagged , , , , | 2 Comments

Practice makes perfect, part 3 – Idiomatic kata

Previously, we looked at some of the foundational C++ code kata – that is, elements of C++ coding that are absolutely key to master if you’re going to be programming in C++.
Practice makes perfect, part 1 – Code kata
Practice makes perfect, part 2 – foundation kata

In this article I want to introduce what I call ‘idiomatic’ kata.  These exercises have a bit more latitude (and variation) in how they can be implemented.  In that respect they are closer to traditional code kata.  The idea with these kata is to reinforce C++ constructs that aren’t encountered quite so often in typical C++ programming and so are more easily forgotten (or even avoided)

There’s no order to these kata.  None is more important than any other.

If you’re new to C++ there’s nothing wrong with practicing these exercises; but use them as a means to explore new language features.  It’s good to learn how these idiomatic patterns work.

Continue reading

Posted in C/C++ Programming, Design Issues, training | Tagged , , , | Leave a comment

Running the eclipse-mosquitto MQTT Broker in a docker container

I first wrote about MQTT and IoT back in 2012, when I developed a simple C based library to publish and subscribe Quality of Service (QoS) level 0 MQTT messages.

Subsequently, MQTT has grown to be one of the most widely used IoT connectivity protocols with direct support from service such as AWS. Back in 2010, the first open-source MQTT Broker was Mosquitto. Mosquitto is now part of the Eclipse Foundation, and an iot.eclipse.org project, sponsored by cedalo.com.

Another area that has grown during the interim period is the use of container technology, such as Docker, for both testing and deployment. We have, also, extensively covered Docker in previous blog posts.

For another internal dogfood project, I wanted to run a local MQTT Broker rather than a web-based broker, such as http://mqtt.eclipse.org/. Mosquitto can be installed natively on Windows, Mac and Linux. Still, one of the significant benefits of Docker is not polluting your working machine with lots of different tools.

Running Mosquitto in a Docker container is, therefore, a perfect test environment. Rather than, as in the previous Docker blog articles, build our own Docker image containing Mosquitto, we can use the official Dockerhub image.

eclipse-mosquitto Docker image

Pull the latest image

I’m assuming you have Docker installed and configured for your local working environment.

First, pull the latest image from Dockerhub:

% docker pull eclipse-mosquitto

Note of caution: the instructions on the Dockerhub site are incorrect!

Run the docker image

Run the basic Docker image with default settings:

% docker run -it --name mosquitto -p 1883:1883 eclipse-mosquitto 
1582194844: mosquitto version 1.6.8 starting
1582194844: Config loaded from /mosquitto/config/mosquitto.conf.
1582194844: Opening ipv4 listen socket on port 1883.
1582194844: Opening ipv6 listen socket on port 1883.

The -p 1883:1883 argument maps the docker container’s default MQTT socket 1883 the localhost (127.0.0.1) port 1883. Alternatively, we could map that onto another localhost port if it clashed with a locally running MQTT broker, e.g. -p 11883:1883.

Using the --name directive also allows the container to be stopped and restarted, using:

% docker stop mosquitto

and

% docker start mosquitto

Testing the eclipse-mosquitto Docker container

To test the setup of the running Mosquitto container, I used my original software, still available on github. To build this, you’ll need a C compiler (ideally gcc or clang) and CMake.

Alternatively, any MQTT client should work for test purposes.

Subscribe

Next, we must subscribe to a topic. In a command window invoke the subscribe client to a topic, the default for our project being hello\world on port 127.0.0.1:1883, e.g.

% ./mqttsub
MQTT SUB Test Code
port:1883 
Connected to MQTT Server at 127.0.0.1:1883
Subscribed to MQTT Service hello/world with QoS 0

Publish

To test publishing, open another command window and invoke the publisher-client. The publisher-client, by default, publishes 10 messages to the topic hello\world and then closes the connection, e.g.

% ./mqttpub 
MQTT PUB Test Code
port:1883 
Connected to MQTT Server at 127.0.0.1:1883
Published to MQTT Service hello/world with QoS0
Sent 1 messages
Published to MQTT Service hello/world with QoS0
Sent 2 messages
Published to MQTT Service hello/world with QoS0
Sent 3 messages
Published to MQTT Service hello/world with QoS0
Sent 4 messages
Published to MQTT Service hello/world with QoS0
Sent 5 messages
Published to MQTT Service hello/world with QoS0
Sent 6 messages
Published to MQTT Service hello/world with QoS0
Sent 7 messages
Published to MQTT Service hello/world with QoS0
Sent 8 messages
Published to MQTT Service hello/world with QoS0
Sent 9 messages
Published to MQTT Service hello/world with QoS0
Sent 10 messages

Subscribe output

On returning to the subscriber window, we will see the received message displayed.

Message number 1
Message number 2
Message number 3
Message number 4
Message number 5
Message number 6
Message number 7
Message number 8
Message number 9
Message number 10

Mosquitto window output

Returning the window where the docker image was invoked, various log messages are shown:


1582194844: mosquitto version 1.6.8 starting
1582194844: Config loaded from /mosquitto/config/mosquitto.conf.
1582194844: Opening ipv4 listen socket on port 1883.
1582194844: Opening ipv6 listen socket on port 1883.
1582205221: New connection from 172.17.0.1 on port 1883.
1582205221: New client connected from 172.17.0.1 as default_sub (p1, c1, k30).
1582205225: New connection from 172.17.0.1 on port 1883.
1582205225: New client connected from 172.17.0.1 as default_pub (p1, c1, k30).
1582205235: Client default_pub disconnected.

Setting up persistent files

Mosquitto can be configured, for example, to change logging, password, listener-ports, etc. This is achieved using mosquitto.conf file.

To set up mosquitto.conf, first create a local working directory with a three sub-directories of config, data and log, e.g.


% cd
% mkdir docker-mosquitto
% cd docker-mosquitto
% mkdir mosquitto 
% mkdir mosquitto/config/ 
% mkdir mosquitto/data/
% mkdir mosquitto/log/

Create a config file

Next, create a test file called mosquitto.conf in the newly created subdirectory mosquitto/conf/:

% touch mosquitto/config/mosquitto.conf

Edit the config file

Using your favourite editor (okay vi isn’t my favourite, but it’s convenient):

% vi mosquitto/config/mosquitto.conf

And add the as a minimum set of conf directives.


persistence true
persistence_location /mosquitto/data/
log_dest file /mosquitto/log/mosquitto.log

The full list of configuration items can be found [here](https://mosquitto.org/man/mosquitto-conf-5.html].

Run the docker image with a mounted volume

Now, when invoking the docker image we use the -v flag mapping the local filesystem into the docker container. The running container will now pick up the locally defined mosquitto.conf. Invoke e.g:

% docker run -it --name mosquitto -p 1883:1883 -v $(pwd)/mosquitto:/mosquitto/ eclipse-mosquitto 

Closing

I hope this post gave you a useful overview of getting an MQTT Mosquitto Broker up and running using Docker.

Hopefully, in future posts, I will be able to share further details of the dogfood project.

 

Posted in General | Tagged , , | 5 Comments

Practice makes perfect, part 2 – foundation kata

In the previously article we looked at the need for repetitive practice – code kata.  In this article I want to  present some of my preferred foundational kata.

If you’re a beginner to C++ I recommend you fully internalize all these examples before having a look at the idiomatic kata.

If you’re a more experienced C++ programmer you may be looking at these kata and thinking “Jeez – these are so basic!  Who couldn’t do this!”.  Bear in mind though – we all started somewhere!  I still practice most of these kata regularly.   Remember, you practice exercises like this not until you get them right, but until you can’t get them wrong!

Continue reading

Posted in C/C++ Programming, Design Issues, training | Tagged , , , | 1 Comment

Practice makes perfect, part 1 – Code kata

Imagine you’re at a Jazz club, enjoying a smooth jazz quartet.  It’s time for the sax player’s solo.  All of a sudden, he stops the band, rifles in a bag a pulls out a book of music theory.

What the?!” you think.

The saxophonist looks to the audience, “I’ve just got to look up the notes for E-flat minor.  I can never remember them.”

It’s understandable you’re unlikely be too impressed with this particular musician.

If you’re a musician, a sportsperson, a dancer, martial artist or anyone who practices an activity that requires skill and competence you’ll be familiar with repetitive practice.

For a musician it means seemingly endless hours of practicing scales and arpeggios to learn music theory and even more hours practicing studies to learn techniques to play your chosen instrument more precisely, more effectively, more beautifully.

If you practice Japanese martial arts you repeat kihon (fundamentals) and kata (or, as they were known in the west, flourishes) to refine your movements and skills.

Almost every activity includes these two practices alongside more creative, freeform practices.  They are essential to developing muscle-memory.  Without them, any creative practice will always be clumsy and immature; and there will always be a ‘ceiling’ to what you can achieve.

What does this have to do with software engineering?

Continue reading

Posted in C/C++ Programming, Design Issues, training | Tagged , , , , | 3 Comments

Function function return return values values*

The latest C++ standard is now upon us, so it’s time to have a look at some of its new features.

To put one of the new features into context we’re going to have a look at – as the title suggests – multiple function return values

I should really distinguish between the following:

  • A Subroutine (or Subprogram) is a parameterised block of code that can be called multiple times from within a program.
  • A Procedure is a subroutine that may have multiple input and/or output parameters and usually does not return a value.  Procedures may change the state of the system
  • A Function is a subroutine that has only input parameters and produces a return value.  Functions are stateless – they will always produce the same result for the same inputs.

C++ programmers typically blur these distinctions (or ignore them).  To keep with the C++ vernacular I will use the term ‘function’ to mean any of the above.

 

(*Sorry – this is a really terrible pun for a title)

Continue reading

Posted in C/C++ Programming | Tagged , , , , | Leave a comment