Contents
Function call basics
When teaching classes about embedded C or embedded C++ programming, one of the topics we always address is “Where does the memory come from for function arguments?“
Take the following simple C function:
void test_function(int a, int b, int c, int d);
when we invoke the function, where are the function arguments stored?
int main(void)
{
//...
test_function(1,2,3,4);
//...
}
Unsurprisingly, the most common answer after “I don’t know” is “the stack“; and of course if you were compiling for x86 this would be true. This can be seen from the following x86 assembler for main setting up the call to test_function (Note: your milage will vary if compiled for a 64-bit processors):
...
subl $16, %esp
movl $4, 12(%esp)
movl $3, 8(%esp)
movl $2, 4(%esp)
movl $1, (%esp)
call _test_function
...
The stack is decremented by 16-bytes, then the four int’s are moved onto the stack prior to the call to test_function.
In addition to the function arguments being pushed, the call will also push the return address (i.e. the program counter of the next instruction after the call) and, what in x86 terms, is often referred to as the saved frame pointer on to the stack. The frame pointer is used to reference local variables further stored on the stack.
This stack frame format is quite widely understood and historically been the target of malicious buffer overflows attacks by modifying the return address.
But, of course, we’re not here to discuss x86, it’s the ARM architecture we’re interested in.
The AAPCS
ARM is a RISC architecture; whereas the x86 is CISC. Since 2003 ARM have published a document detailing how separately compiled and linked code units work together. Over the years it has gone through a couple of name changes, but is now officially referred to as the “Procedure Call Standard for the ARM Architecture” or the AAPCS (I know, don’t ask!).
If we recompile main.c for ARM using the armcc compiler:
> armcc -S main.c
we get the following:
...
MOV r3,#4
MOV r2,#3
MOV r1,#2
MOV r0,#1
BL test_function
...
Here we can see that the four arguments have been placed in register r0-r3. This is followed by the “Relative branch with link” instruction. So how much stack has been used for this call? The short answer is none, as BL instruction moves the return address into the Link Register (lr/r14) rather than pushing it on to the stack, as per the x86 model.
Note: Around a function call there maybe other stack operations but that’s not the focus of this post
The Register Set
I’d imagine many readers are familiar with the ARM register set, but just to review;
- There are 16 data/core registers r0-r15
- Of these 16, three are special purpose registers
- Register r13 acts as the stack pointer (SP)
- Register r14 acts as the link register (LR)
- Register r15 acts as the program counter (PC)
Basic Model
So the base function call model is that if there are four or fewer 32-bit parameters, r0 through r3 are used to pass the arguments and the call return address is stored in the link register.
If we add a fifth parameter, as in:
void test_function2(int a, int b, int c, int d, int e);
int main(void)
{
//...
test_function2(1,2,3,4,5);
//...;
}
We get the following:
...
MOV r0,#5
MOV r3,#4
MOV r2,#3
STR r0,[sp,#0]
MOV r1,#2
MOV r0,#1
BL test_function2
...
Here, the fifth argument (5) is being stored on the stack prior to the call.
Note however, in a larger code base you are likely to see at least one an extra stack “push” here (quite often r4) which is never accessed in the called function. This is because the stack alignment requirements defined by the AAPCS differ from functions called within the same translation unit to those called across translation units. The basic requirement of the stack is that:
SP % 4 == 0
However, the call is classes as a public interface, then the stack must adhere too:
SP % 8 == 0
Return values
Given the following code:
int test_function(int a, int b, int c, int d);
int val;
int main(void)
{
//...
val = test_function(1,2,3,4);
//...
}
By analyzing the assembler we can see the return value is place in r0
...
MOV r3,#4
MOV r2,#3
MOV r1,#2
MOV r0,#1
BL test_function
LDR r1,|L0.40| ; load address of extern val into r1
STR r0,[r1,#0] ; store function return value in val
...
C99 long long Arguments
The AAPCS defines the size and alignment of the C base types. The C99 long long is 8 bytes in size and alignment. So how does this change our model?
Given:
long long test_ll(long long a, long long b);
long long ll_val;
extern long long ll_p1;
extern long long ll_p2;
int main(void)
{
//...
ll_val = test_ll(ll_p1, ll_p2);
//...
}
We get:
...
LDR r0,|L0.40|
LDR r1,|L0.44|
LDRD r2,r3,[r0,#0]
LDRD r0,r1,[r1,#0]
BL test_ll
LDR r2,|L0.48|
STRD r0,r1,[r2,#0]
...
|L0.40|
DCD ll_p2
|L0.44|
DCD ll_p1
This code demonstrates that an 64-bit long long uses two registers (r0-r1 for the first parameter and r2-r3 for the second). In addition, the 64-bit return value has come back in r0-r1.
Doubles
As with the long long, a double type (based on the IEEE 754 standard) is also 8-bytes in size and alignment on ARM. However the code generated will depend on the actual core. For example, given the code:
double test_dbl(double a, double b);
double dval;
extern double dbl_p1;
extern double dbl_p2;
int main(void)
{
//...
dval = test_dbl(dbl_p1, dbl_p2);
//...
}
When compiled for a Cortex-M3 (armcc –cpu=Cortex-M3 –c99 -S main.c) the output is almost identical to the long long example:
...
LDR r0,|L0.28|
LDR r1,|L0.32|
LDRD r2,r3,[r0,#0]
LDRD r0,r1,[r1,#0]
BL test_dbl
LDR r2,|L0.36|
STRD r0,r1,[r2,#0]
...
|L0.28|
DCD dbl_p2
|L0.32|
DCD dbl_p1
However, if we recompile this for a Cortex-A9 (armcc –cpu=Cortex-A9 –c99 -S main.c), note we get quite different generated instructions:
...
LDR r0,|L0.40|
VLDR d1,[r0,#0]
LDR r0,|L0.44|
VLDR d0,[r0,#0]
BL test_dbl
LDR r0,|L0.48|
VSTR d0,[r0,#0]
...
|L0.40|
DCD dbl_p2
|L0.44|
DCD dbl_p1
The VLDR and VSTR instructions are generated as the Cortex-A9 has Vector Floating Point (VFP) technology.
Mixing 32-bit and 64-bit parameters
Assuming we change our function to accept a mixture of 32-bit and 64-bit parameters, e.g.
void test_iil(int a, int b, long long c);
extern long long ll_p1;
int main(void)
{
//...
test_iil(1, 2, ll_p1);
//...
}
As expected we get; a in r0, b in r1 and ll_p1 in r2-r3.
...
LDR r0,|L0.32|
MOV r1,#2
LDRD r2,r3,[r0,#0]
MOV r0,#1
BL test_iil
...
|L0.32|
DCD ll_p1
However, if we subtly change the order to:
void test_iil(int a, long long c, int b);
extern long long ll_p1;
int main(void) { //... test_ili(1,ll_p1,2); //... }
We get a different result; a is in r0, c is in r2-r3, but now b is stored on the stack (remember this may also include extra stack alignment operations).
...
MOV r0,#2
STR r0,[sp,#0] ; store parameter b on the stack
LDR r0,|L0.36|
LDRD r2,r3,[r0,#0]
MOV r0,#1
BL test_ili
...
|L0.36|
DCD ll_p1
So why doesn’t parameter ‘c’ use r1-r2? because the AAPCS states:
“A double-word sized type is passed in two consecutive registers (e.g., r0 and r1, or r2 and r3). The content of the registers is as if the value had been loaded from memory representation with a single LDM instruction”
As the complier is not allowed to rearrange parameter ordering, then unfortunately the parameter ‘b’ has to come in order after ‘c’ and therefore cannot use the unused register r1 and ends up on the stack.
C++
For all you C++ programmers out there, it is important to realize that for class member functions the implicit ‘this’ argument is passed as a 32-bit value in r0. So, hopefully, you can see the implications if targeting ARM of:
class Ex
{
public:
void mf(long long d, int i);
};
vs.
class Ex
{
public:
void mf(int i, long long d);
};
Summary
Even though keeping arguments in registers may be seen as “marginal gains“, for large code bases I have seen, first-hand, significant performance and power improvements simply by rearranging the parameter ordering.
And finally…
I’ll leave you with one more bit of code to puzzle over. An often quoted guideline when programming in C is not to pass struct’s by value, but rather to pass by pointer.
So given the following code:
typedef struct
{
int a;
int b;
int c;
int d;
} Example;
void pass_by_copy(Example p);
void pass_by_ptr(const Example* const p);
Example ex = {1,2,3,4};
int main(void)
{
//...
pass_by_copy(ex);
pass_by_ptr(&ex);
//...
}
Can you guess/predict the difference in performance and memory implications of each option?
Feabhas embedded programming training courses
This post originally appear on the ARM Connected Community site
- Navigating Memory in C++: A Guide to Using std::uintptr_t for Address Handling - February 22, 2024
- Embedded Expertise: Beyond Fixed-Size Integers; Exploring Fast and Least Types - January 15, 2024
- Disassembling a Cortex-M raw binary file with Ghidra - December 20, 2022
Co-Founder and Director of Feabhas since 1995.
Niall has been designing and programming embedded systems for over 30 years. He has worked in different sectors, including aerospace, telecomms, government and banking.
His current interest lie in IoT Security and Agile for Embedded Systems.
It's not fun, since it should be done by compilers. This is the weakness
the program counter of the next instruction after the call) and, what in x86 terms, is often referred to as the saved frame pointer on to the stack. Where such information?
On an ARM this is stored in the Link register (LR/R14) rather than on the stack.
Unfortunately if you want everything done by the compilers then you have to work in higher-level languages typically using VM. These then don't, typically, have the required performance characteristics needed for real-time embedded systems.
Hello - I have read your post and your challenge. Passing by pointer is generally quicker and uses less memory, but it depends on the site of the object being passed. In this case the object is tiny, only four ints, so about the same size as a pointer. If you make the object bigger, eg change d from int to (say) int d[1000], then the superiority of pointer is marked - about 100 times faster on my pc, using gcc. I would expect a similar result with an embedded application.