Using final in C++ to improve performance

Dynamic polymorphism (virtual functions) is central to Object-Oriented Programming (OOP). Used well, it provides hooks into an existing codebase where new functionality and behaviour can (relatively) easily be integrated into a proven, tested codebase.

Subtype inheritance can bring significant benefits, including easier integration, reduced regression test time and improved maintenance.

However, using virtual functions in C++ brings a runtime performance overhead. This overhead may appear inconsequential for individual calls, but in a non-trivial real-time embedded application, these overheads may build up and impact the system’s overall responsiveness.

Refactoring an existing codebase late in the project lifecycle to try and achieve performance goals is never a welcome task. Project deadline pressures mean any rework may introduce potential new bugs to existing well-tested code. And yet we don’t want to perform unnecessary premature optimization (as in avoiding virtual functions altogether) as this tends to create technical debt, which may come back to bite us (or some other poor soul) during maintenance.

The final specifier was introduced in C++11 to ensure that either a class or a virtual function cannot be further overridden. However, as we shall investigate, this also allows them to perform an optimization known as devirtualization, improving runtime performance.

Interfaces and subtyping

Unlike Java, C++ does not explicitly have the concept of Interfaces built into the language. Interfaces play a central role in Design Patterns and are the principal mechanism to implement the SOLID ‘D’ Dependency Inversion Principle pattern.

Simple Interface Example

Let’s take a simplified example; we have a mechanism layer defining a class named PDO_Protocol. To decouple the protocol from the underlying utility layer, we introduced an interface called Data_link. The concrete class CAN_bus then realizes the Interface.

This design would yield the following Interface class:

Side note: I’ll park the discussion about using pragma once, virtual-default-destructors and pass-by-copy for another day.

The client (in our case, PDO_protocol) is only dependent on the Interface, e.g.

Any class realizing the Interface, such as CAN_bus, must override the pure-virtual functions in the Interface:

Finally, in main, we can bind a CAN_bus object to a PDO_protocol object. The calls from PDO_protocol invoke the overridden functions in CAN_bus.

Using dynamic polymorphism

It then becomes very straightforward to swap out the CAN_bus for an alternative utility object, e.g. RS422 :

In main, we bind the PDO_protocol object to the alternative class.

Importantly, there are no changes to the PDO_protocol class. With appropriate unit testing, introducing the RS422 code into the existing codebase involves integration testing (rather than a blurred unit/integration test).

There are many ways we could create the new type (i.e. using factories, etc.), but, again, let’s park that for this post.

The cost of Dynamic Polymorphic behaviour

Using subtyping and polymorphic behaviour is an important tool when trying to manage change. But, like all things in life, it comes at a cost.

The code generated in the examples using the Arm GNU Toolchain v11.2.1.

A previous posting covered the Arm calling convention for AArch32 ISA. For a simple member-function call, e.g.:

We get the following assembler for the call to the member function in read_sensor:

The Branch with Link (bl) opcode is the AArch32 function calling convention (r0 contains the object’s address).

So what happens at the call site when we make this function virtual?

The generated assembler for sensor.get_value() becomes:

The actual code generated, naturally, depends on the specific ABI (Application Binary Interface). But, for all C++ compilers, it will involve a similar set of steps. Visualizing the implementation:

And examining the generated assembler, we can deduce the following behaviour:

  • r0 contains the address of the object (passed as the parameter to read_sensor)
  • the contents at this address are loaded into r3
  • r3 now contains the vtable-pointer (vtptr)
  • The vtptr is, in effect, an array of function pointers.
  • The first entry in the vtable is loaded back into r3 (e.g. vtable[0])
  • r3 now contains the address of Sensor::get_value
  • the current program counter (pc) is moved into the link register (lr) before the function call
  • The branch-with-exchange opcode is executed. So, the instruction bx r3 calls Sensor::get_value

If, for example, we were calling sensor.set_ID(), then the second memory load would be LDR r3,[r3,#4] to load the address of Sensor::set_ID into r3 (e.g. vtable[1]). Most ABIs structure the vtable based on the order of virtual function declaration.

We can deduce that the overhead of using a virtual function (for Arm Cortexv7-M) is:

However, what is significant is the second memory load (LDR r3,[r3]), as this memory read requires Flash access. A read from Flash is typically slower than an equivalent read from SRAM. A lot of design effort goes into improving Flash read performance, so your “mileage may vary” regarding the actual timing overhead.

Using polymorphic functions

If we create a class that derives from Sensor, e.g.

And then pass an object of the derived type to the function read_sensor, then the same assembler is executed.

But by visualizing the memory model, it becomes clear how the same code:

Invokes the derived function:

The derived class has its own vtable populated at link-time. Any overridden functions replace the vtable entry with the address of the new function. The constructors are responsible for storing the address of the vtable in the classes vtptr.

Any virtual functions in the base class that are not overridden still point at the base class implementation. Pure-virtual functions (as used in the interface pattern) have no entry populated in the vtable, so they must be overridden.

Introducing final

As previously noted, the final specifier was introduced alongside override in C++11.

The final specifier was introduced to ensure that a derived class cannot override a virtual function or that a class cannot be further derived from it.

For example, currently, we could derive further from the Rotary_sensor class.

When defining the Rotary_encoder class, this may not have been our intended design. Adding the final specifier stops any further derivation.

A class, may be specified as final e.g.

Which, if inherited from, generates the following error:

Or an individual function can be tagged as final, e.g.:

Devirtualization

Okay, so how can this help with compiler optimization?

When calling a function, such as read_sensorand the parameter is a pointer/reference to the Base class, which in turn calls a virtual member function, the call must be polymorphic.

If we overload read_sensor to take a Rotary_encode object by reference, e.g.

If the compiler can prove exactly which actual method is called at compile time, it can change a virtual method call into a direct method call.

Without the final specifier, the compiler cannot prove that the Rotary_encode reference, sensor, isn’t bound to a further derived class instance. So the generated assembler for both read_sensor functions are identical.

However, if we apply the final specifier to the Rotary_encoder class, the compiler can prove that the only matching call must be Rotary_encoder::get_value, then it can apply devirtualization and generate the following code for read_sensor(Rotary_encoder&) :

Templates and final

As our two read_sensor functions are identical, the DRY principle comes into play. If we modify the code so that read_sensor is a template function, e.g.   

The code generator will bind dynamically or statically as appropriate, depending on whether we call with a Sensor object or a Rotary_encoder object.

Revising the Interface

Given the potential for devirtualization, can we utilize this in our Interface design?

Unfortunately, for the compiler to be able to prove the actual method call, we must use final in conjunction with a pointer/reference to the derived type. Given the original code:

The compile cannot perform devirtualization because it has a reference to the interface (base) class and not the derived class. This leaves us with two potential refactoring solutions:

  • Modify the link type to the derived type
  • Make the client a template class

Devirtulization using a direct link

Using a direct link is a “quick and dirty” fix.

It does change the PDO_protocol header, but otherwise, it “does the job”. The generated code now calls CAN_bus::send and CAN_bus::recieve directly rather than through a vtable call.

However, using this approach, we reintroduce the coupling between the “Mechanism layer” and the “Utility layer”, breaking the DIP.

Devirtulization using templates

Alternatively, we can rework the client code as a template class, where the template parameter specifies the link class.

Templates bring their complications, but it does ensure we get static binding to any classes specified as final.

Summary

The final specifier offers an opportunity to refactor existing interface code to alter the binding from dynamic to static polymorphism, typically improving runtime performance. The actual gains will depend significantly on the underlying ABI and machine architecture (start throwing in pipelining and caching, and the waters get even muddier).

Ideally, when using virtual functions in embedded applications, considering whether a class should be specified as final should be decided at design time rather than late in the project’s timeline.

Niall Cooling
Dislike (1)
Website | + posts

Co-Founder and Director of Feabhas since 1995.
Niall has been designing and programming embedded systems for over 30 years. He has worked in different sectors, including aerospace, telecomms, government and banking.
His current interest lie in IoT Security and Agile for Embedded Systems.

About Niall Cooling

Co-Founder and Director of Feabhas since 1995. Niall has been designing and programming embedded systems for over 30 years. He has worked in different sectors, including aerospace, telecomms, government and banking. His current interest lie in IoT Security and Agile for Embedded Systems.
This entry was posted in C/C++ Programming, Design Issues and tagged , , , , , . Bookmark the permalink.

12 Responses to Using final in C++ to improve performance

  1. Rud Merriam says:

    "However, using virtual functions in C++ brings a runtime performance overhead. This overhead may appear inconsequential for individual calls, but in a non-trivial real-time embedded application, these overheads may build up and impact the system’s overall responsiveness."

    This is a ubiquitous misleading assertion. Before going into why, I'll say this article is a good discussion on optimization of this technique.

    The fallacy of virtual function cost is that they eliminate a decision process, namely, which class to invoke. When the decision process is considered the cost of virtual calls is less. See https://hackaday.com/2015/11/13/code-craft-embedding-c-timing-virtual-functions/ for specific details.

    Like (3)
    Dislike (0)
  2. My assertion, "... these overheads may build up and impact the system’s overall responsiveness." are based on observations while working on a commercial, real-world, real-time, deeply embedded project written in C++, where the widespread use of Interfaces contributed to performance challenges.

    As I tried to convey, the actual overheads will depend on a number of factors. As stated, "This overhead may appear inconsequential for individual calls" which your blog asserts. But it is an oversimplification, as my direct experience shows, to imply they can be completely ignored when writing C++ for real-world, real-time embedded systems.

    Like (7)
    Dislike (0)
  3. Susanne Oberhauser-Hirschof says:

    Thank you for this article on how to optimize away virtual functionc alls. The true motivation to do this however is not the vtable lookup (as Rudd Meriam said). The true motivation is that without vtable lookup, the function call can be inlined, unlocking compiler optimization across function boundaries. And _that_ can give significant, measurable performance boost, plus better cache locality.

    Like (1)
    Dislike (0)
  4. Susanne Oberhauser-Hirschoff says:

    Thank you for this article on how to optimize away virtual functionc alls. The true motivation to do this however is not the vtable lookup (as Rudd Meriam said). The true motivation is that without vtable lookup, the function call can be inlined, unlocking compiler optimization across function boundaries. And _that_ can give significant, measurable performance boost, plus better cache locality.

    Like (5)
    Dislike (0)
  5. I have just discovered the final keyword today. Beside using it, there is another method to devirtualize virtual function calls, it is to fully qualify the function name.

    class Derived D : public B
    {
    virtual void foo();
    void bar() { Derived::foo(); }
    };

    Like (0)
    Dislike (0)
  6. Great article - thanks!

    What tool do you use for the diagrams, like the "Fig-15-derived-vtable-image"?

    Like (0)
    Dislike (0)
  7. Cheers, I wish it was something better, but they are done in PowerPoint as they come from our Advanced C++ programming courses, where they are animated as part of the explanation about vtable performance.

    Like (0)
    Dislike (0)
  8. Rud Merriam says:

    Oliver - your technique is not needed. If you directly call a virtual function the vtable is not used for lookup. The vtable is only used when calling through a pointer to the base class.

    Like (0)
    Dislike (2)
  9. Prashanth says:

    Fascinating discussion. Is the performance gain enough to justify introducing a templated call to Data_Link? The template implies a direct binding of the link class with the interface. Mixing template and inheritance in this case, in my opinion, muddies the design intent.

    Like (1)
    Dislike (0)
  10. VictorG says:

    2 Notes:
    1. If you are going with "template" solution, you better do not use 'virtual' at all, (Do not give to compiler a chance to use virtualization) So no need for 'override' or 'final'
    2. When you are talking about Embedded there maybe different constrains. If you are really worry only about run-time performance - fine, If you have limited memory for your application - another story.

    Like (1)
    Dislike (0)
  11. Hi Prashanth,
    In isolation, it's impossible to answer that. Always be 'obvious and simple' where you can, and certainly, templates add a (high) degree of complexity for comprehension.
    We always need to separate the 'design intent' from the implementation.
    As with all 'optimisations', it should be viewed as a last resort and only applied if you can demonstrate a performance issue through hardware trace.
    regards,
    Niall.

    Like (1)
    Dislike (0)
  12. Hi Victor,
    Point 1 - absolutely, the who interface can be dropped (and replaced with a `concept` in C++20).
    Point 2 - of course, embedded have numerous 'design forces' in play, with performance just being one of them. In this case, `devirtualization` reduces opcode count (and could possibly remove the memory required for `vtables` and `vtable pointers` - so you could argue it improves performance and reduces both code & data memory!)

    Like (1)
    Dislike (0)

Leave a Reply