Introduced in 2004, the ARM Cortex-M architecture is currently the most popular 32-bit architecture on the market, adopted by most, if not all major MCU manufacturers. The Cortex-M was designed from the outset to be RTOS kernel friendly: dedicated RTOS tick timer, context switch handler, interrupt service routines written in C, tail-chaining, easy critical section management and more. Many Cortex-M MCU implementations are complemented with a floating-point unit (FPU), DSP extensions, highly versatile debug port and a memory protection unit (MPU).
The ARM Cortex-M
In 2004, Arm introduced a new family of CPU cores called Cortex-M (M stands for Microcontroller) based on a Reduced Instruction Set Computer (RISC) architecture. The first Cortex-M was called the Cortex-M3, and the family has evolved to include a number of derivative cores: Cortex-M0/M0+, Cortex-M4, high-performance Cortex-M7, and the recently introduced Cortex-M23 and M33 with TrustZone security technology.
The programmer’s model (see Figure 1) of the Cortex-M processor family is highly consistent. For example, R0 to R15, PSR, CONTROL and PRIMASK are available to all Cortex-M processors. Two special registers - FAULTMASK and BASEPRI - are available only on the Cortex-M3, Cortex-M4, Cortex-M7 and Cortex-M33, and the floating-point register bank and floating-point status and control register (FPSCR) is available on the Cortex-M4, Cortex-M7 and Cortex-M33 within the optional floating-point. Some Cortex-M implementations are also equipped with a Memory Protection Unit (MPU).
[Figure 1 | Armv7-M-based CPU register model.]
Both the CPU register and FPU registers (assuming the processor is equipped with one) are saved and restored by the RTOS during a context switch. Because the MPU configuration is obtained from a table, we only need to load the MPU registers when the task is switched-in. In other words, there is no need to save the MPU configuration for the task being switched-out. The details will be described in an upcoming section.
Cortex-M privilege levels
At power up, the Cortex-M starts in privileged mode, giving it access to all the features of the CPU. It can access any memory or I/O location, enable/disable interrupts, set up the nested vectored interrupt controller (NVIC), and configure the FPU and MPU, and so on.
To keep a system safe and secure, privileged mode code must be reserved for code that has been fully tested and is known to be trusted. Because of the thorough testing that most RTOSs undergo, RTOSs are generally considered trusted while most application code is not. There are few exceptions to this practice. ISRs, for example, are typically assumed to be trusted and thus also run in privileged mode, as long as those ISRs are not abused and kept as short as possible. This is a typical recommendation from most RTOS vendors.
Application code can be made to run on a Cortex-M in non-privileged mode, thus restricting what the code can do. Specifically, non-privileged mode prevents code from being able to disable interrupts, change the settings of the nested vectored interrupt controller (NVIC), change the mode back to privileged, and alter MPU settings as well as a few other things. This is a desirable feature because we don’t want untrusted code to give itself privileges and thus change the protection put in place by the system designer.
Since the CPU always starts up in privileged mode, tasks need to either be created from the get-go to run in non-privileged mode or switched to non-privileged (by calling an API) shortly after starting. Once in non-privileged mode, the CPU can only switch back to privileged mode when either servicing an interrupt or an exception.
Since non-privileged code cannot disable interrupts either through the CPU or the NVIC, application code is forced to use RTOS services to gain access to shared resource. Because RTOS services need to run in privileged mode (to disable interrupts during critical sections), non-privileged tasks must pass through a special mechanism on the Cortex-M called the SuperVisor Call (SVC) to switch back to privileged mode. The SVC behaves like an interrupt but is invoked by a CPU instruction, aptly named SVC. This is also known as a software interrupt.
On the Cortex-M, the SVC instruction uses an 8-bit argument to specify which of 256 possible RTOS functions (or services) the caller wants to execute. The system designer decides what RTOS services should be made available to non-privileged code. For example, you might not want to allow a non-privileged task to terminate another task (or itself). Also, none of these services would allow interrupts to be disabled as that would defeat one of the reasons to run code in non-privileged mode. Once invoked, the SVC instruction vectors to an exception handler called the SVC Handler.
This process is shown in Figure 2. (1) Some non-privileged code executes SVC #5 to wait on a mutex. (2) The SVC instruction forces the SVC exception handler to execute. The behavior is the same as if an interrupt was generated. The SVC handler extracts the argument (i.e., the value 5) and uses that to index (3) into the SVC Jump Table. (4) The desired RTOS service is executed (in privileged mode), and upon completion, the RTOS returns to the non-privileged code.
The SVC handler is part of the RTOS so you don’t have to worry about implementing that. In fact, your application code will invoke the same RTOS APIs regardless of whether your task runs in privileged or non-privileged mode.
Going through the SVC handler comes at a price: additional code and CPU cycles. On the Cortex-M3, the SVC handler adds about 1 Kbytes of code and executes between 75 and 125 CPU instructions to execute. So, any RTOS service invoked by non-privileged will require more processing time than if the same RTOS service was called from privileged mode.
[Figure 2 | Limiting CPU, NVIC and MPU access from non-privileged code.]
Running code in non-privileged mode also prevents user code from disabling interrupts, thus reducing the chances of locking up the system. Of course, lockups are still possible if user code gets into an infinite loop, especially when that happens in a high-priority task or ISR. However, a lockup can be recovered in this case through the use of a watchdog.
As a side note, the Cortex-M generates a fault (a Bus Fault) if a non-privileged task attempts to disable interrupts through the NVIC. Your application code will need to account for this.
Running in non-privileged mode still doesn’t prevent application code from accessing any memory locations and peripheral devices or prevent code from executing out of RAM. This is where the MPU comes in.
The Cortex-M MPU in the Armv7-M architecture
The MPU on the Cortex-M (assuming Armv7-M) is a device that allows a process to have access to up to eight (8) or sixteen (16) memory or peripheral regions (depending on the MCU implementation). The location and size of each region is configurable. The size of each region must be a multiple of a power of two but cannot be smaller than 32 bytes. Also, the base address of a region must be aligned to an integer multiple value of the region size. So, if the region is 8K bytes, then the region must be aligned on an 8K boundary. Because of the relatively few regions available in the MPU, regions are typically used to limit access to RAM and peripherals and not so much code. However, at least one region must be used to provide access to code space.
A convenient way to organize the memory is to group the RAM needed for a process in one contiguous block as shown in Figure 3. Each of the processes would be set up in a similar fashion. An expanded view of Process A shows that it consists of four tasks, each with its own stack. Process A also manages a peripheral device. The white spaces represent unused memory or I/O space possibly due to alignment restrictions of the MPU.
[Figure 3 | Grouping regions by process.]
F3(1) An MPU region is needed to provide access to code space. The region can be set up to only allow access to the code associated with the process but that can sometimes be problematic when a process shares code (i.e. libraries) with other processes.
F3(2) An MPU region is needed to allow all tasks within the process to access peripheral devices assigned to the process. For example, if Process A manages an Ethernet controller, then the region must allow access to all the registers associated with this device.
F3(3) An MPU region is used to access all the RAM allocated to the process. It is assumed here that process global variables and a process heap are shared by all tasks within the process. As a side note, it’s not possible to use a global heap that can be used by all processes because you would not be able to set up an MPU table to separate the dynamically allocated memory of one process with that of another.
F3(4) An MPU region is used for RedZone stack checking. In fact, we only need a single region to cover all the task stacks in a process because we simply need to move the RedZone during a context switch. This, however, implies that each task will require a slightly different MPU process table. That being said, this greatly depends on how the RTOS manages the MPU during a context switch. For example, the RTOS might decide to only load the first seven regions from the MPU process table and load the last region with the base address of the stack to set the RedZone. Most of the time, the RTOS stores the base address of the task stack in the task’s control block (TCB). Using this scheme, all tasks within a process can share the exact same process table yet properly set the RedZone for the task’s stack.
F3(5) This represents unused RAM caused by the Cortex-M’s MPU requirement that the size of all regions must be a binary power of two. So, if Process A requires 7 Kbytes or RAM, then 1K would be lost due to the fact that Process A would need to be 8 K. Instead of letting that space go to waste, you might simply want to increase the size of certain stacks within the process to reduce the chance of getting stack overflows. However, the drawback to this is that if you ever need to add functionality to a process, then you might not remember how much memory you can reclaim. In fact, from a safety-critical point of view, if you qualify your system with a memory configuration, then you might not be able to reclaim it back. Thus, it’s probably best to allocate the stacks needed for the process and live with the wasted space.
From a programmer’s point-of-view, the Cortex-M MPU is a fairly simple device that consists of 19 32-bit registers as shown in Figure 4. You will note that this model differs from the one presented in Figure 1 because some of the registers are actually banked and thus indirectly addressable, but internally, this is how they appear.
[Figure 4 | The Cortex-M MPU registers.]
The TYPE register is used to determine the number of MPU regions supported by the MPU, and the DREGION field of this register will always read as 0, 8 or 16. The CTRL register is used to configure some aspects of the MPU, but practically speaking, this register is used to enable or disable the MPU. In fact, the MPU should be disabled prior to changing the configuration of any or all of the regions. The RNR number allows you to address a specific MPU region.
Referring to Figure 4, you will notice that the lower five bits of the RBAR have a fixed value. When set to 1, the ‘V bit’ indicates that the lower 4-bits are used to specify the region number. The upper bits of RBAR are used to specify the base address of the region. The base address must be aligned on a boundary that matches the size of the region; e.g. a 1 Kbytes region must align on a 1 Kbytes boundary.
For the most part, setting up the attributes for a given region is fairly straightforward:
RASR.XN It’s highly recommended that you set this bit to 1 when the region covers RAM and you don’t expect to execute code out of that region. This would catch code injection attacks from a hacker.
RASR.AP: If the region covers a RAM region, then you’d set the bits to ‘011’, and if the region covers ROM, you’d set this field to ‘110’.
RASR.TEX S C B Figure 4 shows the typical value of these bits based on where the memory region resides.
RASR.SRD This field allows you to subdivide a region into eight equal parts. This feature can greatly reduce wasted memory. For example, a 16 Kbytes region has eight 2 Kbytes sub-regions, so if a process only needs 5 Kbytes (3 sub-regions), then you can disable five of those sub-regions and assign them to a different process(es).
RASR.SIZE This field is a bit more complicated to set because it requires some manual intervention and specifically looking at the linker map file to determine the encoded binary power of two size attribute.
RASR.EN This bit enables (1) or disables (0) the region. If you don’t need all eight regions, you must disable the region so that you don’t inadvertently enable regions from a different process.
Listing 1 shows the assembly language code of an optimized function that loads all eight MPU regions. I show this as an example of how efficiently we can change the MPU configuration, but this is not something you have to worry about. It’s really the responsibility of the RTOS to determine the best way to manage the MPU. However, you will need to follow the RTOS guidelines on how to set up the MPU process table for each task. For this particular implementation, you need to create an MPU process table that assigns all eight regions even if fewer are used. The prototype for the function is:
void OS_MPU_ProcessSet (ARM_MPU_Region_t *p_process);
p_process is a pointer to an MPU process table that contains eight pairs of RBAR and RASR values. ARM_MPU_Region_t is a data type defined by ARM’s Cortex Microcontroller Software Interface Standard (CMSIS)3 and is declared as follows:
uint32_t RBAR; // Region base address
uint32_t RASR; // Region attributes (type, region size, enable, etc.)
So, for each task, you would need to declare an array of ARM_MPU_Region_t containing eight entries as follows:
Note that the last entry contains the base address of the task’s stack and also assumes that the RedZone size is 32 bytes.
[Listing 1 | Configuring all 8 MPU regions.]
The MPU in the Cortex-M is a fairly simple device. The RTOS is responsible for configuring the MPU on every context switch. However, it’s the application developer’s responsibility to set up the MPU process table for the application. Tasks within a process can share the same MPU process table if the RTOS sets up the RedZone for each task.
There still are a few things to take care of to get an application running with an MPU. Specifically, how do you group the RAM by process? How does a process communicate with another process? What happens if a task accesses memory or a peripheral device outside its allocated memory space? Apart from task stacks, should kernel objects be allocated within the process memory space? We will address these questions in Part 3.
- Jean J. Labrosse. Detecting Stack Overflows (Part 1 of 2). March 8, 2016.
- Jean J. Labrosse. Detecting Stack Overflows (Part 2 of 2). March 14, 2016.
- ARM. MPU functions for the Armv7-M.