Architecture of ARMv8-based Firmware Systems

This article by Sergey Temerkhanov and Igor Pochinok describes the use of ARMv8 processors in server (and optionally in embedded) systems from the firmware components’ perspective.

The article has been initially published on www.embedded.com.


Since its release in 2011, the ARMv8 processor architecture has become quite widespread in the mobile device market. According to the forecasts of the ARM Limited CEO, the processors of this generation will acquire a world market share of up to 25% by 2020. It is natural enough that the software support was formed and has been developing further by inheriting the features and general principles of the historically formed infrastructure.

A fundamentally different situation is observed in the server segment of the market. X86-based servers have been dominating this area for a long while, while ARMv8 is just finding its way (and only into specific business segments). The novelty of this market for ARM and the fact that most of the accepted standards and specifications (primarily ACPI and UEFI) have not been adapted for ARM systems until recently has left its mark on the development of the software infrastructure.

This article focuses on an overview on the ARM-based server system and processor features and makes no claims of being an exhaustive description. The authors would also like to draw the reader’s attention to the fact that provided data can quickly become obsolete – soon enough, new processors will come with new technical solutions that may require a different approach to the implementation of the software infrastructure.

First, we should point out that the current implementations of firmware for ARMv8 server systems consist of several relatively independent components. This gives a number of advantages, such as the possibility of using the same components in both the server and embedded systems’ firmware, as well as the relative independence of introduced changes.

So, what modules and components are used in these systems, and what are their functions? The overall chart for the loading and interaction of modules is shown in Fig. 1. The process begins with the initialization of subsystems, such as RAM and interprocessor interfaces. In current implementations, this is executed by a separate module in the EL3S mode immediately after switching on the main CPU power. Thus, this component of the system has the maximum possible privileges. It does not usually interact with the OS directly.

Fig 1. The loading and interaction of modules. (Source: Auriga)

Later, the control is transferred to the next component, most often the ARM Trusted Firmware (ATF) module, which is executed in the same mode. ATF control can be transferred either directly from the level 0 loader described in the previous paragraph or indirectly through a special UEFI module that implements the PEI (PreEFI Initialization). ATF consists of several modules that receive the control at different times. The BL1 start module performs the initialization of the platform parts assigned to the secure processor mode. Since ARMv8-based systems use hardware separation for trusted and non-trusted resources, including RAM, the BL1 module prepares an environment where the trusted code can be executed. In particular, this type of initialization includes the configuration of memory/cache controllers (trusted and non-trusted zones are marked through the programming of the registers in these devices) and marking of on-chip devices (energy-independent memory controllers). This markup also introduces the filtering of DMA transactions on the basis of device types (trusted/non-trusted). Given all this, memory writing/reading is possible only to/from areas whose security settings match those of the device.

 Implementations of a trusted environment can be quite complex; for example, they can include a separate OS. However, the description of such implementations is beyond the scope of this article. The BL1 module configures the MMU address translation table, as well as the exception handler table, where the most important element is the exception handler for the Secure Monitor Call (SMC) instruction. At this point, the handler is minimal and can actually only transfer control to images loaded into RAM. While running, the BL1 module loads the next stage (BL2) into RAM and transfers control to it. The BL2 module works in the EL1S mode with reduced privileges. Therefore, the transfer of control to this module is performed using the “ERET” instruction.

The purpose of the BL2 module is to load the remaining firmware modules (BL3 parts) and transfer control to them. The reduced privilege level is used to avoid possible damage to the code and EL3S data already in the memory. These parts’ code is executed by calling the EL3S code located at the BL1 stage using the SMC instruction.

The third stage of the loading and initialization of the ATF can consist of three stages, but the second stage is usually omitted. Thus, in fact, only two remain. The BL3-1 module is part of the trusted code that is accessible to general-purpose software (OS, etc.) in runtime. The key part of this module is the exception handler called by the “SMC” instruction. There are functions in the module itself for implementing standard SMC calls: the code that implements the standard PSCI interface (designed to control the entire platform, such as enabling/disabling processor cores, platform-wide power management, and rebooting) and also handles vendor-specific calls (providing information about the platform, managing embedded devices, etc.).

As mentioned above, the presence of the BL3-2 module is optional; its code (in the case of a module) is executed in the EL1S mode. Usually, it serves as a specialized service/monitor for the events that occur during platform operation (interrupts from certain timers, devices, etc.)

In fact, BL3-3 is not an ATF module, but an image of firmware executed in the nonsecure mode. It usually takes control in the EL2 mode and represents an image of either a bootloader similar to the widely known U-Boot or that of a UEFI environment, which is standard for server systems.

The overall chart of ATF module initialization is shown in Fig. 2.

Fig. 2. ATF module initialization. (Source: Auriga)

Another initialization path may be used in certain ARMv8-based server systems: ATF is started during the UEFI PEI phase, after which the transition to the UEFI DXE phase occurs.

The ARMv8 UEFI differs significantly from that on x86. The PEI and DXE (Driver) phases are used for both x86 and ARMv8. However, on many ARMv8 systems, the PEI phase is significantly reduced and no hardware initialization is performed during it. This stage consists of setting up the MMU translation tables, configuring the interrupt controller and the CPU timer (according to the UEFI specification, the only interrupt processed for this environment is the timer interrupt), building the EFI Hand-off block (HOB), and execution of the DXE Core. At this stage, native UEFI modules tend to use the platform-specific SMC calls described above.

The bulk of UEFI work is performed at the DXE phase. First, this involves loading and the startup of hardware drivers – both for on-chip peripherals and external devices connected via PCIe, USB, SATA, etc. interfaces.

It should be noted that ARMv8-based systems differ significantly from similar systems based on the x86 architecture in terms of configuration, device detection mechanisms, etc. For example, the main device detection mechanism for x86 is scanning the PCI configuration space and assigning memory addresses to devices, which they have to decode. In the case of ARMv8-based systems, built-in peripherals almost always have fixed addresses in the memory space (ports are unused, as they are unsupported by the CPU architecture) and in some cases are not visible in the PCI configuration space. For such systems, there is a hardware description composed of a Flattened Device Tree, a tree-like description of device connections, which also describe resources such as memory ranges and interrupt numbers associated with these devices.

In more advanced systems, SoCs support access through the PCI configuration space and contain controllers implementing access to this space via the Enhanced Configuration Access Mechanism (ECAM). Since the memory addresses of such units are fixed, the common PCI device configuration mechanism is not applicable. Specifically, for the systems with fixed PCI device address windows, the Enhanced Allocation PCI capability has been developed, which resolves this conflict. A separate article could be written on the unique properties of this capability. In brief, it can be described as a set of alternative registers containing information about memory addresses, bus numbers (for built-in PCI-PCI bridges), etc.

UEFI cannot be separated from another method of passing on information about the platform configuration – ACPI. At the moment, the development and refinement of ACPI specifications for the improved support of the ARMv8 architecture is ongoing. According to available information, ACPI should become the key method of describing essential information about the platform (primarily the number and configuration of processor cores, PCI/PCIe controllers) and its management for ARMv8. Some of the ARMv8 OSes planned for release only support the ACPI mechanism.

The DXE stage includes device detection and their initialization and registration in UEFI, as well as preparation for OS booting. The latter consists of preparing the system memory map and configuration information – that is, loading, generating, and publishing ACPI tables, the modification of these to reflect the current configuration of the platform, making similar changes to the FDT, and checking and generating checksums. The modules loaded at this stage may implement UEFI Runtime Services – functions available for calling from the OS in runtime. It should be noted that in all the systems that have ever been used by the authors of this article, device detection was implemented via the PCI ECAM mechanism.

Upon completion of this stage, the Boot Device Selection (BDS) commences. A separate module is usually used at this stage that processes the values of the “BootOrder,” “BootNext,” and other related variables. Often, this module implements a (pseudo) graphical user interface. At this point, there are many commonalities with x86-based systems: the same boot methods are used – PXE, iSCSI, block devices (such as SATA/SAS/USB drives, SSDs, and NVME devices), etc.

It is necessary to call the reader’s attention to the drivers of external devices (usually PCIe devices) for ARMv8 UEFI. They can be implemented in the form of modules located in storage devices (with the FAT32 file system) as well as reside directly in the device (Option ROM). Adding ARMv8 to the list of supported architectures causes problems for vendors in some cases. The simple recompilation of the source code for ARMv8 is not always sufficient, because some modules are not designed to function in a full 64-bit address space. Difficulties may also arise due to the fact that the translation of the PCI bus to the processor addresses and vice versa is widely used in ARMv8 systems. This is caused by the decision to abandon legacy “windows” located within the lower 32 bits of memory address spaces. In terms of support enhancement, drivers compiled in the EBC bytecode may provide the required level of compatibility. However, when this article was written, the EBC interpreter for ARMv8 was at an early stage of development.

The transfer of control to the module loaded into the memory (boot loader or directly into the OS kernel) is performed in accordance with the UEFI specification: the UEFI handle of the module in the X0 register, the system table pointer in X1, and the return address in X30 (LR).

The OS kernel takes some preparation steps using the UEFI services, then sets its own translation tables and calls the UEFI methods ExitBootServices() and SetVirtualAddressMap(). This is necessary because the UEFI code executes in the same address space as the OS kernel. In addition, timer interrupts and any possible DMA transfers have to be disabled. There is a notable aspect of the ARMv8 Linux OS design: the main kernel code executes in the EL1 mode, whereas the EL2 mode is reserved only for a part of the KVM hypervisor code. Thus, the kernel drops its privilege level from EL2 to EL1 during initialization. After that, only the Runtime Services, a subset of all UEFI services, are available to the kernel. The Linux kernel on ARMv8 uses the PSCI interface extensively when it is implemented in one of the ATF modules, as mentioned earlier. This is especially characteristic of multicore systems. The interface itself and the process of secondary CPU core initialization can be briefly described as issuing SMC calls with the PSCI function number and the entry point of the initialization function as parameters. As a matter of fact, calls to UEFI and SMC services are currently the main means of interaction between the OS and the firmware. There are draft specifications of other firmware event notification facilities, but there have been no reports of any completed implementations to date (2015).

To sum up, it should be mentioned that this article does not provide an exhaustive description of the functioning and interaction of ARMv8-based firmware components. Moreover, the development of the architecture is undergoing constant refinement and refactoring, which are likely to provide material for new publications.