IOMMU Support in Linux
Earlier this year, we contributed a driver for the Input/Output Memory Management Unit (IOMMU) introduced with the Allwinner H6 SoCs in Linux, that got merged in Linux 5.8. It felt like a good occasion to write about what an IOMMU is, what it’s useful for and how it integrates into Linux.
At the low-level, a CPU executes an infinite sequence of instructions, some of them either reading or writing to an address in memory. Depending on the system design, the hardware maps that address to the main RAM of the system, or to a specific device. The set of all the memory addresses than the CPU can access is the memory space.
In older or simpler designs, the CPU accesses directly the physical location on the memory bus of a memory word or register, the physical address. This simple design is still in use on some micro-controllers today, but it introduces some issues related to security, and reliability.
Indeed, that design doesn’t enforce any isolation between processes or permission checks. This means that any process in the system has read/write access to the entire memory in the system, including the kernel and other applications. This could lead to a system-wide memory corruption if a buggy application starts to write at some random place in memory. And this is the ideal scenario: indeed, it also allowed any process to snoop on any memory stored by another process or the kernel, leading to potential sensitive information leaks if that process was malicious.
To fix that issue, CPU designs introduced a hardware unit called the Memory Management Unit (MMU) that intercepts all the memory accesses done by the CPU once enabled and translates them. This mechanism allows to have two distinct address spaces: the physical address space that we talked about earlier, and a virtual address space that the CPU operates on, the MMU being in charge of translating the virtual addresses into physical addresses when the CPU access them.
The best part is that the kernel can change the virtual address space anytime, so it actually uses one virtual address space per process (and potentially one for the kernel). In addition, the MMU also allows to set extra attributes to carry out access control over memory ranges, allowing to change whether we want to allow the associated process to read, write, or execute the data stored in a particular range.
This mechanism allows the kernel to prevent any process to messing with the memory of another process at the hardware level, almost entirely closing the class of attacks we mentioned earlier, and enables the operating system to provide more optimizations and features.
Even though it looks complete, there’s a limitation though: the MMU translates the addresses used by the CPU. Yet, some devices can also access memory directly without the CPU intervention using a feature called Direct Memory Access (DMA), and since the CPU isn’t involved, the MMU isn’t either. This means that we’re still vulnerable to the same class of issues than the one we described earlier if we perform a misconfigured DMA operation or if an attacker manages to trick a device into reading some other parts of the memory than the one it’s supposed to access. As an example of such an attack, you can find an in-depth article explaining how the Google Project Zero team was able to pull this off on a Nexus 6 modem here.
To close that channel of attacks, system designers later integrated IOMMUs, that fill pretty much the same needs and with the same features than regular MMUs, but providing that translation and protection to devices that perform DMA instead of the CPU. It’s not unusual to even have several IOMMUs: the GPU almost always has its own, and then there would be another one for the rest or a subset of the devices in the system.
IOMMUs also allow to address a common issue where the device isn’t able to access the entirety of the buffer for some reason. It mainly happens either when the buffer it’s supposed to perform DMA on isn’t contiguous in physical memory and is thus split between some chunks while the device can solely perform DMA on a single chunk; or when the device can’t access the buffer location in RAM due to some limitations. One common limitation would be that devices can just access the lowest 4 GB of the RAM (since it’s addressable on 32 bits), while the system might have more than that. Either way, the way to deal with that situation is to create a bounce buffer: a temporary buffer that the device can access and where we’re going to copy the content of the original buffer to so that the device can access it. There’s an obvious performance cost to it though, since we could introduce a copy of the whole buffer before the device can operate on it. Still, IOMMUs save the day once again since you can basically create any mapping, both accommodating the non-contiguous case by creating a mapping where the virtual buffer would be contiguous, and using a virtual address that the device can access.
Integration into Linux
The IOMMU Framework
Since an IOMMU is so useful, Linux gained support for IOMMUs using a dedicated driver framework under drivers/iommu. That framework uses three different concepts to enforce permissions as fine-grained as possible for a given device: domains, groups, and devices.
A device is quite straightforward, it’s a device that can perform DMA operations through our IOMMU, a domain is a representation of a virtual address space and a group is a set of devices that the IOMMU can’t isolate from each other. The framework then relies on functions to manage those three concepts. As you can see in the operations that a driver can perform, all those functions are to manage the relationships between the devices, groups and domains that the system is going to have, and manage each individually (to create a mapping in a particular domain for example).
The DMA API
The second part of the integration in Linux, once it knows how to manage the various address spaces through the IOMMU framework and drivers, is to allow the drivers for the devices that need to perform DMA to use that IOMMU.
One of the main goals of Linux is to be as portable as possible, and it does so by abstracting much of the usual driver operations through frameworks and APIs, and allows to decouple the driver itself (how does one need to program a device to perform a given operation) from how the platform works (in our case, how would the platform do DMA, or how should Linux configure the IOMMU).
DMA is no different here, and Linux has two distinct APIs depending on the context: the streaming API and the coherent API. As we mentioned earlier, you would use one over the other depending on the situation you’re in, and who is in charge of allocating the buffer that gets accessed through DMA. If the user-space hands the buffer to the driver (for example when you would send a network packet, or write to a file), you’d need to setup a temporary mapping through the streaming API. In other situations though, the kernel would be in charge of allocating the buffer and then pass that buffer to the user-space—the Direct Rendering Manager (DRM) and Video4Linux2 (v4l2) can operate like that—, and that would be where you’d need the coherent API.
The two APIs are here to make sure that the device can perform DMA on the buffer we want to access. They thus make sure that the device can access the buffer. What it actually means then depends on the platform, what memory range the device can access, which memory allocator to use, etc. Part of that decision involves whether the device accesses go through an IOMMU, which then gets configured properly to create the mappings for our buffer in the device’s virtual address space.
The Allwinner H6 IOMMU
We contributed to Linux 5.8 a driver for the IOMMU introduced by Allwinner in their H6 SoC. Compared to the IOMMUs found on modern PCs, its design is a bit simple: it doesn’t support virtualization, there’s a single address space, and it doesn’t cover all the devices doing DMA in the system but solely the DMA intensive ones: the video and display devices.
The driver thus registers and allows a single domain and group since the IOMMU can’t isolate devices from one another, and it supports a single address space. Despite these limitations though, it’s still a welcome addition for all the reasons above.
Video and display controllers also require a lot of big, contiguous buffers allocated through the Contiguous Memory Allocator (CMA) on earlier SoCs. CMA works by putting aside a part of the system RAM at boot, and allocating big buffers from that area. There’s a big drawback now though: the system has to maintain a rather big (in the 100 MB order), static, pool of RAM put aside, no matter if the system is going to use its video or display capabilities, and without any way to give the RAM back if it doesn’t. As we discussed, an IOMMU has the capability to map several, separate, chunks in physical memory to provide a contiguous buffer in the device virtual space, effectively reducing (or even removing) the need for CMA.
With Linux 5.8 onward, the Allwinner H6 uses its IOMMU with the DRM display driver and the Cedrus video decoder driver. As time goes by and we enable its usage across more drivers, it benefits more devices. And with any luck, more Allwinner SoCs will feature an IOMMU.