VFIO tips and tricks: ACS

Sometimes VFIO users are befuddled that they aren't able to separate devices between host and guest or multiple guests due to IOMMU grouping and revert to using legacy KVM device assignment, or as is the case with may VFIO-VGA users, apply the PCIe ACS override patch to avoid the problem. Let's take a moment to look at what this is really doing.

Hopefully we all have at least some vague notion of what an IOMMU does in a system, it allows mapping of an I/O virtual address (IOVA) to a physical memory address. Without an IOMMU, all devices share a flat view of physical memory without any memory translation operation. With an IOMMU we have a new address space, the IOVA space, that we can put to use.

Different IOMMUs have different level of functionality. Before the proliferation of virtualization, IOMMUs often provided only translation, and often only for a small aperture or window of the address space. These IOMMUs mostly provided two capabilities, avoiding bounce buffers and creating contiguous DMA operations. Bounce buffers are necessary when the addressing capabilities of the device are less than that of the platform, for instance if the device can only address 4GB of memory, but your system supports 8GB. If the driver allocates a buffer above 4GB, the device cannot directly DMA to it. A bounce buffer is buffer space in lower memory, where the device can temporarily DMA, which is then copied to the driver allocated buffer on completion. An IOMMU can avoid the extra buffer and copy operation by providing an IOVA within the device's address space, backed by the driver's buffer that is outside of the device's address space. Creating contiguous DMA operations comes into play when the driver makes use of multiple buffers, scattered throughout the physical address space, and gathered together for a single I/O operation. The IOMMU can take these scatter-gather lists and map them into the IOVA space to form a contiguous DMA operation for the device. In the simplest example, a driver may allocate two 4KB buffers that are not contiguous in the physical memory space. The IOMMU can allocate a contiguous range for these buffers allowing the I/O device to do a single 8KB DMA rather than two separate 4KB DMAs.

Both of these features are still important for high performance I/O on the host, but the IOMMU feature we love from a virtualization perspective is the isolation capabilities of modern IOMMUs. Isolation wasn't possible on a wide scale prior to PCI-Express because conventional PCI does not tag transactions with an ID of the requesting device (requester ID). PCI-X included some degree of a requester ID, but rules for interconnecting devices taking ownership of the transaction made the support incomplete for isolation. With PCIe, each device tags transactions with a requester ID unique to the device (the PCI bus/device/function number, BDF), which is used to reference a unique IOVA table for that device. Suddenly we go from having a shared IOVA space used to offload unreachable memory and consolidate memory, to a per device IOVA space that we can not only use for those features, but also to restrict DMA access from the device. For assignment to a virtual machine, we now simply need to populate the IOVA space for the assigned device with the guest physical to host physical memory mappings for the VM and the device can transparently perform DMA in the guest address space.

Back to IOMMU groups; IOMMU groups try to describe the smallest sets of devices which can be considered isolated from the perspective of the IOMMU. The first step in doing this is that each device must associate to a unique IOVA space. That is, if multiple devices alias to the same IOVA space, then the IOMMU cannot distinguish between them. This is the reason that a typical x86 PC will group all conventional PCI devices together, all of them are aliased to the same PCIe-to-PCI bridge. Legacy KVM device assignment will allow a user to assign these devices separately, but the configuration is guaranteed to fail. VFIO is governed by IOMMU groups and therefore prevents configurations which violate this most basic requirement of IOMMU granularity.

Beyond this first step of being able to simply differentiate one device from another, we next need to determine whether the transactions from a device actually reach the IOMMU. The PCIe specification allows for transactions to be re-routed within the interconnect fabric. A PCIe downstream port can re-route a transaction from one downstream device to another. The downstream ports of a PCIe switch may be interconnected to allow re-routing from one port to another. Even within a multifunction endpoint device, a transaction from one function may be delivered directly to another function. These transactions from one device to another are called peer-to-peer transactions and can be bad news for devices operating in separate IOVA spaces. Imagine for instance if the network interface card assigned to your guest attempted a DMA write to a guest physical address (IOVA) that matched the MMIO space for a peer disk controller owned by the host. An interconnect attempting to optimize the data path of that transaction could send the DMA write straight to the disk controller before it gets to the IOMMU for translation.

This is where PCIe Access Control Services (ACS) comes into play. ACS provides us with the ability to determine whether these redirects are possible as well as the ability to disable them. This is an essential component in being able to isolate devices from one another and sadly one that is too often missing in interconnects and multifunction endpoints. Without ACS support at every step from the device to the IOMMU, we must assume that redirection is possible at the highest upstream device lacking ACS, thereby breaking isolation of all devices below that point in the topology. IOMMU groups in a PCI environment take this isolation into account, grouping together devices which are capable of untranslated peer-to-peer DMA.

Combining these two things, the IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups. VFIO uses this information to enforce safe ownership of devices for userspace. With the exception of bridges, root ports, and switches (ie. interconnect fabric), all devices within an IOMMU group must be bound to a VFIO device driver or known safe stub driver. For PCI, these drivers are vfio-pci and pci-stub. We allow pci-stub simply because it's known that the host does not interact with devices via this driver (using legacy KVM device assignment on such devices while the group is in use with VFIO for a different VM is strongly discouraged). If when attempting to use VFIO you see an error message indicating the group is not viable, it relates to this rule that all of the devices in the group need to be bound to an appropriate host driver.

IOMMU groups are visible to the user through sysfs:

$ find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/2/devices/0000:00:14.0
/sys/kernel/iommu_groups/3/devices/0000:00:16.0
/sys/kernel/iommu_groups/4/devices/0000:00:19.0
/sys/kernel/iommu_groups/5/devices/0000:00:1a.0
/sys/kernel/iommu_groups/6/devices/0000:00:1b.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.1
/sys/kernel/iommu_groups/7/devices/0000:00:1c.2
/sys/kernel/iommu_groups/7/devices/0000:02:00.0
/sys/kernel/iommu_groups/7/devices/0000:03:00.0
/sys/kernel/iommu_groups/8/devices/0000:00:1d.0
/sys/kernel/iommu_groups/9/devices/0000:00:1f.0
/sys/kernel/iommu_groups/9/devices/0000:00:1f.2

/sys/kernel/iommu_groups/9/devices/0000:00:1f.3

Here we see that devices like the audio controller (0000:00:1b.0) have their own IOMMU group, while a wireless adapter (0000:03:00.0) and flash card reader (0000:02:00.0) share an IOMMU group. The later is a result of lack of ACS support at the PCIe root ports (0000:00:1c.*). Each device also has links back to its IOMMU group:

$ readlink -f /sys/bus/pci/devices/0000\:03\:00.0/iommu_group/
/sys/kernel/iommu_groups/7

The set of devices can thus be found using:

$ ls /sys/bus/pci/devices/0000\:03\:00.0/iommu_group/devices/

0000:00:1c.0 0000:00:1c.1 0000:00:1c.2 0000:02:00.0 0000:03:00.0

Using this example, if I wanted to assign the wireless adapter (0000:03:00.0) to a guest, I would also need to bind the flash card reader (0000:02:00.0) to either vfio-pci or pci-stub in order to make the group viable. An important point here is that the flash card reader does not also need to be assigned to the guest, it simply needs to be held by a device which is known to either participate in VFIO, like vfio-pci, or known not to do DMA, like pci-stub. Newer kernels than used for this example will split this IOMMU group as support has been added to expose the isolation capabilities of this chipset, even though it does not support PCIe ACS directly.

In closing, let's discuss strategies for dealing with IOMMU groups that contain more devices than desired. For a plug-in card, the first option would be to determine whether installing the card into a different slot may produce the desired grouping. On a typical Intel chipset, PCIe root ports are provided via both the processor and the PCH (Platform Controller Hub). The capabilities of these root ports can be very different. On the latest Linux kernels we have support for exposing the isolation of the PCH root ports, even though many of them do not have native PCIe ACS support. These are therefore often a good target for creating smaller IOMMU groups. On Xeon class processors (except E3-1200 series), the processor-based PCIe root ports typically support ACS. Client processors, such as the i5/i7 Core processor do not support ACS, but we can hope future products from Intel will update this support.

Another option that many users have found is a kernel patch which overrides PCIe ACS support in the kernel, allowing command line options to falsely expose isolation capabilities of various components. In many cases this appears to work well, but without vendor confirmation, we cannot be sure that the devices are truly isolated. The occurrence of a misdirected DMA may be sufficiently rare to mask association with this option. We may also find differences in chipset programming or address assignment between vendors that allows relatively safe use of this override on one system, while other systems may experience issues. Adjusting slot usage and using a platform with proper isolation support are clearly the best options.

The final option is to work with the vendor to determine whether isolation is present and quirk the kernel to recognize this isolation. This is generally a matter of determining whether internal peer-to-peer between functions is possible, or in the case of downstream ports, also determining whether redirection is possible. Multifunction endpoints that do not support peer-to-peer can expose this using a single static ACS table in configuration space, exposing no capabilities.

Hopefully this entry helps to describe why we have IOMMU groups, why they take the shape that they do, and how they operate with VFIO. Please comment if I can provide further clarification anywhere.

VFIO tips and tricks

Monday, August 25, 2014

IOMMU Groups, inside and out