Monday, September 22, 2014

VFIO interrupts and how to coax Windows guests to use MSI

Interrupts are used by devices for signaling attention.  In the case of a NIC, there might be an interrupt indicating a packet received or that a transmit queue is empty.  As with everything else, how we signal interrupts has evolved over time.

In the PCI space, we started with just four physical interrupt lines, INT{A,B,C,D}.  These are known as INTx, where x may be any one of the four lines.  The configuration space for each PCI function indicates which interrupt line is used by that function.  A common configuration is that function 0 may use INTA, while function 1 uses INTB, which helps to distribute devices evenly among the interrupt lines.  PCI bridges also incorporate a standard swizzle to remap interrupt lines between primary and secondary interfaces, so that we don't over-use some of the interrupt lines.  Each slot may also have different mappings, so INTA on one slot doesn't actually pull the same line as INTA on another slot.

These interrupt lines are wired to be active-low, meaning that when an interrupt is not being signaled, the physical wire floats to a high value (ex. 5 volts), with a low current.  If a device wants to signal an interrupt, it pulls the line to ground.  For electrical reasons, this makes it possible for multiple devices to share the same interrupt line.  Any one of the devices may pull the interrupt line low, then it's the task of the operating system to poll each of the devices using that particular line to determine which require service.

One of the issues for device assignment may quickly become apparent here.  When the OS polls each device to determine which device requires service, it typically does so via device specific drivers.  Drivers like vfio-pci don't know how to determine which device pulled the interrupt line without some extra information.  That extra information is provided by a feature introduced in the 2.3 version of the PCI spec which provides two important bits in PCI configuration space.  The first is the Interrupt Status bit of the Device Status register, which tells us when the device is signaling an interrupt.  This gives us a standard way to determine information that was previously device specific.  The second bit is the Interrupt Disable bit in the Device Command register.  This allows us to mask the interrupt at the device such that it ceases to pull the interrupt line low.

These two important features allow us to assign a device making use of INTx signalling to a guest, because we can now identify when it is our assigned device signaling the interrupt and also prevent the device from continuing to signal in the host while it is being serviced by the guest.  This latter feature means that the guest cannot saturate the host with interrupts by failing to service the interrupt.

VFIO does also have support for non-PCI-2.3 compliant devices, but it requires a much more restricted configuration.  We still need to identify when the assigned device is signaling an interrupt, which we can only do in a non-device specific way by requiring only a single device per interrupt line.  Also when this is the case, we can mask the interrupt at the system APIC rather than at the device itself.  Therefore we can achieve the same results, but we require an exclusive interrupt line for the device, which can often be an insurmountable configuration restriction.

An astute reader may note that in either case, we forward the interrupt signal to the guest with either the device or the APIC configured to mask further interrupts.  This implies that we need some sort of acknowledgement from the guest in order to unmask the device and allow subsequent interrupts.  I won't go into the details of KVM IRQFD resamplers, but suffice to say we need a return signal from the hypervisor for this unmask.  All of this masking and unmasked adds to the interrupt latency and makes INTx less desirable from a throughput and overhead perspective when assigning a device.  Note that I say less desirable rather than undesirable, because in many cases the interrupt rate and latency requirements for the device are more than satisfied using this mechanism.  However, if the device supports a more efficient mechanism, why not use it.

Message Signaled Interrupts (MSI) provide that more efficient mechanism.  An MSI is simply a DMA write by the device of a specific message data at a specific address.  This improves two things from a virtualization perspective, first we have a much larger address space of interrupts, which generally means that each interrupts source will have an exclusive interrupt, removing the problem of determining the source when the interrupt is shared.  Also, message signaled interrupts are interpreted as edge-triggered, eliminating the need for masking and thus the need for an unmask path.  Therefore, to handle a device making use of MSI, we have no hardware interaction with the device upon receiving the interrupt (no device or APIC masking) and no return path from the hypervisor to re-enable subsequent interrupts.

In most cases devices that can use MSI interrupts will be automatically configured to do so.  You can verify this by looking in /proc/interrupts on a Linux guest or looking at the device resources in Device Manager for a Windows guest.  In the Linux case the device interrupts will be listed as MSI rather than APIC, in the Windows case a negative number for the interrupt indicates MSI while a positive number indicates standard INTx.  Another way to tell is by looking at /proc/interrupts on the host.  Both the interrupt type and VFIO name of the interrupt will indicate the signaling method being used.

To determine whether a device supports MSI, lspci on the host can be used to look at the capabilities of the device.  For example:

$ sudo lspci -v -s 1:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2) (prog-if 00 [VGA controller])
Subsystem: Corp. Device 2753
Flags: bus master, fast devsel, latency 0, IRQ 53
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
Expansion ROM at f7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel driver in use: vfio-pci
Kernel modules: nouveau

Here we can see the capability at [68] is an MSI capability, which is currently enabled.  The device may also reports MSI-X, which is a further extension of MSI, providing some additional flexibility beyond the scope of our discussion here.  Reporting either MSI or MSI-X indicates support for Message Signaled Interrupts.

If you find that your device supports MSI but it's not being enabled, and your guest is Windows, you can follow the steps found here to attempt to enable it.  Please note the part about making backups, not guaranteed to work, etc.  If it doesn't work, you may find your VM unbootable and need to restore from backup.  Being an assigned device, there's also a good chance you can remove the device, undo the settings, and re-add the device.

The summary of the procedure is to identify the Device Instance Path from the Details tab of the device in Device Manager.  Run regedit and find the same path under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\  After following down the tree using the Device Instance Path information, continue to follow down through "Device Parameters" and "Interrupt Management".  Here you will find that you either have or need to create a new key named "MessageSignaledInterruptProperties".  Within that key, find or create a DWORD value named "MSISupported".  The value should be 1 enable MSI or 0 to disable.

In my case, the Windows 8.1 VM seems to work well with MSI added and enabled on the GPU and enabled on the audio function.  I can't however say that I see a noticeable performance difference although given what we know from above about the overhead of various paths, we can suspect that the hypervisor load is reduced in MSI mode.

If you're using a Linux guest and find devices that aren't using MSI, use modinfo on the kernel module to see if it has an option to turn it on.

A couple points beyond the scope of this post, but I'll mention for completeness, INTx on PCI Express is no longer based on a physical wire.  It's actually more like MSI, using a transaction based mechanism.  However, for compatibility the semantics of the interrupts are the same as if it was a physical wire.  Second, while INTA-INTD are the PCI standard, chipsets can actually route more interrupt lines to help spread out the interrupt load.  The Q35 QEMU model for instance has PIRQ lines A-H, and these are interleaved among devices in chipset specific ways.

Another important note is that hardware gets interrupts wrong.  A lot.  If your device doesn't work with MSI enabled, it may be because the hardware is broken and the vendor never intended MSI to be enabled.  In the case of Linux, MSI is specifically disabled for some vendor's products based on past transgressions and may or may not work on your hardware.  Good luck and please comment on successes or failures, particularly successes that result in a measurable performance improvement.  Thanks.

Monday, September 15, 2014

OVMF split image support

Gerd Hoffmann's Fedora OVMF builds have been updated to support installing the split CODE/VARS binaries.  Wherever you get your OVMF binaries, the advantage of this is that the EFI variables, ex. bootloader information, is stored separately from the executable code of the firmware allowing it to be updated without blasting the variable store.  The libvirt update mentioned the other day already supports this quite nicely.  Rather than having a loader entry with a single read-write image, we switch that to read-only entry and add nvram storage.  The XML looks like this:

<domain type='kvm'>
    <loader readonly='yes' type='pflash'>/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram template='/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd'>

Once the guest is started, a copy of the NVRAM templace is made an placed under /var/lib/libvirt/qemu/nvram/$DOMAIN_VARS.fd.  This then becomes part of the state of the VM.

On the QEMU commandline, you'll need to manually create a copy of the VARS file for each VM and specify the CODE and VARS as:

/usr/libexec/qemu-kvm ... \
    -drive if=pflash,format=raw,readonly,file=/path/to/OVMF_CODE.fd \
    -drive if=pflash,format=raw,file=/copy/of/OVMF_VARS.fd

I'm also told that virt-install and virt-manager support for OVMF are coming real soon and the interface will be similar to the XML, allowing selection of both a CODE and template VARS files.  The libvirt config file, /etc/libvirt/qemu.conf, also allows a default VARS template image to be specified per code image, so that the <nvram> entry gets filled in automatically based on the file used for the <loader> entry.

Finally, how do you tell whether you have a split or unified image for OVMF?  Lacking some sort of parser, apparently the best way to tell is by file size.  A unified image will be exactly 2MB while the split CODE image will be 2MB-128KB and the VARS image will be 128KB.  Unsurprisingly then, you can also create a split image with dd, taking the first 128K as VARS and the rest as CODE.

Good luck.

Thursday, September 11, 2014

libvirt now supports OVMF

Thanks to the work of Michal Privoznik and support of Laszlo Ersek and others, libvirt can now manage VMs using OVMF natively.  If you're on Fedora and using Gerd's OVMF RPMs, you simply need to create a copy of /usr/share/edk2.git/ovmf-x64/OVMF-pure-efi.fd for each VM (put it somewhere like /var/lib/libvirt/images/), and make it writable (support is still new and it doesn't seem to change file permissions for the VM yet).  Then, edit the domain XML to include this:

<domain type='kvm'>
    <loader type='pflash'>/var/lib/libvirt/images/VM1-OVMF.fd</loader>

Since the OVMF image we're using is a "unified" image, it contains both the UEFI code itself as well as variable storage space, so the above adds it as writable by the VM.  There are also ways to have a split image so you can maintain the UEFI code separate from the variables, but I'll wait for builds from Gerd that support that before I attempt to document it.

With support for both the kvm=off cpu option and OVMF in libvirt, we're now able to run completely native libvirt VMs with GeForce and Radeon GPU assignment.  Support is already underway for virt-manager and virt-install of OVMF.

Also, a VM CPU selection tip, since we don't care about migration with an assigned GPU, there are few reasons left not to want to use the -cpu host option for QEMU.  To enable that through libvirt, change the CPU definition in the XML to this:

<domain type='kvm'>
  <cpu mode='host-passthrough'/>

Automatic vCPU pinning is also available:

<domain type='kvm'>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>

And yes, hugepage support is also available, see libvirt documentation for details.  Enjoy.

Monday, September 1, 2014

KVM Forum 2014

The schedule is up for KVM Forum 2014.  It looks like I'll be talking about VFIO GPU assignment with OVFM on the afternoon of Tuesday, October 14th.  If you're in or around Düsseldorf, register for the conference and come see.  There's also a talk just before mine on KvmGT that should be interesting.