Monday, September 22, 2014

VFIO interrupts and how to coax Windows guests to use MSI

Interrupts are used by devices for signaling attention.  In the case of a NIC, there might be an interrupt indicating a packet received or that a transmit queue is empty.  As with everything else, how we signal interrupts has evolved over time.

In the PCI space, we started with just four physical interrupt lines, INT{A,B,C,D}.  These are known as INTx, where x may be any one of the four lines.  The configuration space for each PCI function indicates which interrupt line is used by that function.  A common configuration is that function 0 may use INTA, while function 1 uses INTB, which helps to distribute devices evenly among the interrupt lines.  PCI bridges also incorporate a standard swizzle to remap interrupt lines between primary and secondary interfaces, so that we don't over-use some of the interrupt lines.  Each slot may also have different mappings, so INTA on one slot doesn't actually pull the same line as INTA on another slot.

These interrupt lines are wired to be active-low, meaning that when an interrupt is not being signaled, the physical wire floats to a high value (ex. 5 volts), with a low current.  If a device wants to signal an interrupt, it pulls the line to ground.  For electrical reasons, this makes it possible for multiple devices to share the same interrupt line.  Any one of the devices may pull the interrupt line low, then it's the task of the operating system to poll each of the devices using that particular line to determine which require service.

One of the issues for device assignment may quickly become apparent here.  When the OS polls each device to determine which device requires service, it typically does so via device specific drivers.  Drivers like vfio-pci don't know how to determine which device pulled the interrupt line without some extra information.  That extra information is provided by a feature introduced in the 2.3 version of the PCI spec which provides two important bits in PCI configuration space.  The first is the Interrupt Status bit of the Device Status register, which tells us when the device is signaling an interrupt.  This gives us a standard way to determine information that was previously device specific.  The second bit is the Interrupt Disable bit in the Device Command register.  This allows us to mask the interrupt at the device such that it ceases to pull the interrupt line low.

These two important features allow us to assign a device making use of INTx signalling to a guest, because we can now identify when it is our assigned device signaling the interrupt and also prevent the device from continuing to signal in the host while it is being serviced by the guest.  This latter feature means that the guest cannot saturate the host with interrupts by failing to service the interrupt.

VFIO does also have support for non-PCI-2.3 compliant devices, but it requires a much more restricted configuration.  We still need to identify when the assigned device is signaling an interrupt, which we can only do in a non-device specific way by requiring only a single device per interrupt line.  Also when this is the case, we can mask the interrupt at the system APIC rather than at the device itself.  Therefore we can achieve the same results, but we require an exclusive interrupt line for the device, which can often be an insurmountable configuration restriction.

An astute reader may note that in either case, we forward the interrupt signal to the guest with either the device or the APIC configured to mask further interrupts.  This implies that we need some sort of acknowledgement from the guest in order to unmask the device and allow subsequent interrupts.  I won't go into the details of KVM IRQFD resamplers, but suffice to say we need a return signal from the hypervisor for this unmask.  All of this masking and unmasked adds to the interrupt latency and makes INTx less desirable from a throughput and overhead perspective when assigning a device.  Note that I say less desirable rather than undesirable, because in many cases the interrupt rate and latency requirements for the device are more than satisfied using this mechanism.  However, if the device supports a more efficient mechanism, why not use it.

Message Signaled Interrupts (MSI) provide that more efficient mechanism.  An MSI is simply a DMA write by the device of a specific message data at a specific address.  This improves two things from a virtualization perspective, first we have a much larger address space of interrupts, which generally means that each interrupts source will have an exclusive interrupt, removing the problem of determining the source when the interrupt is shared.  Also, message signaled interrupts are interpreted as edge-triggered, eliminating the need for masking and thus the need for an unmask path.  Therefore, to handle a device making use of MSI, we have no hardware interaction with the device upon receiving the interrupt (no device or APIC masking) and no return path from the hypervisor to re-enable subsequent interrupts.

In most cases devices that can use MSI interrupts will be automatically configured to do so.  You can verify this by looking in /proc/interrupts on a Linux guest or looking at the device resources in Device Manager for a Windows guest.  In the Linux case the device interrupts will be listed as MSI rather than APIC, in the Windows case a negative number for the interrupt indicates MSI while a positive number indicates standard INTx.  Another way to tell is by looking at /proc/interrupts on the host.  Both the interrupt type and VFIO name of the interrupt will indicate the signaling method being used.

To determine whether a device supports MSI, lspci on the host can be used to look at the capabilities of the device.  For example:

$ sudo lspci -v -s 1:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2) (prog-if 00 [VGA controller])
Subsystem: Corp. Device 2753
Flags: bus master, fast devsel, latency 0, IRQ 53
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
Expansion ROM at f7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel driver in use: vfio-pci
Kernel modules: nouveau

Here we can see the capability at [68] is an MSI capability, which is currently enabled.  The device may also reports MSI-X, which is a further extension of MSI, providing some additional flexibility beyond the scope of our discussion here.  Reporting either MSI or MSI-X indicates support for Message Signaled Interrupts.

If you find that your device supports MSI but it's not being enabled, and your guest is Windows, you can follow the steps found here to attempt to enable it.  Please note the part about making backups, not guaranteed to work, etc.  If it doesn't work, you may find your VM unbootable and need to restore from backup.  Being an assigned device, there's also a good chance you can remove the device, undo the settings, and re-add the device.

The summary of the procedure is to identify the Device Instance Path from the Details tab of the device in Device Manager.  Run regedit and find the same path under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\  After following down the tree using the Device Instance Path information, continue to follow down through "Device Parameters" and "Interrupt Management".  Here you will find that you either have or need to create a new key named "MessageSignaledInterruptProperties".  Within that key, find or create a DWORD value named "MSISupported".  The value should be 1 enable MSI or 0 to disable.

In my case, the Windows 8.1 VM seems to work well with MSI added and enabled on the GPU and enabled on the audio function.  I can't however say that I see a noticeable performance difference although given what we know from above about the overhead of various paths, we can suspect that the hypervisor load is reduced in MSI mode.

If you're using a Linux guest and find devices that aren't using MSI, use modinfo on the kernel module to see if it has an option to turn it on.

A couple points beyond the scope of this post, but I'll mention for completeness, INTx on PCI Express is no longer based on a physical wire.  It's actually more like MSI, using a transaction based mechanism.  However, for compatibility the semantics of the interrupts are the same as if it was a physical wire.  Second, while INTA-INTD are the PCI standard, chipsets can actually route more interrupt lines to help spread out the interrupt load.  The Q35 QEMU model for instance has PIRQ lines A-H, and these are interleaved among devices in chipset specific ways.

Another important note is that hardware gets interrupts wrong.  A lot.  If your device doesn't work with MSI enabled, it may be because the hardware is broken and the vendor never intended MSI to be enabled.  In the case of Linux, MSI is specifically disabled for some vendor's products based on past transgressions and may or may not work on your hardware.  Good luck and please comment on successes or failures, particularly successes that result in a measurable performance improvement.  Thanks.


  1. Does it make a difference, if it's i440FX or Q35?
    I'm passing GPU along with its HD Audio + PCH's HD Audio. While GPU is MSI enabled, I can't get HD Audio codecs to get it (both are MSI enabled on host).

    1. I've only tried on 440FX and only with Nvidia GPU+audio. I don't see why Q35 would make a difference, but I'm really not sure what sort of success level to expect with these sorts of regedit backdoors on Windows.

  2. choppy sound disappeared in win 8 vm using hdmi audio, thanks. on a Z97 extreme 4, I was able to enable MSI on the GPU + GPU sound card, but not the passthrough usb controller.

  3. Excellent post! It seems the crackles mostly gone. I spend last night trying to tweak buffer size and frequency, without luck. The IRQ MSI fix seem like improved things a lot. Need to test it a bit longer though.

  4. For me `MSI: Enable+` changes to `MSI: Enable-` when I bind the device to `vfio-pci`. I don't even run QEMU yet. And changes back to `Enable+` if I unbind it. I can't figure out how to keep it enabled.

    1. vfio-pci only enables MSI when the guest does, a device sitting idle and unused bound to vfio-pci should never have MSI enabled.

  5. Couldn't use HDMI audio on VM windows 8.1, with gtx 970 passed through. Had cracking sound, and crashing audio settings.

    Seems to work well with MSI , but need further tests.
    Thanks a lot!

  6. Thanks a lot. Got rid of audio crackles/popping and random frame drops while gaming (for the most part).

  7. Thanks for this blog Alex, success using a new Asus Strix GTX 970...note I had to create the registry entries for both the video and audio devices of the card...just doing one didn't work.

    Thanks to reading your blogs I found that an FXO/PSTN card I have isn't INTx compliant so I need to provide it a full interrupt, the GTX 970 used the same interrupt when not using MSI so this fix not only improves GFX performance but also allows me to virtualise my PBX :-)

    Thanks for all your hard work!

  8. Is there anyway to force/set the initial interrupt used by vfio-intx before its switches to vfio-msi? Trying to work around one device that doesn't support PCI2.3 INTx disable.

    1. INTx routing is not typically adjustable at the OS level. You can try moving the card to a different slot or looking in the BIOS to see if your motherboard offers any interrupt configuration, otherwise you'll need to disable or unbind the conflicting devices to get an exclusive interrupt.

  9. fixed my issues! couldnt for the life of me figure out why windows crawled so slow when my GPU was passed through, but worked fine otherwise. interrupt storms, excessive device polling... sounds right


Comments are not a support forum. For help with problems, please try the vfio-users mailing list (