Monday, September 22, 2014

VFIO interrupts and how to coax Windows guests to use MSI

Interrupts are used by devices for signaling attention.  In the case of a NIC, there might be an interrupt indicating a packet received or that a transmit queue is empty.  As with everything else, how we signal interrupts has evolved over time.

In the PCI space, we started with just four physical interrupt lines, INT{A,B,C,D}.  These are known as INTx, where x may be any one of the four lines.  The configuration space for each PCI function indicates which interrupt line is used by that function.  A common configuration is that function 0 may use INTA, while function 1 uses INTB, which helps to distribute devices evenly among the interrupt lines.  PCI bridges also incorporate a standard swizzle to remap interrupt lines between primary and secondary interfaces, so that we don't over-use some of the interrupt lines.  Each slot may also have different mappings, so INTA on one slot doesn't actually pull the same line as INTA on another slot.

These interrupt lines are wired to be active-low, meaning that when an interrupt is not being signaled, the physical wire floats to a high value (ex. 5 volts), with a low current.  If a device wants to signal an interrupt, it pulls the line to ground.  For electrical reasons, this makes it possible for multiple devices to share the same interrupt line.  Any one of the devices may pull the interrupt line low, then it's the task of the operating system to poll each of the devices using that particular line to determine which require service.

One of the issues for device assignment may quickly become apparent here.  When the OS polls each device to determine which device requires service, it typically does so via device specific drivers.  Drivers like vfio-pci don't know how to determine which device pulled the interrupt line without some extra information.  That extra information is provided by a feature introduced in the 2.3 version of the PCI spec which provides two important bits in PCI configuration space.  The first is the Interrupt Status bit of the Device Status register, which tells us when the device is signaling an interrupt.  This gives us a standard way to determine information that was previously device specific.  The second bit is the Interrupt Disable bit in the Device Command register.  This allows us to mask the interrupt at the device such that it ceases to pull the interrupt line low.

These two important features allow us to assign a device making use of INTx signalling to a guest, because we can now identify when it is our assigned device signaling the interrupt and also prevent the device from continuing to signal in the host while it is being serviced by the guest.  This latter feature means that the guest cannot saturate the host with interrupts by failing to service the interrupt.

VFIO does also have support for non-PCI-2.3 compliant devices, but it requires a much more restricted configuration.  We still need to identify when the assigned device is signaling an interrupt, which we can only do in a non-device specific way by requiring only a single device per interrupt line.  Also when this is the case, we can mask the interrupt at the system APIC rather than at the device itself.  Therefore we can achieve the same results, but we require an exclusive interrupt line for the device, which can often be an insurmountable configuration restriction.

An astute reader may note that in either case, we forward the interrupt signal to the guest with either the device or the APIC configured to mask further interrupts.  This implies that we need some sort of acknowledgement from the guest in order to unmask the device and allow subsequent interrupts.  I won't go into the details of KVM IRQFD resamplers, but suffice to say we need a return signal from the hypervisor for this unmask.  All of this masking and unmasked adds to the interrupt latency and makes INTx less desirable from a throughput and overhead perspective when assigning a device.  Note that I say less desirable rather than undesirable, because in many cases the interrupt rate and latency requirements for the device are more than satisfied using this mechanism.  However, if the device supports a more efficient mechanism, why not use it.

Message Signaled Interrupts (MSI) provide that more efficient mechanism.  An MSI is simply a DMA write by the device of a specific message data at a specific address.  This improves two things from a virtualization perspective, first we have a much larger address space of interrupts, which generally means that each interrupts source will have an exclusive interrupt, removing the problem of determining the source when the interrupt is shared.  Also, message signaled interrupts are interpreted as edge-triggered, eliminating the need for masking and thus the need for an unmask path.  Therefore, to handle a device making use of MSI, we have no hardware interaction with the device upon receiving the interrupt (no device or APIC masking) and no return path from the hypervisor to re-enable subsequent interrupts.

In most cases devices that can use MSI interrupts will be automatically configured to do so.  You can verify this by looking in /proc/interrupts on a Linux guest or looking at the device resources in Device Manager for a Windows guest.  In the Linux case the device interrupts will be listed as MSI rather than APIC, in the Windows case a negative number for the interrupt indicates MSI while a positive number indicates standard INTx.  Another way to tell is by looking at /proc/interrupts on the host.  Both the interrupt type and VFIO name of the interrupt will indicate the signaling method being used.

To determine whether a device supports MSI, lspci on the host can be used to look at the capabilities of the device.  For example:

$ sudo lspci -v -s 1:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2) (prog-if 00 [VGA controller])
Subsystem: eVga.com. Corp. Device 2753
Flags: bus master, fast devsel, latency 0, IRQ 53
Memory at f6000000 (32-bit, non-prefetchable) [size=16M]
Memory at c0000000 (64-bit, prefetchable) [size=256M]
Memory at d0000000 (64-bit, prefetchable) [size=32M]
I/O ports at e000 [size=128]
Expansion ROM at f7000000 [disabled] [size=512K]
Capabilities: [60] Power Management version 3
Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [78] Express Legacy Endpoint, MSI 00
Capabilities: [100] Virtual Channel
Capabilities: [250] Latency Tolerance Reporting
Capabilities: [258] L1 PM Substates
Capabilities: [128] Power Budgeting <?>
Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Capabilities: [900] #19
Kernel driver in use: vfio-pci
Kernel modules: nouveau

Here we can see the capability at [68] is an MSI capability, which is currently enabled.  The device may also reports MSI-X, which is a further extension of MSI, providing some additional flexibility beyond the scope of our discussion here.  Reporting either MSI or MSI-X indicates support for Message Signaled Interrupts.

If you find that your device supports MSI but it's not being enabled, and your guest is Windows, you can follow the steps found here to attempt to enable it.  Please note the part about making backups, not guaranteed to work, etc.  If it doesn't work, you may find your VM unbootable and need to restore from backup.  Being an assigned device, there's also a good chance you can remove the device, undo the settings, and re-add the device.

The summary of the procedure is to identify the Device Instance Path from the Details tab of the device in Device Manager.  Run regedit and find the same path under HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\  After following down the tree using the Device Instance Path information, continue to follow down through "Device Parameters" and "Interrupt Management".  Here you will find that you either have or need to create a new key named "MessageSignaledInterruptProperties".  Within that key, find or create a DWORD value named "MSISupported".  The value should be 1 enable MSI or 0 to disable.

In my case, the Windows 8.1 VM seems to work well with MSI added and enabled on the GPU and enabled on the audio function.  I can't however say that I see a noticeable performance difference although given what we know from above about the overhead of various paths, we can suspect that the hypervisor load is reduced in MSI mode.

If you're using a Linux guest and find devices that aren't using MSI, use modinfo on the kernel module to see if it has an option to turn it on.

A couple points beyond the scope of this post, but I'll mention for completeness, INTx on PCI Express is no longer based on a physical wire.  It's actually more like MSI, using a transaction based mechanism.  However, for compatibility the semantics of the interrupts are the same as if it was a physical wire.  Second, while INTA-INTD are the PCI standard, chipsets can actually route more interrupt lines to help spread out the interrupt load.  The Q35 QEMU model for instance has PIRQ lines A-H, and these are interleaved among devices in chipset specific ways.

Another important note is that hardware gets interrupts wrong.  A lot.  If your device doesn't work with MSI enabled, it may be because the hardware is broken and the vendor never intended MSI to be enabled.  In the case of Linux, MSI is specifically disabled for some vendor's products based on past transgressions and may or may not work on your hardware.  Good luck and please comment on successes or failures, particularly successes that result in a measurable performance improvement.  Thanks.

Monday, September 15, 2014

OVMF split image support

Gerd Hoffmann's Fedora OVMF builds have been updated to support installing the split CODE/VARS binaries.  Wherever you get your OVMF binaries, the advantage of this is that the EFI variables, ex. bootloader information, is stored separately from the executable code of the firmware allowing it to be updated without blasting the variable store.  The libvirt update mentioned the other day already supports this quite nicely.  Rather than having a loader entry with a single read-write image, we switch that to read-only entry and add nvram storage.  The XML looks like this:

<domain type='kvm'>
  ...
  <os>
    <loader readonly='yes' type='pflash'>/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram template='/usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd'>
    ...
  </os>
</domain>

Once the guest is started, a copy of the NVRAM templace is made an placed under /var/lib/libvirt/qemu/nvram/$DOMAIN_VARS.fd.  This then becomes part of the state of the VM.

On the QEMU commandline, you'll need to manually create a copy of the VARS file for each VM and specify the CODE and VARS as:

/usr/libexec/qemu-kvm ... \
    -drive if=pflash,format=raw,readonly,file=/path/to/OVMF_CODE.fd \
    -drive if=pflash,format=raw,file=/copy/of/OVMF_VARS.fd

I'm also told that virt-install and virt-manager support for OVMF are coming real soon and the interface will be similar to the XML, allowing selection of both a CODE and template VARS files.  The libvirt config file, /etc/libvirt/qemu.conf, also allows a default VARS template image to be specified per code image, so that the <nvram> entry gets filled in automatically based on the file used for the <loader> entry.

Finally, how do you tell whether you have a split or unified image for OVMF?  Lacking some sort of parser, apparently the best way to tell is by file size.  A unified image will be exactly 2MB while the split CODE image will be 2MB-128KB and the VARS image will be 128KB.  Unsurprisingly then, you can also create a split image with dd, taking the first 128K as VARS and the rest as CODE.

Good luck.

Thursday, September 11, 2014

libvirt now supports OVMF

Thanks to the work of Michal Privoznik and support of Laszlo Ersek and others, libvirt can now manage VMs using OVMF natively.  If you're on Fedora and using Gerd's OVMF RPMs, you simply need to create a copy of /usr/share/edk2.git/ovmf-x64/OVMF-pure-efi.fd for each VM (put it somewhere like /var/lib/libvirt/images/), and make it writable (support is still new and it doesn't seem to change file permissions for the VM yet).  Then, edit the domain XML to include this:

<domain type='kvm'>
  ...
  <os>
    ...
    <loader type='pflash'>/var/lib/libvirt/images/VM1-OVMF.fd</loader>
  </os>
</domain>

Since the OVMF image we're using is a "unified" image, it contains both the UEFI code itself as well as variable storage space, so the above adds it as writable by the VM.  There are also ways to have a split image so you can maintain the UEFI code separate from the variables, but I'll wait for builds from Gerd that support that before I attempt to document it.

With support for both the kvm=off cpu option and OVMF in libvirt, we're now able to run completely native libvirt VMs with GeForce and Radeon GPU assignment.  Support is already underway for virt-manager and virt-install of OVMF.

Also, a VM CPU selection tip, since we don't care about migration with an assigned GPU, there are few reasons left not to want to use the -cpu host option for QEMU.  To enable that through libvirt, change the CPU definition in the XML to this:

<domain type='kvm'>
  ...
  <cpu mode='host-passthrough'/>
  ...
</domain>

Automatic vCPU pinning is also available:

<domain type='kvm'>
  ...
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
  </cputune>
  ...
</domain>

And yes, hugepage support is also available, see libvirt documentation for details.  Enjoy.

Monday, September 1, 2014

KVM Forum 2014

The schedule is up for KVM Forum 2014.  It looks like I'll be talking about VFIO GPU assignment with OVFM on the afternoon of Tuesday, October 14th.  If you're in or around Düsseldorf, register for the conference and come see.  There's also a talk just before mine on KvmGT that should be interesting.

Sunday, August 31, 2014

Does my graphics card ROM support EFI?

If you're wanting to try legacy-free OVMF-based GPU assignment, it might be a good idea to start by testing whether your graphics card has EFI support in the PCI option ROM.  I've written a small program to parse the ROM and report some basic info about the contents.  Here's example output:

$ ./rom-parser GT635.rom 
Valid ROM signature found @0h, PCIR offset 190h
 PCIR: type 0, vendor: 10de, device: 1280, class: 030000
 PCIR: revision 0, vendor revision: 1
Valid ROM signature found @f400h, PCIR offset 1ch
 PCIR: type 3, vendor: 10de, device: 1280, class: 030000
 PCIR: revision 3, vendor revision: 0
  EFI: Signature Valid
 Last image

This is what we typically expect to see, there are two headers, the first is type 0, which is a standard PC BIOS ROM, the second is type 3, an EFI ROM. If you don't have EFI support in the ROM, the OVMF solution will not work for you.  Newer graphics card will hopefully all have EFI support.

To get this program:

$ git clone https://github.com/awilliam/rom-parser
$ cd rom-parser
$ make

You'll need to copy the ROM to a file first, the program does not have support for enabling the ROM through pci-sysfs. To do this from the host:

# cd /sys/bus/pci/devices/0000:01:00.0/
# echo 1 > rom
# cat rom > /tmp/image.rom
# echo 0 > rom

If you get a zero-sized file, look for an error in dmesg. The ROM may only be readable initially after boot, before any drivers have bound to it. Use the pci-stub.ids= boot option to attempt to keep the device in a pristine, unused state.

Thursday, August 28, 2014

Upstream updates for August 28th 2014

qemu.git now includes the MTRR fixes that eliminate the long delay in guest reboot when using OVMF with an assigned device on Intel hardware that does not support IOMMU snoop control.

Wednesday, August 27, 2014

Fixes for Linux Radeon with 440FX guests

The DRM and Radeon drivers in Linux assume that there's always a parent device to the GPU.  We can break this assumption easily with either the 440FX or Q35 QEMU machine modules by attaching the GPU to the root bus.  This has been one of the problems drawing users to more complicated Q35 models which more accurately reflect the host hardware.  We can also fix the driver to avoid such assumptions:

Tuesday, August 26, 2014

Upstream updates for August 26th 2014

A couple updates relevant to Nvidia GeForce assignment:

QEMU

fe08275d is now in qemu.git, decoupling the primary Nvidia GPU device quirk from the x-vga=on option.  This means that an Nvidia GPU assigned to a legacy-free OVMF VM will now enable this quirk automatically.

libvirt

d0711642 is now in libvirt.git enabling libvirt support for the kvm=off QEMU cpu option.  To enable this in your XML, add this to your VM definition:

<domain type='kvm'...>
  <features>
    <kvm>
      <hidden state='on'/>
    </kvm>
  </features>
  ...
</domain>

VFIO+VGA FAQ

Question 1:

I get the following error when attempting to start the guest:
vfio: error, group $GROUP is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver.
Answer:

There are more devices in the IOMMU group than you're assigning, they all need to be bound to the vfio bus driver (vfio-pci) or pci-stub for the group to be viable.  See my previous post about IOMMU groups for more information.  To reduce the size of the IOMMU group, install the device into a different slot, try a platform that has better isolation support, or (at your own risk) bypass ACS using the ACS override patch.

Question 2: 

I've applied the ACS override patch, but it doesn't work.  The IOMMU group is the same regardless of the patch.

Answer:

The ACS override patch needs to be enabled with kernel command line options.  The patch file adds the following documentation:


pcie_acs_override =
        [PCIE] Override missing PCIe ACS support for:
    downstream
        All downstream ports - full ACS capabilties
    multifunction
        All multifunction devices - multifunction ACS subset
    id:nnnn:nnnn
        Specfic device - full ACS capabilities
        Specified as vid:did (vendor/device ID) in hex
The option pcie_acs_override=downstream is usually sufficient to split IOMMU grouping caused by lack of ACS at a PCIe root port.  Also see my post discussing IOMMU groups, ACS, and why use of this patch is potentially dangerous.

Question 3:

I have Intel host graphics, when I start the VM I don't get any output on the assigned VGA monitor and my host graphics are corrupted.  I also see errors in dmesg indicating unexpected drm interrupts.

Answer:

You're doing VGA assignment with IGD and have failed to apply or enable the i915 VGA arbiter patch.  The patch needs to be enabled with i915.enable_hd_vgaarb=1 on the kernel commandline.  See also my previous post about VGA arbitration and my previous post about using OVMF as an alternative to VGA assignment.

Question 4:

I have non-Intel host graphics and have a problem similar to Question 3.

Answer:

You need the other VGA arbiter patch.  This one is simply a bug in the VGA arbiter logic.  There are no kernel command line options to enable it.

Question 5:

I have Intel host graphics, I applied and enabled the i915 patch and now I don't have DRI support on the host.  How can I fix this?

Answer:

See my previous post about VGA arbitration to understand why this happens.  This is a know side-effect of enabling VGA arbitration on the i915 driver.  The only solution is to use a host graphics device that can properly opt-out of VGA arbitration or avoid VGA altogether by using a legacy-free guest.

Question 6:

How can I prevent host drivers from attaching to my assigned devices?

Answer:

The easiest option is to use the pci-stub.ids= option on the kernel commandline.  This parameter takes a comma separated list of PCI vendor:device IDs (found via lspci -n) for devices to be claimed by pci-stub during boot.  Note that if vfio-pci is built statically into the kernel, vfio-pci.ids= can be used instead.  There is currently no way to select only a single device if there are multiple matches for the vendor:device ID.

Question 7:

Do I need the NoSnoop patch?

Answer:

No, it was deprecated long ago.

Question 8:

Do I need vfio_iommu_type1.allow_unsafe_interrupts=1?

Answer:

Probably not.  Try vfio-based device assignment without it, if it fails look in dmesg for this:
No interrupt remapping support.  Use the module param "allow_unsafe_interrupts" to enable VFIO IOMMU support on this platform
If, and only if, you see that error message do you need the module options.  Also note that this means you opt-in to running vfio device assignment on a platform that does not protect against MSI-based interrupt injection attacks by guests.  Only trusted guests should be run in this configuration.  (Actually I just wish this was a frequently asked question, common practice seems to be to blindly use the option without question)

Question 9:

I use the nvidia driver in the host.  When I start my VM nothing happens.  What's wrong?

Answer:

The nvidia driver locks the VGA arbiter and does not release it causing the VM to stop on its first access to VGA resources.  If this is not yet fixed in the nvidia driver release, user contributed patches can be found to avoid this problem.

Question 10:

I'm assigning an Nvidia card to a Windows guest and get a Code 43 error in device manager.

Answer:

The Nvidia driver, starting with 337.88 identifies the hypervisor and disables the driver when KVM is found.  Nvidia claims this is an unintentional bug, but has no plans to fix it.  To work around the problem, we can hide the hypervisor by adding kvm=off to the list of cpu options provided (QEMU 2.1+ required).  libvirt support for this option is currently upstream.

Note that -cpu kvm=off is not a valid incantation of the cpu parameter, a CPU model such as host, or SandyBridge must also be provided, ex: -cpu host,kvm=off.

Update: The above workaround is sufficient for drivers 337.88 and 340.52.  With 344.11 and presumably later, the Hyper-V CPUID extensions supported by KVM also trigger the Code 43 error.  Disabling these extensions appears to be sufficient to allow the 344.11 driver to work.  This includes all of the hv_* options to -cpu.  In libvirt, this includes:

    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>

    </hyperv>

and

  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
  </clock>

Unfortunately removing these options will impose a performance penalty as these paravirtual interfaces are designed to improve the efficiency of virtual machines.

Monday, August 25, 2014

Primary graphics assignment without VGA

We really have a love-hate relationship with VGA when talking about graphics assignment.  We love that it initializes our cards and provides a standard interface, but we hate all the baggage that it brings along.  There is however an alternative emerging, UEFI by way of OVMF.  UEFI is a legacy-free firmware for PCs (among other things) that aims to replace the BIOS.  Let me repeat that, "legacy-free".  It doesn't get any more legacy than VGA.

So how do we get primary graphics without VGA?  Well, assuming your graphics card isn't too terribly old, it probably already contains support for both UEFI and VBIOS in the ROM and OVMF will use that new entry point to initialize the card.  There are some however some additional restrictions in going this route.  First, the guest operating system needs to support UEFI.  This means a relatively recent version of Linux, Windows 8+, or some of the newer Windows Server versions.  A gaming platform is often the target for "enthusiast" use of GPU assignment, so Windows 8/8.1 is probably a good target if you can bear the user interface long enough to start a game.  AFAIK, Windows 7 does not support UEFI natively, requiring the CSM (Compatibility Support Module) which I believe defeats the purpose of using UEFI.  If one of these guests does not meet your needs, turn away now.

Next up, UEFI doesn't (yet) support the Q35 chipset model.  In a previous post I showed Windows 7 happily running on a 440FX QEMU machine with both GeForce and Radeon graphics.  The same is true for Windows 8/8.1 here.  Linux isn't so happy about this though.  The radeon driver in particular will oops blindly looking for an downstream port above the graphics card.  You may or may not have better luck with Nvidia, neither nouveau or nvidia play nice with my GT635.  The radeon driver problem may be fixable without Q35, but it needs further investigation.  fglrx is untested.  Therefore, if Windows 8/8.1 is an acceptable guest or you're willing to help test or make Linux guests work, let's move on, otherwise turn back now.

Ok, you're still reading, let's get started.  First you need an OVMF binary.  You can build this from source using the TianoCore EDK2 tree, but it is a massive pain.  Therefore, I recommend using a pre-built binary, like the one Gerd Hoffmann provides.  With Gerd's repo setup (or one appropriate to your distribution), you can install the edk2.git-ovmf-x64 package, which gives us the OVMF-pure-efi.fd OVMF image.

Next create a basic libvirt VM using virt-manage or your favorite setup too.  We'll need to edit the domain XML before we install, so just point it to a pre-existing (empty) image or whatever gets it to the point that you can have the VM saved without installing it yet.  Also, don't assign any devices just yet.  Once there, edit the domain XML with virsh edit.  To start, we need to add a tag on the first line to make libvirt accept QEMU commandline options.  libvirt does not yet have support for natively handling the OVMF image, but it is under active development upstream, so these instructions may change quickly.  Make the first line of the XML look like this:

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>

The xmlns specification is the part that needs to be added.  Next we need to add OVMF, just before the </domain> close at the end of the file, add something like this:

  <qemu:commandline>
    <qemu:arg value='-drive'/>
    <qemu:arg value='if=pflash,format=raw,readonly,file=/usr/share/edk2.git/ovmf-x64/OVMF-pure-efi.fd'/>
  </qemu:commandline>

Adjust the path as necessary if you're using a different installation.  If you're using selinux, you may with to copy this file to /usr/share/qemu/ and run restorecon on it to setup permissions for QEMU to use it.

Save the XML and you should now be able to start the VM with OVMF.  From here, we can do the rest of the VM management using virt-manager.  You can add the GPU and audio as a standard PCI assigned device.  If you remove the "Graphics" (ie. VNC/Spice) and "Video" (ie. VGA/QXL/Cirrus) devices from the VM, the assigned GPU will be the primary display.

If you are not assigning the audio function or otherwise need to use the host audio for the guest, I recommend using QXL+Spice.  When the guest is running you can configure the QXL display off and use the connection only for sound.

UEFI purists will also note that I'm not providing a second flash drive for EFI variable storage.  In simple configurations this is unnecessary, but if you find that you've lost your boot options, this is the reason why.

In the video below I'm running two separate Windows 8.1 VMs using OVMF and managed through libvirt.  The one on the left is assigned a GeForce GT635, the one on the right a Radeon HD8570.  The host processor is an i5-3470T (dual-core + threads) and each VM is given 2 vCPUs, exposed as a single core with 2 threads.

Astute viewers will note that I'm using the version of the Nvidia driver which requires the kvm=off QEMU options.  There are two ways around this currently.  The first is to add additional qemu:commandline options:

<qemu:arg value='-cpu'/>
<qemu:arg value='host,hv_time,kvm=off'/>

We can do this because QEMU doesn't explode at the second instance of -cpu on the commandline, but it's not an ideal option.  The preferred method, and what I'm using, is support that is just going into libvirt which creates a new feature element for this purpose.  The syntax there is:

<features>
  <kvm>
    <hidden state='on'/>
  </kvm>
</features>

This will add kvm=off to the set of parameters that libvirt uses.

Also note that while Radeon doesn't seem to need any of the devices quirks previously enabled with the x-vga=on option, Nvidia still does.  This commit, which will soon be available in qemu.git, enable the necessary quirk anytime an Nvidia VGA device is assigned.

To summarize, for Radeon no non-upstream patches are required.  For Nvidia, libvirt will soon be updated for the hidden feature above, until then use the additional commandline option and QEMU will soon be updated to always quirk Nvidia VGA devices, until then the referenced commit will need to be applied.  Both cases currently need the qemu:commandline OVMF support, which should also be in libvirt soon.

In the video below, I'm running a stock Fedora 20 kernel.  The IGD device is used by the host with full DRI support (as noted by glschool running in the backgroups... which I assume doesn't work when DRI is disabled by using the i915 patch).  QEMU is qemu.git plus the above referenced Nvidia quirk enablement patch and  libvirt is patched with this, both of which have been accepted upstream and should appear in the development tree at any moment.  If you're running OVMF from a source other than above, make sure it includes commit b0bc24af from mid-June of this year.  If it doesn't, OVMF will abort when it finds an assigned device with a PCIe capability.  Also, you'll likely experience a long (~1 minute) delay on guest reboot.  This is due to lack of MTRR initialization on reboot, patches have been accepted upstream and will be in qemu.git shortly.


Enjoy.

IOMMU Groups, inside and out

Sometimes VFIO users are befuddled that they aren't able to separate devices between host and guest or multiple guests due to IOMMU grouping and revert to using legacy KVM device assignment, or as is the case with may VFIO-VGA users, apply the PCIe ACS override patch to avoid the problem.  Let's take a moment to look at what this is really doing.

Hopefully we all have at least some vague notion of what an IOMMU does in a system, it allows mapping of an I/O virtual address (IOVA) to a physical memory address.  Without an IOMMU, all devices share a flat view of physical memory without any memory translation operation.  With an IOMMU we have a new address space, the IOVA space, that we can put to use.

Different IOMMUs have different level of functionality.  Before the proliferation of virtualization, IOMMUs often provided only translation, and often only for a small aperture or window of the address space.  These IOMMUs mostly provided two capabilities, avoiding bounce buffers and creating contiguous DMA operations.  Bounce buffers are necessary when the addressing capabilities of the device are less than that of the platform, for instance if the device can only address 4GB of memory, but your system supports 8GB.  If the driver allocates a buffer above 4GB, the device cannot directly DMA to it.  A bounce buffer is buffer space in lower memory, where the device can temporarily DMA, which is then copied to the driver allocated buffer on completion.  An IOMMU can avoid the extra buffer and copy operation by providing an IOVA within the device's address space, backed by the driver's buffer that is outside of the device's address space.  Creating contiguous DMA operations comes into play when the driver makes use of multiple buffers, scattered throughout the physical address space, and gathered together for a single I/O operation.  The IOMMU can take these scatter-gather lists and map them into the IOVA space to form a contiguous DMA operation for the device.  In the simplest example, a driver may allocate two 4KB buffers that are not contiguous in the physical memory space.  The IOMMU can allocate a contiguous range for these buffers allowing the I/O device to do a single 8KB DMA rather than two separate 4KB DMAs.

Both of these features are still important for high performance I/O on the host, but the IOMMU feature we love from a virtualization perspective is the isolation capabilities of modern IOMMUs.  Isolation wasn't possible on a wide scale prior to PCI-Express because conventional PCI does not tag transactions with an ID of the requesting device (requester ID).  PCI-X included some degree of a requester ID, but rules for interconnecting devices taking ownership of the transaction made the support incomplete for isolation.  With PCIe, each device tags transactions with a requester ID unique to the device (the PCI bus/device/function number, BDF), which is used to reference a unique IOVA table for that device.  Suddenly we go from having a shared IOVA space used to offload unreachable memory and consolidate memory, to a per device IOVA space that we can not only use for those features, but also to restrict DMA access from the device.  For assignment to a virtual machine, we now simply need to populate the IOVA space for the assigned device with the guest physical to host physical memory mappings for the VM and the device can transparently perform DMA in the guest address space.

Back to IOMMU groups; IOMMU groups try to describe the smallest sets of devices which can be considered isolated from the perspective of the IOMMU.  The first step in doing this is that each device must associate to a unique IOVA space.  That is, if multiple devices alias to the same IOVA space, then the IOMMU cannot distinguish between them.  This is the reason that a typical x86 PC will group all conventional PCI devices together, all of them are aliased to the same PCIe-to-PCI bridge.  Legacy KVM device assignment will allow a user to assign these devices separately, but the configuration is guaranteed to fail.  VFIO is governed by IOMMU groups and therefore prevents configurations which violate this most basic requirement of IOMMU granularity.

Beyond this first step of being able to simply differentiate one device from another, we next need to determine whether the transactions from a device actually reach the IOMMU.  The PCIe specification allows for transactions to be re-routed within the interconnect fabric.  A PCIe downstream port can re-route a transaction from one downstream device to another.  The downstream ports of a PCIe switch may be interconnected to allow re-routing from one port to another.  Even within a multifunction endpoint device, a transaction from one function may be delivered directly to another function.  These transactions from one device to another are called peer-to-peer transactions and can be bad news for devices operating in separate IOVA spaces.  Imagine for instance if the network interface card assigned to your guest attempted a DMA write to a guest physical address (IOVA) that matched the MMIO space for a peer disk controller owned by the host.  An interconnect attempting to optimize the data path of that transaction could send the DMA write straight to the disk controller before it gets to the IOMMU for translation.

This is where PCIe Access Control Services (ACS) comes into play.  ACS provides us with the ability to determine whether these redirects are possible as well as the ability to disable them.  This is an essential component in being able to isolate devices from one another and sadly one that is too often missing in interconnects and multifunction endpoints.  Without ACS support at every step from the device to the IOMMU, we must assume that redirection is possible at the highest upstream device lacking ACS, thereby breaking isolation of all devices below that point in the topology.  IOMMU groups in a PCI environment take this isolation into account, grouping together devices which are capable of untranslated peer-to-peer DMA.

Combining these two things, the IOMMU group represents the smallest set of devices for which the IOMMU has visibility and which is isolated from other groups.  VFIO uses this information to enforce safe ownership of devices for userspace.  With the exception of bridges, root ports, and switches (ie. interconnect fabric), all devices within an IOMMU group must be bound to a VFIO device driver or known safe stub driver.  For PCI, these drivers are vfio-pci and pci-stub.  We allow pci-stub simply because it's known that the host does not interact with devices via this driver (using legacy KVM device assignment on such devices while the group is in use with VFIO for a different VM is strongly discouraged).  If when attempting to use VFIO you see an error message indicating the group is not viable, it relates to this rule that all of the devices in the group need to be bound to an appropriate host driver.

IOMMU groups are visible to the user through sysfs:

$ find /sys/kernel/iommu_groups/ -type l
/sys/kernel/iommu_groups/0/devices/0000:00:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/2/devices/0000:00:14.0
/sys/kernel/iommu_groups/3/devices/0000:00:16.0
/sys/kernel/iommu_groups/4/devices/0000:00:19.0
/sys/kernel/iommu_groups/5/devices/0000:00:1a.0
/sys/kernel/iommu_groups/6/devices/0000:00:1b.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.0
/sys/kernel/iommu_groups/7/devices/0000:00:1c.1
/sys/kernel/iommu_groups/7/devices/0000:00:1c.2
/sys/kernel/iommu_groups/7/devices/0000:02:00.0
/sys/kernel/iommu_groups/7/devices/0000:03:00.0
/sys/kernel/iommu_groups/8/devices/0000:00:1d.0
/sys/kernel/iommu_groups/9/devices/0000:00:1f.0
/sys/kernel/iommu_groups/9/devices/0000:00:1f.2

/sys/kernel/iommu_groups/9/devices/0000:00:1f.3

Here we see that devices like the audio controller (0000:00:1b.0) have their own IOMMU group, while a wireless adapter (0000:03:00.0) and flash card reader (0000:02:00.0) share an IOMMU group.  The later is a result of lack of ACS support at the PCIe root ports (0000:00:1c.*).  Each device also has links back to its IOMMU group:

$ readlink -f /sys/bus/pci/devices/0000\:03\:00.0/iommu_group/
/sys/kernel/iommu_groups/7

The set of devices can thus be found using:

$ ls /sys/bus/pci/devices/0000\:03\:00.0/iommu_group/devices/

0000:00:1c.0  0000:00:1c.1  0000:00:1c.2  0000:02:00.0  0000:03:00.0

Using this example, if I wanted to assign the wireless adapter (0000:03:00.0) to a guest, I would also need to bind the flash card reader (0000:02:00.0) to either vfio-pci or pci-stub in order to make the group viable.  An important point here is that the flash card reader does not also need to be assigned to the guest, it simply needs to be held by a device which is known to either participate in VFIO, like vfio-pci, or known not to do DMA, like pci-stub.  Newer kernels than used for this example will split this IOMMU group as support has been added to expose the isolation capabilities of this chipset, even though it does not support PCIe ACS directly.

In closing, let's discuss strategies for dealing with IOMMU groups that contain more devices than desired.  For a plug-in card, the first option would be to determine whether installing the card into a different slot may produce the desired grouping.  On a typical Intel chipset, PCIe root ports are provided via both the processor and the PCH (Platform Controller Hub).  The capabilities of these root ports can be very different.  On the latest Linux kernels we have support for exposing the isolation of the PCH root ports, even though many of them do not have native PCIe ACS support.  These are therefore often a good target for creating smaller IOMMU groups.  On Xeon class processors (except E3-1200 series), the processor-based PCIe root ports typically support ACS.  Client processors, such as the i5/i7 Core processor do not support ACS, but we can hope future products from Intel will update this support.

Another option that many users have found is a kernel patch which overrides PCIe ACS support in the kernel, allowing command line options to falsely expose isolation capabilities of various components.  In many cases this appears to work well, but without vendor confirmation, we cannot be sure that the devices are truly isolated.  The occurrence of a misdirected DMA may be sufficiently rare to mask association with this option.  We may also find differences in chipset programming or address assignment between vendors that allows relatively safe use of this override on one system, while other systems may experience issues.  Adjusting slot usage and using a platform with proper isolation support are clearly the best options.

The final option is to work with the vendor to determine whether isolation is present and quirk the kernel to recognize this isolation.  This is generally a matter of determining whether internal peer-to-peer between functions is possible, or in the case of downstream ports, also determining whether redirection is possible.  Multifunction endpoints that do not support peer-to-peer can expose this using a single static ACS table in configuration space, exposing no capabilities.

Hopefully this entry helps to describe why we have IOMMU groups, why they take the shape that they do, and how they operate with VFIO.  Please comment if I can provide further clarification anywhere.

Thursday, August 21, 2014

What's the deal with VGA arbitration?

Let's take a step back on the VFIO VGA train and take a look at what exactly is VGA, why does it need to be arbitrated, and why can't we seem to get that arbitration working upstream.

VGA (Video Graphics Array) is a remnant of early PCs and certainly falls within the category of a legacy interface.  If you take a look at my slides from KVM Forum 2013, you can get a little taste of the history.  VGA came after things like EGA and CGA and incorporated many of their features for compatibility, but it was effectively the dawn of the PC era.  We didn't have interfaces like PCI.  Devices lived at known, fixed addresses, and there was only ever intended to be one device.  VGA devices are initialized through proprietary code provided as the VBIOS, ie. the ROM on PCI VGA cards of today.  How does the VBIOS know where to find the VGA device for initialization?  It's always at a known, fixed address, MMIO regions 0xa0000-0xbffff and a couple sets of I/O port ranges.  Remember ISA NICs with jumpers that could only be programmed to a couple addresses and required specifying which address to the driver?  VGA devices don't have jumpers.

In a modern PCI, we're no longer using an ISA bus and we can clearly support more than a single VGA device, but the mechanisms to do so are via layers of compatibility.  PCI bridges actually have a VGA Enable bit in their configuration space which defines whether the bridge will do positive decode on transactions to the VGA spaces, effectively defining whether it will take ownership of a transaction to the VGA area.  Each PCI bridge has one of these enable bits, but the PCI specification only defines the results when a single bridge per bus enables VGA.  Therefore, at any given time, VGA access can only be directed to a single PCI bus.  It's the responsibility of system software to manage which bus VGA is routed to.  Managing the VGA routing is what we call VGA arbitration.

In an ideal configuration, each VGA device is on a separate, leaf bus, where there are no further VGA devices downstream.  System software then needs only to disable the VGA enable bit on one set of bridges and enable it on another.  Things get tricky when we have multiple VGA devices on the same bus or one VGA device downstream of another.  In this case we need to prevent the other VGA devices from claiming the transactions so that they reach the intended VGA device.  One way to do this is via the PCI MMIO and I/O port space enable bits on the device.  The PCI specification defines that a device can only claim a transaction to an address space owned by the device when the appropriate bit is set in the PCI COMMAND register.  By disabling these bits, we can guarantee that the device won't claim the transaction, allowing the intended device on the same bus or downstream bus to do so.  Another option to do this is to use device specific mechanisms to configure the controller to ignore these transactions, in effect putting the device into a legacy-free operating state.

When a PC boots, the system BIOS selects a primary VGA device.  Some BIOS/motherboard manufacturers allow the user to select the primary VGA device.  This configures the chain of PCI bridges to one of the VGA devices, disabling the others.  The ROM is copied from the PCI option ROM space of the VGA controller and stored in the shadow ROM area at 0xc0000 to be compatible with legacy PCs.  The VBIOS executes from there and relies on being able to reach its device at the known, fixed VGA address spaces.

The operating system may continue to use VGA access to the device for compatibility, and boot loaders like GRUB and SYSLINUX don't have device specific video drivers.  Instead they make use of other standards layered on top of VGA, like VESA that allow them to switch modes and use the VGA device with standard drivers.  In general though, we expect that once we're using device specific drivers in the operating system, the VGA space sees little to no use.  This means that when we want to switch VGA routing to the non-primary device so that we can repeat the above boot process in a virtual machine, we generally don't have many conflicts and can even boot two VMs simultaneously, switching VGA routing between VGA device, with bearable performance.

Where we get into trouble (AIUI) is that DRI support in Xorg wants to create a fixed mapping (mmap) to the VGA MMIO space.  This fixed mapping implies that we can no longer change the VGA routing on the bridges since the mmap would suddenly target a different device.  Xorg therefore disables DRI support when there are multiple participants in VGA arbitration.  People like DRI support [citation needed] and multiple VGA devices are fairly common, especially when VGA support is built into the processor, such as with Intel IGD.  As a result, host drivers want to opt-out of VGA arbitration as quickly as they can by disabling legacy interfaces on the device so that there are never multiple VGA arbitration participants and DRI remains enabled.

For a typical plugin PCIe VGA card, the device can be guaranteed that there are no downstream VGA devices and there are no other devices on the same bus, simply because the point-to-point topology of PCI Express does not allow it.  Such devices don't need to do anything special to opt-out of VGA arbitration, they simply need to not rely on  VGA address spaces and notify system software that arbitration is no longer necessary.

Intel IGD isn't that lucky.  The IGD VGA device lives on the PCIe root complex where nearly all of the other devices in the system are either sibling devices on the same bus or devices on subordinate buses from the root complex.  In order for VGA to be routed to any other device in the system, the IGD device needs to be configured to not claim the transaction.  As noted earlier, this can be done either by using standard PCI disabling of MMIO and I/O or via device specific methods.  The native IGD driver would really like to continue running when VGA routing is directed elsewhere, so the PCI method of disabling access is sub-optimal.  At some point Intel knew this and included mechanisms in the device that allowed VGA MMIO and I/O port space to be disabled separately from PCI MMIO and I/O port space.  These mechanisms are what the i915 driver uses today when it opts-out of VGA arbitration.

The problem is that modern IGD devices don't implement the same mechanisms for disabling legacy VGA.  The i915 driver continues to twiddle the old bits and opt-out of arbitration, but the IGD VGA is still claiming VGA transactions.  This is why many users who claim they don't need the i915 patch finish their proclamation of success with a note about the VGA color palate being corrupted or noting some strange drm interrupt messages on the host dmesg.  They've ignored the fact that the VGA device is only working further into guest boot, when the non-legacy parts of the device drivers have managed to activate the device.  The errors they see are a direct results of VGA accesses by the guest being claimed by the IGD device rather than the assigned VGA device.

So why can't we simply fix the i915 driver to correctly disable legacy VGA support?  Well, in their infinite wisdom, hardware designers have removed (or broken) that feature of IGD.  The only mechanism is therefore to use the PCI disabling mechanism, but the host driver can't run permanently with PCI I/O space disabled, so it must participate in VGA arbitration, enabling I/O space only when needed since it will always claim VGA transactions as well as PCI space transactions.  This has the side-effect that when another VGA device is present, we now have multiple VGA arbitration participants, and Xorg disables DRI.

How can we resolve this?

a) Have a mode switch in i915 that allows it to behave correctly with VGA arbitration.

This is the currently available i915 patch.  The problems here are that DRI will be disabled and the maintainer is not willing to accept a driver mode switch option, instead pushing for a solution in other layers.

b) Remove Xorg dependency on mapping VGA space for DRI.

Honestly, I have no idea how difficult this is.  The problem is one of compatibility and deprecation.  If we could fix Xorg today to decouple DRI from VGA arbitration, how long would it be before we could fix the i915 driver to work correctly with VGA arbitration?  Could we ever do it or would we need to maintain compatibility with older Xorg?

c) Next-gen VGA arbitration

We can do tricks with page fault handlers to allow user space to have an apparently consistent mmap of VGA MMIO space.  On switching VGA routing we can invalidate the mmap pages and re-route VGA in the fault handler.  The trouble is again one of compatibility and deprecation.  If we provided Xorg with a v2 VGA arbitration interface today, could the i915 driver ever rely on it being used?

d) Don't use VGA

This is actually a promising path, and one that we'll talk more about in the future...

Dual VGA assignment, GeForce + Radeon

This is a little old, but it's a nice bit of eye candy to start things off.  On the left we have an Nvidia 8400GS and on the right an AMD HD5450.  These are both very, very low-end cards (but they're quite; fanless).  A big bottleneck here is the hard disk.  Two VMs exercising an old spinning rust drive can get pretty slow.  In future posts I'll show how we can improve this.

The script I used has evolved from what's shown in the video, so this will differ from what you can read off the screen.  Here's the current version:

#!/bin/sh -x

#PASS_AUDIO="yes"

GPU=$(lspci -D | grep VGA | grep -i geforce | awk '{print $1}')
AUDIO=$(lspci -D -s $(echo $GPU | colrm 12 1) | grep -i audio | awk '{print $1}')
LIBVIRT_GPU=pci_$(echo $GPU | tr ':' '_' | tr '.' '_')
LIBVIRT_AUDIO=pci_$(echo $AUDIO | tr ':' '_' | tr '.' '_')

virsh nodedev-detach $LIBVIRT_GPU
virsh nodedev-detach $LIBVIRT_AUDIO

DISK=/mnt/store/vm/win7-geforce

if [ -v PASS_AUDIO ]; then
AUDIO_CMD="-device vfio-pci,host=$AUDIO,addr=9.1"
fi

MEM=4096

HUGE="-mem-path /dev/hugepages"

NEED=$(( $MEM / 2 ))
TOTAL=$(grep HugePages_Total /proc/meminfo | awk '{print $NF}')
WANT=$(( $TOTAL + $NEED ))

echo $WANT > /proc/sys/vm/nr_hugepages

AVAIL=$(grep HugePages_Free /proc/meminfo | awk '{print $NF}')
if [ $AVAIL -lt $NEED ]; then
echo $TOTAL > /proc/sys/vm/nr_hugepages
HUGE=
fi

QEMU=/home/alwillia/local/bin/qemu-system-x86_64
#QEMU=/bin/qemu-kvm

QEMU_PA_SAMPLES=8192 QEMU_AUDIO_DRV=pa \
$QEMU \
-enable-kvm -rtc base=localtime \
-m $MEM $HUGE -smp sockets=1,cores=2 -cpu host,hv-time,kvm=off \
-vga none -nographic \
-monitor stdio -serial none -parallel none \
-nodefconfig \
-device intel-hda -device hda-output \
-netdev tap,id=br0,vhost=on \
-device virtio-net-pci,mac=02:12:34:56:78:91,netdev=br0 \
-drive file=$DISK,cache=none,if=none,id=drive0,aio=threads \
-device virtio-blk-pci,drive=drive0,ioeventfd=on,bootindex=1 \
-device vfio-pci,host=$GPU,multifunction=on,addr=9.0,x-vga=on \
$AUDIO_CMD

TOTAL=$(grep HugePages_Total /proc/meminfo | awk '{print $NF}')
RELEASE=$(( $TOTAL - $NEED ))
echo $RELEASE > /proc/sys/vm/nr_hugepages

virsh nodedev-reattach $LIBVIRT_GPU
virsh nodedev-reattach $LIBVIRT_AUDIO

Much of this is similar to what you'll find in the ArchLinux BBS thread, but there are some differences.  First notice that I'm not using the vfio bind scripts found there, I'm using libvirt through virsh for that.  Next, note that I'm not using the QEMU Q35 chipset model.  The importance of Q35 has been largely exaggerated.  If you are mostly concerned with assigning GPUs to Windows guests, Windows seems to be perfectly happy using the default 440FX chipset model.  Linux guests won't like this, particularly with Radeon cards because the driver blindly attempts to poke at the upstream PCIe root port.

On the QEMU side of things, everything should be included in QEMU 2.1.  The last required piece was the kvm=off cpu option, which is necessary for Nvidia 340+ guest drivers.  On the kernel side, this setup requires only the i915 patch since IGD is my host graphics.  This means DRI is disabled on the center, host monitor and using i915.enable_hd_vgaarb=1 on the kernel commandline.  The GeForce and Radeon devices are bound to pci-stub using kernel commandline options, ex. pci-stub.ids=10de:0e0f,1002:aab0,10de:1280,1002:6611  The only other kernel option necessary is intel_iommu=on.  I do not require the PCIe ACS override patch because one card is installed in a slot from the processor-based (PEG) root port and the other is installed in a slot from the PCH root port.  I find many users on the ArchLinux thread using the option vfio_iommu_type1.allow_unsafe_interrupts=1.  In most cases this is entirely unnecessary since most new processors are going to support VT-d2 and therefore have interrupt remapping support.  AMD IOMMU users have always had hardware support for interrupt remapping and any recent kernel can be configured to enable it.

For mouse and keyboard in the guest, I use synergy.