VFIO tips and tricks

Thursday, October 13, 2016

How to improve performance in Windows 7

A contribution from Thomas Lindroth on the vfio-users mailing list:

I thought I'd share a trick for improving the performance on win7 guests. The
tl;dr version is add <feature policy='disable' name='hypervisor'/> to the
<cpu> section of your libvirt xml like so:

<cpu mode='host-passthrough'>
<topology sockets='1' cores='3' threads='1'/>
<feature policy='disable' name='hypervisor'/>
</cpu>

The long story is that according to Microsoft's documentation "On systems
where the TSC is not suitable for timekeeping, Windows automatically selects
a platform counter (either the HPET timer or the ACPI PM timer) as the basis
for QPC." QPC = QueryPerformanceCounter() which is a windows api for getting
timing info. Some redhat documentation say: "Windows 7 do not use the TSC as
a time source if the hypervisor-present bit is set". Instead if falls back on
acpi_pm or hpet if hpet is enabled in the xml.

The hypervisor present bit is a fake cpuid flag qemu and other hypervisors
injects to show the guest it's running under a hypervisor. This is different
from the KVM signature that can be hidden with <kvm><hidden state='on'>.
With the hypervisor flag disabled in libvirt xml windows 7 started using TSC
as timing source for me.

Nvidia has a "Timer Function Performance" benchmark on their web page to
measure overhead from timers. With acpi_pm the timer query took 3,605ns on
average and with TSC 12.52ns. Passmark's CPU floating point performance
benchmark, which query timers 265,000 times/sec, went from 3952 points with
acpi_pm to 5594 points with TSC. The reason TSC is so much faster is because
both acpi_pm and hpet are emulated by qemu in userspace and TSC is handled by
KVM in kernel space.

All games I've tested use the timer at least 25,000 times/sec. I'm guessing
it's the graphics drivers doing that. Some games like Outlast query the timer
~275,000 times/sec. The performance for those games are basically limited by
how fast the host can do context switches. I expect the performance
improvement with TSC is great in those games. Unfortunately 3dmark's fire
strike benchmark still do 25,000 queries/sec to the acpi_pm even with the
hypervisor flag hidden. There must be some other windows api for using the
"platform counter" as Microsoft calls it but most games don't use it.

Unless you are using windows 7 you'll probably not benefit from this. Windows
10 is probably using the hypervclock instead. That redhat documentation
talking about the hypervisor bit was actually a guide for how to turn off TSC
to "resolve guest timing issues". I don't experience any problems myself but
if you got one of those "clocksource tsc unstable” systems this might not
work so well.

Google says this is the NVIDIA timer benchmark.

It would be interesting to see how the hyper-v extensions compare and whether tests like Fire Strike actually makes use of them. Thanks Thomas!

(copied with permission)

Monday, September 26, 2016

Passing QEMU command line options through libvirt

This one comes from Stefan's blog with some duct tape and bailing wire courtesy of Laine. I've talked previously about using wrapper scripts to launch QEMU, which typically use sed to insert options that libvirt doesn't know about. This is by far better than defining full vfio-pci devices using <qemu:arg> options, which many guides suggest, but it hides the devices from libvirt and causes all sorts of problems with device permissions and locked memory, etc. But, there's a nice compromise as Stefan shows in his last example at the link above. Say we only want to add x-vga=on to one of our hostdev entries in the libvirt domain XML. We can do something like this:

<qemu:commandline>

<qemu:arg value='-set'/>

<qemu:arg value='device.hostdev0.x-vga=on'/>

</qemu:commandline>

The effect is that we add the option x-vga=on to the hostdev0 device, which is defined via a normal <hostdev> section in the XML and gets all the device permission and locked memory love from libvirt. So which device is hostdev0? Well, things get a little mushy there. libvirt invents the names based on the order of the hostdev entries in the XML, so you can simply count (starting from zero) to pick the entry for the additional option. It's a bit cleaner overall than needing to manage a wrapper script separately from the VM. Also, don't forget that to use <qemu:commandline> you need to first enable the QEMU namespace in the XML by updating the first line in the domain XML to:

Otherwise libvirt will promptly discard the extra options when you save the domain.

"Intel-IOMMU: enabled": It doesn't mean what you think it means

A quick post just because I keep seeing this in practically every how-to guide I come across. The instructions grep dmesg for "IOMMU" and come up with either "Intel-IOMMU: enabled" or "DMAR: IOMMU enabled". Clearly that means it's enabled, right? Wrong. That line comes from a __setup() function that parses the options for "intel_iommu=". Nothing has been done at that point, not even a check to see if VT-d hardware is present. Pass intel_iommu=on as a boot option to an AMD system and you'll see this line. Yes, this is clearly not a very intuitive message. So for the record, the mouthful that you should be looking for is this line:

DMAR: Intel(R) Virtualization Technology for Directed I/O

or on older kernels the prefix is different:

PCI-DMA: Intel(R) Virtualization Technology for Directed I/O

When you see this, you're pretty much past all the failure points of initializing VT-d. FWIW, the "DMAR" flavors of the above appeared in v4.2, so on a more recent kernel, that's your better option.

Thursday, September 1, 2016

And now you're an expert

Video from my KVM Forum 2016 talk:

Wednesday, August 24, 2016

KVM Forum 2016 - An Introduction to PCI Device Assignment with VFIO

Slides available here:

http://awilliam.github.io/presentations/KVM-Forum-2016

Video to come

Friday, July 15, 2016

Intel Graphics assignment

Hey folks, it feels like it's time to mention that assignment of Intel graphics devices (IGD) is currently available in qemu.git and will be part of the upcoming QEMU 2.7 release. There's already pretty thorough documentation of the modes available in the source tree, please give it a read. There are two modes described there, "legacy" and "Universal Passthrough" (UPT), each have their pros and cons. Which ones are available to you depends on your hardware. UPT mode is only available for Broadwell and newer processors while legacy mode is available all the way back through SandyBridge. If you have a processor older than SandyBridge, stop now, this is not going to work for you. If you don't know what any of these strange names mean, head to Wikipedia and Ark to figure it out.

The high level overview is that "legacy" mode is much like our GeForce support, the IGD is meant to be the primary and exclusive graphics in the VM. Additionally the IGD address in the VM must be at PCI 00:02.0, only Seabios is currently supported, only the 440FX chipset model is supported (no Q35), the IGD device must be the primary host graphics device, and the host needs to be running kernel v4.6 or newer. Clearly assigning the host primary graphics is a bit of an about-face for our GPU assignment strategy, but we depend on running the IGD video ROM, which depends on VGA and imposes most of the above requirements as well (oh add CONFIG_VFIO_PCI_VGA to the requirements list). I have yet to see an IGD ROM with UEFI support, which is why OVMF is not yet supported, but seems possible to support with a CSM and some additional code in OVMF.

Legacy mode should work with both Linux and Windows guests (and hopefully others if you're so inclined). The i915 driver does suffer from the typical video driver problem that sometimes the whole system explodes (not literally) when unbinding or re-binding the IGD to the driver. Personally I avoid this by blacklisting the i915 driver. Of course as some have found out trying to do this with discrete GPUs, there are plenty of other drivers ready to jump on the device to keep the console working. The primary ones I've seen are vesafb and efifb, which one is used on your system depends on your host firmware settings, legacy BIOS vs UEFI respectively. To disable these, simply add video=vesafb:off or video=efifb:off to the kernel command line (not sure which to use? try both, video=vesafb:off,efifb:off). The first thing you'll notice when you boot an IGD system with i915 blacklisted and the more basic framebuffer drivers disabled is that you don't get anything on the graphics head after grub. Plan for this. I use a serial console, but perhaps you're more comfortable running blind and willing to hope the system boots and you can ssh into it remotely.

If you've followed along with this procedure, you should be able to simply create a <hostdev> entry in your libvirt XML, which ought to look something like this:

</source>

</hostdev>

Again, assigning the IGD device (which is always 00:02.0) to address 00:02.0 in the VM is required. Delete the <video> and <graphics> sections and everything should just magically work. Caveat emptor, my newest CPU is Broadwell, I've been told this works with Skylake, but IGD is hardly standardized and each new implementation seems to tweak things just a bit.

Some of you are probably also curious why this doesn't work on Q35, which leads into the discussion of UPT mode; IGD clearly is not a discrete GPU, but "integrated" not only means that the GPU is embedded in the system, in this case it means that the GPU is kind of smeared across the system. This is why IGD assignment hasn't "just worked" and why you need a host kernel with support for exposing certain regions through vfio and a BIOS that's aware of IGD, and it needs to be at a specific address, etc, etc, etc. One of those requirements is that the video ROM actually also cares about a few properties of the device at PCI address 00:1f.0, the ISA/LPC bridge. Q35 includes its own bridge at that location and we cannot simply modify the IDs of that bridge for compatibility reasons. Therefore that bridge being an implicit part of Q35 means that IGD assignment doesn't work on Q35. This also means that PCI address 00:1f.0 is not available for use in a 440FX machine.

Ok, so UPT. Intel has known for a while that the sprawl of IGD has made it difficult to deal with for device assignment. To combat this, both software and hardware changes have been made that help to consolidate IGD to be more assignment-friendly. Great news, right? Well sort of. First off, in UPT mode the IGD is meant to be a secondary graphics device in the VM, there's no VGA mode support (oh, BTW, x-vga=on is automatically added by QEMU in legacy mode). In fact, um, there's no output support of any kind by default in UPT mode. How's this useful you ask, well between the emulated graphics and IGD you can setup mirroring so you actually have a remote-capable, hardware accelerated graphics VM. Plus, if you add the option x-igd-opregion=on to the vfio-pci device, you can get output to a physical display, but there again you're going to need the host running kernel v4.6 or newer and the upcoming QEMU 2.7 support, while no-output UPT has probably actually worked for quite a while. UPT mode has no requirements for the IGD PCI address, but note that most VM firmare, SeaBIOS or OVMF, will define the primary graphics as the one having the lowest PCI address. Usually not a problem, but some of you create some crazy configs. You'll also still need to do all the blacklisting and video disabling above, or just risk binding and unbinding i915 from the host, gambling each time whether it'll explode.

So UPT sounds great except why is this opregion thing optional? Well, it turns out that if you want to do that cool mirroring thing I mention above and a physical output is enabled with the opregion, you actually need to have a monitor attached to the device or else your apps don't get any hardware acceleration love. Whereas if IGD doesn't know about any outputs, it's happy to apply hardware acceleration regardless of what's physically connected. Sucks, but readers here should already know how to create wrapper scrips to add this extra option if they want it (similar to x-vga=on). I don't think Intel really wants to support this hacky hybrid mode either, thus the experimental x- option prefix tag.

Oh, one more gotcha for UPT mode, Intel seems to expect otherwise, but I've had zero success trying to run Linux guests with UPT. Just go ahead and assume this is for your Windows guests only at this point.

What else... laptop displays should work, I believe switching outputs even works, but working on laptops is rather inconvenient since you're unlikely to have a serial console available. Also note that while you can use input-linux to attach a laptop keyboard and mouse (not trackpad IME), I don't know how to make the hotkeys work, so that's a bummer. Some IGD devices will generate DMAR error spew on the host when assigned, particularly the first time per host boot. Don't be too alarmed by this, especially if it stops before the display is initialized. This seems to be caused by resetting the IGD in an IOMMU context where it can't access its page tables setup by the BIOS/host. Unless you have an ongoing spew of these, they can probably be ignored. If you have something older than SandyBridge that you wish you could use this with and continued reading even after told to stop, sorry, there was a hardware change at SandyBridge and I don't have anything older to test with and don't really want to support additional code for such outdated hardware. Besides, those are pretty old and you need an excuse for an upgrade anyway.

With this support I've switched my desktop system so that the host actually runs from a USB stick and the previous bare-metal Fedora install is virtualized with IGD, running alongside my existing GeForce VM. Give it a try and good luck.

Sunday, January 3, 2016

Comments on the 7 Gamers, 1 CPU video

In case you've seen this video:

And you're thinking to yourself that the R9 Nano they used looks like a great choice for your own GPU assignment build, think again. This GPU is known to have reset issues, so while it's impressive to see a build with this degree of consolidation, you should be suspicious about the problems Linus alludes to and the limited functionality of the overall system shown in the video. For instance, does rebooting a VM require a host reboot, or perhaps a manual soft eject of the GPU from the VM? We see this a lot with newer AMD cards, well beyond the partial workarounds we have for Bonaire and Hawaii based GPUs.

Personally I would have preferred to see an NVIDIA based solution, but due to the scale of this build, and unique slot and power restrictions, the compatibility with GPU assignment was mostly an afterthought. NVIDIA is not without issues, but for the time being we understand those issues and have workarounds for them, and even a path for supported configurations with Quadro cards.

On the plus side, yes, this is using KVM and VFIO and it's an impressive example of what this technology can do. However, when you're spec'ing your own build, do your own research and don't rely on videos like this to choose your components. My 2 cents...