Thursday, October 13, 2016

How to improve performance in Windows 7

A contribution from Thomas Lindroth on the vfio-users mailing list:
I thought I'd share a trick for improving the performance on win7 guests. The
tl;dr version is add <feature policy='disable' name='hypervisor'/> to the
<cpu> section of your libvirt xml like so:

<cpu mode='host-passthrough'>
<topology sockets='1' cores='3' threads='1'/>
<feature policy='disable' name='hypervisor'/>
</cpu>

The long story is that according to Microsoft's documentation "On systems
where the TSC is not suitable for timekeeping, Windows automatically selects
a platform counter (either the HPET timer or the ACPI PM timer) as the basis
for QPC." QPC = QueryPerformanceCounter() which is a windows api for getting
timing info. Some redhat documentation say: "Windows 7 do not use the TSC as
a time source if the hypervisor-present bit is set". Instead if falls back on
acpi_pm or hpet if hpet is enabled in the xml.

The hypervisor present bit is a fake cpuid flag qemu and other hypervisors
injects to show the guest it's running under a hypervisor. This is different
from the KVM signature that can be hidden with <kvm><hidden state='on'>.
With the hypervisor flag disabled in libvirt xml windows 7 started using TSC
as timing source for me.

Nvidia has a "Timer Function Performance" benchmark on their web page to
measure overhead from timers. With acpi_pm the timer query took 3,605ns on
average and with TSC 12.52ns. Passmark's CPU floating point performance
benchmark, which query timers 265,000 times/sec, went from 3952 points with
acpi_pm to 5594 points with TSC. The reason TSC is so much faster is because
both acpi_pm and hpet are emulated by qemu in userspace and TSC is handled by
KVM in kernel space.

All games I've tested use the timer at least 25,000 times/sec. I'm guessing
it's the graphics drivers doing that. Some games like Outlast query the timer
~275,000 times/sec. The performance for those games are basically limited by
how fast the host can do context switches. I expect the performance
improvement with TSC is great in those games. Unfortunately 3dmark's fire
strike benchmark still do 25,000 queries/sec to the acpi_pm even with the
hypervisor flag hidden. There must be some other windows api for using the
"platform counter" as Microsoft calls it but most games don't use it.

Unless you are using windows 7 you'll probably not benefit from this. Windows
10 is probably using the hypervclock instead. That redhat documentation
talking about the hypervisor bit was actually a guide for how to turn off TSC
to "resolve guest timing issues". I don't experience any problems myself but
if you got one of those "clocksource tsc unstable” systems this might not
work so well.

Google says this is the NVIDIA timer benchmark.

It would be interesting to see how the hyper-v extensions compare and whether tests like Fire Strike actually makes use of them. Thanks Thomas!

(copied with permission)