Friday, April 24, 2015

Progress on the AMD front

With the help of Alex Deucher, I've been able to make progress on the long standing reset issue we've had with some Radeon GPUs.  This issue typically manifests itself as a BSOD on VM restart.  Some users attempt to do things like hot-unplug the GPU from the VM before reboot/shutdown or even suspend/resume the host in an attempt to work around this.  Rebooting the host is also an option, but even more undesirable.

The problem is believed to be limited to Bonaire and Hawaii GPUs.  It's a hardware bug and should be fixed on more recent ASICs.  In my experience, it appears that the GPU is sufficiently disconnected from the PCI bus that the PCI bus reset we rely on for most graphics cards has no effect on these particular GPUs.  This leaves the internal SMC engine running microcode loaded from the guest driver, interfering with the driver re-load on the next boot.

What we've found is that there are some ASIC specific reset mechanism we can use on these cards that get us to a sufficiently fresh state for the card to be used repeatedly, most of the time.  I add that qualifier because I do still see occasional failure, but they manifest as if the card never wakes up rather than getting a BSOD.  The solution for this is still a host reboot, but it should be a relatively rare occurrence.

I had originally hoped this could be implemented as a device specific reset in the kernel, allowing a kernel update to transparently enable this for users, but given the nature of the workaround, I now feel more comfortable implementing it in the QEMU vfio driver.  This will go in after the QEMU v2.3 release.

If you're affected by this problem, I'd encourage you to give this patch a try.