Friday, April 24, 2015

Progress on the AMD front

With the help of Alex Deucher, I've been able to make progress on the long standing reset issue we've had with some Radeon GPUs.  This issue typically manifests itself as a BSOD on VM restart.  Some users attempt to do things like hot-unplug the GPU from the VM before reboot/shutdown or even suspend/resume the host in an attempt to work around this.  Rebooting the host is also an option, but even more undesirable.

The problem is believed to be limited to Bonaire and Hawaii GPUs.  It's a hardware bug and should be fixed on more recent ASICs.  In my experience, it appears that the GPU is sufficiently disconnected from the PCI bus that the PCI bus reset we rely on for most graphics cards has no effect on these particular GPUs.  This leaves the internal SMC engine running microcode loaded from the guest driver, interfering with the driver re-load on the next boot.

What we've found is that there are some ASIC specific reset mechanism we can use on these cards that get us to a sufficiently fresh state for the card to be used repeatedly, most of the time.  I add that qualifier because I do still see occasional failure, but they manifest as if the card never wakes up rather than getting a BSOD.  The solution for this is still a host reboot, but it should be a relatively rare occurrence.

I had originally hoped this could be implemented as a device specific reset in the kernel, allowing a kernel update to transparently enable this for users, but given the nature of the workaround, I now feel more comfortable implementing it in the QEMU vfio driver.  This will go in after the QEMU v2.3 release.

If you're affected by this problem, I'd encourage you to give this patch a try.

11 comments:

  1. I have patched hw/vfio/pci.c to include my Radeon R9 285 (Tonga) in the cases for this solution. It works! If i shutdown my Windows 8.1 guest and run qemu again, it works fine.

    Unfortunately, doing a soft-reboot didn't work. The system gets to the Windows loading screen and then the monitor loses signal. I don't know if the guest system is frozen. Killing qemu and launching it again works and Windows boots OK.

    Thank you!

    Below, my patch:

    diff hw/vfio/pci.c /home/lucas/pci.c
    3477a3478,3479
    > /* Tonga */
    > case 0x6939: /* [Radeon R9 285] */

    ReplyDelete
    Replies
    1. I'm told that Tonga does not have the same reset problem as Bonaire and Hawaii, so this shouldn't be necessary. Can you clarify what problem you're seeing and what kernel and qemu version you're using? On Bonaire and Hawaii this fix works for both VM reboot as well as shutdown/restart, so it's possible you're suffering from something a bit different.

      Delete
    2. $ uname -a
      Linux desktop 3.18.0 #1 SMP Fri May 29 01:01:54 EEST 2015 x86_64 x86_64 x86_64 GNU/Linux

      $ /usr/local/bin/qemu-system-x86_64 --version
      QEMU emulator version 2.3.50, Copyright (c) 2003-2008 Fabrice Bellard

      I'm passing through the Radeon R9 285 to the Windows 8.1 guest. My VM start script is as follows:

      #!/bin/bash

      configfile=/etc/vfio-pci0.cfg

      vfiobind() {
      dev="$1"
      vendor=$(cat /sys/bus/pci/devices/$dev/vendor)
      device=$(cat /sys/bus/pci/devices/$dev/device)
      if [ -e /sys/bus/pci/devices/$dev/driver ]; then
      echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
      fi
      echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id

      }

      modprobe vfio-pci

      cat $configfile | while read line;do
      echo $line | grep ^# >/dev/null 2>&1 && continue
      vfiobind $line
      done

      sudo /usr/local/bin/qemu-system-x86_64 -enable-kvm -M q35 -m 7168 -cpu host \
      -smp 3,sockets=1,cores=3,threads=1 \
      -bios /usr/share/seabios/bios.bin -vga none \
      -device ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=root.1 \
      -device vfio-pci,host=02:00.0,bus=root.1,addr=00.0,multifunction=on,x-vga=on \
      -device vfio-pci,host=02:00.1,bus=root.1,addr=00.1 \
      -drive file=/windows/windows8.qcow2,if=virtio,boot=on,cache=writeback \
      -device qxl

      exit 0

      Whenever I have used the proprietary Catalist driver on the Windows guest and I turn off (or reboot) the guest, upon turning on again, I will see the Seabios boot screen and the Windows "loading" splash screen. But then the monitor will go blank (no signal). I will only be able to use the Windows guest again if I reboot the Linux host.

      I have tested your solution and it worked, as described in my first comment. But after that, is has never worked again (i.e. I have not been able to turn off and turn on the guest system again without rebooting the host). I don't know why it stopped working, since I have changed nothing in the setup.

      How can I debug the problem?

      Thanks,
      Lucas

      Delete
    3. Ok, sounds like Tonga should not be included in the Bonaire/Hawaii workaround. I really have no good advice on how to debug these, even with a Bonaire on hand and prodding AMD, it took a long time to come up with these workarounds. We really expect a PCI bus reset to put the device back into a pristine state. On most devices that works, but on Bonaire and Hawaii, the PCI bus interface seems to be completely independent of the GPU core. I have no idea what the problem might be on Tonga :(

      Delete
    4. Hello,

      Today I bought a Sapphire AMD r7 360 with UEFI "BIOS" and test a reboot with Windows 7 x64 on a Debian 8 host system with OVMF.
      It works without problems.

      If you need more Informations, send me a message to my google account or qemu@suppser.de.

      br Tobias


      Bonaire as Radeon R7 360 (HD 7790/R7 260)

      unpatched Kernel
      Linux kvmtest01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1 (2015-05-24) x86_64 GNU/Linux

      qemu-system-x86_64 -version
      QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-12+deb8u1), Copyright (c) 2003-2008 Fabrice Bellard

      01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 665f (rev 81) (prog-if 00 [VGA controller])
      Subsystem: PC Partner Limited / Sapphire Technology Device e258
      Flags: bus master, fast devsel, latency 0, IRQ 50
      Memory at e0000000 (64-bit, prefetchable) [size=256M]
      Memory at f0000000 (64-bit, prefetchable) [size=8M]
      I/O ports at e000 [size=256]
      Memory at f7c00000 (32-bit, non-prefetchable) [size=256K]
      Expansion ROM at f7c40000 [disabled] [size=128K]
      Capabilities:
      Kernel driver in use: vfio-pci

      01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aac0
      Subsystem: PC Partner Limited / Sapphire Technology Device aac0
      Flags: bus master, fast devsel, latency 0, IRQ 17
      Memory at f7c60000 (64-bit, non-prefetchable) [size=16K]
      Capabilities:
      Kernel driver in use: vfio-pci


      user@kvmtest01:~/kvm$ cat kvm-start.sh
      #!/bin/bash

      INSTALLFILE=win7-uefi-x64_system.qcow2
      IMAGEFILE=win7-uefi-x64_system-01.qcow2
      #FILESIZE=50G

      # PCI address of the passtrough devices
      DEVICE1="01:00.0"
      DEVICE2="01:00.1"

      # load vfio-pci module
      modprobe vfio-pci

      for dev in "0000:$DEVICE1" "0000:$DEVICE2"; do
      vendor=$(cat /sys/bus/pci/devices/${dev}/vendor)
      device=$(cat /sys/bus/pci/devices/${dev}/device)
      if [ -e /sys/bus/pci/devices/${dev}/driver ]; then
      echo ${dev} > /sys/bus/pci/devices/${dev}/driver/unbind
      fi
      echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
      done

      # create a imagefile from backingfile file if not exist
      if [ ! -e $IMAGEFILE ]; then
      qemu-img create -f qcow2 -o backing_file=$INSTALLFILE,backing_fmt=qcow2 $IMAGEFILE
      fi


      QEMU_PA_SAMPLES=4096 QEMU_AUDIO_DRV=pa \
      taskset -c 0-2 \
      qemu-system-x86_64 \
      -enable-kvm \
      -m 4096 \
      -cpu host,kvm=off \
      -smp 3,sockets=1,cores=3,threads=1 \
      -machine pc-i440fx-2.1,accel=kvm \
      -soundhw hda \
      -bios /usr/share/ovmf/OVMF.fd `# SID version of OVMF` \
      -device vfio-pci,host=$DEVICE1,addr=0x8.0x0,multifunction=on,x-vga=on \
      -device vfio-pci,host=$DEVICE2,addr=0x8.0x1 \
      -vga none \
      -device qxl \
      -device virtio-net-pci,netdev=user.0,mac=52:54:00:a0:66:43 \
      -netdev user,id=user.0 \
      -drive file=$IMAGEFILE,if=none,id=drive-virtio-disk0,format=qcow2,cache=none \
      -device virtio-blk-pci,scsi=off,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \
      -rtc base=localtime,driftfix=slew \
      -smb /home/user/Downloads/toinstall \
      -usb \
      -device usb-mouse \
      -device usb-kbd

      Delete
    5. pci-ids says your R7 360 is not Bonaire:

      665f Tobago PRO [Radeon R7 360 / R9 360 OEM]

      Delete
    6. ok, I had the Info from this page, sorry it's in german, but there is a technical table.

      http://ht4u.net/news/31194_von_radeon_r7_360_bis_radeon_r9_390x_-_alle_daten_und_preise_bekannt/

      I will test the constallation with i915 Patch (it's a Intel i5 haswell CPU) in the next days?

      Delete
    7. This reset issue didn't work for me, R9 380. Got a BSOD with "thread stuck in device driver"

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi, after test with i915 patch and without OVMF I can give you feedback.

    qemu with -machine q35 and ioh3420 will go to bluescreen after "starting windows".
    The AMD installtool cant install the driver, so I did it from device manager.
    It's working in Windows 7 safe mode, but not in normal mode.
    See screenshots of 3 starts with -vga qxl, this will also happen with -vga none and x-vga=on -device qxl.
    http://www.suppser.de/sharestuff/img/bluescr001.png
    http://www.suppser.de/sharestuff/img/bluescr002.png
    http://www.suppser.de/sharestuff/img/bluescr003.png

    qemu with -machine q35 with normal PCI will go to a black screen and screen will power off.
    AMD installtool is not working in installationconfig with vga qxl and x-vga=off.
    It's working in Windows 7 safe mode, but not in normal mode.

    qemu with -machine pc normal PCI ( PCI-E is not supported) will work in normal mode and with reboot.
    AMD installtool is not working in installationconfig with vga qxl and x-vga=off.

    from yesterday
    qemu with -machine pc nomral PCI and OVMF BIOS will work in normal mode and with reboot.
    AMD installtool is not working.

    See here a GPU-z screenshot, this says also Tobago, don't trust a list from internet ;-)
    http://www.suppser.de/sharestuff/img/Sapphire_R7_360.gif

    Can I support you with more debugging oder something? then give me an Info.

    br Tobias

    ReplyDelete
  4. I believe that Fiji might also exhibit problems to do with PCI reset. My current Fiji based GPU does not crash on reset, but the display will not show anything on guest resets until you either reboot or suspend the host. The guest still runs as far as I could test, but the GPU won't output any display. Seeing as the solution is similar I am guessing this might be a different manifestation of the reset bug.

    ReplyDelete

Comments are not a support forum. For help with problems, please try the vfio-users mailing list (https://www.redhat.com/mailman/listinfo/vfio-users)