Virtually Gaming, Part 1: In the Beginning – Hardware and Xen

For about two years now I have managed to stick to the “No Windows on bare metal.” policy. This was instated for many reasons, including security and ease of backups (it is difficult to beat ZFS snapshots and send/receive functionality). The key reason for using Windows at all has been gaming, and both myself and my wife do play various games, mostly of the co-op FPS genre. While native Linux support has increased dramatically in that time, the availability of native Linux games still hasn’t quite reached parity with availability on the Windows platform.

Combining the “No Windows on bare metal.” policy with the requirement for high performance gaming capability meant that the only solution that fits is PCI passthrough of a high end GPU to the virtual machine. In this article I will describe the journey to the solution over the past two years, including (often unfortunate) choices of hardware, software, working around hardware, firmware, driver and software bugs, crippling and limitations, and other bumps on the road to virtualized gaming.

Hardware

When I first embarked on this project, it was an off-shoot of the project to upgrade my workstation. While there was nothing wrong with my Quad Core 2 in terms of performance, I needed to get a second machine up and running for my wife. So, somewhat optimistically, I thought this would be an ideal opportunity to solve three problems at the same time:

  1. Get a gaming grade workstation up and running for my wife
  2. Virtualize the Windows part of my dual-boot setup so I never have to reboot for the sake of joining a game when my friends invite me
  3. Implement the “No Windows on bare metal.” policy

The motherboard that caught my eye was the EVGA SR-2. It seemed to fit all of the necessary requirements:

  1. Plenty of CPU power (dual socket, capable of taking up to two 6-core Xeons)
  2. Full support for plenty of ECC memory (after the last build I have vowed to never build another machine without ECC RAM, having spent days troubleshooting a stock-setting stability issue that turned out to be marginal memory)
  3. Plenty of PCIe slots (7 x16 slots, with 64 usable PCIe lanes between them)
  4. VT-d support (originally listed in the spec, and confirmed with EVGA tech support prior to purchase – a claim that turned out to be rather stretching the truth)

A sizable investment into the motherboard, a pair of 6-core X5650 Xeons, and 48GB (6x8GB) of registered ECC RAM later, problems began.

Hardware Problems

The first motherboard I got turned out to have a faulty PCIe slot #1. The retailer I bought the motherboard from went bust a few weeks after my purchase, but EVGA generally have excellent RMA service, and I registered the motherboard as soon as I had received it to qualify for the full 10 year manufacturer’s warranty that was offered on this motherboard.

In order to not put my build on hold, before I RMA-ed the faulty SR-2, I bought another, second hand SR-2 on eBay. I thoroughly tested it, and to this day, this is the SR-2 that has been completely fault free in the main workstation that was the product of this project. It turns out, I was quite lucky to have bought a second motherboard – because the replacement that was sent was also faulty, and failed to reliably finish POST-ing with either of my CPUs in either socket. That got RMA-ed as well, and the replacement is currently in use as a prototyping rig for the next incarnation of this workstation, but that motherboard also has problems which cause it to fail to boot on a hot reboot (I am putting off RMA-ing it until the prototyping stage of the project is completed and being without a working prototyping machine for a week won’t be a problem.

In conclusion: Beware EVGA warranty replacement motherboards – they are all refurbished items that were sent back as faulty, and either repaired or the fault was never reproducible by their testing team so they got recycled as is. Always test any refurbished replacements extremely thoroughly (all slots, sockets, ports and features) when you receive them – if you get a faulty replacement, EVGA will pay for the shipping costs back to them for another replacement, but only within the first month after you receive the replacement, so acting quickly and thoroughly is of vital importance to avoid courier costs that can quickly add up to a lot.

More RAM

At this time I looked into using 96GB of RAM on the SR-2. This turned out to be very difficult as the machine would generally refuse to POST, except after a fresh CMOS reset. This was particularly annoying because the CPUs themselves (which contain the MCH) officially support 192GB of RAM each. After a lot of trial and error, I found a way to make the machine reliably post with 96GB of RAM:

  1. Use dual-ranked (this is important, single ranked won’t work for 96GB!) x4 registered 1600MHz 1.35V DIMMs
  2. Boot the machine with only 6 DIMMs. Go to the memory settings, and manually set all of the memory timings to what they defaulted to. Make sure you set the command rate to 2T (defaults to 1T).
  3. If you are overclocking, make sure you set the MCH strap to 1600MHz.

Do this and your SR-2 should POST with 96GB. It may require a few attempts where the motherboard re-sets itself and re-attempts the POST, but both of mine successfully POST within 30 seconds.

All of the symptoms indicate that there is a BIOS bug in timeouts at various stages of the POST that cause some initializations to fail and time out when more than 48GB of RAM is used. Officially, EVGA only claim the SR-2 supports up to 48GB of RAM, and it is unlikely they will be fixing this BIOS bug.

Hypervisor

Back when I began this project (late 2012), the only hypervisor with notable reports of GPU passthrough success without requiring a lot of manually applied experimental patches was Xen, so this was what I chose for the project. Additionally, my previous tests indicated that the performance overheads of using Xen were among the lowest of all the available hypervisors, so it seemed like a win-win situation.

The primary GPU in the machine was an ageing but perfectly adequate GeForce 8800GT that came from my previous workstation. Then I had to select a suitable GPU for passthrough to a virtual machine. Nvidia passthrough only worked on expensive Quadro (and not all Quadros, only the expensive ones), Tesla and Grid cards which they refer to as “MultiOS compatible”. The cost of most of those made them not an option worth considering. That meant trying an ATI card, so I got a cheap passively cooled single-slot Radeon HD6450. This is where a whole array of real problems began:

  1. EVGA SR-2 motherboard uses a pair of NF200 PCIe bridges to multiplex 32 PCIe lanes available on the upstream Intel 5520 PCIe hub into 64 PCIe lanes available for GPUs. NF200 bridges have severe bugs and limitations when it comes to compatibility with VT-d. They bypass IOMMU for DMA transfers, so when the VM tries to access RAM within it’s virtual address space that overlaps the physical address of a PCI BAR (aperture) that belongs to a hardware device, the memory writes will hit the BAR, which will crash the machine (and maybe corrupt your disks, if the BAR being trampled belongs to a disk controller). The solution to this was to write a hvmloader patch that marked all of the IOMEM areas from the host as reserved. This was an ugly bodge that resulted in a fair amount of memory in the domU (what Xen calls guest VMs) becoming unusable, but it worked (and with enough RAM it wasn’t a major problem).
  2. More than likely related to point 1, this motherboard appears to have broken (or non-existent) support for interrupt remapping, which means that any devices passed to a VM have to have dedicated, unshared interrupts. If you pass a device sharing an interrupt to the VM, the VM will most likely crash the entire host. Problems 1 and 2 are very similar in symptoms (host crash), which made them quite difficult to troubleshoot and get to the bottom of because no one change to the configuration made the problem go away. It took some help from the Xen developers and a fair amount of guesswork to figure it all out. The only solution is to move cards around to different slots until all of the hardware you intend to pass through to virtual machines has dedicated interrupts that aren’t shared with other hardware. This can be fiddly, but it is generally achievable – in my final configuration, I am successfully passing two GPUs and three USB controllers to VMs.
  3. ATI cards suffer from terrible drivers that fail to re-initialize the card without full BIOS level re-POST-ing (and said re-POST-ing doesn’t happen when the VM is rebooted, only when the entire physical machine is rebooted). The consequence is that they work OK when the VM is first booted up after a host reboot, but subsequent VM reboots result in massive performance degradation, glitches, and sometimes complete host crashes. While some of this is being worked on (e.g. functionality to reset the GPU via a bus reset from Xen dom0), it is still not available in the current released version. This particular problem turned out to not be easily solvable (having already written a patch for Xen’s hvmloader, I was very keen to avoid having to write any more to implement PCI bus resetting functionality for the Xen pci-stub driver. To at least be able to prove the concept, I bought the cheapest Nvidia Quadro that is supported for GPU passthrough (Quadro 2000), and this worked absolutely fine. Having finally found a solution that works perfectly, I went on to find ways of making GeForce cards work with PCI passthrough through fooling the Nvidia driver into initializing them even though they weren’t expensive enough, by modifying the cards’ ID number into an equivalent Quadro card. As discussed in previous articles, Nvidia cards up to and including the Fermi generation can be modified into equivalent Quadro cards by changing the appropriate ID strap bits in the cards’ BIOS using nvflash. Kepler cards require a small hardware modification. The easiest modifications are GTX 680 to Tesla K10 (remove one resistor) and GTX780Ti to Quadro K6000 (add one large, easy to solder resistor across appropriate pins on the EEPROM). I am currently running a pair of GTX 780Ti cards.

Issues 1 and 2 listed above are why I said that claiming the SR-2 supports VT-d was seriously stretching the truth. On a well designed workstation motherboard, the above problems should never have arisen. After all that, and many, many man-days invested in it working around the various bugs mentioned above, I have Xen working on the system, with EL6 (CentOS) dom0, and two domUs, one running XP x64 and one running Windows 7 x64. The hardware passed through on PCIe level is:

XP x64:

  • Intel ICH10 HD Audio
  • 2x ICH10 USB
  • GeForce GTX 780Ti

Windows 7 x64:

  • NEC USB 3 controller
  • GeForce GTX 780Ti

GRUB options:

 kernel /xen.gz noreboot unrestricted_guest=1 msi=1
 module <kernel and options> intel_iommu=on pcie_ports=compat

Note that unrestricted_guest=1 and pcie_ports=compat are required on the SR-2, but may not be required if you hardware behaves better. If your IOMMU implementation is good and includes ACS functionality, you shouldn’t need unrestricted_guest=1.

pcie_ports=compat is required because without it the SR-2 makes the PCI hotplug driver flap very quickly on one of the PCI devices built into the south bridge chipset, which causes an interrupt flood that makes the machine grind to a halt. (Have I mention enough times yet that the SR-2 is extremely buggy?)

Xen domU config:

name="mydomu"
description="None"
uuid="a57e6840-e9f5-4a14-a822-abcdef012345"
memory=16384
maxmem=16384
vcpus=6
on_poweroff="destroy"
on_reboot="restart"
on_crash="destroy"
localtime=1
keymap="en-gb"

builder="hvm"
device_model_override="/usr/lib/xen/bin/qemu-dm"
device_model_version="qemu-xen-traditional"
boot="c"
disk=[ '/dev/zvol/ssd/mydomu,raw,hda,rw', '/dev/sr0,raw,hdc:cdrom,rw' ]
vif=[ 'mac=00:11:22:33:44:55,bridge=br0,model=e1000', ]
stdvga=1
usb=1
acpi=1
apic=1
pae=1
gfx_passthru=0
pci = [ '07:00.0', '07:00.1', '00:1b.0', '00:1a.1' ]
xen_platform_pci=1
pci_msitranslate=0
pci_power_mgmt=1

Obviously you will need to change things like PCI addresses, MAC addresses, block device paths, and suchlike to suit your own system.

/etc/modprobe.d/xen-pciback.conf:

options xen-pciback permissive=1 hide=(07:00.0)(07:00.1)(00:1b.0)(00:1a.1)

Note the PCI IDs in the xen-pciback module options correspond to the PCI IDs in the Xen domU configuration. You may not need permissive=1 if you have better hardware than I do.

And Another Thing

One thing I feel I have to mention is that I have had extremely bad experience with every SAS card I have tried to use in the SR-2 with virtualization. This includes two different LSI cards, an Adaptec card and a 3ware card. They all work fine in a normal bare metal setup, and cause all kinds of crash inducing problems, some more difficult to debug than others when IOMMU is enabled and VMs are running with PCI devices passed through to them. SATA cards (I tried Silicon Image and Marvell), OTOH, seem to always work just fine, with no problems whatsoever, including when using 1:5 SATA port multipliers. In some cases this is caused by the SAS controller being a native PCI-X chip and using a phantom PCI-X to PCIe bridge. In other cases it seems to be caused by the SAS card’s driver trying to do some interesting DMA accesses that crash the entire host when virtual machines are running with PCI devices passed through to them. In short – avoid using SAS cards and stick with SATA – but then again I find that to be good advice to follow regardless.

This setup has worked without any significant problems for the past two years. But things have changed in that time. There is now a native Linux version of Steam, and many games have native Linux ports. It is time that this long term reliable system is updated accordingly. More on that in the next article.

Virtualized Gaming: Nvidia Cards, Part 2: GeForce, Quadro, and GeForce Modified into a Quadro – Higher End Fermi Models

Following the success with QuadForce 2450 modification (GeForce GTS450 -> Quadro 2000), I went on to investigate whether the same modification will work on the GTX470 to turn it into a Quadro 5000 and on a GTX480 to turn it into a Quadro 6000. Modifying a GTX580 into a somewhat obscure Quadro 7000 was also undertaken.

MODELCORE CONFIGURATIONMEMORY CHANNELSMEMORY
GeForce GTX470448:56:405x1.25GB
GeForce GTX480480:60:486x1.50GB
Quadro 5000352:44:405x2.50GB
Quadro 6000448:56:486x6.00GB

In all three cases, the modifications were successful, and they all worked as expected – features like VGA passthrough work on the 5000 and 6000 models and gaming performance is excellent, as you would expect – I can play Crysis at 3840×2400 in a virtual machine. Again, the extra GL functions aren’t there (if you compare the output of glxinfo between a real Quadro and a QuadForce, you will find a number of GL primitives missing), so some aspects of OpenGL performance are still crippled. PhysX support is also a little hit-and-miss. In a VM, on Windows 7 it seems to work on Quadro cards; on XP it appears to not be working. On bare metal on Windows XP it works. This appears to be due to the Quadro driver itself, rather than due to the cards not being genuine Quadros.

Finally, the GF100 based cards (GTX470/480) also get an extra feature enabled by the modification – second DMA channel. Normally there is a unidirectional DMA channel between the host and the card. Following the modification, the second DMA channel in the other direction is activated. This has a relatively moderate impact on gaming performance, but it can have a very large impact on performance of I/O bound number crunching applications since it increases the memory bandwidth between the card and the system memory (you can read and write to/from the GPU memory at the same time). Compare the CUDA-Z Memory report for the GTX470 before and after modifying it into a Quadro 5000 – GTX470 only has a unidirectional async memory engine, but after modifying it the engine becomes bidirectional:

The same happens on the GTX480 – it’s async engine also becomes bidirectional after modification.

Quadro 7000 is a little different from the other two. It doesn’t have dual DMA channels, and Nvidia don’t list it as MultiOS capable. The drivers do not do the necessary adjustments to make it work with VGA passthrough. That means that, unfortunately, the gain from modifying a GTX580 is questionable in terms of what you will gain. Note, however, that the Quadro 7000 was never aimed at the virtualization market; it was only available as a part of the QuadroPlex 7000 product – an external GPU enclosure designed for driving multiple monitors for various visualisation work. Hence the lack of MultiOS support on it.

Here is how the QuadForce 5470 does in SPECviewperf (GTX470 = 100%):

Compared to the QuadForce 2450, the performance improvements are more modest – the only real difference is observable in the lightwave benchmark.

Unfortunately, my QuadForce 6480 is currently in use, so I cannot get measurements from it, but since the they are both based on the GF100 GPU, the results are expected to be very similar.

On the QuadForce 7580 there was no observed SPEC performance improvement.

I have since acquired a Kepler Based 4GB GTX680 and successfully modified it into Quadro K5000. Modifying it into a Grid K2 also works, but there don’t appear to be any obvious advantages from doing so at the moment (K5000 works fine for virtualization passthrough, even though it wasn’t listed as MultiOS last time I checked). This QuadForce K5680 is why my GTX470 became free for testing again. More on Quadrifying Keplers in the next article. I also have a GTX690 now (essentially two 680s on the same PCB), which will be replacing the QuadForce 6480, so this will also be written up in due time. Unfortunately, however, quadrifying Keplers in most cases requires some hardware as well as BIOS modifications. I will post more on all this soon, along with a tutorial on soft-modding.

Virtualized Gaming: Nvidia Cards, Part 1: GeForce, Quadro, and GeForce Modified into a Quadro

Recently I built a new system with the primary intention of running Linux the vast majority of the time and never having to stop what I am doing to reboot into Windows every time I wanted to play a game. That meant gaming in a VM, which in turn meant VGA passthrough. I am an Enterprise Linux 6 user, and Fedora is too bleeding edge for me. What I really wanted to run is KVM virtualization, but the support for VGA passthrough didn’t seem to work for me with EL6 packages, even after a selective update to much newer kernel, qemu and libvirt related packages. VMware ESX won’t work with PCI passthrough on my EVGA SR-2 motherboard because EVGA, in their infinite wisdom, decided to put all the PCIe slots behind Nvida NF200 routers/bridges which don’t support PCIe ACS functionality, which ESX requires for PCI passthrough. That left me with Xen as the only remaining option. I now mostly have Xen working the way I want – not without issues, but I will cover virtualized gaming and Xen details in another article. For now, what matters is that Xen VGA passthrough currently only works with ATI cards and Nvidia Quadro (but not GeForce) cards.

ATI cards are not an option for me due to various driver bugs (e.g. handling monitors on which refresh rate is dependant on resolution due to bandwidth limitations), lack of features (no option to use anything but EDID modes, to the extent of completely ignoring monitor driver .inf files; the custom mode feature used to exist in the drivers (the documentation for it can still be found on the AMD website) but has been removed at some point) and most importantly, lack of multiple DL-DVI outputs on cards more recent than the Radeon HD4xxx series (Radeon HD5xxx and later cards only come with a single DL-DVI port – on those that come with a second DVI port, even though it physically looks like a DL, it only provides a single link).

Nvidia GeForce cards don’t work in a virtual machine, at least not without unmaintained patches that don’t work with all cards and guest operating systems.

That leaves Nvidia Quadro cards. Unfortunately, those are eyewateringly expensive. But, on paper, the spec lists the same GPUs used on GeForce and Quadro cards. This got me looking into what makes a Quadro a Quadro and a few days of research and a weekend of experimentation yielded some interesting and very useful results. While it looks like some features such as certain GL functions are disabled in the chips (probably by laser cutting), some features are purely down to the driver deciding whether to enable them or not. It turns out, making cards work in a VM is one of the driver-depentant features.

Phase 1: Verify That Quadros Cards Work in a VM When GeForce Don’t

Looking at the specification and feature list of Quadro cards, Quadro 2000, 4000, 5000 and 6000 models support the “MultiOS” feature, which is what Nvidia calls VGA passthrough. So, the first thing I did was acquire a “cheap” second hand quadro Quadro 2000 on eBay. Cheap here being a relative term because a second hand Quadro costs between 3 and 8 times the amount the equivalent (and usually higher specification) GeForce card costs. The Quadro card proved to work flawlessly, but the Quadro 2000 is based on a GF106 chip with only 192 shaders, so gaming performance was unusable at 3840×2400 (I will let go of my T221 monitors when they are pried out of my cold, dead fingers). Gaming at 1920×1200 was just about bearable with some detail level reductions, but even so it was borderline.

Here is how the genuine Quadro 2000 shows up in GPU-Z and CUDA-Z:

And here are the genuine Quadro 2000 SPECviewperf11 results:

VIEWSETCOMPOSITE
catia-0323.86
ensight-0416.63
lightwave-0143.12
maya-0336.25
proe-057.07
sw-0232.21
tcvis-0218.82
snx-0117.50

Phase 2: Get an Equivalent GeForce Card and Investigate What Makes a Quadro a Quadro

The next item on the acquisition list was a GeForce GTS450 card. On paper the spec for a GTS450 is identical to a Quadro 2000:
GF106 GPU
192 shaders
1GB of GDDR5
Note: There are some models that are different despite also being called GTS450. Specifically, there is an OEM model that only has 144 shaders, and there is a model with 192 shaders but with GDDR3 memory rather than GDDR5. The DDR3 model may be more difficult to modify due to various differences, and the 144 shader model may not work properly as a Quadro 2000.

Armed with the information I dug out, I set out to modify the GTS450 into a QuadForce (a splice between a Quadro and a GeForce – and Gedro just doesn’t sound right). This was successful, and the card now detected as a Quadro 2000, and everything seemed to work accordingly. The VGA passthrough worked, and since the GTS450 is clocked significantly higher than the Quadro 2000, the gaming performance was improved to the point where 1920×1200 performance was quite livable with. What didn’t improve to Quadro levels is OpenGL performance of certain functions that appear to have been disabled on the GeForce GPUs. Consequently, SPECviewperf11 results are much lower than on a real Quadro 2000 card, but the GeForce GTS450 scores higher on every gaming test since games don’t use the missing functionality, and the GeForce card is clocked higher. It is unclear at the moment whether the extra GL functionality was disabled on the GPU die by laser cutting or whether it is disabled externally to the GPU, e.g. by different hardware strapping or pin shorting via the PCB components – more research into this will need to be done by someone more interested in those features than me. Since the stamped-on GPU markings are different between the GTS450 (GF106-250, checked against 3 completely different GDDR5 GTS450 cards) and the Quadro 2000 (GF106-875 on the one I have), it seems likely the extra GL functionality is laser cut out of the GPU.

Here is how the GTS450 modified to Quadro 2000 shows up in GPU-Z and CUDA-Z:

CUDA-Z performance seems to scale with the clock speeds, so the faux-Quadro card wins.

Here are the SPECviewperf11 results for a GTS450 before and after modifying it into a Quadro 2000. As you can see, in this test those missing GL functions make a huge difference, but in some tests there is still a substantial improvement:

GTS450:

VIEWSETCOMPOSITE
catia-033.33
ensight-0420.67
lightwave-0110.80
maya-035.38
proe-050.36
sw-026.75
tcvis-020.35
snx-012.37

QuadForce 2450:

VIEWSETCOMPOSITE
catia-033.24
ensight-0417.83
lightwave-0110.72
maya-037.75
proe-050.37
sw-026.87
tcvis-020.35
snx-012.35

Here is the data in chart form (relative performance, real Quadro 2000 = 100%).

As you can see the real Quadro dominates in all tests except ensignt-04 where it gets soundly beaten by the GeForce card. Modification does seem to improve some aspects of performance. In particular, Maya results seem to improve by a whopping 44% following the modification.

If you are only interested in support and VGA passthrough for virtual machines, modifying a GeForce card to a Quadro can be an extremely cost effective solution (especially if your budget wouldn’t stretch to a real Quadro card anyway). If you are only interested in performance of the kind measured by SPECviewperf, then depending on the applications you use, a real Quadro is still a better option in most cases.