QNAP TS-421 – Review, Modification and RedSleeve Linux

Requirement

With the RedSleeve Linux release rapidly approaching, I needed a new server. The current one is a DreamPlug with an SSD and although it has so far worked valiantly with perfect reliability, it doesn’t have enough space to contain all of the newly build RPM packages (over 10,000 of them, including multiple versions the upstream distribution contains), and is a little lower on CPU (1.2GHz single core) and RAM (512MB) than ideal to handle the load spike that will inevitably happen once the new release becomes available. I also wanted a self contained system that doesn’t require special handling with many cables hanging off of it (like SATA or USB external disks). I briefly considered the Tonido2 Plug, but between the slower CPU (800MHz) and the US plug, it seemed like a step backward just for the added tidyness of having an internal disk.

Specification

The requirements I had in mind needed to cover at least the following:
1) ARM CPU
2) SATA
3) At least a 1.2GHz CPU
4) At least 512MB of RAM
5) Everything should be self contained (no externally attached components)

Selection

Very quickly the choice started to focus on various NAS appliances, but most of them had relatively non-existant community support for running custom Linux based firmware. The one exception to this is QNAP NAS devices which have rather good support from the Debian community; and where there is a procedure to get one Linux distribution to run, getting another to run is usually very straightforward. After a quick look through the specifications, I settled on the QNAP TS-421, which seems to be the highest spec ARM based model:

CPU: 2GHz ARMv5 Marvell Kirkwood (same as in the DreamPlug but 66% higher clock speed)
RAM: 1GB (twice as much as DreamPlug)
SATA: 4x 3.5″ SATA disk trays, based on the excellent Marvell 88SX7042 PCIe SATA controller
eSATA: 2x
Ethernet: 2x Gigabit (same as DreamPlug)
USB: 2x 2.0, 2x 3.0

Disks

At the time when I ordered the QNAP TS-421, it was listed as supporting 4TB drives – the largest air filled that were available at the time. I ordered 4x 4TB HGST drives because they are known to be more reliable than other brands. In the 10 days since then Toshiba announced 5TB drives, but these are not yet commercially available. I briefly considered the 6TB Helium filled Hitachi drives, but these are based on a new technology that has not been around for long enough for long term reliability trends to emerge – and besides, they were prohibitively expensive (£87/TB vs £29/TB for the 4TB model), and to top it all off, they are not available to buy.

Overview

Once the machine arrived, it was immediately obvious that the build quality is superb. One thing, however, bothered me immediately – it uses an external power brick, which seems like a hugely inconvenient oversight on an otherwise extremely well designed machine.

In order to make playing with alternative Linux installations I needed to get serial console access. To do this you will need a 3.3V TTL serial cable, same as what is used on the Raspberry Pi. These are cheaply available from many sources. One thing I discovered the hard way after some trial and error is that you need to invert the RX and TX lines between the cable and the QNAP motherboard, i.e. RX on the cable needs to connect to TX on the motherboard, and vice versa. There is also no need to connect the VCC line (red) – leave it disconnected. My final goal was to get RedSleeve Linux running on this machine, the process for which is documented on the RedSleeve wiki so I will not go into it here.

Modifying

One thing that becomes very obvious upon opening the QNAP TS-421 is that there is ample space inside it for a PSU, which made the design decision to use an external power brick all the more ill considered. So much so that I felt I had to do something about it. It turns out the standard power brick it ships with fits just fine inside the case. Here is what it looks like fitted.

QNAP TS-421 with internalized PSU
QNAP TS-421 with internalized PSU
QNAP TS-421 with internalized PSU

It is very securely attached using double sided foam tape. Make sure you make some kind of a gasket to fit between the PSU and the back of the case – this is in order to prevent upsetting the crefully designed airflow through the case. I used some 3mm thick expanded polyurethane which works very well for this purpose. The cable tie is there just for extra security and to tidy up the coiled up DC cable that goes back out of the case and into the motherboard’s power input port. This necessitated punching two 1 inch holes in the back of the case – one for the input power cable and one for the 12V DC output cable. I used a Q.Max 1 inch sheet metal hole punch to do this. There is an iris type grommet for the DC cable to prevent any potential damage arising from it rubbing on the metal casing.

QNAP TS-421 with cable holes punched through the back of the case

The finished modification looks reasonably tidy and is a vast improvement on a trailing power brick.

QNAP TS-421 running RedSleeve Linux

One other thing worth mentioning is that internalizing the PSU makes no measurable difference to internal temperatures with the case closed. In fact, if anything the PSU itself runs cooler than it does on the outside due to the cooling fan inside the case. The airflow inside the case is incredibly well designed, hence the reason why it is vital you use a gasket to seal the gap between the power input port on the PSU and the back of the case. To give you the idea of just how well the airflow is designed, with the case off, the HGST drives run at about 50-55C idle and 60-65C under load. With the case on they run at about 30C idle and 35C under full load (e.g. ZFS scrub or SMART self tests).

Virtualized Gaming: Nvidia Cards, Part 3: How to Modify 2xx – 4xx series GeForce into a Quadro

There has been a large amount of interest in the previous two articles in this series and many calls for a modifying guide. In this article I will explain the details of how to modify your Fermi based GeForce card into a corresponding equivalent Quadro card. Specifically, you the following:

GeForce Model GPU Quadro Model
GeForce GTS450 GF106 Quadro 2000
GeForce GTX470 GF100 Quadro 5000
GeForce GTX480 GF100 Quadro 6000

The Tesla (2xx/3xx) and Fermi (4xx) series of GPUs can be modified by modifying the BIOS. Earlier cards can also be modified, but the modification is slightly different to what is described in this article. There is no hardware modification required on any of these cards. The modification is performed by modifying what is known as the “straps” that configure the GPU at initialization time. The nouveau project (free open source nvidia driver implementation for Xorg) has reverse engineered and documented some of the straps, including the device ID locations. We can use this to change the device ID the card reports. This causes the driver to enable a different set of features that it wouldn’t normally expose on a gaming grade card, even though the hardware is perfectly capable of it (you are only supposed to have those features if you paid 4-8x more for what is essentially the same (and sometimes even inferior) card by buying a Quadro).

The main benefit of doing this modification is enabling the card to work in a virtual machine (e.g. Xen). If the driver recognizes a GeForce card, it will refuse to initialize the card from a guest domain. Change the card’s device ID into a corresponding Quadro, and it will work just fine. On the GF100 models, it will even enable the bidirectional asynchronous DMA engine which it wouldn’t normally expose on a GeForce card even though it is there (on GF100 based GeForce cards only a unidirectional DMA engine is exposed). This can potentially significantly improve the bandwidth between the main memory and GPU memory (although you probably won’t notice any difference in gaming – it has been proven time and again that the bandwidth between the host machine and the GPU is not a bottleneck for gaming workloads).

Another thing that this modification will enable is TCC mode. This is particularly of interest to users of Windows Vista and later because it avoids some of the graphics driver overheads by putting the card in a mode only used for number-crunching. Note: Although most Quadros have TCC mode available, you may want to look into modifying the card into a corresponding Tesla model if you are planning to use it purely for number crunching. You can use the same method described below, just find a Tesla based on the same GPU with equal or lower number of enabled shader processors, find it’s device ID in the list linked at the bottom of the article, and change the device IDs using the strap.

Before you begin even contemplating this make sure you know what you are doing, and that the instructions here come with no warranty. If you are not confident you know what you are doing, buy a pre-modified card from someone instead or get somebody who does know what they are doing to do it for you.

To do this, you will require the following:

  • NVFlash for Windows and/or NVFlash for DOS
    Note: You may need to use the DOS version – for some reason the Windows version didn’t work on some of my Fermi cards. If you use the DOS version, make sure you have a USB stick or other media set up to boot into DOS.
  • Hex editor. There are many available. I prefer to use various Linux utilities, but if you want to use Windows, HxD is a pretty good hex editor for that OS. It is free, but please consider making a small donation to the author if you use it regularly.
  • Spare Graphics card, in case you get it wrong. If you are new to this, your boot graphics card (the spare one, not the one you are planning to modify) should preferably not be an Nvidia one (to avoid potential embarrassment of flashing the wrong card). Skip this part at your peril.

On Fermi BIOS-es the strap area is 16 bytes long and it starts at file offset 0×58. Here is an example based on my PNY GTX480 card:
0000050: e972 2a00 de10 5f07 ff3f fc7f 0040 0000 .r*..._..?...@..
0000060: ffff f17f 0000 0280 7338 a5c7 e92d 44e9 ........s8...-D.

The very important thing to note here is that the byte order is little-endian. That means that in order to decode this easily, you should re-write the highlighted data as:
7FFC 3FFF 0000 4000 7FF1 FFFF 8002 0000

This represents two sets of straps, each containing an AND mask and an OR mask. The hardware level straps are AND-ed with the AND mask, and then OR-ed with the OR mask.

The bits that control the device ID are 10-13 (ID bits 0-3) and 28 (bit 4). We can ignore the last 8 bytes of the strap since all the bits controlling the device ID is in the first 8 bytes.

This makes the layout of the strap bits we need to change a little more obvious:

Fxx4xxxx xxxxxxxx xx3210xx xxxxxxxx
   ^                ^^^^
   |                ||||-pci dev id[0]
   |                |||--pci dev id[1]
   |                ||---pci dev id[2]
   |                |----pci dev id[3]
   |---------------------pci dev id[4]
F - cannot be set, always fixed to 0

The device ID of the GTX480 is 0x06C0. In binary, that is:
0000 0110 1100 0000
We want to modify it into a Quadro 6000, which has the device ID 0x06D8. In binary that is:
0000 0110 1101 1000

The device ID differs only in the low 5 bits, which is good because we only have the low 5 bits available in the soft strap.

So we need to modify as follows
From:   0000 0110 1100 0000
To:     0000 0110 1101 1000
Change: xxxx xxxx xxx1 1xxx

We only need to change two of the strap bits from 0 to 1. We can do this by only adjusting the OR part of the strap.

It is easier to see what is going on if we represent this as follows:

ID Bit:   4                  32 10
Strap: -xxA xxxx xxxx xxxx xxAx xxxx xxxx xxxx
Old Strap:
AND-0: 7F        FC        3F        FF
       0111 1111 1111 1100 0011 1111 1111 1111
OR-0:  00        00        40        00
       0000 0000 0000 0000 0100 0000 0000 0000
New Strap:
AND-0: 7F        FC        3F        FF
       0111 1111 1111 1100 0011 1111 1111 1111
OR-0:  10        00        60        00
       0001 0000 0000 0000 0110 0000 0000 0000

Note that in the edit mask above, bit 31 is marked as “-”. Bit 31 is always 0 in both AND and OR strap masks.
Bits we must keep the same are marked with “x”. Bits we need to amend are marked with “A”.

So what we need to do is flash the edited strap to the card. We could do this directly in the BIOS, but this would require calculating the strap checksum, which is tedious. Instead we can use nvflash to take care of the strap rewrite for us, and it will take care of the checksum transparently.
The new strap is:
0x7FFC3FFF 0x10006000 0x7FF1FFFF 0x80020000
The second pair is unchanged from where we read from the BIOS above. Make sure you have ONLY changed the device ID bits and that your binary to hex conversion is correct – otherwise you stand a very good chance of bricking the card.

We flash this onto the card using:
nvflash --index=X --straps 0x7FFC3FFF 0x10006000 0x7FF1FFFF 0x00020000
Note:
1) The last OR strap is 0×00020000 even though the data in the BIOS reads as if it should be 0×80020000. You cannot set the high bit (the left-most one) to 1 in the OR strap (just like you cannot set it to 0 in the AND strap). Upon flashing nfvlash will turn the high bit to 1 for you and what will end up in the BIOS will be 0×80020000 even though you set it to 0×00020000. This is rather unintuitive and poorly documented.
2) You will need to check what the index of the card you plan to flash is using nvflash -a, and replace X with the appropriate value.

Here is an example (from my GTX480, directly corresponding the the pre-modification fragment above) of how the ROM differs after changing the strap:

0000050: e972 2a00 de10 5f07 ff3f fc7f 0060 0010 .r*..._..?...`..
0000060: ffff f17f 0000 0280 7338 a597 e92d 44e9 ........s8...-D.

The difference at byte 0x6C is the strap checksum that nvflash calculated for us.

Reboot and your card should now get detected as a Quadro 6000, and you should be able to pass it through to your virtual machine without problems. I have used this extensively to enable me to pass my GeForce 4xx series cards to my Xen VMs for gaming. I will cover the details of virtualization use with Xen in a separate article. Note that I have had reports of cards modified using this method also working virtualized using VMware vDGA, so if this is your preferred hypervisor, you are in luck. Quadro 5000 and 6000 are also listed as supported for VMware vSGA virtualization, so that should work, too – if you have tried vSGA with a modified GeForce card, please post a comment with the details.

The same modification method described here should work for modifying any Fermi card into the equivalent Quadro card. Simply follow the same process. You may find this list of Nvidia GPU device IDs useful to establish what device ID you want to modify the card to. The GPU should match between the GeForce card the the Quadro/Tesla/Grid you are modifying to – so check which Nvidia card uses which GPU.

Many thanks to the nouveau project for reverse engineering and documenting the initialization straps, and all the people who have contributed to the effort.

In the next article I will cover modifying Kepler GPU based cards. They are quite different and require a different approach. There are also a number of pitfalls that can leave you chasing your tail for days trying to figure out why everything checks out but the modification doesn’t work (i.e. the card doesn’t function in a VM).

Virtualized Gaming: Nvidia Cards, Part 2: GeForce, Quadro, and GeForce Modified into a Quadro – Higher End Fermi Models

Following the success with QuadForce 2450 modification (GeForce GTS450 -> Quadro 2000), I went on to investigate whether the same modification will work on the GTX470 to turn it into a Quadro 5000 and on a GTX480 to turn it into a Quadro 6000. Modifying a GTX580 into a somewhat obscure Quadro 7000 was also undertaken.

Model Core Configuration Memory Channels Memory
GeForce GTX470 448:56:40 5x 1.25GB
GeForce GTX480 480:60:48 6x 1.50GB
Quadro 5000 352:44:40 5x 2.50GB
Quadro 6000 448:56:48 6x 6.00GB

In all three cases, the modifications were successful, and they all worked as expected – features like VGA passthrough work on the 5000 and 6000 models and gaming performance is excellent, as you would expect – I can play Crysis at 3840×2400 in a virtual machine. Again, the extra GL functions aren’t there (if you compare the output of glxinfo between a real Quadro and a QuadForce, you will find a number of GL primitives missing), so some aspects of OpenGL performance are still crippled. PhysX support is also a little hit-and-miss. In a VM, on Windows 7 it seems to work on Quadro cards; on XP it appears to not be working. On bare metal on Windows XP it works. This appears to be due to the Quadro driver itself, rather than due to the cards not being genuine Quadros.

Finally, the GF100 based cards (GTX470/480) also get an extra feature enabled by the modification – second DMA channel. Normally there is a unidirectional DMA channel between the host and the card. Following the modification, the second DMA channel in the other direction is activated. This has a relatively moderate impact on gaming performance, but it can have a very large impact on performance of I/O bound number crunching applications since it increases the memory bandwidth between the card and the system memory (you can read and write to/from the GPU memory at the same time). Compare the CUDA-Z Memory report for the GTX470 before and after modifying it into a Quadro 5000 – GTX470 only has a unidirectional async memory engine, but after modifying it the engine becomes bidirectional:

GTX470 CUDA-Z MemoryQuadForce 5000 CUDA-Z Memory

The same happens on the GTX480 – it’s async engine also becomes bidirectional after modification.

Quadro 7000 is a little different from the other two. It doesn’t have dual DMA channels, and Nvidia don’t list it as MultiOS capable. The drivers do not do the necessary adjustments to make it work with VGA passthrough. That means that, unfortunately, the gain from modifying a GTX580 is questionable in terms of what you will gain. Note, however, that the Quadro 7000 was never aimed at the virtualization market; it was only available as a part of the QuadroPlex 7000 product – an external GPU enclosure designed for driving multiple monitors for various visualisation work. Hence the lack of MultiOS support on it.

Here is how the QuadForce 5470 does in SPECviewperf (GTX470 = 100%):

QuadForce 5470 SPECviewperf

Compared to the QuadForce 2450, the performance improvements are more modest – the only real difference is observable in the lightwave benchmark.

Unfortunately, my QuadForce 6480 is currently in use, so I cannot get measurements from it, but since the they are both based on the GF100 GPU, the results are expected to be very similar.

On the QuadForce 7580 there was no observed SPEC performance improvement.

I have since acquired a Kepler Based 4GB GTX680 and successfully modified it into Quadro K5000. Modifying it into a Grid K2 also works, but there don’t appear to be any obvious advantages from doing so at the moment (K5000 works fine for virtualization passthrough, even though it wasn’t listed as MultiOS last time I checked). This QuadForce K5680 is why my GTX470 became free for testing again. More on Quadrifying Keplers in the next article. I also have a GTX690 now (essentially two 680s on the same PCB), which will be replacing the QuadForce 6480, so this will also be written up in due time. Unfortunately, however, quadrifying Keplers in most cases requires some hardware as well as BIOS modifications. I will post more on all this soon, along with a tutorial on soft-modding.

Virtualized Gaming: Nvidia Cards, Part 1: GeForce, Quadro, and GeForce Modified into a Quadro

Recently I built a new system with the primary intention of running Linux the vast majority of the time and never having to stop what I am doing to reboot into Windows every time I wanted to play a game. That meant gaming in a VM, which in turn meant VGA passthrough. I am an Enterprise Linux 6 user, and Fedora is too bleeding edge for me. What I really wanted to run is KVM virtualization, but the support for VGA passthrough didn’t seem to work for me with EL6 packages, even after a selective update to much newer kernel, qemu and libvirt related packages. VMware ESX won’t work with PCI passthrough on my EVGA SR-2 motherboard because EVGA, in their infinite wisdom, decided to put all the PCIe slots behind Nvida NF200 routers/bridges which don’t support PCIe ACS functionality, which ESX requires for PCI passthrough. That left me with Xen as the only remaining option. I now mostly have Xen working the way I want – not without issues, but I will cover virtualized gaming and Xen details in another article. For now, what matters is that Xen VGA passthrough currently only works with ATI cards and Nvidia Quadro (but not GeForce) cards.

ATI cards are not an option for me due to various driver bugs (e.g. handling monitors on which refresh rate is dependant on resolution due to bandwidth limitations), lack of features (no option to use anything but EDID modes, to the extent of completely ignoring monitor driver .inf files; the custom mode feature used to exist in the drivers (the documentation for it can still be found on the AMD website) but has been removed at some point) and most importantly, lack of multiple DL-DVI outputs on cards more recent than the Radeon HD4xxx series (Radeon HD5xxx and later cards only come with a single DL-DVI port – on those that come with a second DVI port, even though it physically looks like a DL, it only provides a single link).

Nvidia GeForce cards don’t work in a virtual machine, at least not without unmaintained patches that don’t work with all cards and guest operating systems.

That leaves Nvidia Quadro cards. Unfortunately, those are eyewateringly expensive. But, on paper, the spec lists the same GPUs used on GeForce and Quadro cards. This got me looking into what makes a Quadro a Quadro and a few days of research and a weekend of experimentation yielded some interesting and very useful results. While it looks like some features such as certain GL functions are disabled in the chips (probably by laser cutting), some features are purely down to the driver deciding whether to enable them or not. It turns out, making cards work in a VM is one of the driver-depentant features.

Phase 1: Verify That Quadros Cards Work in a VM When GeForce Don’t

Looking at the specification and feature list of Quadro cards, Quadro 2000, 4000, 5000 and 6000 models support the “MultiOS” feature, which is what Nvidia calls VGA passthrough. So, the first thing I did was acquire a “cheap” second hand quadro Quadro 2000 on eBay. Cheap here being a relative term because a second hand Quadro costs between 3 and 8 times the amount the equivalent (and usually higher specification) GeForce card costs. The Quadro card proved to work flawlessly, but the Quadro 2000 is based on a GF106 chip with only 192 shaders, so gaming performance was unusable at 3840×2400 (I will let go of my T221 monitors when they are pried out of my cold, dead fingers). Gaming at 1920×1200 was just about bearable with some detail level reductions, but even so it was borderline.

Here is how the genuine Quadro 2000 shows up in GPU-Z and CUDA-Z:

Quadro 2000 GPU-ZQuadro 2000 CUDA-Z CoreQuadro 2000 CUDA-Z MemoryQuadro 2000 CUDA-Z Performance

And here are the genuine Quadro 2000 SPECviewperf11 results:

Viewset Composite
catia-03 23.86
ensight-04 16.63
lightwave-01 43.12
maya-03 36.25
proe-05 7.07
sw-02 32.21
tcvis-02 18.82
snx-01 17.50

Phase 2: Get an Equivalent GeForce Card and Investigate What Makes a Quadro a Quadro

The next item on the acquisition list was a GeForce GTS450 card. On paper the spec for a GTS450 is identical to a Quadro 2000:
GF106 GPU
192 shaders
1GB of GDDR5
Note: There are some models that are different despite also being called GTS450. Specifically, there is an OEM model that only has 144 shaders, and there is a model with 192 shaders but with GDDR3 memory rather than GDDR5. The DDR3 model may be more difficult to modify due to various differences, and the 144 shader model may not work properly as a Quadro 2000.

Armed with the information I dug out, I set out to modify the GTS450 into a QuadForce (a splice between a Quadro and a GeForce – and Gedro just doesn’t sound right). This was successful, and the card now detected as a Quadro 2000, and everything seemed to work accordingly. The VGA passthrough worked, and since the GTS450 is clocked significantly higher than the Quadro 2000, the gaming performance was improved to the point where 1920×1200 performance was quite livable with. What didn’t improve to Quadro levels is OpenGL performance of certain functions that appear to have been disabled on the GeForce GPUs. Consequently, SPECviewperf11 results are much lower than on a real Quadro 2000 card, but the GeForce GTS450 scores higher on every gaming test since games don’t use the missing functionality, and the GeForce card is clocked higher. It is unclear at the moment whether the extra GL functionality was disabled on the GPU die by laser cutting or whether it is disabled externally to the GPU, e.g. by different hardware strapping or pin shorting via the PCB components – more research into this will need to be done by someone more interested in those features than me. Since the stamped-on GPU markings are different between the GTS450 (GF106-250, checked against 3 completely different GDDR5 GTS450 cards) and the Quadro 2000 (GF106-875 on the one I have), it seems likely the extra GL functionality is laser cut out of the GPU.

Here is how the GTS450 modified to Quadro 2000 shows up in GPU-Z and CUDA-Z:
QuadForce 2000 GPU-ZQuadForce 2000 CUDA-Z CoreQuadForce 2000 CUDA-Z MemoryQuadForce 2000 CUDA-Z Performance

CUDA-Z performance seems to scale with the clock speeds, so the faux-Quadro card wins.

Here are the SPECviewperf11 results for a GTS450 before and after modifying it into a Quadro 2000. As you can see, in this test those missing GL functions make a huge difference, but in some tests there is still a substantial improvement:

GTS450:

Viewset Composite
catia-03 3.33
ensight-04 20.67
lightwave-01 10.80
maya-03 5.38
proe-05 0.36
sw-02 6.75
tcvis-02 0.35
snx-01 2.37

QuadForce 2450:

Viewset Composite
catia-03 3.24
ensight-04 17.83
lightwave-01 10.72
maya-03 7.75
proe-05 0.37
sw-02 6.87
tcvis-02 0.35
snx-01 2.35

Here is the data in chart form (relative performance, real Quadro 2000 = 100%).

GTS450 vs. Quadro 2000

As you can see the real Quadro dominates in all tests except ensignt-04 where it gets soundly beaten by the GeForce card. Modification does seem to improve some aspects of performance. In particular, Maya results seem to improve by a whopping 44% following the modification.

If you are only interested in support and VGA passthrough for virtual machines, modifying a GeForce card to a Quadro can be an extremely cost effective solution (especially if your budget wouldn’t stretch to a real Quadro card anyway). If you are only interested in performance of the kind measured by SPECviewperf, then depending on the applications you use, a real Quadro is still a better option in most cases.

Note: I am selling one of my Quadrified GTS450 cards. I bought several fully expecting to brick a few in the process of attempting to modify them, but the success rate was 100% so I now have more of them than I need.

IBM T221 3840×2400 204dpi Monitor – Part 7: Positive Update

For once it would appear that I have a positive update on the subject of Nvidia drivers. It would seem that patching the latest (319.23) driver is no longer required on Linux. Even better, there is a way to achieve a working T221 setup without RandR getting in the way by insisting the two halves are separate monitors. I covered the issues with Nvidia drivers in a previous article.

The build part now works as expected out of the box. Simply:

export IGNORE_XEN_PRESENCE=1
bash ./NVIDIA-Linux-x86_64-319.23.run

and everything should “just work”.

Best of all, there appears to be a workaround for the RandR information being visible even when Xinerama is being overridden. It turns out, Ximerama and RandR seem to be mutually exclusive. So even though the option disabling RandR explicitly seems to get silently ignored, enabling Xinerama fixes that problem. And since the Nvidia driver’s Xinerama info override still works, this solves the problem!

You may recall from a previous article the following in xorg.conf:

[...]
Section "ServerLayout"
	Identifier "Layout0"
	Screen 0 "Screen0" 0 0
	Option "Xinerama" "0"
EndSection
[...]
Section "Screen"
	Identifier "Screen0"
[...]
	Option "NoTwinViewXineramaInfo" "True"
	Option "TwinView" "1"
	Option "TwinViewOrientation" "RightOf"
	Option "metamodes" "DFP-0:1920x2400, DFP-3:1920x2400"
[...]
EndSection

It turns out the solution is to simply enable Xinerama:

Section "ServerLayout"
	Identifier "Layout0"
	Screen 0 "Screen0" 0 0
	Option "Xinerama" "1"
EndSection

This implicitly disables RandR and Nvidia driver’s Xinerama info override takes care of the rest. Magic. :)

Update:
If you are still having problems when using KDE, there is another trick you can use to force xinerama and disable RandR. Ammend the following line in kdmrc:

/etc/kde/kdm/kdmrc:
ServerArgsLocal=-extension RANDR +xinerama -nr -nolisten tcp

WQUXGA – IBM T221 3840×2400 204dpi Monitor – Part 6: Regressing Drivers and Xen

I recently built a new machine, primarily because I got fed up of having to stop what I’m working on and reboot from Linux into Windows whenever my friends and/or family invited me to join them in a Borderlands 2 session. Unfortunately, my old machine was just a tiny bit too old (Intel X38 based) to have full, bug-free VT-d/IOMMU support required for VGA passthrough to work, so after 5 years, I finally decided it was time to rectify this. More on this in another article, but the important point I am getting to is that VGA passthrough requires a recent version of Xen. And there this part of the story really begins.

Some of you may have figured out that RHEL derivatives are my Linux distribution of choice (RedSleeve was a big hint). Unfortunately, RedHat have dropped support for Xen Dom0 kernels in EL6, but thankfully, other people have picked up the torch and provide a set of up to date, supported Xen Dom0 kernels and packages for EL6. So far so good. But it was never going to be so simple, at a time when drivers are getting increasingly dumber, feature sparse and more bloated at the same time. That is really what this story is about.

For a start, a few details about the system setup that I am using, and have been using for years.

  • I am a KDE, rather than Gnome user. EL6 comes with KDE 4, which use X RandR rather than Xinerama extensions to establish the geometry of the screen layout. This isn’t a problem in itself, but there is no way to override whatever RandR reports, so on a T221 you end up with a regular desktop on half of the T221, and an empty desktop on the other, which looks messy and unnatural.
  1. EL6 had had a Xorg package update that bumped the ABI version to from 10 to 11
  2. Nvidia drivers have changed the way TwinView works after version 295.x (TwinView option in xorg.conf is no longer recognized)
  3. Nvidia drivers 295.x do not support Xorg ABI v11.
  4. Nvidia kernel drivers 295.x do not build against kernels 3.8.x.

And therein lies the complication.

Nvidia drivers v295 when used with options TwinView and NoTwinViewXineramaInfo also seem to override RandR geometry to the show there is a single, large screen available, rather than two screens. This is exactly what we want when using the T221. Drivers after 295.x (304.x seems to be the next version), don’t recognize the TwinView configuration option, and while they provide Xinerama geometry override when using the NoTwinViewXineramaInfo option, they do not override RandR information any more. This means that you end up with a desktop that looks as you would expect it to if you used two separate monitors (e.g. status bar is only on the first screen, no wallpaper stretch, etc.), rather than a single, seamless desktop.

As you can see, there is a large compound issue in play here. We cannot use the 295.x drivers, because

  1. They don’t support Xorg ABI 11 – this can be solved by downgrading the xorg-x11-server-* and xorg-x11-drv-* packages to an older version (1.10 from EL 6.3). Easily enough done – just make sure you add xorg-x11-* to your exclude line in /etc/yum.conf after downgrading to avoid accidentally updating them in the future.
  2. They don’t build against 3.8.x kernels (which is what the Xen kernel I am using is – this is regardless of the long standing semi-allergy of Nvidia binary drivers to Xen). This is more of an issue – but with a bit of manual source editing I was able to solve it.

Here is how to get the latest 295.x driver (295.75) to build against Xen kernel 3.8.6. You may need to do this as root.

Kernel source acquisition and preparation:

wget http://uk1.mirror.crc.id.au/repo/el6/SRPMS/kernel-xen-3.8.6-1.el6xen.src.rpm
rpm -ivh kernel-xen-3.8.6-1.el6xen.src.rpm
cd ~/rpmbuild/SPEC
rpmbuild -bp kernel-xen.spec
cd ~/rpmbuild/BUILD/linux-3.8.6
cp /boot/config-3.8.6-1.el6xen.x86_64 .config
make prepare
make all

Now that you have the kernel sources ready, get the Nvidia driver 295.75, the patch, patch it and build it.

wget http://uk.download.nvidia.com/XFree86/Linux-x86_64/295.75/NVIDIA-Linux-x86_64-295.75.run
wget https://dl.dropboxusercontent.com/u/61491808/NVIDIA-Linux-x86_64-295.75.patch
bash ./NVIDIA-Linux-x86_64-295.75.run --extract-only
patch < NVIDIA-Linux-x86_64-295.75.patch
cd NVIDIA-Linux-x86_64-295.75
export IGNORE_XEN_PRESENCE=y
export SYSSRC=~/rpmbuild/BUILD/linux-3.8.6
cp /usr/include/linux/version.h $SYSSRC/include/linux/
./nvidia-installer -s

And there you have it Nvidia driver 295.75 that builds cleanly and works against 3.8.6 kernels. The same xorg.conf given in part 3 of this series will continue to work.

It is really quite disappointing that all this is necessary. What is more concerning is that the ability to use a monitor like the T221 is diminishing by the day. Without the ability to override what RandR returns, it may well be gone completely soon. It seems the only remaining option is to write a fakerandr library (similar to fakexinerama). Any volunteers?

It seems that Nvidia drivers are both losing features and becoming more bloated at the same time. 295.75 is 56MB. 304.88 is 65MB. That is 16% bloat for a driver that is regressively missing a feature, in this case an important one. Can there really be any doubt that the quality of software is deteriorating at an alarming rate?

Clevo M860TU / Sager NP8662 / mySN XMG5 GPU (GTX260M / FX 3700M) Replacement / Upgrade and Temperature Management Modifications

Recently, my wife’s Clevo M860TU laptop suffered a GPU failure. Over our last few Borderlands 2 sessions, it would randomly crash more and more frequently, until any sort of activity requiring 3D acceleration refused to work for more than a few seconds. The temperatures as measured by GPU-Z looked fine (all our computers get their heatsinks and fans cleaned regularly), so it looked very much like the GPU itself was starting to fail. A few days later, it failed completely, with the screen staying permanently blank.

The original GPU in it was an Nvidia GTX260M. These proved near impossible to come by in MXM III-HE form factor. Every once in a while a suitable GTX280M would turn up on eBay, but the prices were quite ludicrous (and consequently they would never sell, either). Interestingly, Nvidia Quadro FX 3700M MXM III-HE modules seem to be fairly abundant and reasonably priced. This is interesting considering that they cost several times more than the GTX280M new. Their spec (128 shaders, 75W TDP) is identical.

MXM-III HE Nvidia Quadro FX 3700M

MXM-III HE Nvidia Quadro FX 3700M

The GTX260M has 112 shaders and a lower TDP of 65W, so the cooling was going to be put under increased strain (especially since I decided to upgrade it from a dual core to a quad core CPU at the same time – more on that later). Having fitted it all (it is a straight drop-in replacement, but make sure you use shims and fresh thermal pads for the RAM if required to ensure proper thermal contact with the heatsink plate), I ran some stress tests.

Within 10 minutes of OCCT GPU test, it hit 97C, and started throttling and producing errors. I don’t remember what temperatures the GTX260M was reaching before, but I am quite certain it was not this high. I had to find a way to reduce the heat production of the GPU. Given the cooling constraints in a laptop, even a well designed one like the Clevo M860TU, the only way to reduce the heat was by reducing either the clock speed or the voltage – or both. Since the heat produced by a circuit is proportional to the multiple of the clock speed and the square of the voltage, reducing the voltage has a much bigger effect than reducing the clock speeds. Of course, reducing the voltage necessitates a reduction in clock speed to maintain stability. The only way to do this on an Nvidia GPU is by modifying the BIOS. Thankfully, the tools for doing so are readily available:

After some experimentation, it wasn’t difficult to find the optimal setting given the cooling constraints. The original settings were:

  • Core: 550MHz
  • Shaders:1375MHz
  • Memory: 799MHz (1598MHz DDR)
  • Voltage: 1.03V (Extra)
  • Temperature: Throttles  at 97C and gets unstable (OCCT GPU test)
  • FPS: ~17

The settings I found that provided 100% stability and reduced the temperatures down to a reasonable level are as follows:

  • Core: 475MHz
  • Shaders: 1250MHz
  • Memory: 799MHz (1598MHz DDR)
  • Voltage: 0.95V (Extra)
  • Temperature: 82C peak (OCCT GPU test)
  • FPS: ~16

The temperature drop is very significant, but the performance reduction is relatively minimal. It is worth noting that OCCT is specifically designed to produce maximum heat load. Playing Borderlands 2 and Crysis with all the settings set to maximum at 1920×1200 resulted in peak temperatures around 10C lower than the OCCT test.

While I had the laptop open I figured this would be a good time to upgrade the CPU as well. Not that I think that the 2.67Hz P9600 Core2 was underpowered, but with the 2.26GHz Q9100 quad core Core2s being quite cheap these days, it seemed like a good idea. And considering that when overclocking the M860TU from 1066 to 1333FSB I had to reduce the multiplier on the P9600 (not that there was often any need for this), the Q9100′s lower multiplier seemed like a promising overall upgrade. The downside, of course, was that the Q9100 is rated to a TDP of 45W compared to P9600′s 25W. Given the heatsink on the Clevo M860TU is shared between the CPU and the GPU, this no doubt didn’t help the temperatures observed under OCCT stress testing. Something could be done about this, too, though.

Enter RMClock – a fantastic utility for tweaking VIDs to achieve undervolting on x86 CPUs at above minimum clock speed. Intel Enhanced SpeedStep reduces both the clock speed and the voltage when applying power management. The voltage VID and clock multipliers are overrideable (within the minimum and maximum for both hard-set in the CPU), which means that in theory, with a very good CPU, we could run the maximum multiplier and minimum VID to reduce power saving. In most cases, of course, this would result in instability. But, it turns out, my Q9100 was stable under several hours of OCCT testing at minimum VID (1.05V) at top multiplier (nominal VID 1.275V). This resulted in a 10C drop in peak OCCT CPU load tests, and a 6C drop in peak OCCT GPU load tests (down to 76C from 82C peak).

WQUXGA – IBM T221 3840×2400 204dpi Monitor – Part 5: When You Are Really Stuck With a SL-DVI

I recently had to make one of these beasts work bearably well with only a single SL-DVI cable. This was dictated by the fact that I needed to get it working on a graphics card with only a single DVI output, and my 2xDL-DVI -> 2xLFH-60 adapter was already in use. As I mentioned previously, I found the standard 1xSL-DVI’s worth 13Hz to be just too slow when it comes to a refresh rate (I could see the mouse pointer skipping along the screen), but the default 20Hz from 2xSL-DVI was just fine for practically any purpose.

So, faced with the need to run with just a single SL-DVI port, it was time to see if a bit of tweaking could be applied to reduce the blanking periods and squeeze a few more FPS out of the monitor. In the end, 17.1Hz turned out to be the limit of what could be achieved. And it turns out, this is sufficient for the mouse skipping to go away and make the monitor reasonably pleasant to use.

(Note: My wife disagrees – she claims she can see the mouse skipping at 17.1Hz. OTOH, she is unable to read my normal font size (MiscFixed 8-point) on this monitor at full resolution. So how you get along with this setup will largely depend on whether your eyes’ sensitivity is skewed toward high pixel density or high frame rates.)

The xorg.conf I used is here:

Section "Monitor"
  Identifier    "DVI-0"
  HorizSync    31.00 - 105.00
  VertRefresh    12.00 - 60.00
  Modeline "3840x2400@17.1"  165.00  3840 3848 3880 4008  2400 2402 2404 2406 +hsync +vsync
EndSection

Section "Device"
  Identifier    "ATI"
  Driver        "radeon"
EndSection

Section "Screen"
  Identifier    "Default Screen"
  Device        "ATI"
  Monitor        "DVI-0"
  DefaultDepth    24
  SubSection "Display"
    Modes    "3840x2400@17.1"
  EndSubSection
EndSection

The Modeline could easily be used to create an equivalent setting in Windows using PowerStrip or a similar tool, or you could hand-craft a custom monitor .inf file.

In the process of this, however, I have discovered a major limitation of some of the Xorg drivers. Generic frame buffer (fbdev) and VESA (vesa) drivers do not support Modelines, and will in fact ignore them. ATI’s binary driver (fglrx) also doesn’t support modelines. Linux CCC application mentions a section for custom resolutions, but there is no such section in the program. So if you want to use a monitor in any mode other than what it’s EDID reports, you cannot use any of these drivers. This is a hugely frustrating limitation. In the case of fbdev driver, it is reasonably forgiveable because it relies on whatever modes the kernel frame buffer exposes. In the case of the VESA driver it is understandable that it only supports standard VESA modes. But ATI’s official binary driver lacking this feature is quite difficult to forgive – it has clearly be dumbed down too far.

Getting the Best out of the MacBook Pro Retina 15 Screen in VMware Fusion

I make no secret of the fact that I am neither a fan of Apple nor a fan of virtualization. But sometimes they make for the best available option. I have recently found myself in such a situation. My current employer, mercifully, allows employees a choice of something other than vanilla Windows machines to work on, and there was an option of getting a MacBook Pro. As you can probably guess from some of the previous articles here, I find the single most important productivity feature of a computer to be the screen resolution, an opinion I appear to share with Linus Torvalds. So I opted for the 15″ MacBook Pro Retina.

Unfortunately, the native Linux support on that machine still isn’t quite perfect. Since speed is not a concern in this particular case, I opted to run Linux using VMware Fusion on OSX. Unfortunately, VMware Fusion cannot handle full 2880×1800 resolution of the display and with lower resolutions running in full screen mode the quality is badly degraded by blurring and aliasing. The solution is to create a custom 2880×1800 mode in /etc/X11/xorg.conf that fits within VMware virtual graphic driver’s capabilities. This took a bit of working out since the mode had to fit within horizontal and vertical refresh rates of the driver and the total pixel clock the driver allows. The following are the settings that work for me:

Section "Monitor"
        Identifier "MacBookPro"
        HorizSync 30.0 - 90.0
        VertRefresh 30.0 - 60.0
        ModeLine "2880x1800C" 358.21 2880 2912 4272 4304 1800 1839 1852 1891
EndSection

Section "Screen"
        Identifier "Default Screen"
        Monitor "MacBookPro"
        DefaultDepth 24
        SubSection "Display"
                Modes "2880x1800C"
        EndSubSection
EndSection

The result is being able to run a full screen 2880×1800 mode, and it looks absolutely superb.

Virtual Performance – Or Lack Thereof

Updated: Results for VMware ESXi 5.0.0 and the newly released VMware Player 5.0.0 have been added.

People always seem very shocked when I suggest that virtualization comes with a very substantial performance penalty even when virtualization hardware extensions are used. Concerningly, this surprise often comes from people who have already either committed their organization’s IT infrastructure to virtualization, or have made firm plans to do so. The only thing I can conclude in these cases, unbelievable as it may appear, is that they haven’t done any performance testing of their own to assess the solution they are planning to adopt.

So I decided to document some basic performance tests that show just how substantial the performance hit of virtualization is.

Test Setup

Hardware:
Core2 Quad 3.2GHz
8GB of RAM
2x500GB 7200rpm SATA DM RAID1 for the main system
1x250GB 7200rpm SATA for testing

Virtual Test Configuration (VMware Player 4.0.4, Xen 4.1.2 (PV and HVM), KVM (RHEL6), VirtualBox 4.1.18):
CPU Cores: 4 (all)
RAM: 6GB
Disk: System booting off the 2×500 RAID1. Raw 250GB SATA disk passed to the VM.

Disk write caching was enabled in the VMware configuration. You may think that this unfairly gives the VM configuration an advantage, but as you will see from the results, even with this “cheat”, the performance is still very disappointing compared to bare metal. In any case, the amount of disk I/O is negligible – the caches and the working set always fit into memory.

Physical Test Configuration:
CPU Cores: 4 (all)
RAM: 6GB (limited using mem=6G boot parameter)
Disk: Booting directly off the same 250GB SATA disk used for VM testing, with the same kernel and configuration.

The Test

The test performed is the compile of the vanilla 2.6.32.59 Linux kernel. This is the script used for testing:

#!/bin/bash

echo Cleaning...
make clean > /dev/null 2>&1
make mrproper > /dev/null 2>&1
sync
echo 3 > /proc/sys/vm/drop_caches
echo Configuring...
make allmodconfig > /dev/null 2>&1
echo Syncing...
sync
find . -type f -print0 | xargs --null cat > /dev/null 
echo "Timing build..."
time (make -j16 all > /dev/null 2>&1)

The source tree is cleaned and all caches dropped. The allmodconfig configuration is used to get some degree of testing of disk I/O by creating the maximum number of files. Caches are then primed by pre-loading all the source files. This is done in order to more accurately measure the CPU and RAM subsystems without bottlenecking on disk I/O. The CPU in the system has 4 cores, and 16 build threads are used to ensure the CPU and memory I/O are saturated, but without causing enough memory pressure to cause swapping.

On the host and in the guest, all unnecessary services and processes were stopped (especially crond which could theoretically cause additional load on the system that would distort the results).

All tests were carried out 3 times in a row, and the best result for each is considered here (the differences between the runs were minimal).

This is very much a redneck, brute-force test. There isn’t much finesse to it. But I like tests like this because they cannot be cheated with the sort of smoke and mirrors illusions that virtualization software is very good at applying.

Results

Bare metal: 1,042.523s (100%)
Xen 4.1.2 (PV): 1,316.984s (79.16%)
VMware ESXi 5.0.0: 1,361.321s (76.58%)
VMware Player 5.0.0: 1,478.732s (70.50%)
VMware Player 4.0.4: 1,520.023s (68.59%)
KVM (RHEL6): 1,691.849s (61.62%)
Xen 4.1.2 (HVM): 2,839.442s (36.72%)
VirtualBox 4.1.18: 8,876.945s (19.06%)

Note: No, this is not a typo – VirtualBox really is that bad.

To make this difference easier to visualise, here it is on graphs

Virtualization Performance - Time in Seconds

Virtualization Performance – Time in Seconds

 

To give a better idea of relative performance, here it is in % points, with bare metal being 100%.

Virtualization Performance - Relative Difference

Virtualization Performance – Relative Difference

The difference is substantial even with the least poorly performing hypervisor. Virtualization performance is over a 5th (21%) down with paravirtualized Xen down compared to bare metal, and nearly a quarter (24%) lower than bare metal with VMware ESXi, and even worse with KVM. Or if you prefer to look at it the other way around, bare metal is more than a quarter as fast again (26.32%) as the best performing hypervisor on the same hardware.

Don’t get me wrong – virtualization is handy for all sorts of low-performance tasks. In cases where it is used to consolidate a number of mostly idle systems into one mostly idle system, it brings clear benefits. (Except maybe in the case of VirtualBox – the performance there is just too appalling for anything, and HVM Xen is pretty poor, too.) But for uses where performance is important, thoughts of virtualizing need to undergo a serious reality check. Even if your system is designed to scale completely horizontally, requiring 26%+ of extra hardware (best case scenario, it could be a lot worse depending on which hypervisor you use) is likely to put a significant strain on your budget and running costs.

Note: It is worth stressing that these tests are carried out on hardware with VT-x, and support for this is enabled and used for all the tested hypervisors. So the results here are based on optimal hardware support.