WQUXGA a.k.a. OMGWTF – IBM T221 3840×2400 204dpi Monitor – Part 3: ATI vs. Nvidia

It is at times like this that I get to fully consider just how bad a decision it was to jump ship from ATI to Nvidia when it came to graphics cards. But now that sense has been forced back upon me, I will hopefully not consider such madness again for the best part of the next decade.

Due to the ATI drivers being fundamentally unable to handle the T221 reliably, I bit the bullet and decided to put my old 8800GT card back in. The first WTF came about when it transpired that ATI drivers cannot be uninstalled from Windows XP using their bundled uninstaller in Safe or VGA modes. This is quite bad when you consider that it could be the ATI drivers that are making the machine not boot into normal mode. Credit to it, though, the ATI uninstaller was not too bad once I ran it in normal mode, and after using it to remove all ATI software and uninstalling the ATI devices in Device Manager, there wasn’t enough left to cause problem on the next reboot, during which the machine contained an Nvidia card. Everything booted up fine, and after a quick run of the Auslogics Registry Cleaner (just to make sure – easily the best registry cleaner I have used to date), everything was ready for the installation of Nvidia drivers. Everything went quite painlessly, and a reboot later I had the T221 configured for 2x1920x2400@20Hz mode. The only thing that didn’t come up perfectly by default is that I had to add the 1920×2400@20Hz mode in Nvidia Control Panel (click the Customize button).

By this point, the superior features of Nvidia were already becoming apparent:

  • Text and low-resolution mode anti-aliasing in firmware – such modes look vastly better than on ATI hardware).
  • Until the driver enables the secondary port, it remains disabled. This is really nice on the T221 because it means you don’t get the same thing on the screen twice during the BIOS POST and early stages of the boot process. I can imagine this also being annoying on a multi-head setup.
  • The primary port wasn’t switching for no apparent reason in Windows with multiple screens plugged in.
  • Best of all – Windows XP drivers Just Work ™. They don’t forget their settings between reboots.
  • No tearing down the middle of the screen where the two halves meet. With the ATI card, the mouse couldn’t be drawn on both halves at the same time. In the middle, you could make it virtually disappear, Not a big deal, but yet another example of general bugginess. Also, in games the tearing along the same line disappeared (I always run with vsync forced to on, and it was still visible from time to time with the ATI card).

Just the properly working drivers would have easily convinced me of the error of my recent ways, but all the other niceties really make for a positive difference to the experience.

After I got Windows working (it only took 20 minutes, after giving up on ATI after wasting a whole day on getting it to work properly and remember the settings between reboots), it was time to get things working in Linux. The first thing that jumped out at me about this part of the exercise is just how much better ATI’s Linux drivers are compared to their Windows drivers. It is obvious that they are actually being developed by somebody competent. Unlike the Windows drivers, the Linux drivers worked out of the box, and the only unusual thing that I needed to do was to make sure Fake Xinerama was configured and preloaded. Removing them was a simple case of:

# rpm -e fglrx

Simple, efficient, reliable. Seems ATI‘s Windows driver team have a lot to learn from their Linux driver team.

The machine came up fine with the nouveau drivers loaded, but I wanted to get Nvidia’s binary drivers working. The experience here was a little more problematic than it had been with the ATI drivers. The nvidia-xconfig and nvidia-settings utilities weren’t as intuitive as the ATI configuration utility, and the setup suffered from a particularly annoying problem where GPU scaling would default to on. This resulted in the screen mode being left stretched and unusable, but sometimes just starting the nvidia-settings program would fix some of it. In the end I just gave up on it and wrote my own xorg.conf according to the documentation – and that worked perfectly. You may want to set the following environment variable to force vsync in GL modes (e.g. for mplayer’s GL output)

# export __GL_SYNC_TO_VBLANK=1

This ensured there was no tearing visible during video playback.

One thing worth noting is that Nvidia drivers bring their own Xinerama layer with them, so the Xorg Xinerama should be disabled. There is also an option for faking the Xinerama information (NoTwinViewXineramaInfo), so no need for Fake Xinerama, either.

In conclusion, it is quite clear that Nvidia win hands down in terms of features and user experience, especially on Windows, due to their more stable drivers are more intuitive configuration utilities. The story is different on Linux – I would put ATI slightly ahead on that platform, at least in terms of configuration utilities. Having to use Fake Xinerama isn’t a big deal for the technically minded. Even on Linux, however, in terms of the overall outcome and the end experience, I feel Nvidia still come out ahead, since ATI drivers still occasionally produce visible tearing when playing back high definition video.

All this made me think about what is the most important thing about a product such as graphics cards. In the end it is not just about performance. Performance is only a part of the overall package. What I find is that the most important thing about a product is the whole experience of configuring it and using it. How easy is it to get to working under edge case conditions? How reliable is it – once it is working does it stay working? Are there any experience ruining artifacts such as tearing visible in applications, even with vsync enabled? These sorts of things along with the crowning touches such as anti-aliasing of low resolution modes and only having one active video output until the drivers specifically enable the others are what really impacts the experience. And based on my experience of Nvidia and ATI cards over the past few years, I hope somebody talks some sense to me if I consider an ATI product again – except perhaps if their FireGL team starts writing their Windows drivers.

WQUXGA a.k.a. OMGWTF – IBM T221 3840×2400 204dpi Monitor – Part 2: Windows

When I set out to do this, I thought getting everything working under Windows would be easier than it was under Linux. After all, the drivers should be more mature and AMD would have likely put more effort into making sure things “just work” with their drivers. The experience has shown this expectation was unfounded. Getting the T221 working in SL-DVI 3840×2400@13Hz mode was trivial enough, but getting the 2xSL-DVI 2x1920x2400@20Hz mode working reliably has proven to be quite impossible.

The first problem has been the utter lack of intuitiveness in the Catalyst Control Center. It took a significant amount of research to finally find that the option for desktop stretching across two monitors lies behind a right click menu on an otherwise unmarked object:

CCC Desktop Stretch Option

CCC Desktop Stretch Option

Results, however, were intermittent. Sometimes the resolution for the second half of the screen would randomly get mis-set, sometimes it would work. Sometimes the desktop stretching would fail. Eventually, when it all worked (and it would usually require a lot of unplugging of the secondary port to get a usable screed back), it would be fine for that Windows session, but it would all go wrong again after a reboot. The screen would just go to sleep at the point where the login screen should come up, and the only way to wake it up is to unplug the secondary DVI link, log in, and then plug in the second cable, usually a few times, before it would come up in a usable mode. Then the same resolution and desktop stretching configuration process would have to be repeated – with a non-deterministic number of attempts required, using both the Windows display settings configuration and the Catalyst Control Center.

At first I thought it could be due to the fact that I am using a 4870X2 card, so I disabled one of the GPUs. That didn’t help. Then I tried using a different monitor driver, rather than the “Default Monitor” which is purely based on the EDID settings the monitor provides. I tried a ViewSonic VP2290b driver (this was a rebranded T221), and a custom driver created using PowerStrip based on the EDID settings, and neither helped. Since I only use Windows for occasional gaming and not for any serious work, this isn’t a show stopping issue for me, but I am stunned that AMD‘s Linux drivers are more stable and usable than the Windows ones when using even slightly unusual configurations.

To add a final insult to injury, 4870X2 card doesn’t end up running the monitor with one GPU running each 1920×2400 section. Instead, one GPU ends up running both, and the 2nd GPU remains idle. At first I attributed the tearing between the two halves of the screen to be due to each half being rendered by a different GPU. Unfortunately, considering that all tests show that one GPU remains cold and idle while the other one is shown to be under heavy load, I have to conclude that this is not the case. This is particularly disappointing because the experience is both visually bad (tearing between the two 1920×2400 sections) and poorly performing (one GPU always remains idle and the frame rates suffer quite badly – 7-9fps in the Crysis Demo Benchmark). I clearly recall that my Nvidia 9800GX2 card I had before had a configuration option to enable dual-screen dual-GPU mode.

I am just about ready to give up on AMD GPUs, purely because the drivers are of such poor quality and lacking important features (e.g. requirement of fakexinerama under Linux, something that Nvidia drivers have a built in option for). I’m going to dig out my trusty old 8800GT card and see how that compares.

Genesi Efika MX Smartbook’s 0 Button Mouse

I love my Genesi Efika MX Smartbook – it’s an awesome little machine. But there have been three things that have bothered me about it since I got mine, and they are the sort of things that can make a difference between sub-mediocrity and brilliance. I have already covered one of the issues in a previous post concerning the screen upgrade.

The second big problem I have with it is that the buttons on the touch pad are completely unusable. This is not an exaggeration. Due to the way they are designed, it is only possible to use them for dragging with a copious amount of luck – not skill – luck. Clicking using the buttons in the touchpad requires only an infinitesimally smaller amount of luck than dragging. This isn’t acceptable, and since I otherwise rather like the Smartbook, I decided to find a good workaround that doesn’t involve carrying a mouse or a trackball with me – this would ruin one of the best things about it – the portability.

I used to have Sony Vaio PCG-U1 and PCG-U3 machines in the past. They were quite awesome, and competed quite successfully on spec with the Genesi Efika MX Smartbook – which is fairly impressive considering the Vaio‘s in question were produced in 2002 – 9 years ago. The main reason why I finally needed to upgrade from the old Vaio was because 1024×768 sccreen resolution simply stopped being sufficient for any serious use. The standard Efika would have failed this requirement even worse were it not for the possibility of the 1280×720 screen upgrade. Plus, the Efika is much thinner and doesn’t require a battery pack as big as the rest of the laptop for 6 hours’ battery life. But I digress. The main point I was getting to is that the Vaio had mouse buttons that were quite separate from the joypad, while still being very ergonomic and easy to use. This made me think about using a similar trick on the Efika. All I needed was two conveniently placed yet redundant keys on the keyboard to remap into mouse buttons. The “House” (the one with an icon of a houe as opposed to”Home”) and “Alt” keys in the bottom left corner seemed perfect for this task.

To do this, we need to do two things:

  1. Disable Xorg’s usage of the keys using xmodmap. I put mine in /etc/X11/xmodmap.
  2. Configure actkbd to trap the low-level keystrokes and execute xdotool commands to issue Xorg mouse button events. Put this in /etc/actkbd.conf
  3. Put the two together and make it happen automatically on login using a script /etc/X11/Xsession.d/95-keyremap.

That is pretty much it. The “House” and “Left Alt” keys will now act as left and right mouse buttons respectively. I hope you find it to be a big an improvement as I did. It feels like having mouse buttons again after being stuck with a 0 button mouse.

These instructions are for Ubuntu, since that is what the Efika ships with and I haven’t gotten around to putting Fedora on it yet. It shouldn’t be difficult to adapt the above approach for other distributions.

WQUXGA a.k.a. OMGWTF – IBM T221 3840×2400 204dpi Monitor – Part 1: Linux

I’m not sure how many people occasionally stop to notice this sort of thing, but to me it frequently seems that technology regresses for long periods from it’s infrequent peaks. In the 60s we saw flights of the likes of XB-70 Valkyrie and the SR-71 Blackbird, and people walked on the moon. Yet in 2011 we are reading about the last flight of the Space Shuttle rather than about the first colony on Mars. It makes a quote from Idiocracy all the more uncanny: “… sadly the world’s greatest minds and resources where focused on conquering hair loss and prolonging erections.

The same pattern seems to apply to some aspects of the computer industry, when cost pressures take precedence over quality, features and innovation. In 2001, we saw the introduction of the IBM T220 monitor, with resolution of 3840×2400 on a 22.2″ panel. It was later superseded by the T221 with very similar specifications, but it was ultimately discontinued in 2005. Nothing matching it has been available since. Today, the screen resolutions seems to be undergoing an erosion. On small panels the “standards” (sub-standards?) have settled at the completely unusable 1024×600, and with total of five exceptions from Dell (3007WFP, 3008WFP, U3011), Samsung (305T) and Apple (Cinema HD), the commonly available screens are limited to 1920×1080 resolution. Even 1920×1200 screens are getting more and more rare, especially on laptops, because screens are marketed by diagonal size and for any given diagonal length, 16:9 ratio screens have a smaller surface area than 16:10 ratio screens.

IBM T221 monitors, especially of the latest DG5 variety, are very hard to come by and still expensive if you can ever find one. Typically they sell for double what you can get a Dell 3007WFP for. But you do get more than twice the pixel count and more than twice the pixel density. I have recently acquired a T221 and if your eyes can handle it (and mine can), the experience is quite amazing – once you get it working properly. Getting it working properly, however, can be quite a painful experience if you want to get the most out of it.

My T221 came with a single LFH-60 -> 2x SL-DVI (single link DVI) cable. There are two LFH-60 connectors on the T221, which allows the screen to be run using 4x SL-DVI inputs. This provides a maximum refresh of 48Hz. There is also a way to run this monitor using 2xDL-DVI inputs at 48Hz, but this requires special adapters, but that is a subject for another article, since I haven’t got any of those yet.

Using a single LFH-60 -> 2x SL-DVI cable, there are only two modes in which the T221 can be run:

1) As a single 3840×2400 panel @ 13Hz using a single SL-DVI port

2) As two separate monitors, each being 1920×2400 @ 20Hz, using two SL-DVI ports

The 13Hz mode is completely straightforward to get working on both RHEL6 and XP x64, but 13Hz is  just not fast enough. You can actually see the mouse pointer skipping as you move it, and playing back a video also results in visible frame skipping. So I have spent the effort to get the 2x1920x2400@20Hz mode working on my ATI HD4870X2. The end results are worth it, but the process isn’t entirely straightforward. The important thing to consider is that when running in anything other than 3840×2400@13Hz mode appears to the computer as two completely separate 1920×2400 monitors.

IBM T221 with Linux

ATI‘s Linux drivers aren’t really mature enough for the job, and to achieve the best results, you have to use aticonfig to generate xorg.conf without xinerama support, start X-Windows, fire up the amdcccle configuration utility for ATI cards, enable dual screens, then add xinerama support. If all this sounds complicated to you – it is, and it took a lot of trial and error to get right. So to save you the effort, here is a copy of my xorg.conf file. This is from a RHEL6 machine using the ATI fglrx driver. It will almost certainly work on other distributions, too, with little or no modification.

This still won’t work quite as you’d hope, though – xinerama passes information to the applications about the geometry of the desktop, and apps will only maximize to one screen. This also goes for the task bar, and applies to video playback. The last bit of magic involves faking the xinerama information. Nvidia drivers come with a built in option for this: “NoTwinViewXineramaInfo”. Unfortunately, ATI drivers have no such option. But, this being the world of Linux, there is a backup plan. There is a LD_PRELOAD library called Fake Xinerama that can be used to override the screen geometry passed to applications, and make the applications think they are on a single 3840×2400 screen. All you need to do is the following:

1) Compile fake xinerama from the like above
2) Add the line “/usr/local/lib64/libXinerama.so” to your /etc/ld.so.preload file.
3) Create a file ~/.fakexinerama containing:

1
0 0 3840 2400

The first line contains the number of screens, the second line’s format is:
<origin X> <origin Y> <width X> <width Y>
If you are booting into graphical environment immediately (runlevel 5), you will need the .fakexinerama file in root’s home directory, too, since gdm/kdm run as root.

And if you have managed to follow all that, you will have a single seamless  3840×2400@20Hz desktop.

Hardware Accelerated SSL on SheevaPlug (Marvell Kirkwood ARM) Using OpenSSL on Fedora

I have recently been spending a quite a lot of time working on Linux on various ARM devices. It is quite amazing what ARM hardware is capable of nowdays. One of the most popular ARM based machines available is the SheevaPlug. The performance of it is pretty good for a small server – my experience shows that the 1.2GHz Marvell Kirkwood 88F6281 compares quite favoutably to the likes of 1.66GHz Intel Atom N450 in terms of both server performance and especially in terms power usage. Atom N450 systems have a typical power draw of about 22W idle and 28W under load – a far cry from the supposed 7.6W total of 5.5W N450 + 2.1W NM10. SheevaPlug, on the other hand, draws 2.3W idle and 7W under load.

In some areas, however, the Atom does hold a performance advantage, especially in usage that requires heavy number crunching – unlike the Marvell KirkwoodAtom N450 has a FPU and SIMD capability via the SSE/SSE2/SSSE3 instruction sets. One set of applications that get better performance on Atom N450 are the ones doing encryption, for example OpenSSL. Or do they…

Not quite. The Kirkwood ARM has an ace up it’s sleeve, and as it turns out, it is one powerful enough to allow it to close the gap against a processor with 4x the power budget. It has a hardware crypto engine that supports MD5, SHA1 and AES-128 acceleration.

Unfortunately, mainstream Linux distributions don’t come with the hardware crypto acceleration enabled, and most of the documentation available is sufficiently out of date to be unapplicable to the current generation of distributions. All of it points at OCF Linux, which hasn’t been updated for kernels past 2.6.33 and OpenSSL 0.9.8n, both of which are deprecated. I have modified the kernel patches to make them work on 2.6.35, but unfortunately the cryptodev driver uses locked ioctl operation which has been removed from the kernel starting with 2.6.36, so further modifications are required to make it work on later kernels. OCF Linux also doesn’t appear to have been updated since late 2010. But things are not as bad as it initially seems – it turns out that there is an alternative.

The reason kernel patches are required is because acceleration depends on the BSD style cryptodev kernel interface. There is an alternative, more up to date project that provides this much less intrusively: Cryptodev-linux. It provides a standalone driver that doesn’t require the entire kernel to be recompiled for it, and it works with the 2.6.36+ kernels.

That just leaves OpenSSL support. Well, it turns out that OpenSSL 1.0.0 already comes with support for cryptodev hardware offload, it just isn’t enabled by default. It has to be enabled during the configure stage by providing -DHAVE_CRYPTODEV (for encryption offload) and -DUSE_CRYPTODEV_DIGESTS (for hashing offload). If you are building against Cryptodev-linux you will also have to provide the -DHASH_MAX_LEN=64 parameter – this is normally in OCF‘s cryptodev.h header file, but isn’t present in the header files that Cryptodev-linux provides. Not a big deal, but something to bear in mind when you are building your own OpenSSL with cryptodev engine support.

So, how big a difference does the Kirkwood‘s acceleration make? Quite a substantial one. Here is what openssl speed test produces:

Kirkwood without cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 1870065 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 516074 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 256 size blocks: 132474 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 33342 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 4171 aes-128 cbc’s in 3.00s

Kirkwood with cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 85277 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 64 size blocks: 82960 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 256 size blocks: 59806 aes-128-cbc’s in 0.03s
Doing aes-128-cbc for 3s on 1024 size blocks: 40939 aes-128-cbc’s in 0.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 8227 aes-128-cbc’s in 0.00s

The results show, predictably, that with very small (unrealistically small) data blocks, software-only userspace crypto is faster due to less context switching. With 1KB blocks, however, hardware crypto is 23% faster, and with 8KB blocks the hardware engine goes twice as fast as the software-only option. But what is really impressive is the reduction in CPU time. Because the hardware crypto engine is asynchronous, there is practically no CPU time required when using it, which is important since it leaves the CPU free to get on with other tasks.

For comparison, there are the Atom N450 results:

# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 3813930 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 1098375 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 256 size blocks: 294884 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 1024 size blocks: 74520 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 9245 aes-128-cbc’s in 2.99s

So the Atom is faster all around – on 1KB blocks it is 82% faster, which reduces to a 12% advantage using 8KB blocks. But let us not forget that we could, in theory, run two instances of OpenSSL, one with hardware offload and one without, which would give us the combined total performance of both, if that is all we needed the machine to do. This would give us figures of approximately:

1KB: 33342+40939=74281
8KB: 4171+8227=12398

This ties with the Atom using 1KB blocks, and beats it by 34% using 8KB blocks – all in a power envelope 4x smaller. Pretty impressive.

Installing Cryptodev-linux is trivially simple, and is simply a matter of the usual ”make; make install” procedure after extracting the tar ball (make sure you have the kernel headers for your kernel installed and available in /lib/modules/$(uname -r)/build/).

I mentioned above the required additional parameters to make OpenSSL build with cryptodev support. On Fedora 13′s OpenSSL‘s source package, you can edit the relevant line in the spec file. The relevant section on my version reads:

./Configure –prefix=/usr –openssldir=%{_sysconfdir}/pki/tls ${sslflags} zlib enable-camellia enable-seed enable-tlsext enable-rfc3779 enable-cms enable-md2 no-idea no-mdc2 no-rc5 no-ec no-ecdh no-ecdsa –with-krb5-flavor=MIT –enginesdir=%{_libdir}/openssl/engines –with-krb5-dir=/usr -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -DHASH_MAX_LEN=64 shared threads ${sslarch} fips

In case you cannot modify/build it yourself, here are the packages:
openssl-1.0.0-1.kw.fc13.src.rpm
openssl-1.0.0-1.kw.fc13.armv5tel.rpm
openssl-devel-1.0.0-1.kw.fc13.armv5tel.rpm

Enabling Write-Read-Verify Feature on Disks

Given the appalling reliability of modern disks, any feature that helps ensure data integrity and early detection of failure has to be deemed a good thing. What caught my attention recently is that all of the Seagate Barracuda disks I have (a number of ST31000333AS, ST31000340AS and ST31000528AS models) support the Write-Read-Verify feature. But there is a snag – disks from different batches, even for the same model, seem to disagree about the default state of this feature. Worse, the feature gets reset to it’s default setting on every reboot. This wouldn’t be a problem if the usual tool for such things on Linux, hdparm, had an option for controlling the state of this feature – but it doesn’t. So I wrote a patch to add control of write-read-verify capability to hdparm. This has been included upstream in hdparm as of version 9.39. Hopefully this will help keep your data a little safer.

The Appalling Quality of Hard Disks

I decided to write this article after having spent the last three years fighting unreliable and buggy disks of certain brands. The most prominent anti-star of this article is the disk model HD501LJ – a 500GB SATA disk. If you Google the model number, I am sure you can find out who the manufacturer of it is.

The story begins back in February 2008 when I bought two of these disks for the machine I was building. Approximately 15 months later, both of the disks (used in a RAID1 stripe) failed about 20 minutes apart with massive unrecoverable media failure, taking the data with them. This was annoying and inconvenient, but I learned some decades ago about the importance of keeping backups, so it wasn’t that serious a setback – more an annoyance at the waste of time than anything else.

Before I say anything else, I would like to say that the quality of the service provided by Rexo (the company in UK that handles warranty replacements for this particular disk manufacturer) is superb. They always send the replacement disk the same day the faulty disks arrive, and as the saga that I am about to recount unfolded, they even sent couriers to pick up faulty disks and deliver the replacements at the same time. Their superb service was in fact the only reason why I bought more disks made by the same manufacturer. This turned out to have been a big mistake since Rexo no longer handle warranty services directly – you have to arrange for it via the manufacturer’s web site. No doubt they were too efficient and helpful toward the end customers for the manufacturer’s liking.

The real story begins with the disks that arrived as replacements under warranty. One of them worked OK, and passed all the cursory tests I threw at it (short and long SMART tests and a rudimentary pass of badblocks). The other initially passed the tests, but approximately 50% of the time, the actuator would get stuck and the disk would just click indefinitely when trying to power up. Power cycling typically rectified the problem but I wasn’t prepared to put up with it, so it went back to get replaced.

The next replacement disk exhibited a different interesting problem. It failed the SMART tests immediately, on a sector that was beyond the LBA addressable range. This turned into a pending sector and couldn’t be fixed because it wasn’t a writable sector – it was a spare sector for remapping unrecoverable sectors. Clearly the firmware is buggy in it’s handling of such a condition – it doesn’t handle the physical and logical block addressing correctly. Since this rendered built in SMART diagnostics useless, this disk went back to be replaced again. Another 4 disks arrived after it, all with the same issue. This indicates a systematic fault and very poor quality control procedures on refurbished disks.

Eventually my case got escalated to their engineering department, and one of the engineers hand-picked a disk on which the problem wasn’t manifesting, ran a full set of tests on it (including SMART, which shockingly do not form a part of the quality control check on disks from this manufacturer – or at least they didn’t form a part of the checks in 2009), and sent me that disk to replace the one that was faulty.

Now, another year later, and the problem manifested again on one of those disks (bad sector at LBA+1 address). The only way this could be cured (refusal to reallocate sectors on overwrite and bad sectors beyond the end of LBA addressable range) was by performing a secure erase. That made the disks afterwards pass the full SMART self diagnostics and badblocks tests. The HD501LJ and HD103UJ, however, had an additional problem. Once the security on the disks was activated and the password set (required to perform a secure erase), they didn’t automatically disable security upon the erase. It also appears that the security implementation is buggy, and if the disk is secured, it will cause the machine to crash during booting on certain combinations of BIOS/SATA controller and motherboard. I worked around this by putting the disks in a machine that didn’t end up crashing and disabling the security on the disks manually.

Over time I did a bit more investigative work on these disks, and found additional bugginess in the firmware. I found that unreadable sectors that come up as “pending” sectors in SMART, once written to, disappear rather than show up as reallocated sectors. The pending sector count goes down to 0, and the reallocated sector count stays at 0. This is extremely bad behaviour, and affects not just the HD501LJ mode, but also the 1TB HD103UJ disk from the same manufacturer. Since it isn’t limited to a specific model, it seems likely that it affects all disks made by this manufacturer. I should also point out that there is another model of a disk that I have observed the exact same behaviour from: WD5000AAKS. You have been warned – these disks lie about the number of reallocated sectors they have, which means that one of the most important metrics indicating the health of the disk is missing. In some cases these disks also refused to reallocate the sectors on overwrite. It is worth noting that Google’s research on disk reliability shows the sector reallocation count to be the most reliable indicator of the disk’s imminent failure. They have found that 40% of the failed disks show reallocated sectors (see Figure 14. in the linked document). It seems reasonable to assume that the reason the disks made by the two manufacturers in question do not track this value in order to reduce their warranty claims by making the problem remain unnoticed for longer – no doubt hoping that you won’t notice until just after the warranty period expires. Any manufacturer that does this doesn’t deserve your custom – spend your money instead on a brand of disks that works correctly! Consider yourself warned.

None of the remedial actions listed above are something an average user would likely be able to carry out, and a more knowledgeable user would get there in the end after spending more time on the task than the cost of a disk would justify. All I can say in conclusion is that unless your data and time are worthless, buy disks made by a more competent manufacturer. Good manufacturers make a bad model from time to time. Bad manufacturers make bad models all the time.

Update: I have recently come across an interesting article on disk failure rates. I cannot help but wonder how much higher would returns rates be on disks from the two manufacturers whose disks (mentioned above) I’ve had the misfortune to own if they weren’t misreporting reallocated sectors.

Shared Root Single System Image Clustering

When running multiple servers that are supposed to be nearly identical (e.g. a cluster where all nodes have the same functionality), managing the configuration can be a daunting task if not approached in the right way. In this article we will explore the tools at our disposal to make this easier and the approaches we can take to ensure that our solution has the best available compromise between performance and maintainability.

Shared Root

An excellent way to implicitly maintain the equality of configuration and package installations of multiple servers is to configure them to share the root file system. This ensures that any configuration changes on one of the nodes are implicitly and immediately applied to all the other nodes. Since there is no scope for the configuration and packages to get out of sync, the complexities of maintenance are greatly reduced. It effectively simplifies the task from one of administering n nodes down to administering 1 node.

Open Shared Root

Open shared root is the de facto standard for implementing shared root clusters on Linux. It provides all the required tools for creating an initrd that contains everything needed to start and mount the shared file system that will be used for root. Something like this is required if the shared root file system is anything other than NFS. Linux kernels support NFS booting on the kernel level, so an additional bootstrap that OSR provides isn’t strictly necessary, but even then it is still useful.

File systems that can be used for shared root fall broadly into two categories:

Cluster file systems
Examples of cluster file systems include:
GFS and GFS2 (from RedHat)
OCFS, OFCS2 (from Oracle)
VMFS (VMware)
VxCFS (Veritas/Symantec)

These file systems are characterized by the fact that they exist directly on top of a block device. In the context of shared root these are typically provided by iSCSI, ATAoE, Fibre Channel or DRBD, but directly shared SCSI buses are also sometimes used.

Network file systems
Examples of network file systems include:
NFS
CIFS (a.k.a. SMB or Samba)
GlusterFS

These file systems are characterized by the fact that they export an already existing, underlying file system. The underlying file system is typically one that exist directly on a block device, such as the cluster file systems mentioned above or any of the many non-cluster local file systems (e.g. ext3, NTFS, …).

When a cluster file system is used for the shared root, there is another important consideration – not only does the file system itself have to be supported by the bootstraping process, but the underlying block device has to be supported, too.

OSR supports a number of combinations of several of file systems and block devices mentioned above (see the OSR OS platform support matrix for most up to date information). At the time of writing of this article, support for GlusterFS based OSR isn’t listed in the matrix, but support does exist for it – the author of this article knows this for a fact since he developed and contributed the patch and the corresponding documentation to the OSR project :-) .

The OSR website has plenty of excellent documentation and howtos on how to configure it for most common scenarios, so there seems little need to repeat them here.

Pitfalls

While OSR has a lot going for it, it isn’t without its own share of potential pitfalls. It is great when it works (and it is mature enough that most of the time it does just work). One of the ongoing aspects of its development is the minimization of the initrd bootstrap footprint. However, the fact that it has to start up all the clustering components opens a sizable scope for things to break. The author of this article has seen it happen more than once that an update to key clustering software components even on enterprise distributions has resulted in the said components becoming broken and rendering the entire cluster unbootable. This is particularly problematic on OSR because the entire system is rendered unbootable with relatively limited ability to troubleshoot the problem, since the entire root file system is unavailable. For just such cases it would be very useful to have a more fully featured bootstrap root that would allow for much more graceful troubleshooting and recovery.

Another complication introduced by OSR is that the startup and shutdown sequences have to be adjusted to cooperate gracefully with the fact that they are running inside the pre-initialisation bootstrap. Specifically, this means being careful which processes need to be excluded during shutdown’s killall5 execution, and which file systems should be left mounted during the first stage of the shutdown sequence (the rootfs needs to be left to the bootstrap root shutdown sequence to unmount). OSR comes with patches to the init scripts to take care of this, but in some cases (such as with GlusterFS), the process is not as straightforward and foolproof as one might hope. When this process hits a problem, the shutdown sequence hangs.

All this got me thinking about a similar approach but with key differences to address the above concerns.

Virtualized Shared Root

To summarize, the two features that this approach was designed to add compared to the vanilla OSR method are:

  1. Availability of a fully featured bootstrap environment.
  2. Removal of need to pay special attention to startup and shutdown sequences due to the peculiarities of running on a shared root.

The obvious way to implement 1) is to ignore the OSR bootstrap and simply use a normal (albeit a relatively minimal) OS install to prepare and bootstrap the volumes for the shared root instance. This works reasonably well, but it does bring a problem with it – the boostrap OS isn’t implicitly identical between the nodes. In OSR this is addressed by the fact that the same initrd is used on all the nodes, so even though the bootstrap OS isn’t permanently shared, a high degree of consistency exists due to the bootstrap being initialized from the same image at every boot. So for the sake of tidiness and feature equivalence with OSR, some method must be applied to ensure that the copies are kept in sync. The tool used to achieve this is csync2.

csync2 is similar to rsync, but is specifically designed for synchronizing a set of files to a large number of remote nodes. I am not going to go into details of csync2 setup here because good documentation exist on the Linbit website and elsewhere. The csync2 configuration file I use is provided because it lists which files should be excluded from the synchronization.

group openvz-osr
{
host openvz-osr1;
host (openvz-osr2);
key /etc/csync2/openvz-osr.key;

include /*;
exclude /dev;
exclude /etc/adjtime;
exclude /etc/blkid;
exclude /etc/csync2/csync2_ssl_*;
exclude /etc/mtab;
exclude /etc/glusterfs;
exclude /etc/sysconfig/hwconf;
exclude /etc/sysconfig/network;
exclude /etc/sysconfig/network-scripts/ifcfg-eth0;

exclude /etc/sysconfig/networking;
exclude /etc/sysconfig/vz-scripts;
exclude /gluster;
exclude /proc;
exclude /sys;

exclude /tmp;
exclude /usr/libexec/hal-*;
exclude /usr/libexec/hald-*;
exclude /var/cache;
exclude /var/csync2/backup;
exclude /var/ftp;
exclude /var/lib/csync2;
exclude /var/lib/nfs/rpc_pipefs;
exclude /var/lib/openais;
exclude /var/lock;
exclude /var/log;
exclude /var/run;
exclude /var/spool;
exclude /var/tmp;

exclude /vz;

include /vz/template;

backup-directory /var/csync2/backup;
backup-generations 3;

auto none;
}

The main thing to pay attention to here is that some files need to be host specific, rather than shared/mirrored (this is, BTW, also the case with OSR). Specifically, these include things like csync2 host keys and network configuration settings (the two nodes still have different names and IP addresses even if they are supposed to be identical in all other ways). As a bare minimum, on any shared root system at least the files/directories highlighted in red in the above config should be kept host-specific. The directories highlighted in blue are virtual file systems that are node-specific and unshareable. The rest will depend on the exact nature and purpose of the system.

The first csync2 run typically takes a few minutes, and subsequent syncs typically take a few seconds. If run as a daily cron job (or manually after any software or configuration update), this will ensure that nodes’ bootstrap OS is kept in sync.

The way 2) is achieved is by using OpenVZ para (pseudo?) virtualization. What originally got me thinking about taking this approach is that OSR effectively fires up the shared root init chrooted to the shared volume it brought up. This is conceptually very similar to FreeBSD’s Jails and Solaris’ Zones. The Linux equivalent of those is OpenVZ. It provides very thin virtualization of the process ID space (in some cases init not having PID of 1 can cause problems) and the networking stack (so that each VM can have independent networking). Just like Jails and Zones, OpenVZ doesn’t use a disk image – instead VM’s files live as ordinary files in the directory path where the OpenVZ chroot exists (usually /vz/private/). This makes it particularly convenient to use shared root – all that is required is that the shared file system is mounted in /vz/private.

This approach delivers in full on the original goal of making the startup and shutdown processes more robust and avoiding the need for init script patches. (Note: for cleanliness a few lines of rc.sysinit could do with commenting out because some features and /proc paths aren’t applicable to OpenVZ chroots, but this is purely to avoid errors being reported during startup.) Additionally, due to the shared root node being virtualized, it is possible to reboot the shared root node without rebooting the entire server. This is in itself quite a useful feature. As with OSR and csync2 approaches, some files and directories should be unshared (see the red list above).

Disk and File System Optimization

While RAID and flash disks have become much more common over the recent years, some of the old advisories on extracting best performance out of them appear to have become deprecated for most common uses. In this article I will try to cover the basic file system optimisations that every Linux system administrator should know and apply regularly.

Performance from the ground up

The default RAID block size offered by most controllers and Linux software RAID of 64-256KB is way too big for normal use. It will kill the performance of small IO operations without yielding a significant increase in performance for large IOs, and sometimes even hurting large IO operations, too.

To pick the optimum RAID block size, we have to consider the capability of the disks.

Multi-sector transfers

Modern disks can handle transfers of multiple sectors in a single operation, thus significantly reducing the overheads caused by the latencies of the bus. You can find the multi-sector transfer capability of a disk using hdparm. Here is an example:

# hdparm -i /dev/hda

/dev/hda:

 Model=WDC WD400BB-75FRA0, FwRev=77.07W77, SerialNo=WD-WMAJF1111111
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=74
 BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78125000
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3
ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6

 * signifies the current active mode[/code]

The thing we are interested in here is the following:

MaxMultSect=16, MultSect=16

I have not seen any physical disks recently that have this figure at anything other than 16, and 16 sectors * 512 bytes/sector = 8KB. Thus, 8KB is a reasonable choice for the RAID block (a.k.a. chunk in software RAID) size for traditional mechanical disks.

There are a few exceptions to the these figures - some modern disks such as the very recent Western Digital ones support sector sizes of 4KB, so it is important to check the details and make sure what you're dealing with. Also, some virtualization platforms provide virtual disk emulation that supports transfers of as many as 128 sectors. However, in the case of virtual disk images, especially sparse ones, it is impossible to make any reasonable guesstimates about the underlying physical hardware so none of this applies in a meaningful way.

Flash memory

For flash based solid state disks, however, there are a few additional things to consider. Flash disks can only be erased in relatively large blocks, typically between 128KiB and 512KiB. Size of writes can have a massive influence on performance because in the extreme case, a single sector write of 512 bytes still ends up resulting in a whole 512KiB block being erased and re-written. Worse, if we are not careful about the way we align our partitions, we could easily end up with file system blocks (typically 4KB) that span multiple physical flash blocks, which means that for all writes to those spanning blocks we would end up having to write two flash blocks. This is bad for both performance and longevity of the disk.

Disk Geometry - Mechanical Disks

There is an additional complication in that alignment of the virtual disk geometry also plays a major role with flash memory.

On a mechanical disk the geometry is variable due to the inherently required logical translation for compensating for varying numbers of sectors per cylinder (inner cylinders have smaller circumference and thus fewer sectors than the outer cylinders). This means that any optimization we may try to do to ensure that superblock and extent beginnings never span cylinder boundaries (and thus avoid a track-to-track seek overhead, which is nowdays, fortunately, very low) is relatively meaningless, because the next cylinder shrink could throw out our calculation. While this used to be a worthwhile optimisation 25 years ago, it is, sadly, no longer the case.

There is a useful side-effect of this translation that one should be aware of. Since outer cylinders have more sectors and the rpm is constant, it follows that the beginning of the disk is faster than the end. Thus, the most performance crytical partitions (e.g. swap) should be physically at the front of the disk. The difference in throughput between the beginning and the end of the disk can be as much as two-fold, so this is quite important!

Disk Geometry - Flash

Flash disks require no such translation so careful geometry alignment is both useful and worthwhile. To take advantage of it, we first have to look at the erase block size of the flash disk in use. If we are lucky, the manufacturer will have provided the erase block
size in the documentation. Most, however, don't seem to. In the absence of definitive documentation, we can try to guesstimate this by doing some benchmarking. The theory is simple - we disable hardware disk caching (hdparm -W0) and test the speed of unbuffered writes to the disk using:

dd if=/dev/zero of=/dev/[hs]d[a-z] oflag=direct bs=[8192|16384|32768|65536|131072|262144|524288]

What we should be able to observe is that the performance will increase nearly linearly up to erase block size (typically, but not always, 128KiB), and then go flat.

Once we have this, we need to partition the disk with the geometry such that cylinders always start at the beginning of an erase block. Since these will always be powers of 2, the default CHS geometry with 255 heads and 63 sectors per track is pretty much the worst that can be chosen. If we set it for 128 heads and 32 sectors per track, however, things become much more sane for aligning cylinders to erase block boundaries. This yields 2MB cylinders which should work well for just about all flash disks. Thus, we can run fdisk by explicitly telling it the geometry:

fdisk -H 128 -S 32 /dev/[hs]d[a-z]

One important thing to note is that the first partition (physically) on the disk doesn't start at sector 0. This is a hangover from DOS days, but if we used the first cylinder as is, we would end up messing up the alignment of our first partition. So, what we can do instead is make a partition spanning only the 1st cylinder and simply not use it. We waste a bit of space but that is hardly a big deal. Alternatively, we could also the /boot partition at the beginning of the disk as that changes very infrequently and is never accessed after booting.

Next we have to look at what available options are available for the file system we intend to use. The ext2/34 file systems provides several parameters that are worth looking at.

Stride

The stride parameter is used to adjust the file system layout so that data and metadata for each block is placed on different disks. This improves performance because the operations can be parallelized.

This is specifically related to RAID - on a single disk we cannot distribute this load and there is more to be gained by keeping the data and metadata in adjecent blocks to avoid seek times and make better use of read-ahead.

The stride parameter should be set so that ext4 block size (usually 4KB) * stride = chunk size, in this case ext3 block size = 4KB, stride = 2, RAID chunk = 8KB.

mkfs.ext4 -E stride=2

Stripe Width

This is a setting that both RAID arrays and flash media can benefit from. It aims to arrange blocks so that writes will be to a whole stripe at once, rather than suffer a double hit on the read-modify-write operation that RAID levels with parity (RAID 3,4,5,6) suffer.
This benefit is also directly applicable to flash media because on flash we have to write an entire erase block, so cramming more useful data-writes into that single operation has a positive effect both in terms of performance and disk longevity. If the erase block size (or stripe width size for RAID) is, for example, 128KiB, we should set stripe-width = 128KiB / 4KiB = 32:

mkfs.ext4 -E stripe-width=32

Block Groups

So far so good, but we're not done yet. Next we need to to consider the extent / block group size. The beginning of each block group contains a superblock for that group. It is the top of that inode subtree, and needs to be checked to find any file/block in that group. That means the beginning block of a block group is a major hot-spot for I/O, as it has to be accessed for every I/O operation on that group. This, in turn, means that for anything like reasonable performance we need to have the block group beginnings distributed evenly across all the disks in the RAID array, or else one disk will end up doing most of the work while the others are sitting idle.

For example, the default for ext2/3/4 is 32768 blocks in a block group. The adjustment can only be made in increments of 8 blocks (32KB assuming 4KB blocks). Other file systems may have different granularity.

The optimum number of blocks in a group will depend on the RAID level and the number of disks in the array, but you can simplify it into a RAID0 equivalent for the purpose of this exercise e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0. Ideally you want the block group size to align to the stripe width +/- 1 stride width so that the block group beginnings rotate among the disks (upward for +1 stride, downward for -1 stride, both will achieve the same effect).

The stripe width in the case described is 8KB * 6 disks = 48KB. So, for optimal performance, the block group should align to a multiple of 8KB * 7 disks = 56KB. Be careful here - in the example case we need a number that is a multiple of 56KB, but not a multiple of 48KB because if they line up, we haven't achieved anything and are back where we started!

56KB is 14 4KB blocks. Without getting involved in a major factoring exercise, 28,000 blocks sounds good (default is 32768 for ext3, which is in a reasonable ball park). 28,000*4KB is a multiple of 56KB but not 48KB, so it looks like a reasonable choice in this example.

Obviously, you'll have to work out the optimal numbers for your specific RAID configuration, the above example is for:
disk multi-sector = 16
ext3 block size = 4KB
RAID chunk size = 8KB
ext3 stripe-width = 12
ext3 stride = 2
RAID = 6
disks = 8

mkfs.ext4 -g 28000

In case a flash disk is being used, the default value of 32768 is fine since this results in block groups that are 128MB in size. 128MB is a clean multiple of all likely erase block sizes, so no adjustment is necessary.

Journal Size

Journal size can also be adjusted to optimize array performance. Ideally, the journal should be sized to fill a multiple of stripe size. In the example above, this means a multiple of 48KB. The default is 128MB, which doesn't quite fit, but 126MB (for example) does.

mkfs.ext4 -J size=32256

Since flash disks typically have very fast reads and access time, it is possible to not use journalling at all. Some crash-proofing will be lost, but fsck will typically complete very quickly in a SSD, thus minimizing the requirement for having a journal in environments that don't require the extra degree of crash-proof data consistency. In journalling is not required, simply use ext2 file system instead:

mkfs.ext2

or disable the journal:

mkfs.ext4 -O ^has_journal

Growing

If you are certain that the file system will never need to be grown, you can adjust the amount of reserved space for new inodes. Unfortunately, the growth limit has to be a few percept bigger than the current file system size, but this is still better than the default of 1000x bigger or 16TB, whichever is smaller. This will also free up some space for data.

mkfs.ext4 -E resize=6T

Crippling Abstraction

The sort of finesse explained above that can be applied to extract better (and sometimes _massively_ better) performance from disks is one of the key reasons why LVM (Logical Volume Management) should be avoided where possible. It abstracts things and encourages a lack of forward thinking. Adding a new volume is the same as adding a new disk to a software RAID to stretch it. It'll upset the block group size calculation and disrupt the advantage of load balancing across all the disks in the array that we have just carefully established. By doing this you can cripple the performance on some operations from scaling linearly with the number of disks to being bogged down to the performance of just one disk.

This can make a massive difference to IOPS figures you get out of a storage system. There is scope for offsetting this, but it reduces the flexibility somewhat. You could carve up the storage into identical logical volumes, each of which is carefully aligned to the underlying physical storage and add logical volumes in appropriate quantitiy (rather than just one at a time) so that the block groups and journal size still align in an optimal way.