Flash Module Benchmark Collection: SD Cards, CF Cards, USB Sticks

Having spent a considerable amount of time, effort, and ultimately money trying to find decently performing SD, CF and USB flash modules, I feel I really need to ensure that I make the lives of other people with the same requirements easier by publishing my findings – especially since I have been unable to find a reasonable comprehensive data source with similar information.

Unfortunately, virtually all SD/microSD (referred to as uSD from now on), CF and USB flash modules have truly atrocious performance for use as normal disks (e.g. when running the OS from them on a small, low power or embedded device), regardless of what their advertised performance may be. The performance problem is specifically related to their appalling random-write performance, so this is the figure that you should be specifically paying attention to in the tables below.

As you will see, the sequential read and write performance of flash modules is generally quite good, as is random-read performance. But on their own these are largely irrelevant to overall performance you will observe when using the card to run the operating system from, if the random-write performance is below a certain level. And yes, your system will do several MB of writing to the disk just by booting up, before you even log in, so don’t think that it’s all about reads and that writes are irrelevant.

For comparison, a typical cheap laptop disk spinning at 5400rpm disk can typically achieve 90 IOPS on both random reads and random writes with typical (4KB) block size. This is an important figure to bear in mind purely to be able to see just how appalling the random write performance of most removable flash media is.

All media was primed with two passes of:

 dd if=/dev/urandom of=/dev/$device bs=1M oflag=direct

in order to simulate long term use and ensure that the performance figures reasonably accurately reflect what you might expect after the device has been in use for some time.

There are two sets of results:

1) Linear read/write test performed using:

dd if=/dev/$device of=/dev/null    iflag=direct
dd if=/dev/zero    of=/dev/$device oflag=direct

The linear read-write test script I use can be downloaded here.

2) Random read/write test performed using:

iozone -i 0 -i 2 -I -r 4K -s 512m -o -O +r +D -f /path/to/file

In all cases, the test size was 512MB. Partitions are aligned to 2MB boundaries. File system is ext4 with 4KB block size (-b 4096) and 16-block (64KB) stripe-width (-E stride=1,stripe-width=16), no journal (-O ^has_journal), and mounted without access time logging (-o noatime). The partition used for the tests starts at half of the card’s capacity, e.g. on a 16GB card, the test partition spans the space from 8GB up to the end. This is in done in order to nullify the effect of some cards having faster flash at the front of the card.

The data here is only the first modules I have tested and will be extensively updated as and when I test additional modules. Unfortunately, a single module can take over 24 hours to complete testing if their performance is poor (e.g. 1 IOPS) – and unfortunately, most of them are that bad, even those made by reputable manufacturers.

The dd linear test is probably more meaningful if you intend to use the flash card in a device that only ever performs large, sequential writes (e.g. a digital camera). For everything else, however, the dd figures are meaningless and you should instead be paying attention to the iozone results, particularly the random-write (r-w). Good random write performance also usually indicates a better flash controller, which means better wear leveling and better longevity of the card, so all other things being similar, the card with faster random-write performance is the one to get.

Due to WordPress being a little too rigid in it’s templates to allow for wide tables, you can see the SD / CF / USB benchmark data here. This table will be updated a lot so check back often.

More/Better Internal Storage on the Toshiba AC100

One of the unfortunate things about the AC100 is that the internal storage isn’t removable, and thus isn’t easily upgradable or replaceable. The latter could be an issue in the longer term because it is flash memory, so it will eventually wear out, and I since it is relatively basic eMMC, I don’t expect the flash controller to be particularly advanced when it comes to wear leveling and minimizing write amplification. Using the SD slot is an option, but if we are running the operating system from it, we cannot use it for removable media, which could be handy. We could use a USB stick instead, but then we lose the only USB port on the machine. There is no SATA controller inside the AC100.

What can be done about this? Well, models that have a 3G modem have it on a mini-PCIe USB card. Even though Tegra 2 has a PCIe controller built into it, the mini-PCIe slot isn’t fully wired up – only USB lines are connected. Since most of us can tether a data connection via our phones, and since this is more cost effective than paying for two separate mobile connections, the 3G module isn’t particularly vital. The main issue that the slot only has USB wired up. So what we would need is a USB mini-PCIe SSD. Is there such a thing? It turns out that there is. I have been able to find two:

  1. EMPhase Mini PCIe USB S1 SSD
  2. InnoDisk miniDOM-U SSD

The specification of the two modules is virtually identical (both use SLC flash among other similarities), so I decided to investigate both of them. Unfortunately, having contacted an EMPhase re-seller, they called me back having spoken to the manufacturer and talked me out of buying one, citing unspecified issues.

My local InnoDisk re-seller was more interested in selling me a product, but there were two reasons why despite very good pre-sales service I ultimately decided against buying one of these. The first and foremost was the performance specification. According to the manufacturer’s own figures, the random access performance with 4KB blocks is 1440 random read IOPS and 30 random write IOPS. Considering the price per GB of these modules is approximately 4x that of similarly performing SLC SD cards, this module was discarded on the basis of cost effectiveness.

Having discarded the above modules, there are still a few alternative options available. The low risk, tidy options include an SD mini-PCIe USB adapter and a micro-SD mini-PCIe USB adapter. They are very reasonably priced so I got one of each for testing, and I am pleased to say that they work absolutely fine in the AC100. Here is what they look like fitted into the AC100.

Dual micro-SD mini-PCIe USB Adapter
Dual micro-SD mini-PCIe USB Adapter
SD mini-PCIe USB Adapter
SD mini-PCIe USB Adapter

The SD cards will appear as USB disks. If you use the dual micro-SD adapter you can RAID the two cards together.

Unfortunately, I have found that the best results are achieved using a single SD card, purely because I haven’t found any micro-SD cards that have reasonable performance when it comes to random-write IOPS. SD cards fare a little better, but the best SD card I have found in terms of random write IOPS still tops out at a mere 19 random write IOPS using 4KB blocks. Still, it is 2/3 of the marketed figures for the InnoDisk SSD at 4x lower price per GB, and the performance just about scrapes past what I would consider minimal requirements for reasonable use.

I am currently putting together a list of SD, micro SD and USB flash devices and consistent benchmark performance figures for them, which should hopefully help you to choose the ones most suitable for your application. I hope to have the article up reasonably soon, but don’t expect it too soon – benchmarking SD cards takes a long time to do properly.

Alleviating Memory Pressure on Toshiba AC100

After all the upgrades and tweaks to the AC100 (screen upgrade to 1280×720, cooling improvements and boosting the clock speed by over 40%), only one significant issue remains: it only has 512MB of RAM. Unfortunately, the memory controller initialization is done by the closed-source boot loader, so even if we were to solder in bigger chips (Tegra2 can handle up to 1GB of RAM), it is unlikely in the extreme that it would just work.

So, other than increasing the physical amount of memory, can we actually do anything to improve the situation? Well, as a matter of fact, there are a few things.

Clawing Back Some Memory

By default, the GPU gets allocated a hefty 64MB of RAM out of 512MB that we have. This is quite a substantial fraction of our memory, and it would be nice to claw some of it back if we are not using it. I find the Nvidia’s Tegra binary accelerated driver to be too buggy to use under normal circumstances, so I use the basic unaccelerated frame buffer driver instead. There are two frame buffer allocations on the AC100: the internal display and the HDMI port. The latter is only intended for use with TVs which means we shouldn’t need a resulition of more than 1920×1080 on that port. The highest resolution display we can have on the internal port is 1280×720. That means that the maximum amount of memory used by those two frame buffers is 8100KB + 3600KB 11700KB. To be on the safe side, let’s call that 16MB. That still leaves us 48MB that we should be able to safely reclaim. We can do that by telling the kernel that there is extra memory at certain addresses using the following boot parameters:

mem=448M@0M mem=48M@464M

Make sure the accelerated binary Tegra driver is disabled in your xorg.conf, reboot and you should now have 496MB of usable RAM instead of 448MB. It’s just over an extra 10%, which should make a noticeable difference given how tight the memory is to begin with.

If you aren’t using the HDMI interface, my tests show that it is in fact possible to reduce the GPU memory to just 2MB with no ill effects, when using the 1280×720 display panel, because the frame buffer seems to operate in 16-bit mode by default:

mem=448M@0M mem=62M@450M

That leaves a total of 510MB of for applications.

Memory Compression

In the recent kernels, there are two modules that are very useful when we have plenty of CPU resources but very little memory – just the case on the AC100. They are zcache and zram. On the 3.0 kernels instead of zram we can use frontcache which is similar but has the advantage that it is aware and cooperates with zcache. Since at the time of writing this 3.0 isn’t quite as polished and stable for the AC100 as 2.6.38, let’s focus on zram instead.

Assuming you have compiled zcache support into your kernel, all you need to do to enable it is add the kernel boot paramter “zcache”. From there on, your caches should be compressed, thus increasing the amount they can store.

zram provides a virtual block device backed by RAM, but the contents are compressed, so it should always end up using less than the amount of memory it presents as a block device (unless all of the data is uncompressible, which is very unlikely). To err on the side of caution we shouldn’t set this to more than half of the total memory across all the zram devices. To ensure optimal performance, we should also set the number of zram devices to be the same as the number of CPUs cores in the system to make sure that all CPUs end up being used (each zram device handler is a single thread).

To set the number of zram devices to 2 (Tegra2 has 2 CPU cores), we need to create the file /etc/modprobe.d/zram.conf containing the following line:

options zram num_devices=2

Then once we load the zram module (modprobe zram), we should see device nodes called /dev/zram*. We can configure the devices:

echo <memory_size_in_bytes> > /sys/block/<zram_device>/disksize

The amount of memory assigned to each zram device should be such that their total combined size doesn’t exceed half of the total physical memory in the system.

Then we can create swap headers on those zram devices using mkswap (e.g. mkswap /dev/zram0) and enable swapping to them (swapon -p100 /dev/zram0).

We should now have some compressed RAM for swapping to instead of swapping to a slow SD card.

Tweaks

It turns out that some of the default settings on Linux distributions aren’t as sensible as they could be. By default the amount of stack space each thread is allocated is 8MB. This is unnecessarily large and results in more memory consumption than is necessary. Instead we can set the soft limit to 256KB using “ulimit -s 256”. Ideally we should make this happen automatically at startup by creating a file /etc/security/limits.d/90-stack.conf containing the following:

* soft stack 256

Some users have reported that this can increase the amount of available memory after booting by a a rather substantial amount. Since this is a soft limit, programs that require more stack space can still allocate it by asking for it.

Choice of Software

One of the most commonly used types of software nowdays is a web browser, and unfortunately, most web browsers have become unreasonably bloated in recent years. This is a problem when the amount of memory is as limited as in it is on most ARM machines. Firefox and to a somewhat lesser extent Chrome require a substantial amount of memory. However, there is another reasonably fully featured alternative that works on ARM – Midori. Midori is based on the Webkit rendering engine, the same one that is used by Chrome and Safari. However, it’s memory footprint is approximately half of the other browsers. Unfortunately, it’s JavaScript support isn’t quite as good as on Firefox and Chrome yet, but it is sufficiently good for most things, and if memory pressure is a serious issue, you might want to try it out.