Disk and File System Optimization

While RAID and flash disks have become much more common over the recent years, some of the old advisories on extracting best performance out of them appear to have become deprecated for most common uses. In this article, I will try to cover the basic file system optimisations that every Linux system administrator should know and apply regularly.

Performance from the ground up

The default RAID block size offered by most controllers and Linux software RAID of 64-256KB is way too big for normal use. It will kill the performance of small IO operations without yielding a significant increase in performance for large IOs, and sometimes even hurting large IO operations, too.

To pick the optimum RAID block size, we have to consider the capability of the disks.

Multi-sector transfers

Modern disks can handle transfers of multiple sectors in a single operation, thus significantly reducing the overheads caused by the latencies of the bus. You can find the multi-sector transfer capability of a disk using hdparm. Here is an example:

# hdparm -i /dev/hda

/dev/hda:

 Model=WDC WD400BB-75FRA0, FwRev=77.07W77, SerialNo=WD-WMAJF1111111
 Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
 RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=74
 BuffType=DualPortCache, BuffSize=2048kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=78125000
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: Unspecified:  ATA/ATAPI-1 ATA/ATAPI-2 ATA/ATAPI-3
ATA/ATAPI-4 ATA/ATAPI-5 ATA/ATAPI-6

 * signifies the current active mode[/code]

The thing we are interested in here is the following:

MaxMultSect=16, MultSect=16

I have not seen any physical disks recently that have this figure at anything other than 16, and 16 sectors * 512 bytes/sector = 8KB. Thus, 8KB is a reasonable choice for the RAID block (a.k.a. chunk in software RAID) size for traditional mechanical disks.

There are a few exceptions to the these figures – some modern disks such as the very recent Western Digital ones support sector sizes of 4KB, so it is important to check the details and make sure what you’re dealing with. Also, some virtualization platforms provide virtual disk emulation that supports transfers of as many as 128 sectors. However, in the case of virtual disk images, especially sparse ones, it is impossible to make any reasonable guesstimates about the underlying physical hardware so none of this applies in a meaningful way.

Flash memory

For flash based solid state disks, however, there are a few additional things to consider. Flash disks can only be erased in relatively large blocks, typically between 128KiB and 512KiB. Size of writes can have a massive influence on performance because in the extreme case, a single sector write of 512 bytes still ends up resulting in a whole 512KiB block being erased and re-written. Worse, if we are not careful about the way we align our partitions, we could easily end up with file system blocks (typically 4KB) that span multiple physical flash blocks, which means that for all writes to those spanning blocks we would end up having to write two flash blocks. This is bad for both performance and longevity of the disk.

Disk Geometry – Mechanical Disks

There is an additional complication in that alignment of the virtual disk geometry also plays a major role with flash memory.

On a mechanical disk the geometry is variable due to the inherently required logical translation for compensating for varying numbers of sectors per cylinder (inner cylinders have smaller circumference and thus fewer sectors than the outer cylinders). This means that any optimization we may try to do to ensure that superblock and extent beginnings never span cylinder boundaries (and thus avoid a track-to-track seek overhead, which is nowdays, fortunately, very low) is relatively meaningless, because the next cylinder shrink could throw out our calculation. While this used to be a worthwhile optimisation 25 years ago, it is, sadly, no longer the case.

There is a useful side-effect of this translation that one should be aware of. Since outer cylinders have more sectors and the rpm is constant, it follows that the beginning of the disk is faster than the end. Thus, the most performance crytical partitions (e.g. swap) should be physically at the front of the disk. The difference in throughput between the beginning and the end of the disk can be as much as two-fold, so this is quite important!

Disk Geometry – Flash

Flash disks require no such translation so careful geometry alignment is both useful and worthwhile. To take advantage of it, we first have to look at the erase block size of the flash disk in use. If we are lucky, the manufacturer will have provided the erase block
size in the documentation. Most, however, don’t seem to. In the absence of definitive documentation, we can try to guesstimate this by doing some benchmarking. The theory is simple – we disable hardware disk caching (hdparm -W0) and test the speed of unbuffered writes to the disk using:

dd if=/dev/zero of=/dev/[hs]d[a-z] oflag=direct bs=[8192|16384|32768|65536|131072|262144|524288]

What we should be able to observe is that the performance will increase nearly linearly up to erase block size (typically, but not always, 128KiB), and then go flat.

Once we have this, we need to partition the disk with the geometry such that cylinders always start at the beginning of an erase block. Since these will always be powers of 2, the default CHS geometry with 255 heads and 63 sectors per track is pretty much the worst that can be chosen. If we set it for 128 heads and 32 sectors per track, however, things become much more sane for aligning cylinders to erase block boundaries. This yields 2MB cylinders which should work well for just about all flash disks. Thus, we can run fdisk by explicitly telling it the geometry:

fdisk -H 128 -S 32 /dev/[hs]d[a-z]

One important thing to note is that the first partition (physically) on the disk doesn’t start at sector 0. This is a hangover from DOS days, but if we used the first cylinder as is, we would end up messing up the alignment of our first partition. So, what we can do instead is make a partition spanning only the 1st cylinder and simply not use it. We waste a bit of space but that is hardly a big deal. Alternatively, we could also the /boot partition at the beginning of the disk as that changes very infrequently and is never accessed after booting.

Next we have to look at what available options are available for the file system we intend to use. The ext2/34 file systems provides several parameters that are worth looking at.

Stride

The stride parameter is used to adjust the file system layout so that data and metadata for each block is placed on different disks. This improves performance because the operations can be parallelized.

This is specifically related to RAID – on a single disk we cannot distribute this load and there is more to be gained by keeping the data and metadata in adjecent blocks to avoid seek times and make better use of read-ahead.

The stride parameter should be set so that ext4 block size (usually 4KB) * stride = chunk size, in this case ext3 block size = 4KB, stride = 2, RAID chunk = 8KB.

mkfs.ext4 -E stride=2

Stripe Width

This is a setting that both RAID arrays and flash media can benefit from. It aims to arrange blocks so that writes will be to a whole stripe at once, rather than suffer a double hit on the read-modify-write operation that RAID levels with parity (RAID 3,4,5,6) suffer.
This benefit is also directly applicable to flash media because on flash we have to write an entire erase block, so cramming more useful data-writes into that single operation has a positive effect both in terms of performance and disk longevity. If the erase block size (or stripe width size for RAID) is, for example, 128KiB, we should set stripe-width = 128KiB / 4KiB = 32:

mkfs.ext4 -E stripe-width=32

Block Groups

So far so good, but we’re not done yet. Next we need to to consider the extent / block group size. The beginning of each block group contains a superblock for that group. It is the top of that inode subtree, and needs to be checked to find any file/block in that group. That means the beginning block of a block group is a major hot-spot for I/O, as it has to be accessed for every I/O operation on that group. This, in turn, means that for anything like reasonable performance we need to have the block group beginnings distributed evenly across all the disks in the RAID array, or else one disk will end up doing most of the work while the others are sitting idle.

For example, the default for ext2/3/4 is 32768 blocks in a block group. The adjustment can only be made in increments of 8 blocks (32KB assuming 4KB blocks). Other file systems may have different granularity.

The optimum number of blocks in a group will depend on the RAID level and the number of disks in the array, but you can simplify it into a RAID0 equivalent for the purpose of this exercise e.g. 8 disks in RAID6 can be considered to be 6 disks in RAID0. Ideally you want the block group size to align to the stripe width +/- 1 stride width so that the block group beginnings rotate among the disks (upward for +1 stride, downward for -1 stride, both will achieve the same effect).

The stripe width in the case described is 8KB * 6 disks = 48KB. So, for optimal performance, the block group should align to a multiple of 8KB * 7 disks = 56KB. Be careful here – in the example case we need a number that is a multiple of 56KB, but not a multiple of 48KB because if they line up, we haven’t achieved anything and are back where we started!

56KB is 14 4KB blocks. Without getting involved in a major factoring exercise, 28,000 blocks sounds good (default is 32768 for ext3, which is in a reasonable ball park). 28,000*4KB is a multiple of 56KB but not 48KB, so it looks like a reasonable choice in this example.

Obviously, you’ll have to work out the optimal numbers for your specific RAID configuration, the above example is for:
disk multi-sector = 16
ext3 block size = 4KB
RAID chunk size = 8KB
ext3 stripe-width = 12
ext3 stride = 2
RAID = 6
disks = 8

mkfs.ext4 -g 28000

In case a flash disk is being used, the default value of 32768 is fine since this results in block groups that are 128MB in size. 128MB is a clean multiple of all likely erase block sizes, so no adjustment is necessary.

Journal Size

Journal size can also be adjusted to optimize array performance. Ideally, the journal should be sized to fill a multiple of stripe size. In the example above, this means a multiple of 48KB. The default is 128MB, which doesn’t quite fit, but 126MB (for example) does.

mkfs.ext4 -J size=32256

Since flash disks typically have very fast reads and access time, it is possible to not use journalling at all. Some crash-proofing will be lost, but fsck will typically complete very quickly in a SSD, thus minimizing the requirement for having a journal in environments that don’t require the extra degree of crash-proof data consistency. In journalling is not required, simply use ext2 file system instead:

mkfs.ext2

or disable the journal:

mkfs.ext4 -O ^has_journal

Growing

If you are certain that the file system will never need to be grown, you can adjust the amount of reserved space for new inodes. Unfortunately, the growth limit has to be a few percept bigger than the current file system size, but this is still better than the default of 1000x bigger or 16TB, whichever is smaller. This will also free up some space for data.

mkfs.ext4 -E resize=6T

Crippling Abstraction

The sort of finesse explained above that can be applied to extract better (and sometimes _massively_ better) performance from disks is one of the key reasons why LVM (Logical Volume Management) should be avoided where possible. It abstracts things and encourages a lack of forward thinking. Adding a new volume is the same as adding a new disk to a software RAID to stretch it. It’ll upset the block group size calculation and disrupt the advantage of load balancing across all the disks in the array that we have just carefully established. By doing this you can cripple the performance on some operations from scaling linearly with the number of disks to being bogged down to the performance of just one disk.

This can make a massive difference to IOPS figures you get out of a storage system. There is scope for offsetting this, but it reduces the flexibility somewhat. You could carve up the storage into identical logical volumes, each of which is carefully aligned to the underlying physical storage and add logical volumes in appropriate quantitiy (rather than just one at a time) so that the block groups and journal size still align in an optimal way.

Disk and File System Optimisation

How to get the most out of your data storage systems