Shared Root Single System Image Clustering

When running multiple servers that are supposed to be nearly identical (e.g. a cluster where all nodes have the same functionality), managing the configuration can be a daunting task if not approached in the right way. In this article we will explore the tools at our disposal to make this easier and the approaches we can take to ensure that our solution has the best available compromise between performance and maintainability.

Shared Root

An excellent way to implicitly maintain the equality of configuration and package installations of multiple servers is to configure them to share the root file system. This ensures that any configuration changes on one of the nodes are implicitly and immediately applied to all the other nodes. Since there is no scope for the configuration and packages to get out of sync, the complexities of maintenance are greatly reduced. It effectively simplifies the task from one of administering n nodes down to administering 1 node.

Open Shared Root

Open shared root is the de facto standard for implementing shared root clusters on Linux. It provides all the required tools for creating an initrd that contains everything needed to start and mount the shared file system that will be used for root. Something like this is required if the shared root file system is anything other than NFS. Linux kernels support NFS booting on the kernel level, so an additional bootstrap that OSR provides isn’t strictly necessary, but even then it is still useful.

File systems that can be used for shared root fall broadly into two categories:

Cluster file systems
Examples of cluster file systems include:
GFS and GFS2 (from RedHat)
OCFS, OFCS2 (from Oracle)
VMFS (VMware)
VxCFS (Veritas/Symantec)

These file systems are characterized by the fact that they exist directly on top of a block device. In the context of shared root these are typically provided by iSCSI, ATAoE, Fibre Channel or DRBD, but directly shared SCSI buses are also sometimes used.

Network file systems
Examples of network file systems include:
NFS
CIFS (a.k.a. SMB or Samba)
GlusterFS

These file systems are characterized by the fact that they export an already existing, underlying file system. The underlying file system is typically one that exist directly on a block device, such as the cluster file systems mentioned above or any of the many non-cluster local file systems (e.g. ext3, NTFS, …).

When a cluster file system is used for the shared root, there is another important consideration – not only does the file system itself have to be supported by the bootstraping process, but the underlying block device has to be supported, too.

OSR supports a number of combinations of several of file systems and block devices mentioned above (see the OSR OS platform support matrix for most up to date information). At the time of writing of this article, support for GlusterFS based OSR isn’t listed in the matrix, but support does exist for it – the author of this article knows this for a fact since he developed and contributed the patch and the corresponding documentation to the OSR project :-).

The OSR website has plenty of excellent documentation and howtos on how to configure it for most common scenarios, so there seems little need to repeat them here.

Pitfalls

While OSR has a lot going for it, it isn’t without its own share of potential pitfalls. It is great when it works (and it is mature enough that most of the time it does just work). One of the ongoing aspects of its development is the minimization of the initrd bootstrap footprint. However, the fact that it has to start up all the clustering components opens a sizable scope for things to break. The author of this article has seen it happen more than once that an update to key clustering software components even on enterprise distributions has resulted in the said components becoming broken and rendering the entire cluster unbootable. This is particularly problematic on OSR because the entire system is rendered unbootable with relatively limited ability to troubleshoot the problem, since the entire root file system is unavailable. For just such cases it would be very useful to have a more fully featured bootstrap root that would allow for much more graceful troubleshooting and recovery.

Another complication introduced by OSR is that the startup and shutdown sequences have to be adjusted to cooperate gracefully with the fact that they are running inside the pre-initialisation bootstrap. Specifically, this means being careful which processes need to be excluded during shutdown’s killall5 execution, and which file systems should be left mounted during the first stage of the shutdown sequence (the rootfs needs to be left to the bootstrap root shutdown sequence to unmount). OSR comes with patches to the init scripts to take care of this, but in some cases (such as with GlusterFS), the process is not as straightforward and foolproof as one might hope. When this process hits a problem, the shutdown sequence hangs.

All this got me thinking about a similar approach but with key differences to address the above concerns.

Virtualized Shared Root

To summarize, the two features that this approach was designed to add compared to the vanilla OSR method are:

  1. Availability of a fully featured bootstrap environment.
  2. Removal of need to pay special attention to startup and shutdown sequences due to the peculiarities of running on a shared root.

The obvious way to implement 1) is to ignore the OSR bootstrap and simply use a normal (albeit a relatively minimal) OS install to prepare and bootstrap the volumes for the shared root instance. This works reasonably well, but it does bring a problem with it – the boostrap OS isn’t implicitly identical between the nodes. In OSR this is addressed by the fact that the same initrd is used on all the nodes, so even though the bootstrap OS isn’t permanently shared, a high degree of consistency exists due to the bootstrap being initialized from the same image at every boot. So for the sake of tidiness and feature equivalence with OSR, some method must be applied to ensure that the copies are kept in sync. The tool used to achieve this is csync2.

csync2 is similar to rsync, but is specifically designed for synchronizing a set of files to a large number of remote nodes. I am not going to go into details of csync2 setup here because good documentation exist on the Linbit website. The csync2 configuration file I use is provided because it lists which files should be excluded from the synchronization.

group openvz-osr
{
host openvz-osr1;
host (openvz-osr2);
key /etc/csync2/openvz-osr.key;

include /*;
exclude /dev;
exclude /etc/adjtime;
exclude /etc/blkid;
exclude /etc/csync2/csync2_ssl_*;
exclude /etc/mtab;
exclude /etc/glusterfs;
exclude /etc/sysconfig/hwconf;
exclude /etc/sysconfig/network;
exclude /etc/sysconfig/network-scripts/ifcfg-eth0;

exclude /etc/sysconfig/networking;
exclude /etc/sysconfig/vz-scripts;
exclude /gluster;
exclude /proc;
exclude /sys;

exclude /tmp;
exclude /usr/libexec/hal-*;
exclude /usr/libexec/hald-*;
exclude /var/cache;
exclude /var/csync2/backup;
exclude /var/ftp;
exclude /var/lib/csync2;
exclude /var/lib/nfs/rpc_pipefs;
exclude /var/lib/openais;
exclude /var/lock;
exclude /var/log;
exclude /var/run;
exclude /var/spool;
exclude /var/tmp;

exclude /vz;

include /vz/template;

backup-directory /var/csync2/backup;
backup-generations 3;

auto none;
}

The main thing to pay attention to here is that some files need to be host specific, rather than shared/mirrored (this is, BTW, also the case with OSR). Specifically, these include things like csync2 host keys and network configuration settings (the two nodes still have different names and IP addresses even if they are supposed to be identical in all other ways). As a bare minimum, on any shared root system at least the files/directories highlighted in red in the above config should be kept host-specific. The directories highlighted in blue are virtual file systems that are node-specific and unshareable. The rest will depend on the exact nature and purpose of the system.

The first csync2 run typically takes a few minutes, and subsequent syncs typically take a few seconds. If run as a daily cron job (or manually after any software or configuration update), this will ensure that nodes’ bootstrap OS is kept in sync.

The way 2) is achieved is by using OpenVZ para (pseudo?) virtualization. What originally got me thinking about taking this approach is that OSR effectively fires up the shared root init chrooted to the shared volume it brought up. This is conceptually very similar to FreeBSD’s Jails and Solaris’ Zones. The Linux equivalent of those is OpenVZ. It provides very thin virtualization of the process ID space (in some cases init not having PID of 1 can cause problems) and the networking stack (so that each VM can have independent networking). Just like Jails and Zones, OpenVZ doesn’t use a disk image – instead VM’s files live as ordinary files in the directory path where the OpenVZ chroot exists (usually /vz/private/). This makes it particularly convenient to use shared root – all that is required is that the shared file system is mounted in /vz/private.

This approach delivers in full on the original goal of making the startup and shutdown processes more robust and avoiding the need for init script patches. (Note: for cleanliness a few lines of rc.sysinit could do with commenting out because some features and /proc paths aren’t applicable to OpenVZ chroots, but this is purely to avoid errors being reported during startup.) Additionally, due to the shared root node being virtualized, it is possible to reboot the shared root node without rebooting the entire server. This is in itself quite a useful feature. As with OSR and csync2 approaches, some files and directories should be unshared (see the red list above).