Hardware Accelerated SSL on SheevaPlug (Marvell Kirkwood ARM) Using OpenSSL on Fedora

I have recently been spending a quite a lot of time working on Linux on various ARM devices. It is quite amazing what ARM hardware is capable of nowdays. One of the most popular ARM based machines available is the SheevaPlug. The performance of it is pretty good for a small server – my experience shows that the 1.2GHz Marvell Kirkwood 88F6281 compares quite favoutably to the likes of 1.66GHz Intel Atom N450 in terms of both server performance and especially in terms power usage. Atom N450 systems have a typical power draw of about 22W idle and 28W under load – a far cry from the supposed 7.6W total of 5.5W N450 + 2.1W NM10. SheevaPlug, on the other hand, draws 2.3W idle and 7W under load.

In some areas, however, the Atom does hold a performance advantage, especially in usage that requires heavy number crunching – unlike the Marvell KirkwoodAtom N450 has a FPU and SIMD capability via the SSE/SSE2/SSSE3 instruction sets. One set of applications that get better performance on Atom N450 are the ones doing encryption, for example OpenSSL. Or do they…

Not quite. The Kirkwood ARM has an ace up it’s sleeve, and as it turns out, it is one powerful enough to allow it to close the gap against a processor with 4x the power budget. It has a hardware crypto engine that supports MD5, SHA1 and AES-128 acceleration.

Unfortunately, mainstream Linux distributions don’t come with the hardware crypto acceleration enabled, and most of the documentation available is sufficiently out of date to be unapplicable to the current generation of distributions. All of it points at OCF Linux, which hasn’t been updated for kernels past 2.6.33 and OpenSSL 0.9.8n, both of which are deprecated. I have modified the kernel patches to make them work on 2.6.35, but unfortunately the cryptodev driver uses locked ioctl operation which has been removed from the kernel starting with 2.6.36, so further modifications are required to make it work on later kernels. OCF Linux also doesn’t appear to have been updated since late 2010. But things are not as bad as it initially seems – it turns out that there is an alternative.

The reason kernel patches are required is because acceleration depends on the BSD style cryptodev kernel interface. There is an alternative, more up to date project that provides this much less intrusively: Cryptodev-linux. It provides a standalone driver that doesn’t require the entire kernel to be recompiled for it, and it works with the 2.6.36+ kernels.

That just leaves OpenSSL support. Well, it turns out that OpenSSL 1.0.0 already comes with support for cryptodev hardware offload, it just isn’t enabled by default. It has to be enabled during the configure stage by providing -DHAVE_CRYPTODEV (for encryption offload) and -DUSE_CRYPTODEV_DIGESTS (for hashing offload). If you are building against Cryptodev-linux you will also have to provide the -DHASH_MAX_LEN=64 parameter – this is normally in OCF‘s cryptodev.h header file, but isn’t present in the header files that Cryptodev-linux provides. Not a big deal, but something to bear in mind when you are building your own OpenSSL with cryptodev engine support.

So, how big a difference does the Kirkwood‘s acceleration make? Quite a substantial one. Here is what openssl speed test produces:

Kirkwood without cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128 cbc for 3s on 16 size blocks: 1870065 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 64 size blocks: 516074 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 256 size blocks: 132474 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 1024 size blocks: 33342 aes-128 cbc’s in 3.00s
Doing aes-128 cbc for 3s on 8192 size blocks: 4171 aes-128 cbc’s in 3.00s

Kirkwood with cryptodev:
# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 85277 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 64 size blocks: 82960 aes-128-cbc’s in 0.08s
Doing aes-128-cbc for 3s on 256 size blocks: 59806 aes-128-cbc’s in 0.03s
Doing aes-128-cbc for 3s on 1024 size blocks: 40939 aes-128-cbc’s in 0.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 8227 aes-128-cbc’s in 0.00s

The results show, predictably, that with very small (unrealistically small) data blocks, software-only userspace crypto is faster due to less context switching. With 1KB blocks, however, hardware crypto is 23% faster, and with 8KB blocks the hardware engine goes twice as fast as the software-only option. But what is really impressive is the reduction in CPU time. Because the hardware crypto engine is asynchronous, there is practically no CPU time required when using it, which is important since it leaves the CPU free to get on with other tasks.

For comparison, there are the Atom N450 results:

# openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 3813930 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 1098375 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 256 size blocks: 294884 aes-128-cbc’s in 2.99s
Doing aes-128-cbc for 3s on 1024 size blocks: 74520 aes-128-cbc’s in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 9245 aes-128-cbc’s in 2.99s

So the Atom is faster all around – on 1KB blocks it is 82% faster, which reduces to a 12% advantage using 8KB blocks. But let us not forget that we could, in theory, run two instances of OpenSSL, one with hardware offload and one without, which would give us the combined total performance of both, if that is all we needed the machine to do. This would give us figures of approximately:

1KB: 33342+40939=74281
8KB: 4171+8227=12398

This ties with the Atom using 1KB blocks, and beats it by 34% using 8KB blocks – all in a power envelope 4x smaller. Pretty impressive.

Installing Cryptodev-linux is trivially simple, and is simply a matter of the usual ”make; make install” procedure after extracting the tar ball (make sure you have the kernel headers for your kernel installed and available in /lib/modules/$(uname -r)/build/).

I mentioned above the required additional parameters to make OpenSSL build with cryptodev support. On Fedora 13′s OpenSSL‘s source package, you can edit the relevant line in the spec file. The relevant section on my version reads:

./Configure –prefix=/usr –openssldir=%{_sysconfdir}/pki/tls ${sslflags} zlib enable-camellia enable-seed enable-tlsext enable-rfc3779 enable-cms enable-md2 no-idea no-mdc2 no-rc5 no-ec no-ecdh no-ecdsa –with-krb5-flavor=MIT –enginesdir=%{_libdir}/openssl/engines –with-krb5-dir=/usr -DHAVE_CRYPTODEV -DUSE_CRYPTODEV_DIGESTS -DHASH_MAX_LEN=64 shared threads ${sslarch} fips

In case you cannot modify/build it yourself, here are the packages:
openssl-1.0.0-1.kw.fc13.src.rpm
openssl-1.0.0-1.kw.fc13.armv5tel.rpm
openssl-devel-1.0.0-1.kw.fc13.armv5tel.rpm

25 thoughts on “Hardware Accelerated SSL on SheevaPlug (Marvell Kirkwood ARM) Using OpenSSL on Fedora

  1. I’ve tried getting this to work myself and while the tests you show succeed, actually performing SSL bombs out. Did you encounter any further trouble after you made this post?

    • How did you test SSL to make it break? I haven’t tested apache/mod_ssl yet, but I can verify that ssh definitely uses it since the module reference count follows the number of open ssh sessions I have running, so that least that is getting correctly offloaded. Can you check with lsmod that your module reference count goes up with additional ssh sessions?

  2. Pingback: SheevaPlug: Hardware accelarated encryption | Keeping Myself Together

  3. It would be interesting to see a performance comparison between the cryptodev-linux and the AF_ALG interface. The AF_ALG looks like it requires more resources and system calls.

  4. I’ve successfully compiled openssl 1.0.0g in Debian Wheezy (kernel 3.2.0) on the Marvell Kirkwood platform as described in your article. Thanks!

    Benchmarks run fine and show that it is using hw acceleration. However, I can’t get any applications to use the cryptodev API!
    nginx fails to serve any https connections when cryptodev is available, with the following in the logfile:
    [alert] 8176#0: worker process 8185 exited on signal 11
    repeated a lot of times with different pids

    vsftpd seems to ignore cryptodev completely (I’ve set it to use aes128-sha as algorithm, but for unknown reason it does not use cryptodev for this)

    • I presume you checked that the applications you mention are actually dynamically linked against the OpenSSL library?

      Have you tried OpenSSH? I haven’t tested other things yet, but I know that OpenSSH definitely uses cryptodev if available – module reference count goes up, and it works without any ill effects.

      • yes, both applications show a link to the custom compiled version of libssl when I use ldd to check

        further more, I’ve confirmed that openssh uses cryptodev (with scp -c aes256-cbc ).

        I’ll check some more applications. It’s weird that they don’t all work properly with cryptodev.

        • “scp -c aes256-cbc” doesn’t make sense for an offload test – IIRC Kirkwood only supports AES128, MD5 and SHA160 offload. Everything else will either be handled by a different kernel driver or by the SSL library in userspace. Could it be that your tests are failing and apps having problem because you haven’t disabled AES256 in them?

  5. The comparison was not accurate. You should add -elapsed option to openssl test command. Otherwise the time you saw was just the user space time, which in the case of the hardware acceleration, missing the time used by kernel driver for the cryptodev.

    • Not quite. The tests in both cases run for 3 seconds so the elapsed time is fixed. Therefore the hardware offload throughput figures are correct.

      • Yes. But the statement of the CPU time is misleading. The time showed by your test is user-space only, a very small time in user space doesn’t mean the CPU doesn’t take time to do the thing. I tested with scp between a PC and an orion machine (buffalo linkstation pro), the CPU utilization was smaller with hardware acceleration, but not so much (10~20%). I have no solid data since I don’t know how to do that precisely. I’m pretty sure the cryptodev was working since the interrupt counter of CESA kept rising.

        • Not really. The AES offload is asynchronous – from what I can tell, the CPU usage is just busy waiting which will yield to any other CPU consuming task. Offloaded AES doesn’t actually appear to use any CPU time over and above passing data block pointers to the AES engine (so it is relatively more expensive in terms of CPU time for small blocks than it is for large blocks).

          • With -elapsed, time openssl thread waiting for response from hardware engine is also added. Without -elapsed, only user-space time is reported and not kernel space time. Once user-space thread handover job to kernel crypto driver, there is some amount of time the packet spend in kernel too to handover job to hardware engine. Again when hardware engine completes its job, there is finite time it takes for job to come back to user-space. In-between kernel is free to schedule other user/kernel threads. To get correct kernel and user time, I used “time openssl -evp aes-128-cbc”. This gave how long openssl thread took in IOCTL handlings in kernel too.

            This one has minor issue still because callback from hardware engine for job completion in certain cases works in kernel’s own thread context e.g. kthread_softirq or softIRQ or interrupt which is not counted in processes time.

    • Yes and no. The systime ends up being almost completely spent on iowait, which is effectively treated as idle time if something else is trying to run.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>