eSTREAM in Linux: Salsa20 (x86-64)

It has been a while. But finally I’ve ported Bernstein’s x86-64 Salsa20 code (the original code is at: http://cr.yp.to/snuffle/salsa20/amd64-3/salsa20.s). In the process, I realized that the “glue” code used to call the i586 version and the x86-64 version is exactly the same so I combined them into just one salsa20_glue.c file. This was what Sebastian Siewior did for the AES “glue” code too and I think it is a good thing to do for code reuse.

As usual, the patches and the Python script used to automate the reformatting of salsa20.s are available in the Tools & Patches section I am not sure how the x86-64 patch works out since I don’t have a 64-bit machine to test on. Hopefully someone will give it a test-drive.

18 Dec 2007: Herbert and Sebastian tested it on their 64-bit machines. It works! Yeah!

Advertisements

eSTREAM in Linux: Salsa20 (i586)

While testing against a large test vector, I discovered that my previous patch have a serious bug: I’ve switched the source and destination buffers around when calling salsa20_encrypt_bytes()! It doesn’t show up against small test vectors because in those cases, blkcipher_walk-ing typically uses the same buffer as source and destination. It is usually when we are near a page boundary that source and destination points to different buffers and then the bug becomes apparent. A patch has been submitted.

Sebastian Siewior also commented that I left too many comments in the assembly code. I explained that I wanted to make it easy for others to verify that I did not tamper with Bernstein’s code using “diff -b“. But alas, diff was confused by my removal of some of the global labels and so “diff -b” did not work for this purpose. That being the case, I decided that “diff -b” was not a good idea and rewrote my indent.py script to remove the extraneous comments and functions. If you are interested to verify that I did not tamper with the code, you should study the indent.py script instead.

eSTREAM in Linux: Salsa20 (i586)

While I am at it, I decided to resubmit the Salsa20 i586 implementation.

The previous submission did not make it as the assembly code was not indented properly. The original assembly code came from http://cr.yp.to/snuffle/salsa20/x86-pm/salsa20.s and I did not change it initially as I was worried about introducing bugs. Since it is important that the indentation is correct, I eventually wrote a script to automate the process.

This salsa20-i586 version is pretty fast. Under UML on my laptop, it is nearly 1.8 times as fast as the salsa20-generic implementation:

testing speed of salsa20-generic encryption
test 0 (128 bit key, 16 byte blocks): 930563 operations in 1 seconds (14889008 bytes)
test 1 (128 bit key, 64 byte blocks): 966858 operations in 1 seconds (61878912 bytes)
test 2 (128 bit key, 256 byte blocks): 289514 operations in 1 seconds (74115584 bytes)
test 3 (128 bit key, 1024 byte blocks): 75999 operations in 1 seconds (77822976 bytes)
test 4 (128 bit key, 8192 byte blocks): 9273 operations in 1 seconds (75964416 bytes)
test 5 (256 bit key, 16 byte blocks): 1003746 operations in 1 seconds (16059936 bytes)
test 6 (256 bit key, 64 byte blocks): 966838 operations in 1 seconds (61877632 bytes)
test 7 (256 bit key, 256 byte blocks): 288998 operations in 1 seconds (73983488 bytes)
test 8 (256 bit key, 1024 byte blocks): 75048 operations in 1 seconds (76849152 bytes)
test 9 (256 bit key, 8192 byte blocks): 9267 operations in 1 seconds (75915264 bytes)
testing speed of salsa20-i586 encryption
test 0 (128 bit key, 16 byte blocks): 1210840 operations in 1 seconds (19373440 bytes)
test 1 (128 bit key, 64 byte blocks): 1464200 operations in 1 seconds (93708800 bytes)
test 2 (128 bit key, 256 byte blocks): 486000 operations in 1 seconds (124416000 bytes)
test 3 (128 bit key, 1024 byte blocks): 132620 operations in 1 seconds (135802880 bytes)
test 4 (128 bit key, 8192 byte blocks): 16395 operations in 1 seconds (134307840 bytes)
test 5 (256 bit key, 16 byte blocks): 1371696 operations in 1 seconds (21947136 bytes)
test 6 (256 bit key, 64 byte blocks): 1457508 operations in 1 seconds (93280512 bytes)
test 7 (256 bit key, 256 byte blocks): 491316 operations in 1 seconds (125776896 bytes)
test 8 (256 bit key, 1024 byte blocks): 133732 operations in 1 seconds (136941568 bytes)
test 9 (256 bit key, 8192 byte blocks): 16186 operations in 1 seconds (132595712 bytes)

Salsa20 issue resolved!

As I’ve mentioned before, the Salsa20 implementation I submitted had a bug which caused erroneous processing near page boundary. This is similar to the CTR bug.

Well, the bug has been squashed! The patches are available here. It uses the same idea as Herbert’s CTR patch.

I’ve also added an optimization for the case walk.nbytes == nbytes, which I believe occurs rather frequently.

eSTREAM in Linux: Salsa20 (i586)

I’ve ported Bernstein’s x86-pm assembly code of Salsa20 into the kernel. What I did is nothing more than adding glue code to call the assembly code.

More interesting is the patch to dm-crypt to support stream ciphers using the blkcipher interface. In other words, it is now possible to do something like:

cryptsetup luksFormat -c salsa20-stream-plain /dev/loop0

The patches have been sent to linux-crypto@vger.kernel.org. They are also available here where the status of the patches are tracked. Not all patches make it into the kernel. :-)

eSTREAM in Linux: Salsa20

It was pointed out by Herbert Xu that I could achieve what I want with eSTREAM ciphers simply by using the blkcipher interface. The estream.c and stream.c were entirely redundant. So I gave his suggestion a try and it worked for Salsa20! The patch is available on this page. This approach is much more elegant and does not disturb the existing “ecosystem”.

I’ve renamed the posts’ title to “eSTREAM in Linux: <algo name>” as I am now submitting one patch for each cipher. Seems more sensible this way.

PS: It was also suggested to me that Salsa20 can also be implemented as ctr(salsa20,0,16,8) where ctr is the counter mode template and salsa20 is the Salsa20 expansion function. Initially I thought it was possible too but when I read through ctr.c, I realize I can’t specify the block size of the Salsa20 expansion function. Should it be the 16 (blocksize of input) or 64 (blocksize of output)? If 16, then crypto_ctr_crypt_{segment,inplace} will be walking in steps smaller than the output block size; if 64, then the test for ((noncesize + ivsize + countersize) < alg->cra_blocksize) will trigger an error.

PS (19 Nov): This is commit 9ea2097f7339a03cb149c70b512f755cf0a529da in the kernel now.

Porting eSTREAM ciphers into Linux: Part 2

After more experiments with the Linux kernel CryptoAPI, I find that the eSTREAM ciphers are a misfit. (This statement is not quite right as explained in the postscript at the end of this post.) The problem is that eSTREAM ciphers require a call to setiv() but cipher_alg and cipher_tfm do not provide such a callback.

The setiv() call is important for eSTREAM ciphers as most of them use it to mix the IV into their key expansion. This is very different from general block ciphers where the IV is handled at the “mode of operation” level and does not affect the cipher’s key expansion.

So I created a new crypto_type called crypto_estream_type (and the associated *_alg and *_tfm) to address the needs of eSTREAM ciphers. These patches (available here) pass the tcrypt regression test and seem to capture the semantics of eSTREAM ciphers better. Currently I am discussing with the veterans on linux-crypto@vger.kernel.org whether this is the right approach.

PS (14 Nov): I realized today that dm-crypt.c uses crypto_cipher directly so my patches will not work with dm-crypt. Bummer!

PS (15 Nov): Herbert Xu pointed out that I can achieve what I want to do using the blkcipher interface instead of the cipher interface. There is really no need for a new crypto_type. I think he is right so I will be giving it a try. (Note: Herbert is the current maintainer of the APIs and he developed many of these interfaces so he knows them much better than I do.)