eSTREAM in Linux: Salsa20 (x86-64)

It has been a while. But finally I’ve ported Bernstein’s x86-64 Salsa20 code (the original code is at: http://cr.yp.to/snuffle/salsa20/amd64-3/salsa20.s). In the process, I realized that the “glue” code used to call the i586 version and the x86-64 version is exactly the same so I combined them into just one salsa20_glue.c file. This was what Sebastian Siewior did for the AES “glue” code too and I think it is a good thing to do for code reuse.

As usual, the patches and the Python script used to automate the reformatting of salsa20.s are available in the Tools & Patches section I am not sure how the x86-64 patch works out since I don’t have a 64-bit machine to test on. Hopefully someone will give it a test-drive.

18 Dec 2007: Herbert and Sebastian tested it on their 64-bit machines. It works! Yeah!

eSTREAM in Linux: Salsa20 (i586)

While testing against a large test vector, I discovered that my previous patch have a serious bug: I’ve switched the source and destination buffers around when calling salsa20_encrypt_bytes()! It doesn’t show up against small test vectors because in those cases, blkcipher_walk-ing typically uses the same buffer as source and destination. It is usually when we are near a page boundary that source and destination points to different buffers and then the bug becomes apparent. A patch has been submitted.

Sebastian Siewior also commented that I left too many comments in the assembly code. I explained that I wanted to make it easy for others to verify that I did not tamper with Bernstein’s code using “diff -b“. But alas, diff was confused by my removal of some of the global labels and so “diff -b” did not work for this purpose. That being the case, I decided that “diff -b” was not a good idea and rewrote my indent.py script to remove the extraneous comments and functions. If you are interested to verify that I did not tamper with the code, you should study the indent.py script instead.

eSTREAM in Linux: Salsa20 (i586)

While I am at it, I decided to resubmit the Salsa20 i586 implementation.

The previous submission did not make it as the assembly code was not indented properly. The original assembly code came from http://cr.yp.to/snuffle/salsa20/x86-pm/salsa20.s and I did not change it initially as I was worried about introducing bugs. Since it is important that the indentation is correct, I eventually wrote a script to automate the process.

This salsa20-i586 version is pretty fast. Under UML on my laptop, it is nearly 1.8 times as fast as the salsa20-generic implementation:

testing speed of salsa20-generic encryption
test 0 (128 bit key, 16 byte blocks): 930563 operations in 1 seconds (14889008 bytes)
test 1 (128 bit key, 64 byte blocks): 966858 operations in 1 seconds (61878912 bytes)
test 2 (128 bit key, 256 byte blocks): 289514 operations in 1 seconds (74115584 bytes)
test 3 (128 bit key, 1024 byte blocks): 75999 operations in 1 seconds (77822976 bytes)
test 4 (128 bit key, 8192 byte blocks): 9273 operations in 1 seconds (75964416 bytes)
test 5 (256 bit key, 16 byte blocks): 1003746 operations in 1 seconds (16059936 bytes)
test 6 (256 bit key, 64 byte blocks): 966838 operations in 1 seconds (61877632 bytes)
test 7 (256 bit key, 256 byte blocks): 288998 operations in 1 seconds (73983488 bytes)
test 8 (256 bit key, 1024 byte blocks): 75048 operations in 1 seconds (76849152 bytes)
test 9 (256 bit key, 8192 byte blocks): 9267 operations in 1 seconds (75915264 bytes)
testing speed of salsa20-i586 encryption
test 0 (128 bit key, 16 byte blocks): 1210840 operations in 1 seconds (19373440 bytes)
test 1 (128 bit key, 64 byte blocks): 1464200 operations in 1 seconds (93708800 bytes)
test 2 (128 bit key, 256 byte blocks): 486000 operations in 1 seconds (124416000 bytes)
test 3 (128 bit key, 1024 byte blocks): 132620 operations in 1 seconds (135802880 bytes)
test 4 (128 bit key, 8192 byte blocks): 16395 operations in 1 seconds (134307840 bytes)
test 5 (256 bit key, 16 byte blocks): 1371696 operations in 1 seconds (21947136 bytes)
test 6 (256 bit key, 64 byte blocks): 1457508 operations in 1 seconds (93280512 bytes)
test 7 (256 bit key, 256 byte blocks): 491316 operations in 1 seconds (125776896 bytes)
test 8 (256 bit key, 1024 byte blocks): 133732 operations in 1 seconds (136941568 bytes)
test 9 (256 bit key, 8192 byte blocks): 16186 operations in 1 seconds (132595712 bytes)

Salsa20 issue resolved!

As I’ve mentioned before, the Salsa20 implementation I submitted had a bug which caused erroneous processing near page boundary. This is similar to the CTR bug.

Well, the bug has been squashed! The patches are available here. It uses the same idea as Herbert’s CTR patch.

I’ve also added an optimization for the case walk.nbytes == nbytes, which I believe occurs rather frequently.

eSTREAM in Linux: Salsa20 (i586)

I’ve ported Bernstein’s x86-pm assembly code of Salsa20 into the kernel. What I did is nothing more than adding glue code to call the assembly code.

More interesting is the patch to dm-crypt to support stream ciphers using the blkcipher interface. In other words, it is now possible to do something like:

cryptsetup luksFormat -c salsa20-stream-plain /dev/loop0

The patches have been sent to linux-crypto@vger.kernel.org. They are also available here where the status of the patches are tracked. Not all patches make it into the kernel. :-)