eSTREAM in Linux: Salsa20 (x86-64)

It has been a while. But finally I’ve ported Bernstein’s x86-64 Salsa20 code (the original code is at: http://cr.yp.to/snuffle/salsa20/amd64-3/salsa20.s). In the process, I realized that the “glue” code used to call the i586 version and the x86-64 version is exactly the same so I combined them into just one salsa20_glue.c file. This was what Sebastian Siewior did for the AES “glue” code too and I think it is a good thing to do for code reuse.

As usual, the patches and the Python script used to automate the reformatting of salsa20.s are available in the Tools & Patches section I am not sure how the x86-64 patch works out since I don’t have a 64-bit machine to test on. Hopefully someone will give it a test-drive.

18 Dec 2007: Herbert and Sebastian tested it on their 64-bit machines. It works! Yeah!

Advertisements

eSTREAM in Linux: Salsa20 (i586)

While testing against a large test vector, I discovered that my previous patch have a serious bug: I’ve switched the source and destination buffers around when calling salsa20_encrypt_bytes()! It doesn’t show up against small test vectors because in those cases, blkcipher_walk-ing typically uses the same buffer as source and destination. It is usually when we are near a page boundary that source and destination points to different buffers and then the bug becomes apparent. A patch has been submitted.

Sebastian Siewior also commented that I left too many comments in the assembly code. I explained that I wanted to make it easy for others to verify that I did not tamper with Bernstein’s code using “diff -b“. But alas, diff was confused by my removal of some of the global labels and so “diff -b” did not work for this purpose. That being the case, I decided that “diff -b” was not a good idea and rewrote my indent.py script to remove the extraneous comments and functions. If you are interested to verify that I did not tamper with the code, you should study the indent.py script instead.

eSTREAM in Linux: Salsa20 (i586)

While I am at it, I decided to resubmit the Salsa20 i586 implementation.

The previous submission did not make it as the assembly code was not indented properly. The original assembly code came from http://cr.yp.to/snuffle/salsa20/x86-pm/salsa20.s and I did not change it initially as I was worried about introducing bugs. Since it is important that the indentation is correct, I eventually wrote a script to automate the process.

This salsa20-i586 version is pretty fast. Under UML on my laptop, it is nearly 1.8 times as fast as the salsa20-generic implementation:

testing speed of salsa20-generic encryption
test 0 (128 bit key, 16 byte blocks): 930563 operations in 1 seconds (14889008 bytes)
test 1 (128 bit key, 64 byte blocks): 966858 operations in 1 seconds (61878912 bytes)
test 2 (128 bit key, 256 byte blocks): 289514 operations in 1 seconds (74115584 bytes)
test 3 (128 bit key, 1024 byte blocks): 75999 operations in 1 seconds (77822976 bytes)
test 4 (128 bit key, 8192 byte blocks): 9273 operations in 1 seconds (75964416 bytes)
test 5 (256 bit key, 16 byte blocks): 1003746 operations in 1 seconds (16059936 bytes)
test 6 (256 bit key, 64 byte blocks): 966838 operations in 1 seconds (61877632 bytes)
test 7 (256 bit key, 256 byte blocks): 288998 operations in 1 seconds (73983488 bytes)
test 8 (256 bit key, 1024 byte blocks): 75048 operations in 1 seconds (76849152 bytes)
test 9 (256 bit key, 8192 byte blocks): 9267 operations in 1 seconds (75915264 bytes)
testing speed of salsa20-i586 encryption
test 0 (128 bit key, 16 byte blocks): 1210840 operations in 1 seconds (19373440 bytes)
test 1 (128 bit key, 64 byte blocks): 1464200 operations in 1 seconds (93708800 bytes)
test 2 (128 bit key, 256 byte blocks): 486000 operations in 1 seconds (124416000 bytes)
test 3 (128 bit key, 1024 byte blocks): 132620 operations in 1 seconds (135802880 bytes)
test 4 (128 bit key, 8192 byte blocks): 16395 operations in 1 seconds (134307840 bytes)
test 5 (256 bit key, 16 byte blocks): 1371696 operations in 1 seconds (21947136 bytes)
test 6 (256 bit key, 64 byte blocks): 1457508 operations in 1 seconds (93280512 bytes)
test 7 (256 bit key, 256 byte blocks): 491316 operations in 1 seconds (125776896 bytes)
test 8 (256 bit key, 1024 byte blocks): 133732 operations in 1 seconds (136941568 bytes)
test 9 (256 bit key, 8192 byte blocks): 16186 operations in 1 seconds (132595712 bytes)

Salsa20 issue resolved!

As I’ve mentioned before, the Salsa20 implementation I submitted had a bug which caused erroneous processing near page boundary. This is similar to the CTR bug.

Well, the bug has been squashed! The patches are available here. It uses the same idea as Herbert’s CTR patch.

I’ve also added an optimization for the case walk.nbytes == nbytes, which I believe occurs rather frequently.

CTR issue resolved!

The bug in CTR has been resolved! Herbert wrote a better patch than the one I did. After some minor fixes to the patch, CTR now does multi-page processing properly. The same idea should be applied to Salsa20, which also has the same bug.

A large test vectors (large in the sense that it forces multi-page access) of 4100 bytes was added to tcrypt.h to test the code.

The patches are available here. Unfortunately, the patches results in substantial bloat of tcrypt.ko.

Bugs in CTR and Salsa20?

I’ve stumbled on what looks like a bug in the implementation of CTR and Salsa20 in the kernel. I’ve derived salsa20_generic.c from ctr.c so the bugs are essentially similar.

To understand the problem, we need to first understand the following code extract:

blkcipher_walk_init(&walk, dst, src, nbytes);
err = blkcipher_walk_virt_block(desc, &walk, bsize);
while(walk.nbytes) {
    if (walk.src.virt.addr == walk.dst.virt.addr) {
        nbytes = encrypt_inplace(&walk, child, ...);
    else
        nbytes = encrypt_segment(&walk, child, ...);
    err = blkcipher_walk_done(desc, &walk, nbytes);
}

This is the “design pattern” used in various “mode of operations” code, e.g. CBC, CTR, PCBC.

  • The blkcipher_walk_init() call initializes a struct blkcipher_walk with the destination pointer of the ciphertext, the source pointer of the plaintext and the length of data to encrypt.
  • The blkcipher_walk_virt_block() call starts the walking process and also tells the blkcipher APIs the block size of data we are processing. For CBC and PCBC, the block size is simply the blkcipher‘s blocksize so usually blkcipher_walk_virt() is called. For CTR (and Salsa20), the blkcipher‘s blocksize is set to 1 so we need to call blkcipher_walk_virt_block() with the underlying cipher algorithm block size.
  • The encrypt_inplace() and encrypt_segment() are similar code that handles optimally the case where src == dst and src != dst respectively. Their return value is important to the blkcipher_walk_done() call described in the next item.
  • The blkcipher_walk_done() call tells the blkcipher mechanisms how many bytes there are left that the call to encrypt_inplace() or encrypt_segment() did not handle. These are typically the bytes at the end of the block and usually nbytes = walk.nbytes % bsize. The blkcipher API will auto-magically move those bytes into the start of the next loop.

For small buffers of data, the while() loop is seldom executed more than once. However if the buffer size is near the page size, e.g. 4096 bytes, then the blkcipher API will break it up. For example, with AES in CBC mode the while() loop is executed 3 times: first, walk.nbytes = 4008, dst = src; then, walk.nbytes = 16, dst != src; finally, walk.nbytes = 80, dst = src. In the first loop, we see that 4008 % 16 = 8 bytes at the end are not processed. They are only done in the second loop. As the second and third loop has walk.nbytes that is divisble by 16, there are no unprocessed bytes. In total, 4000 + 16 + 80 = 4096 bytes are processed.

In the current implementation of CTR (and Salsa20), the nbytes passed into blkcipher_walk_done() is always zero. This makes sense for small buffers as encrypt_inplace() and encrypt_segment() typically finish the encryption in one loop. However for large buffer that crosses a page (or require at least two loops in the while loop), this may cause the counter to increment unnecessarily.

For example, if 4096 bytes were to be encrypted, the call for CTR (and Salsa20) would be split into 2 parts: first, walk.nbytes = 4008, dst == src; then walk.nbytes = 88, dst == src. This means that the last 8 bytes of the 4008 bytes in the first call will be encrypted with half of the encrypted counter block. It also means that the remaining 8 bytes are discarded as the current APIs do not keep track of how many encrypted bytes were left unused. If we look at the test vectors, we will notice a half-block slip in the output at the end. This is the source of the bug!

My patch to this problem is available here. What it does is to use the encrypt_inplace() and encrypt_segment() code as if they were performing encryption of full blocks all the time, except for the last call in the while loop.

eSTREAM in Linux: Salsa20 (i586)

I’ve ported Bernstein’s x86-pm assembly code of Salsa20 into the kernel. What I did is nothing more than adding glue code to call the assembly code.

More interesting is the patch to dm-crypt to support stream ciphers using the blkcipher interface. In other words, it is now possible to do something like:

cryptsetup luksFormat -c salsa20-stream-plain /dev/loop0

The patches have been sent to linux-crypto@vger.kernel.org. They are also available here where the status of the patches are tracked. Not all patches make it into the kernel. :-)