I got in contact with Sebastien Pouliot in order to contribute back to mono. And I got a really nice email back. He suggested a couple of other optimizations, explained some of the performance gains I was experiencing and was looking forward to seeing more of my optimizations. But unfortunatelly he also wrote this:

For many reasons I don't want to add unsafe code into crypto. Beside being "unsafe" many projects use copies of Mono crypto source code directly into their apps and I don't want to require them to use unsafe code (and limit the condition were the code can be executed, e.g. FullTrust)

That makes it impossible to use the one optimization that gives the biggest performance boost on little endian machines. Since that optimization will not make it into mono, I just wanted to blog about it here - just in case somebody needs the extra edge.

Optimizing the md5 algo on little endian machines

This decoding step is a major part within the md5 algo. It translates a 64 byte array into a 16 uint array. The general endian-safe step looks like this:

C#:
  1. for (i = 0; i <16; i++)
  2. {
  3.   buff[i] = (uint)(inputBuffer[inputOffset + 4 * i])
  4.     | (((uint)(inputBuffer[inputOffset + 4 * i + 1])) <<8 )
  5.     | (((uint)(inputBuffer[inputOffset + 4 * i + 2])) <<16)
  6.     | (((uint)(inputBuffer[inputOffset + 4 * i + 3])) <<24);
  7. }

If this step is performed on a little endian machine, the bytes are already safed in memory in the right order and there is no need for all the bitlogic. Therefore optimizing it using unsafe code and direct copying 4 consecutive bytes as one uint gives a huge speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     byte* inputPointer = bFixed + inputOffset;
  6.  
  7.     for (i = 0; i <16; i++)
  8.     {
  9.       buff[i] = *(uint*)(inputPointer);
  10.       inputPointer += 4;
  11.     }
  12.   }
  13. }

As we are already in unsafe context using an additional pointer for buff gives an additional speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     fixed (uint* aFixed = buff)
  6.     {
  7.       byte* inputPointer = bFixed + inputOffset;
  8.       uint* bufferPointer = aFixed;
  9.  
  10.       for (i = 0; i <16; i++)
  11.       {
  12.         *bufferPointer = *(uint*)(inputPointer);
  13.         inputPointer += 4;
  14.         bufferPointer++;
  15.       }
  16.     }
  17.   }
  18. }

Be sure to still make this Md5 implementation endian safe by introducing a if:

C#:
  1. if (BitConverter.IsLittleEndian)
  2. {
  3.   // unsafe decode
  4. }
  5. else
  6. {
  7.   // bitshifting decode
  8. }

I really can understand why this is important. There should not be any unsafe code in a crypto environment, and I agree on that. Unfortunatelly I have not come up with an alternative solution for that particular decode step. Using

C#:
  1. BitConverter.ToUInt32(inputBuffer, inputOffset + 4 * i);

really is no good. Unfortunatelly this method does some extensive testing before converting the bytes. That slows it down so much, that it performs worse than the bitops.