Well, right now I am testing some other optimizations that use strictly safe code. One of the things I came across:

On my fast machine it helps to do as much arithmetic as possible in one go. For example:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = (b <<22) | (b>> 10);
  3. b += c;

This b += c is not smart to use in this situation and actually produces a performance hit. Better would be:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = ((b <<22) | (b>> 10)) + c;

Altering the whole segment like this in the md5 algorithm adds up to a 10-15% performance increase!

Let's look at the IL:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27. stloc.1  // stores b
  28. ldloc.1  // loads b
  29. ldloc.2
  30. add
  31. stloc.1

the second version's:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5AddOptimization3::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5AddOptimization3::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27.                // store, load eliminated
  28. ldloc.2
  29. add
  30. stloc.1

I thought this was weird, because I assumed that the compiler might be able to find these kinds of unnecessary operations, but I guess it is harder to locate them than it seems. Especially since a wrong optimization might end up with broken code.

A weird thing is: I just did some testing on my old, old i586 - and this optimization I described here actually decreases performances under some cirumstances! Obviously the lesson learned is: Never trust "optimizations" without testing them!

Well, I'll soon write about more safe optimizations I found. But this is for sure again: Optimizing is all about trial and error. There is almost no way to tell if anything might perform better or worse.