Getting started using Microsoft Compute Cluster

by Tobias Hertkorn on January 5th, 2007

Well, strictly for academic reasons I wanted to find out how fast the new Computing Cluster Server by Microsoft really is – I took the chance and installed it on my fresh and new Fujitsu-Siemens Celsius M. Yes, I got it today from my vendor – and thought, well before installing all of Linux and WinXP on it – why not test my code on it that I am writing as part of my diploma thesis. Granted, this is not a real test of the speed of the CCS since it will be only one node and the code will run on the head knot. But it will for sure be a good test if the Fortran 90 code will compile and run. Hell, it will even be a good estimate how much CPU Cycles are used by the underlying 2003 Server and the MPICH framework. Because I will provide some data on how fast the same code performs on the same machine using Linux (Debian’s AMD64 port) and MPICH.

Some more specs about the setup:

Intel® Core 2 Duo E6400 / 2.13 GHz “Conroe” – L1 Cache 2x 32 KB – L2 Cache 2x 1024 KB
Chipset Intel® 975X Express
2 GB RAM DDR II SDRAM 667 MHz
1 x ST3250820AS – 250 GB – Serial ATA-300 – 7200 rpm
NVIDIA Quadro FX 1500 – 256 MB
Ethernet-Controller Broadcom BCM5751

Here are some test results by the one and only BOINC CPU benchmark (Single threaded):

Core2Duo E6400:
Measured floating point speed:	1990.18 million ops/sec
Measured integer speed		4160.19 million ops/sec
Pentium 4 Northwood 2.4 GHz:
Measured floating point speed	964.82 million ops/sec
Measured integer speed		1139.22 million ops/sec

But first things first – Installing
Well, right now I am forced to install a 2003 Server x64 using the provided CD 1 on http://www.microsoft.com/windowsserver2003/ccs/trial/installinstruct.mspx. Let’s see if all the necessary drivers are available for the x64 environment.

Damn, there is no driver for my Broadcom NetXtreme Gigabit Ethernet. Hmm. Let’s hope the 2003 Server recognizes my USB stick. Phew. It worked, and thanks god for my second PC and the excellent online support by Fujitsu-Siemens. But isn’t it annoying that there are about 5 restarts before a complete MS Installation is completed? Well, the stuff one has to do. ;-) Urgs. There is 200Megs worth of updates available. Good thing this is a very, very fast internet conn… And downloads from the update server are around 5MB/s. Well, at least the installation of the patches already shows some of the fastness and furiousness of this piece of hardware. Truly impressive.

Updates are still running. Damn. Well, let’s talk a bit about the code I am about to run. It is a two-dimensional turbulence simulation, using a pseudo spectral approach. That means that there are both calculations in Fourier space and real space. Therefore the code highly depends on FFT. Timesteping is done using a leap frog integration.

The whole code uses MPI to distribute memory consumption and computational power. Especially the convolution benefits from parallel computing. As well as the FFT routines which use FFTW2 since the code uses memory distributed FFT.

The code is usually run on the ALTIX at the LRZ in Munich or the PSI at MPI in Greifswald. For this particular test on CCS I pre-tested it on a Pentium 4 running Debian using MPICH and the Intel Fortran compiler ifort. Obviously the tests were single threaded, since only one CPU was available. I will use a setup (1024×1024 grid points) for this test that needed about 50 minutes to complete on that particular Pentium 4 (Prescott 3.2GHz).

Well, the updates are done. Time to pop in the second CD. Since it complains that there is no Active Directory available, I use the dcpromo.exe wizard to set up a domain and DNS Server on this machine. Which takes a couple of minutes. Aaaand another restart. This is starting to get to me!

Okay, let’s try this again. CD2. And lift-off. Since is going to be the only computing node I select “Create a new compute cluster with this server as the head node” plus I check “Use this head node as a compute node”. Hmm, it complains that this computer only has one NIC. Well, since I am not going to use the net for MPI, I don’t really care and choose to continue anyway. Uhoh, some more hotfixes to install. Smells like restart spirit… Oops. Something really screwed with my network settings. Ah, faulty DNS settings… Yes, first reboot after ICS Hotfix. Wow, this time it takes really long to boot. Granted, there are some more services to start now.

All updates are install and it is time to install the Microsoft Compute Cluster Pack. Done. There is a nice to do list that pops up. Obviously I have to create a network topology. And do some user management. My topology should be “All nodes only on public network”, which is the simplest. After the change of topology is done successfully the wizard informs me that I really should switch off the internal firewall. Okay. There even is a nice wizard for that. Done. Node Management is not necessary since it already knows that I will use the head node as compute node. Nice. Well, last step: User Management. Weeeell. Done?!? Where to go next? Let’s have a look at the Compute Cluster Administrator. Hmm. Right amount of Processors are detected and all are flagged as idle. Good. But I really have no idea how to start compiling my code.

Digging around some I came across this link:

http://windowshpc.net/blogs/apps_general/archive/2006/06/04/134.aspx

That would get me started, if I would be using VS 2005. Which could also be a lot of fun – but maybe later. I really want to get Fortran up and running. While digging around some more I am downloading a trial version of the Intel Fortran Compiler 9.1 for Windows.

Hmm. Damn, it is getting way too late. It’s past pi (3:14 ;-) ). Stop.

January 5th, 2007 4:12 am | Comments (0)

Getting Debian MPICH and icc/ifort ready on Core2Duo/amd64 port

by Tobias Hertkorn on January 3rd, 2007

Sorry, today’s post won’t be pretty and certainly not for people who are not familiar with Debian and installing. It’s basically just a short summary for me. But feel free to email me or post a comment if you need assistance.

Get the Intel Fortran and CC compiler. If you use them for non-commercial purposes it is free to download them.

http://registrationcenter.cps.intel.com/irc_nas/537/l_fc_c_9.1.036.tar.gz

http://registrationcenter.cps.intel.com/irc_nas/534/l_cc_c_9.1.042.tar.gz

Installing the stuff:
Make sure the ia32-libs package is installed, when running an amd64 port.
apt-get install ia32-libs
This is not the default on Debian!

Pointers on how to install the intel compilers:

http://ubuntuforums.org/showthread.php?t=149579

http://lunatics.at/index.php?op=Articles;article=2

Get MPICH2, unpack and configure it:
CC=icc CXX=icpc F77=ifort F90=ifort FFLAGS=’-fomit-frame-pointer -xT -O3 -ip’ CFLAGS=’-fomit-frame-pointer -xT -O3 -ip’ ./configure –prefix=/opt/mpich2 –with-device=ch3:shm –enable-fast –enable-f77

Make sure to include /opt/mpich2/bin in your path.
Get the fftw2, unpack and configure it:
CC=icc CXX=icpc F77=ifort F90=ifort FFLAGS=’-fomit-frame-pointer -xT -O3 -ip’ CFLAGS=’-fomit-frame-pointer -xT -O3 -ip’ ./configure –prefix=/opt/fftw-2.1.5 –enable-mpi –enable-sse2 –enable-i386-hacks

January 3rd, 2007 1:13 pm | Comments (0)

I just learned two new things on Debian

by Tobias Hertkorn on June 14th, 2006

Hahaha, why is that such a big deal? Well, I have been using Debian for more than 10 years now and I thought I was well read, when it comes to the dpkg and apt commands. But today I actually learned not one, but two new things related to apt. Aaaaand the winners are: apt-build and volatile.

Well, as far as I am concerned there is no need for Gentoo anymore. ;) Let there be apt-build. Nice work on making recompiling performance critical packages as easy as directly installing them using apt-get. Nonono, please don’t start a flame on Debian vs. Gentoo, or Gentoo vs. rest of the world. I was just joking. :)

Second cool thing is Debain Volatile, a project aimed at getting more recent software like AV-scanners into the stable branch. So now there is less need to do some apt-pinning and/or upgrading to testing on your production servers. Nice.

June 14th, 2006 5:21 pm | Comments (1)

On the hunt for safe optimizations for md5

by Tobias Hertkorn on June 10th, 2006

Well, right now I am testing some other optimizations that use strictly safe code. One of the things I came across:

On my fast machine it helps to do as much arithmetic as possible in one go. For example:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = (b <<22) | (b>> 10);
  3. b += c;

This b += c is not smart to use in this situation and actually produces a performance hit. Better would be:

C#:
  1. b += (((d ^ a) & c) ^ a) + (uint)K[7] + buff[7];
  2. b = ((b <<22) | (b>> 10)) + c;

Altering the whole segment like this in the md5 algorithm adds up to a 10-15% performance increase!

Let's look at the IL:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5CryptoServiceProviderMonoOrig::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27. stloc.1  // stores b
  28. ldloc.1  // loads b
  29. ldloc.2
  30. add
  31. stloc.1

the second version's:

IL:
  1. ldloc.1
  2. ldloc.3
  3. ldloc.0
  4. xor
  5. ldloc.2
  6. and
  7. ldloc.0
  8. xor
  9. ldsfld unsigned int32[] MD5AddOptimization3::K
  10. ldc.i4.7
  11. ldelem.u4
  12. add
  13. ldarg.0
  14. ldfld unsigned int32[] MD5AddOptimization3::buff
  15. ldc.i4.7
  16. ldelem.u4
  17. add
  18. add
  19. stloc.1
  20. ldloc.1
  21. ldc.i4.s 22
  22. shl
  23. ldloc.1
  24. ldc.i4.s 10
  25. shr.un
  26. or
  27.                // store, load eliminated
  28. ldloc.2
  29. add
  30. stloc.1

I thought this was weird, because I assumed that the compiler might be able to find these kinds of unnecessary operations, but I guess it is harder to locate them than it seems. Especially since a wrong optimization might end up with broken code.

A weird thing is: I just did some testing on my old, old i586 - and this optimization I described here actually decreases performances under some cirumstances! Obviously the lesson learned is: Never trust "optimizations" without testing them!

Well, I'll soon write about more safe optimizations I found. But this is for sure again: Optimizing is all about trial and error. There is almost no way to tell if anything might perform better or worse.

June 10th, 2006 8:07 pm | Comments (2)

Md5 – no unsafe in crypto

by Tobias Hertkorn on June 9th, 2006

I got in contact with Sebastien Pouliot in order to contribute back to mono. And I got a really nice email back. He suggested a couple of other optimizations, explained some of the performance gains I was experiencing and was looking forward to seeing more of my optimizations. But unfortunatelly he also wrote this:

For many reasons I don't want to add unsafe code into crypto. Beside being "unsafe" many projects use copies of Mono crypto source code directly into their apps and I don't want to require them to use unsafe code (and limit the condition were the code can be executed, e.g. FullTrust)

That makes it impossible to use the one optimization that gives the biggest performance boost on little endian machines. Since that optimization will not make it into mono, I just wanted to blog about it here - just in case somebody needs the extra edge.

Optimizing the md5 algo on little endian machines

This decoding step is a major part within the md5 algo. It translates a 64 byte array into a 16 uint array. The general endian-safe step looks like this:

C#:
  1. for (i = 0; i <16; i++)
  2. {
  3.   buff[i] = (uint)(inputBuffer[inputOffset + 4 * i])
  4.     | (((uint)(inputBuffer[inputOffset + 4 * i + 1])) <<8 )
  5.     | (((uint)(inputBuffer[inputOffset + 4 * i + 2])) <<16)
  6.     | (((uint)(inputBuffer[inputOffset + 4 * i + 3])) <<24);
  7. }

If this step is performed on a little endian machine, the bytes are already safed in memory in the right order and there is no need for all the bitlogic. Therefore optimizing it using unsafe code and direct copying 4 consecutive bytes as one uint gives a huge speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     byte* inputPointer = bFixed + inputOffset;
  6.  
  7.     for (i = 0; i <16; i++)
  8.     {
  9.       buff[i] = *(uint*)(inputPointer);
  10.       inputPointer += 4;
  11.     }
  12.   }
  13. }

As we are already in unsafe context using an additional pointer for buff gives an additional speed increase:

C#:
  1. unsafe
  2. {
  3.   fixed (byte* bFixed = inputBuffer)
  4.   {
  5.     fixed (uint* aFixed = buff)
  6.     {
  7.       byte* inputPointer = bFixed + inputOffset;
  8.       uint* bufferPointer = aFixed;
  9.  
  10.       for (i = 0; i <16; i++)
  11.       {
  12.         *bufferPointer = *(uint*)(inputPointer);
  13.         inputPointer += 4;
  14.         bufferPointer++;
  15.       }
  16.     }
  17.   }
  18. }

Be sure to still make this Md5 implementation endian safe by introducing a if:

C#:
  1. if (BitConverter.IsLittleEndian)
  2. {
  3.   // unsafe decode
  4. }
  5. else
  6. {
  7.   // bitshifting decode
  8. }

I really can understand why this is important. There should not be any unsafe code in a crypto environment, and I agree on that. Unfortunatelly I have not come up with an alternative solution for that particular decode step. Using

C#:
  1. BitConverter.ToUInt32(inputBuffer, inputOffset + 4 * i);

really is no good. Unfortunatelly this method does some extensive testing before converting the bytes. That slows it down so much, that it performs worse than the bitops.

June 9th, 2006 2:14 pm | Comments (0)

Optimized MD5CryptoServiceProvider for Mono

by Tobias Hertkorn on June 8th, 2006

Inspired by a comment here, I implemented a couple of my optimization strategies as MD5CryptoServiceProvider. Tests are running as I speak and should be done tomorrow morning, but these figures sure do give some hope:

MD5CopyOptimize (5): 0,446876320604479
MD5CopyOptimize2 (8): 0,447343112157013
MD5Bitlogic (4): 0,448018751258771
MD5Bitlogic2 (7): 0,448770640646336
MD5Inline (3): 0,457765987475873
MD5Inline2 (6): 0,458054073151199
MD5LittleEndian (2): 0,477792175150227
MD5NoArray (1): 0,526980798829037
MD5CryptoServiceProviderMonoOrig (0): 1,45861854618167

That's on my WinXP machine, running Microsoft's .NET 2.0. I do not have any data for Mono yet, but it seems to be about the same - that means, about 3 (!) times faster than the original mono Md5.

To understand which optimizations are done, please download the optimized MD5CryptoServiceProvider: Md5MonoOptimized
UPDATE: 4 and 5 (MD5Bitlogic and MD5CopyOptimize) did have a typo in them and therefore produced wrong results. This problem is fixed as of 15:00 on the 9th.

I'll post the mono test results tomorrow morning.

June 8th, 2006 11:14 pm | Comments (1)

Java and Sun in the news today

by Tobias Hertkorn on May 16th, 2006

I am still a subscriber of the Debian Java mailing list. And on Tuesday a little, innocent email came over the mailing list:
Tom Marble - Towards Java Libre.

After that there was a bit of chit-chat, until reply email from Arnaud Vandyck. After that the thread is really taking off, inspiring very interesting posts about Open Source ethics, Sun's need to adapt, etc. I encourage everyone to follow this thread!

On the same note:
Taken from two articles at heise (in German):
Wird Java doch Open Source?
Sun stellt Java Enterprise Edition 5 fertig

Sun thinking about releasing Java as Open Source
In a recent keynote Sun-CEO, Jonathan Schwartz, did publicly discuss the possibility that Sun might release Java as open source. It would be quite an agressive move to ensure predominance in the Linux world, since releasing Java as open source would create the possibility that it will be bundled with distributions. Right now for example Debian does not ship Sun's Java through its main distribution mechanisms since it violates their distribution policies.

Sun releases Java Enterprise Edition 5
Java EE 5 should make it easier and faster to develope applications. It heavily uses annotations, which make it easier to handle EJB configuration, since the required parameters can be set within the code.

May 16th, 2006 8:01 pm | Comments (0)

Hilarious story about Windows and the likes

by Tobias Hertkorn on May 10th, 2006

This story made me chuckle more than once. It is very well written and has a nice flow and pace.
But on the flipside - it does not fail to bring across a certain and in my eyes true message. Programmers depend more and more on visual designers and though the first release candidate is produced quite fast and "productive", fixing the rest of the bugs seems to take longer and longer, as Click-and-Drag-Programmers seem to no longer have the knowledge and/or the patience to go throught the designer-generated code and really understand why a bug occurs.

Be warned: It is quite long, so bring some time.

The dumbing-down of programming

Part I:
Rebelling against Microsoft, "My Computer" and easy-to-use wizards, an engineer rediscovers the joys of difficult computing.

Part II:
Returning to the source, once knowledge disappears into code, how do we retrieve it?

May 10th, 2006 11:25 am | Comments (0)

The Bazaar and the Cathedral

by Tobias Hertkorn on May 7th, 2006

Well, I just watched Revolution OS. It is a documentation about Linux done in 2001. To quote form their website:

REVOLUTION OS tells the inside story of the hackers who rebelled against the proprietary software model and Microsoft to create GNU/Linux and the Open Source movement.

I really liked the presentation, being a Linux user, hacker and fan since about 1996. Yep, I just got up, walked over to my bookshelf and there it still was. "LinuX Anwenderhandbuch und Leitfaden für die Systemverwaltung" loosly translatable to "Linux Userhandbook and Guide for System Administration". So it is a beginners book to Linux. And it was clearly written by hackers for hackers. Ah, the good-old-times. ;) Well, anyway, to get back to my topic:
The Bazaar and the Cathedral, an essay by Eric S. Raymond is mentioned in that documentation and I just thought, well that might just sound like the beginning of the agile and/or extreme programming movement, whose philosophy I started using at the webdevelopment team of experten-netzwerk about 2 years ago. And it got me quite excited, because I think it is always fun to have an unfiltered look at the unwatered inspiration that leads up to "movements".

Well, it basically is a compilation of very simple principles, but that's the beauty. I needed it written down. I did use some of the points talked about by instinct.

  • Given enough eyeballs, all bugs are shallow.
  • Constant feedback is good. Instant gratification.
  • Bugs don't drive users away.
  • Be chatty with your mailing lists. Don't just send release notes to the mailing list. Let the mailing list come up with solutions or help you choose between possible solutions. Aquire beta testers. Recognize good ideas from other people.
  • Release early, release often, listen to your customers.

As I said very basic ideas, but even going through this simple list, I do stumble over a couple of points that need or needed improvement in my current or former projects. So that got me really interested and fortunatelly I continued reading. And I would encourrage you to do so as well. It is quite a remarkable read. For those who are not willing to do that, even though I just told you to do so (I do promise you some egoboo! Ha, read on!), here is a short summary

Linus and therefore Linux uses a very different communication pattern than traditional software projects use. The major problem of large projects is scalability, known as the n-squared complexity cost. Linux has a starshapped communication pattern, with basically a lot of programmer talking to one guy. This means that there are a lot of people basically working for one guy (Linus). But how can people get motivated to put a lot of effort, work and spare time into a project?

What motivates hackers?
Basically projects don't start in bazaar mode. The code must have a plausable promise. It must promise that given a certain amount of work, it will become brilliant. Based on this promise the interest in the project must be nurtured the produce a "critical mass" of people that are working for the project, by code, testing or giving feedback. The trick is: People are playing for reputation. You don't take up the title hacker yourself. That's a title other people assign to you.
Raymond goes on and uses some definitions from anthropology: The internet culture is comparable, if not exactly the same as a gift culture: People gain status by giving things away and everybody plays for reputation. Usually that translated into gifts or food. But when it comes to the internet culture it translates to: Give away free time, energy and skill in order to look good in front of other hackers. People play this game for reputation. Knowing this, it is easy to optimize the management strategies of open source project. Make sure you motivate your users, testers and programmers by promising them knowlegde, skills and standing in front of other hackers. Raymond uses the term Egoboo used mainly in the SciFi community. The goal in a free software project is to create and nurture a gift culture. This is the ideal way to operate without force in an anarchy kind situation.

Just a tip: Listen in on the talk by Raymond. Link is on his website.

Well, is the paper a bit optimistic? Maybe. ;) Did it start the Extreme Programming or Agile Programming movement? Maybe not. I did start reading the article to get some new ideas for my managing. Did I get those? Hell yeah! Read between the lines and learn a lot about motivation. Industry projects do lean too much on pressure as far as I am concerned and should look at the OSS to find out about reputation, reward and ... fun. ;)

May 7th, 2006 6:36 pm | Comments (0)
Tobi + C# = T# - Blogged blogoscoop