Dr Carlo Kopp's Publications Archive

Industry Publications Index ... Click Here

Supercomputing on a Shoestring:
A Practical Perspective

Originally published December, 1999

by Carlo Kopp

One of the more interesting projects I had the pleasure to tackle over the last 12 months has been the commissioning, testing and operational use of a large Pentium cluster. In this month's feature I will endeavour to provide a practical perspective on many of the issues discussed in the 1998 "Supercomputing on a Shoestring" feature.

The subject of this discussion is the parametric processing cluster. Sadly the term cluster has been used and abused by the vendor community every which way, as a result of which the average sysadmin's first question will always be "what kind of cluster ?" A parametric processing cluster is commonly referred to in the literature as a COW (Cluster of Workstations), NOW (Network of Workstations) or PoPC (Pile of PCs), and is a processing arrangement in which a large number of like or different machines are connected via a high speed network, and via the use of appropriate software tools, harnessed to work on a single problem.

The essential idea in clustering is to achieve via the use of a large number of "small" machines the same aggregate number of computing cycles, and aggregate amount of memory, as would otherwise only be found in a "classical" supercomputer such as a Cray, or a very heavily configured Unix multiprocessor.

All supercomputing applications are by their very nature sensitive to the type of computational problem to be solved. The "classical" supercomputer, typified by the Cray family, the Convex "Crayettes" or the i860 "microCray", achieves its high performance via the use of vector processing techniques, and in larger machines, the generous use of Gigabytes of extremely fast and expensive RAM.

In a vector processing machine the conventional machine instruction set is expanded to include vector instructions, which are designed to operate on arrays of operands, or vectors, rather than single operands. An example of a vector instruction would be a dot product or cross product operation, where the operands (arguments) are pointers to arrays (vectors) of data, and a vector size. When such an instruction is encountered, the processor will repetitively crunch through each and every individual operand in the vector, using dedicated integer or floating point vector processing hardware. Since such a vector operation involves a large number of like computations, it can be very efficiently pipelined for maximum speed. Moreover, since there is no need to execute an opcode fetch cycle for every individual operation, processor time and bus bandwidth is not wasted. As a result, vector processors implemented in any given hardware technology can usually perform vector intensive computations many times faster than a conventional processor would.

The snag with vector processing is that its superb performance potential is confined to problems which lend themsleves to being vectorised. Large scale matrix intensive work, such as finite element analysis techniques used in engineering or science are a typical example. Problems of this ilk can be easily remapped into a form where the vector processor has an unbeatable advantage over any other player in the market.

Difficulties will arise if the problem cannot be easily remapped into a series of large scale matrix bashing operations. Many "real world" computing problems fall into this category, and until now users needing to do such work were severely hampered by the limitations of vector processors, many of which delivered very pedestrian non-vector compute performance.

The answer for many such problems lies in the general area of parallel processing, where a large number of general purpose processors are harnessed in parallel. The "gotcha" of course lies in the issue of how to parallelise the application in the tidiest manner to get this to work.

One approach which was fashionable for many years was the parallelising compiler, which would take the source code for the program, analyse it for those components which have no mutual dependencies in data (ie the results of these computations do not depend upon each other's results), and execute these on separate CPUs in the parallel multiprocessor.

The sad reality is that there are few such compilers out in the market (I have yet to see one of a commercial standard). There are good reasons why, insofar as such a compiler has to be capable of splitting the program into multiple threads in various places, running these on individual CPUs, and then returning to a single thread of execution as appropriate. This is not an easy thing to do, if it is to be completely hidden from the user. The compiler must incorporate a complete run time environment for forking off multiple processes in the "parallel" portions of the program, and then merging the results of these, all transparently. If the program has many portions of code easily parallelised, and bits which cannot be parallelised in between, significant overheads can be incurred at runtime. The "granularity" of the parallelisable portions of the program can create problems of its own.

For many computational problems this is not really necessary, and a particular class of such problems are the category of "parametric problems". A parametric problem is one in which a specific computation, ie program, must be repeatedly executed using different starting conditions or other parameters. A good example is any simulation which must be run for a large number of scenarios.

Consider a computation where you are trying to analyse the behaviour of a storm water drainage system in a large city. Rain can fall with varying intensities across various suburbs, at varying times. Therefore a parametric modelling approach would see us writing a simulation for the flow rates through the drains, and running it a huge number of times for different rainfall maps, and then comparing the results of each of these many compute runs to identify the conditions under which specific drains overflow and flood their neighbourhoods.

A great many practical problems fall into this category. My problem for instance involved analysing the behaviour of large mobile ad hoc networks, under varying conditions of weather, rainfall, cloud and operating frequency. Many of the computations were very non-linear, indeed so much so that a Cray would have been almost utterly useless for this purpose. However, a cluster of 60 Pentiums did the job very nicely indeed.

A great many environmental, economics, engineering and operations research problems are prime candidates for parametric computing, and we can expect to see this market progressively expand in coming years as users in these specialities come to appreciate what they missing.

Implementing a COW, NOW or PoPC

At a first glance implementing a COW/NOW/PoPC or cluster is a trivial chore, we simply hook up a large number of machines and get down to work. In practice there are many considerations involved, and much careful thought is required to get the desired end result.

The starting point, and arguably the central issue, is the choice of the software runtime environment, since this imposes constraints upon other areas. A number of clustering toolsets exist in the marketplace at this time, most of which are oriented toward the "message passing" model. This model is based upon the idea of splitting a large problem into many small chunks, and using a message passing mechanism to exchange results across the boundaries in the problem, these boundaries being the result of partition into smaller chunks. The other alternative is to employ a parametric processing toolset, if your problem falls into this category.

The choice of the runtime environment and style of parallel processing will impose constraints in other areas.

The first is the choice of the machine architecture and operating system you can use. While most modern clustering toolsets are available for a range of architectures, some thought will need to be given to what is the best choice for the application.

Applications which will generate large amounts of I/O traffic to local or network disks will be better served by Unix workstations since these have traditionally been designed for server and power user applications, and thus have the I/O bandwidth to cope well. Applications which are compute and memory bound, rather than I/O bound, are less demanding of the basic hardware, and can be run on commodity PC hardware. Where I/O bandwidth is not critical, a typical Pentium or top end Intel clone will be highly competitive against a Unix workstation chip, especially in a parametric computing environment, where more cheaper CPUs can be harnessed to do the same volume of work.

The choice of machine architecture and operating system is of course constrained by the supported runtime toolset ports, but within that range of choices we still have some play. If we expect to be running compute bound problems, each of which takes relatively little time to execute, then a Linux solution would be very attractive. On the other hand, if our individual problems take many hours if not days to execute on our CPU of choice or involve high I/O rates, a more stable and commercially tuned operating system like a BSD or SVR4 variant would be preferable choice.

This aspect of the exercise should not be underestimated, insofar as any problems you encounter running your code on a single CPU will be magnified N-fold with an N processor cluster. If the operating system you employ, certain proprietary Intel hosted desktop products coming to mind here, likes to go belly up about once a day, on a 32 CPU cluster this is 32 crashes to deal with per day. Poor choices in this respect can be worth much more trouble than may be incurred by rehosting the basic application code to a better, even if more expensive platform.

The choice of processor type and operating system is but one portion of the hardware problem to be solved. The next issues are hardware packaging and networking.

Hardware packaging can significantly impact both the life cycle cost and the uptime of the planned installation. We assume "headless" hosts in the cluster.

If you are planning a centralised cluster installation, serious consideration should be given to the use of rackmounted against desktop machine packaging. Chassis designed for installation in RETMA 19" racks have two important advantages over conventional desktop chassis in rackmount trays.

The first is that they are designed for the thermal environment of a rack, and usually have inlets and exhausts located on the front and the back. A typical arrangement sucks air in through the front panel and exhausts out the back, thus precluding reingestion of heated exhaust air, or ingestion of heat dumped from machines lower down in the racking system. Since hardware failure rates increase with operating temperature, higher initial expenditure will be offset by lower through life expenditure and downtime.

The other factor favouring rack mounted chassis is access for repair or upgrades, since typical rackmount chassis are designed to slide out on rails for maintenance. A tray mounted desktop chassis has to be carried away for such work.

Needless to say, rackmounting does allow a higher packing density in the cabinet which can be a issue for many smaller sites.

The networking problem can also be a source of much heartache if not considered carefully. The issues revolve around the behaviour of the application and its implications for performance and reliability. Since the network is a single point of failure and a potential performance bottleneck, it is an item which must be done right if it is not to become a major headache in the production phase.

With a message passing runtime environment, a variant of a hypercube topology, or a mesh topology will be the most suitable choice since this allows localised inter-processor messaging traffic to be isolated from the rest of the cluster. Judicious choice of topology may allow the use of very cheap networking hardware, or even 10 Base 2 thinwire connections, with no serious performance penalties.

In a parametric processing runtime environment, where the basic model is essentially a star topology, the issue is one of whether to use a switch or a dumb hub, and what channel bit rate and adaptor performance is necessary. The implicit topological model creates in turn an implicit performance bottleneck.

Given the available hardware in the marketplace, the choices boil down to 100 Base T, ATM and FDDI. The latter is arguably a legacy item, but is worth mentioning since some sites may choose to build up a cluster from retired servers and desktop workstations for which the former interfaces may not be cheaply available, whereas pre-loved FDDI hardware may be.

In assessing how much to spend on the networking side, the central issue is whether the application demands a high I/O rate to a central storage device. If this is the case, an example being any application which uses a large common dataset of some kind, then it is advisable to aim for the best possible network performance. Therefore each host in the cluster should use a network I/O card with an embedded processor (if the card can do the front end TCP/IP stack processing, this is even better), and a fast switch should be employed.

Where the application can get by with a copy of the central dataset on local machine disk, or doesn't use large datasets, then embedded network interfaces on the motherboard and a hub may well suffice.

The final issue is one of sizing the hosts in the cluster and tuning their performance.

This boils down to the classical host sizing and tuning problem, and no rocket science is involved. The conventional strategy of running the application on a single testbed system is employed, to determine how much memory and how much local disk to install. Swap space sizing is also conventional. The ground rules are to ensure that swapping is avoided where possible, and that the I/O performance of the disk used is compatible with the application.

A compute bound application on an Intel architecture will mostly get by with commodity IDE disks, with an I/O intensive application wide ultra-SCSI and high performance adaptor boards will be required.

As a final point it is worth reiterating that careful planning pays off in the building of clusters, since any problems seen in one-off installations are by the nature of a cluster magnified N-fold. If it bites you on the desktop, it will devour you on a cluster.

Activetools Clustor

As an example of a parametric processing toolset I will explore the Activetools Clustor product (http://www.activetools.com/), since I have used it extensively, much of this discussion was the subject of my AUUG 99 paper. Many of the general attributes and issues will be applicable to other parametric toolsets.

Activetools Clustor is the commercial development of the Nimrod toolset, devised and implemented by Prof David Abramson and Rok Sosic. It has been widely marketed in the US and is used by a number of academic, government and commercial clients to support a range of applications. Clustor has been ported to Linux on Intel, Solaris on SPARC, Irix on MIPS, AIX on PowerPC, OSF/1 (or whatever it is being called these days) on Alpha and HP/UX on HP-PA, as well as NT on Intel for those enamoured of proprietary operating systems. Activetools will port to any reasonable architecture at a client's request. Freebie demos for all supported architectures, with timebombed licences, are available on their website for users interested in exploring the product.

The basic model used in Clustor is that of running a gaggle of "robot users" on a central "root node" machine, each of which will execute the application using telnet or rsh on one of the cluster's many "client nodes". The toolset provides an environment in which the individual parametric "scenarios" can be centrally programmed prior to runtime, and the activity of the "robot users" managed from a single point. At runtime, each "robot user" logs into its pre-programmed client, sets up its local runtime environment, fires up the application, and upon its completion gathers the results of the computation and cleans up after itself.

The simplicity of this basic model, analogous in concept to the remote terminal emulation model used in performance analysis, belies much of the complexity required to implement it.

At runtime Clustor will create on the target client node a directory tree, and using a configuration script called a "plan file", suffixed .pln, it will copy across as required data files, parameter files and if required the executable binary of the application itself. When the application has completed, it will use the same script to save the results back to a specified place on the "root node". Should the application fail and return with an error code, Clustor will save every single file into an error directory on the "root node" to facilitate debugging.

The Clustor plan file uses a powerful scripting language. The file can be manipulated using a text editor, a technique preferred by most users with experience in Unix scripting or languages like PERL, or using a windowed front end "preparator" tool (implemented in X11 or Windoze depending on the port). A plan file is typically divided into three portions, a generic setup phase, a runtime phase, and a cleanup phase.

For Clustor execution the plan file must be preprocessed by the atcgenerator tool which creates a runtime command file, termed a "run file", suffixed .run, which is human readable and is directly interpreted at runtime by Clustor.

To execute the application on the cluster, the "atcdespatcher" tool is run, using the run file. It will fork off a gaggle of "atcjobmanager" processes, one for each client CPU. These processes each run a specific instantiation of the run file, with the programmed parameters for that specific computational run. The atcdespatcher can be run in the background, or using a GUI activity monitor display.

Clustor is capable of substituting both command line arguments, and data fields in configuration files, to parametrize the job. In the simulations I performed, I used both, to manipulate arguments and .rc file contents.

Porting the application to run on a cluster, using Clustor, requires several steps.

Port the application, if necessary, to the client node operating system. This will typically require disabling of any GUI features and incorporation of an .rc file or command line argument interface, as well as accommodating system idiosyncrasies.
Validate the port by suitable regression testing.
Produce the Clustor plan file.
Test the Clustor plan file for a small number of CPUs.
Validate the results from the cluster test runs against the same results run on individual machines.

The time consumed for this will vary with applications and the complexity of running the application, which is reflected in the complexity of the plan script. I expended several days to move my FreeBSD/Irix port to Linux, and generate the scripts, upon which I expended 2-3 weeks testing and debugging on the cluster.

It is worth noting that Clustor makes for an excellent regression testing environment, indeed my experience was that nothing beats large scale cluster compute runs for finding bugs buried in your code for months !

Clustor will allow inhomogeneous OS and architecture environments, such as the use of Unix on the root node and NT on the client nodes, or varying Unix variants across the cluster. The only caveat is to ensure that the application binary is available on each client, and that endian issues are resolved in binary data files. Inhomogeneous clusters can be advantageous insofar as the root node, a single point of failure, can be implemented using a more reliable, higher performance (and expensive) machine and operating system to the client nodes. Soaking up spare cycles across a site by using underworked desktop machines to supplement a cluster is entirely feasible.

Space limitations preclude a more detailed discussion of issues such as load management in multi-user environments, and specific experience with particular operating system variants. Interested readers are invited to read my AUUG99 presentation slides, posted at http://www.csse.monash.edu.au/~carlo/auug99-clustor.ppt , details of the Monash cluster installation are posted at http://hathor.csse.monash.edu.au/ .

From my perspective as a user I was very pleased with the end result. The use of a cluster allowed me to process around 1000 simulations, each of which could take up to days on a single Pentium CPU, over a period of several months, with a very modest management overhead. The use of the scripts to sort the resulting data proved to be a major time saver. Using the Monash cluster, which ultimately incorporated 60 Pentiums across two campuses in Melbourne, allowed me to complete a simulation project on a scale which was simply not implementable using the conventional "single box" approach.

Cheap commodity processors and tools such as Clustor have the long term potential to revolutionise many areas of engineering, scientific, and operations research computing. It is fair to say that at this time we are standing on the threshold of another new computing paradigm.

$Revision: 1.1 $

Last Updated: Sun Apr 24 11:22:45 GMT 2005