Beowolf-NT: Commodity Supercomputing using DEC Alphas under Windows NT

Denis Nicole, Simon Cox and Kenji Takeda

High Performance Computing Centre, University of Southampton, UK, SO17 1BJ

This article is to appear in HPC Profile. Please feel free to contact ktakeda@soton.ac.uk regarding this research

Details of the current state of this work are given at the bottom of this page.

Last updated 20 May 1998

1. INTRODUCTION

The convergence of the high-end workstation and commodity personal computers has been particularly rapid over the last few years. It is now possible to build cheap, powerful supercomputer-level machines using commodity parts at a fraction of the cost of proprietary systems.

The Beowolf initiative (http://cesdis.gsfc.nasa.gov/beowulf/) has concentrated on using Intel-based machines running Linux to provide very cost-effective production machines for a number of applications.

We have recently purchased a dedicated computational cluster of DEC Alpha workstations. These compete on a node for node basis with systems from IBM and SGI/Cray for many scientific and engineering applications, but using commodity components the cost is lower by a factor of at least three.

2. System Configuration

The current system configuration is:

Eight nodes, each with

500MHz Alpha 21164 processor
256 Mbyte RAM
2.5 Gbyte EIDE drive
Two additional 5Gbyte drives to support Windows NT 5.0 and Linux tests
Windows NT version 4

Server node

200MHz Pentium
32Mbyte RAM
4 x 5 Gbyte IDE drive
30 Gbyte DLT backup
Debian Linux

Network connectivity

100M bit Ethernet
100M bit twelve port Ethernet switch

As this is a compute cluster only four monitors were purchased, with switch boxes between shared monitors. The total system cost was 50,000 UK pounds. This now represents the one of the biggest single computational resources at Southampton University.

The best value for money clearly lies in this commodity desktop PC technology. Here we can not only take advantage of economies of scale in the corporate market, but also in the rapidly increasing take-up of PCs in the home. Even greater leverage can be obtained by using the DEC Alpha microprocessor. These FORTRAN-optimised chips are priced to compete in the Windows NT marketplace against Intel and offer twice the price/performance of comparable Pentium-based systems.

The Beowolf project has focussed on Intel-Linux systems to keep costs low and provide flexibility. In order to provide cheap parallel cluster computing a mixture of Digital UNIX compilation and Linux compute nodes is the most effective route right now. The special effects in the latest Titanic movie were generated on a cluster of 200 DEC Alphas running Windows NT (as file servers) and Linux (for computations) connected by 100Mbps ethernet (http://www.ssc.com/lj/issue46/2494.html).

However, in order to run the best compilers we are forced to choose between Windows NT and Digital UNIX. The latter option is costly though, both in terms of licensing and the need for specialist (eg: SCSI) hardware. This largely offsets the gains made in using commodity machines. We are pursuing the long term goal of delivering an effective remote and local parallel computing service directly under Windows NT. A reason - Windows NT is the wave of the future, whether we like it or not... and it runs Microsoft Office.

2. SINGLE NODE PERFORMANCE

The 500MHz DEC Alpha EV56 workstations we have are gigaflop peak machines. For simple benchmarks it is a 100Mflop system:

Linpack (201) delivers 110 MFlops
Linpack (200) delivers 97 Mflops
Livermore Loops (geom. mean) 103 MFlops

These are almost identical to the same benchmarks run under Digital UNIX.

For real world application performance is similarly impressive. We used the Alpha cluster to perform partitioning of a 15 million element unstructured, tetrahedral grid, which requires 2Gbytes real memory. Initially one SP2 node with 256Mbytes of RAM on the Southampton machine was reconfigured to page off five SCSI disks to be able to handle this job. This took nine hours to complete and necessitated running in the overnight queue, and only on the specific reconfigured node.

The same job took six hours to complete on an AlphaNT node (with 256 Mbytes RAM) paging off a single EIDE drive. Reconfiguring the swap file took six mouse clicks and a reboot! We were able to do eight partitioning jobs in parallel overnight without having to fight through any queues.

3. PARALLEL PERFORMANCE

While the single node performance of these machines is as good as we'd hoped under Windows NT, message passing software for Alpha platforms has been slow in forthcoming.

At present the only implementation of MPI available for Windows NT which runs on DEC Alphas is the Mississippi State MPICH implementation (http://www.erc.msstate.edu/mpi/mpiNT.html) which is still in Beta. We have ported this fully to Digital FORTRAN and believe that we are the only group running MPI on Alpha NT with Digital Visual FORTRAN. The source and .DLLs including FORTRAN wrappers will be distributed on this website soon.

MPI performance on a single machine using shared memory is reasonable, COMMS1 gives bandwidth = 10.8 Mbytes/sec between two processes. However, between two machines it is terrible, bandwidth = 59.2 kbytes/s. This is due to the software implementation of TCP/IP communications and we and the team at Mississippi State are working hard to improve this performance. It should be remembered that this is still only in Beta phase!

Initial tests on real application performance of this system have been carried out. The DNS combustion code ANGUS was compiled to run under Windows NT with only one modification to the main source code required (replacing /dev/null with NUL for dummy file output). ANGUS is a finite-difference code which uses a regular grid and straightforward domain decomposition. The most intensive part of the program is sovling the Poisson equation for the pressure. A small 40x40x40 grid with 2x2x1 processor decomposition case was run with the following performance:

One processor running one process: 20 s/iteration with 2.4 s of comms
One processor running four processes: 38 s/iteration with 21.3 s of comms
Two processors running four processes: 285 s/iteration, 263 s of comms
Cray T3D performance (4 nodes): 61.6 nodesecs/iteration with 14 node secs of communication

This demonstrates the current discrepancy between shared and distributed memory performance shown by the benchmark figures as one would expect. It also shows that uniprocessor MPI can be used for application development now on Windows NT. Further results are presented in Emerson et al, 1998.

Benchmarks running PVM 3.4 Beta4 running under NT (http://www.epm.ornl.gov/pvm/NTport.html) shows more promise, sustaining 4.8 MB/s peak between two processes on a single machine, and 2.3 MB/s peak between two machines when running bwtest. Note that this port of PVM will be included in the final release of PVM 3.4.

4. WINDOWS NT ISSUES

Windows NT 4.0 is certainly quite different to UNIX in many respects. Many of its shortcomings are not surprising considering that it has only been in existence for a few years. Problems with stable remote logins and running graphical applications across the network are currently being tackled by us. Some security issues are being addressed in NT 5.0 which we are currently testing in Beta.

In terms of the MPI and PVM implementations there are a few serious problems. MPI runs under Administrator accounts with full system privileges as it was originally intended to be run shared-memory on a single machine. It can also leave dead processes hanging on remote machines which must be killed off manually. PVM currently requires pvmd3 daemons to be started manually on remote processes and fails to redirect I/O properly. As both of these software suites are still in Beta some of these problems may yet be fixed.

5. CONCLUSIONS

Clearly parallel programming on Windows NT using distributed memory is far from the performance of comparable UNIX/Linux-based systems now. However, the underlying problem at present is in the MPI implementation which is only a recent port still in early Beta. PVM shows more reasonable performance, even though it is also only at Beta4 stage. The fact that a large HPCI consortium code was able to be run under Windows NT on a cluster with only one modification to the source has quelled many of our fears regarding compatability issues.

As a development environment DEC Alphas running Windows NT work well, with shared memory MPI providing full functionality and reasonable performance for testing.

CURRENT PROJECT STATUS

As can clearly be seen from the results above, the MPI performance between machines is awful. However, the PVM figures, and other ping-pong tests we have carried out (Cox, Nicole and Takeda, 1998) suggest that we should expect 4-5 MB/s bandwidth from a more efficient MPI implementation. This is what we are doing at present, by removing some of the intermediate layers between MPI and the network hardware.

Another area we are improving is that of remote access. WinVNC is being tested on our Alpha cluster which allows full remote graphical access to Digital Visual FORTRAN and MS Visual C++. And it's free, including source code.

We are currently testing Linux single node and MPI performance. Initial results show Digital Visual FORTRAN being 60% faster than the EGCS g77 compiler. We are collaborating with Daresbury Laboratory who have recently purchased a PII/Linux cluster with commercial Linux FORTRAN compilers.

Our primary concern is over the performance, but we are also improving some of the other Windows NT niggles. We are developing environments which will allow users to migrate from wholly-UNIX systems to Windows NT smoothly, so they are able to take advantage of some of the latter's features where appropriate.

Please feel free to contact ktakeda@soton.ac.uk regarding this research

Go back to the High Performance Computing Centre homepage.

REFERENCES

Nicole, D.A., Takeda, K. and Wolton, I.C., "HPC on DEC Alphas and Windows NT", Proc. HPCI Conf. 98, Manchester 12-14, 1998

Cox, S.J., Nicole, D.A. and Takeda, K.,"Commodity High Performance Computing at Commodity Prices", WOTUG-21, Proc. 21st World Occam and Transputer User Group Technical Meeting, 1998

Cox, S.J, Daniell, G.J. and Nicole, D.A., "Maximum Entropy, parallel Computation and Lotteries", To be presented at 1998 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, 1998

Emerson, D.R., Maguire, K., Takeda, K., and Nicole, D.A., "An Evaluation of Cost Effective Parallel Computters for CFD", To be presented at the 10th International Conference on Parallel CFD, Taiwan, May 1998

Takeda, K. and Tutty, O.R., "Parallel Discrete Vortex Methods on Commodity Supercomputers; An Investigation into Bluff Body Far Wake Behaviour", To be presented at the 3rd International Workshop on Vortex Flow and Related Numerical Methods, Toulouse, August 1998

The Beowolf Project, http://cesdis.gsfc.nasa.gov/beowulf/

Linux Helps Bring Titanic to Life, Linux Journal, http://www.ssc.com/lj/issue46/2494.html

MPICH for Windows NT, http://www.erc.msstate.edu/mpi/mpiNT.html

PVM for Windows NT, http://www.epm.ornl.gov/pvm/NTport.html

WinVNC from Olivetti & Oracle Research Lab, http://www.orl.co.uk/vnc