Parallel Numerical Methods

Parallel Numerical Methods

A number of specific numerical problems arise in parallel programming, including use of halos in computation, parallel linear algebra, long range forces, and random number generation.

(To read this with no frames type this location into your web browser: training/numan_par.html)

High Performance Computing

High Performance Computing (HPC) is an engineering discipline which develops the technology (hardware, algorithms, and software) to deliver cost-effective computational results.

Parallel Numerical Methods

Selection of the appropriate algorithm often requires some knowledge of the likely platform to be used.

Bibliography

Foster, I., 1995. Designing and Building Parallel Programs. New York: Addison-Wesley. This book may be read online at http://www.cs.rdg.ac.uk/dbpp/

Masuda, N. and Zimmerman, F., 1996. PRNGLIB: A Parallel Random Number Generator Library. Technical Report CSCS-TR-96-08, Centro Svizzero di Calcolo Scientifico, CH-6928, Manno, Switzerland.

http://www.cscs.ch/Official/TechReports/1996/TR-96-08.ps.gz and http://random.mat.sbg.ac.at/news/

Press, W.H., Teulkolsky S.A., Vetterling, W.T. and Flannery B.P., 1997. Numerical Recipes in FORTRAN 77, (Cambridge: Cambridge University Press). See also the Fortran 90 and C versions. [NR]

Quinn, M.J., 1994. Parallel Computing: Theory and Practice. New York: McGraw-Hill.

Parallel Computing

A parallel computer is a set of processors that work together to solve a computational problem. But first: get to know the beast.

Reasons to use a parallel machine

Why use a large number of processors?

To combine their processing power (e.g. device modelling).
To exploit their aggregate memory (e.g. ocean modelling and lattice calculations).
To use increased disk space or parallel I/O (e.g. Databases).
Explore the phase space of a model independently on each processor (e.g. genetic algorithms).

What sort of parallel machine?

A generic parallel computer consists of a number of processors connected together.

However this covers a multitude of sins:

There may be only a few processors (e.g. a dual Pentium)
The "fast interconnection network" may simply be that the memory of the processors is shared as a central resource (e.g. Silicon Graphics).
The memory of the processors may be distributed between the nodes (e.g. IBM SP) with a "fast interconnection" network between the nodes.
The nodes may be connected with ethernet, or fast ethernet (e.g. a cluster).
The machine may be very expensive or very cheap. I have deliberately not written examples here.

Increasingly we need to focus on "cost-effective" use of computational resources. It may not be cost effective to use an expensive parallel machine with a fast interconnection network to perform independent simulations on each node if there a cluster of machines available which will perform the same task.

Scalability

The card example demonstrates:

Speedup. Not necessarily a factor of P on P processors.
Contention for a resource
Network bandwidth and latency

The choice of algorithm may depend on the platform and number of processors
Load balancing
Overlapping computation and communication

With 6 decks and 6 people, better to simply give each person a deck: "task farming".

With 1000 decks and 6 people, we would need a queue: "scheduling" (or hungry puppies). Would each person take a single deck, or several decks at each visit to the supervisor? What if the people worked at different speeds?

Speedup

(1)

In parallel processing "superlinear" effects may occur when a speedup of >P is obtained on P processors. This may be for a number of reasons

The speedup should be measured wrt to the best sequential algorithm on a single node.
The aggregate memory on several nodes may allow a different algorithm.
Increased cache on several nodes may allow more of the code and instructions to run in core.

Speedup is of academic interest. It is possible to obtain perfect speedup curves with a poor parallel algorithm if the single node time is slow.

Speed is what counts. When do I get the answer? How large a simulation can I run overnight?

Part of the job of a scheduler is to maximise the number of computational tasks to run subject to the constraints imposed by the user. This is a non-trivial task.

General Tricks
Partitioning

Divide up the work into a large number of tasks.

Communication

How and when do the tasks need to communicate with other? Do they communicate globally, or locally?

Agglomeration

Gather the tasks together. Can tasks which need to communicate often be brought together? Are there load balancing issues?

Mapping

Map the tasks onto the processors.

Data Layout

In this simple example, we have 16 tasks and 4 (and 8) processors.

Column				Row				Block				Cyclic
1	2	3	4	1	1	1	1	1	1	2	2	1	2	3	4
1	2	3	4	2	2	2	2	1	1	2	2	5	6	7	8
1	2	3	4	3	3	3	3	3	3	4	4	1	2	3	4
1	2	3	4	4	4	4	4	3	3	4	4	5	6	7	8

Halos

A number of tricks exist to reduce the communication in addition to the agglomeration. In this case some of the data is replicated in a "buffer zone" to decrease the amount of communication required.

1	1	1,2	1,2	2	2
1	1	1,2	1,2	2	2
1,3	1,3	1,2,3,4	1,2,3,4	2,4	2,4
1,3	1,3	1,2,3,4	1,2,3,4	2,4	2,4
3	3	3,4	3,4	4	4
3	3	3,4	3,4	4	4

Other Layouts

1	1	1	1	1	1
1	2	2	2	2	2
1	2	3	3	3	3
1	2	3	4	4	4
1	2	3	4	5	5
1	2	3	4	5	6

This one suffers from load balancing problems

Techniques
Parallel Linear Algebra

Good libraries exist for most common linear algebra tasks.

(see references in training/package.html )

The hardest part is that the optimal layout for one part of a complex piece of code may be the worst in another part. Compromises have to be made!

Though it is bad practice, this sometimes means a global data transfer midway through a run, or replicated data sets on each node.

Time complexity of algorithms

Scalable parallel computing requires O(N) algorithms (where N is some number of grid points, for example).

This is because the most general constraint on running a code is how long you are prepared to wait for results. Given a larger number of nodes, this usually remains the same, and so the question becomes "how large a problem can I now run given the additional nodes."

An O(N²) algorithm would only solve a problem twice as large given 4 nodes, if it were 100% efficient.

Long Range Forces

A naïve implementation of an N body problem scales with N². Techniques exist which scale with O(N log N) and O(N).

http://www.epcc.ed.ac.uk/~mario/nbody.html

Computational Geometry

Considerable effort has been put into designing efficient O(N log N) and O(N) algorithms to perform e.g. convex hull, Delaunay triangulation.

O’ Rourke, J., 1993. Computational Geometry in C. Cambridge: Cambridge University Press.

Parallel Random Number Generation

Pseudo-Random number generators exhibits small fluctuations from truly random behaviour.

A random number generator typically has a seed, from which subsequent random numbers are generated.

Generating random numbers on parallel machines requires:

High quality random numbers. Some current random number generators spectacularly fail tests of randomness
Large periods. A generator with a long period of a few years ago can now be exhausted in a few hours on a Pentium.
The simulation must be repeatable independent of the number of processors used.

We should be able to compute any number in the sequence without stepping through all of the preceding numbers.

The previous generation of random number generators does not measure up on most of these accounts.

The new generation of generators, pioneered by Marsaglia and others meet all of these requirements.

It is worth noting that the new generation of random number generators were only designed to fix the quality issue on sequential processors, but happen simultaneously to fix all the problems of parallel random number generation. Oh yes, and they are also faster by up to a factor of 2 (or more) than other generators. There may not be such a thing as a free lunch, but…

A new generation random number generator typically has an immense period, a seed vector, from which any number in the sequence may be precomputed.

As an example of the incredible periods available recall that those in Numerical Recipes in Fortran 77 have a period of 2^31-1 (~2e9). Generators now exist with a period of 2^19937 - 1, a number with about 6000 decimal digits; the 32-bit random numbers exhibit best possible equidistribution properties in dimensions up to 623. A DEC Alpha Station would require about 10^5983 millennia to exhaust the period.

When any number in the sequence can be precomputed in advance, parallelism is trivial. Simply start each processor on the random number m p / P, where p = processor number p of P in total, m = period.

Summary

Parallel programming for scientific and engineering problems is (or should be) "the norm."

The definition and goalposts of parallel programming are always changing, which results in the observation that parallel computing has always been "getting easier".

The single node performance of the code is crucial.
Using P nodes introduces some overhead. The only task is to reduce this overhead subject to the constraint of solving the problem in reasonable time.

At present the choice of algorithm may depend upon the machine to be used. Ideally this should be the other way round.
A number of special problems arise when using a parallel machine, such as how to arrange data in the code, and how to generate random numbers.

Parallel computing can be summed up by single node performance and scalability.

Last updated 10-Mar-98. Maintained by SJ Cox