ScaLAPACK is a linear algebra library for parallel computers. Routines are available to solve the linear system A*x=b, or to find the matrix eigensystem, for a variety of matrix types. ScaLAPACK is automatically available in the Cray T3E SCILIB.
ScaLAPACK implements the block-oriented LAPACK linear algebra routines, adding a special set of communication routines to copy blocks of data between processors as needed. As with LAPACK, a single subroutine call typically carries out the requested computation.
However, ScaLAPACK requires the the user to configure the processors and distribute the matrix data, before the problem can be solved. A beginning user may find these tasks daunting.
This article is a gentle introduction to ScaLAPACK, keeping the complications to a minimum.
Suppose we had the following linear system, A*x=b:
A(I,J) = min(I,J)
B(I) = 1
On a sequential computer, LAPACK can easily be used to find the
solution X ( 1, 0, 0, 0, ... ). The skeleton code to set up and
solve this system is:
do I = 1, N
B(I) = 1
do J = 1, N
A(I,J) = min(I,J)
end do
end do
call SGESV ( N, 1, A, N, IPIV, B, N, INFO )
The computed solution X overwrites the vector B on return.
If we want to use ScaLAPACK to solve the system, the heart of the code will be almost the same; the call to SGESV is replaced by a call to PSGESV, with a few more arguments. But we will also have to produce some new code, which isn't about the problem itself, but about how we are going to use parallel processors on the problem.
Before going into details, we need to think of a simple use of ScaLAPACK as involving four steps:
ScaLAPACK assumes that the linear algebra problem will be treated by a set of processors that are arranged logically into some sort of rectangular "processor grid." This arrangement will allow any processor to be identified by its row and column. For our example, we will suppose we have four processors, arranged as follows:
Column 0 Column 1
Row 0 Processor 0 Processor 1
Row 1 Processor 2 Processor 3
Now the data in the problem, in this case the matrix and the right hand side, will be distributed among the processors, so that each processor "owns" a portion of the data. For a 6 by 6 matrix, a simple distribution would be:
Processor 0 | Processor 1
|
B1 A11 A12 A13 | A14 A15 A16
B2 A21 A22 A23 | A24 A25 A26
B3 A31 A32 A33 | A34 A35 A36
--------------------+--------------
B4 A41 A42 A43 | A44 A45 A46
B5 A51 A52 A53 | A54 A55 A56
B6 A61 A62 A63 | A64 A65 A66
|
Processor 2 | Processor 3
Each processor only needs space to store its portion of the matrix. To be efficient, each processor should compute the elements it "owns". So one complication will be figuring out which global matrix entries A(I,J) belong to a given processor, and where such an entry is stored. In other words, processor number 3 must know that it owns global matrix entry A(5,4), and that it should be stored in row 2, column 1 of its local storage.
We will call the local storage arrays "APART" and "BPART", and we will compute two sets of indices:
Assuming our local array APART is of dimensions NB by NB, and that our processor's location in the processor grid is (MYROW, MYCOL), then the following code will correctly compute the portion of A that any processor needs:
do IPART = 1, NB
I = MYROW*NB + IPART ! Skip down past other processor's rows
do JPART = 1, NB
J = MYCOL*NB + JPART ! Skip over past other processor's columns
APART(IPART,JPART) = min ( I, J )
end do
end do
Processors in processor column 0 also get a portion of the right hand side:
if ( MYCOL == 0 ) then
do IPART = 1, NB
BPART(IPART) = 1.0
end do
end if
To get superior parallel performance, the matrix should actually be divided up into more and smaller blocks. The code here gets more complicated, but the idea is still simple: each processor gets a "chunk" of the data.
Before calling ScaLAPACK, we have to report how the matrix was divided up. The routine DESCINIT packs up the relevant information into the arrays DESCA and DESCB:
call DESCINIT ( DESCA, N, N, NB, NB, 0, 0, ICTXT, NB, INFO )
call DESCINIT ( DESCB, N, 1, NB, 1, 0, 0, ICTXT, NB, INFO )
Now we can call the routine that actually does the computation:
call PSGESV ( N, 1, APART, 1, 1, DESCA, IPIV, BPART, 1, 1, DESCB, INFO )
The code for managing the processors can be confusing, so it's useful to think of it in terms of what it's doing for you. Essentially, you have to choose how many processors you want, get a special identifier called the "context" that refers to this group of processors, arrange the processors in a rectangular grid, and tell each processor where it belongs in the grid. The code to do this might look as follows:
NPROW = 2
NPCOL = 2
call BLACS_GET ( -1, 0, ICTXT )
call BLACS_GRIDINIT ( ICTXT, 'Row-major', NPROW, NPCOL )
call BLACS_GRIDINFO ( ICTXT, NPROW, NPCOL, MYROW, MYCOL )
After these calls, each processor has received a unique pair of row and
column indices in MYROW and MYCOL. Even though the same code is to be
executed by all processors, each one knows "who" it is.
Once you are done with ScaLAPACK, you release the processors with the calls
call BLACS_GRIDEXIT ( ICTXT )
call BLACS_EXIT ( 0 )
To complete the program, we have to assemble the four steps discussed above into a file, and insert the appropriate data declarations at the head of the program. To see these details, refer to the text of the sample program. Assuming our program is stored in the file "myprog.f", we can compile and run it interactively on the T3E with the commands:
f90 myprog.f
mpprun -n 4 a.out &
The output will be something like:
SCTEST1
Solve A*x=b using ScaLAPACK PSGESV.
STOP (PE 3) executed at line 104 in Fortran routine 'SCTEST'
STOP (PE 2) executed at line 104 in Fortran routine 'SCTEST'
STOP (PE 1) executed at line 104 in Fortran routine 'SCTEST'
PSGESV returned INFO = 0
STOP (PE 0) executed at line 104 in Fortran routine 'SCTEST'
[1] Done mpprun -n 4 a.out
We've kept this introduction light and superficial. We used simple declarations, and default values for many input arguments. However, it is important to note that ScaLAPACK provides a great deal of flexibility.
A program calling ScaLAPACKK, if it is suitably parameterized, can handle many different sizes and arrangements of processors. This can help you find a suitable tradeoff between communication overhead and parallel speedup.
ScaLAPACK can handle complex data as well as real. There are special routines to handle matrices that are banded or positive definite. You can also carry out eigenvector computations. You can divide your matrix into smaller blocks, rather than giving each processor a single contiguous block.
ScaLAPACK includes utility routines to allow you to easily compute row or column maxima or minima on your distributed matrix, or to explicitly copy a blocks of data from one processor to others.
It's possible to set up the full matrix on one processor, and then copy the appropriate portions to the others. However, this approach has significant costs: your program will have to allocate enough memory for the full matrix, the computation of the full matrix will take place on a single processor, you will have to set up the transfer of each matrix block to the appropriate processor, and wait until these transfers are completed before proceeding. It is much better to figure out the portion of the matrix that will "belong" to each processor, and compute it locally.
In our example, we broke the full matrix A into very large chunks. ScaLAPACK allows us to break A into smaller blocks, which can improve load balancing. For instance, as Gauss elimination proceeds, the beginning rows and columns of the matrix are no longer manipulated. If our matrix was distributed using large chunks, then many processors will fall idle, with nothing to do. Using smaller blocking sizes, we can be sure that most of the processors stay busy from beginning to end of the algorithm. If you appropriately parameterize your code, you can easily see how changing the block size affects performance.
You can return to the HTML web page.