An Introduction to ScaLAPACK


ScaLAPACK is a linear algebra library for parallel computers. Routines are available to solve the linear system A*x=b, or to find the matrix eigensystem, for a variety of matrix types. ScaLAPACK is automatically available in the Cray T3E SCILIB.

ScaLAPACK implements the block-oriented LAPACK linear algebra routines, adding a special set of communication routines to copy blocks of data between processors as needed. As with LAPACK, a single subroutine call typically carries out the requested computation.

However, ScaLAPACK requires the the user to configure the processors and distribute the matrix data, before the problem can be solved. A beginning user may find these tasks daunting.

This article is a gentle introduction to ScaLAPACK, keeping the complications to a minimum.

Setting Up a Sequential Solution:

Suppose we had the following linear system, A*x=b: A(I,J) = min(I,J)
B(I) = 1
On a sequential computer, LAPACK can easily be used to find the solution X ( 1, 0, 0, 0, ... ). The skeleton code to set up and solve this system is:


        do I = 1, N
          B(I) = 1
          do J = 1, N
            A(I,J) = min(I,J)
          end do
        end do

        call SGESV ( N, 1, A, N, IPIV, B, N, INFO )
      
The computed solution X overwrites the vector B on return.

Using a Parallel Approach:

If we want to use ScaLAPACK to solve the system, the heart of the code will be almost the same; the call to SGESV is replaced by a call to PSGESV, with a few more arguments. But we will also have to produce some new code, which isn't about the problem itself, but about how we are going to use parallel processors on the problem.

Before going into details, we need to think of a simple use of ScaLAPACK as involving four steps:

We will first look at the data distribution question, which will allow us to see how the processors are going to cooperate.

Distributing the Data:

ScaLAPACK assumes that the linear algebra problem will be treated by a set of processors that are arranged logically into some sort of rectangular "processor grid." This arrangement will allow any processor to be identified by its row and column. For our example, we will suppose we have four processors, arranged as follows:

                 Column 0         Column 1

        Row 0    Processor 0      Processor 1

        Row 1    Processor 2      Processor 3
      

Now the data in the problem, in this case the matrix and the right hand side, will be distributed among the processors, so that each processor "owns" a portion of the data. For a 6 by 6 matrix, a simple distribution would be:

               Processor 0  |  Processor 1  
                            |
        B1   A11  A12  A13  | A14  A15  A16
        B2   A21  A22  A23  | A24  A25  A26
        B3   A31  A32  A33  | A34  A35  A36
        --------------------+--------------
        B4   A41  A42  A43  | A44  A45  A46
        B5   A51  A52  A53  | A54  A55  A56
        B6   A61  A62  A63  | A64  A65  A66
                            |
               Processor 2  |  Processor 3
      

Each processor only needs space to store its portion of the matrix. To be efficient, each processor should compute the elements it "owns". So one complication will be figuring out which global matrix entries A(I,J) belong to a given processor, and where such an entry is stored. In other words, processor number 3 must know that it owns global matrix entry A(5,4), and that it should be stored in row 2, column 1 of its local storage.

We will call the local storage arrays "APART" and "BPART", and we will compute two sets of indices:

Assuming our local array APART is of dimensions NB by NB, and that our processor's location in the processor grid is (MYROW, MYCOL), then the following code will correctly compute the portion of A that any processor needs:


        do IPART = 1, NB
          I = MYROW*NB + IPART        !  Skip down past other processor's rows
          do JPART = 1, NB
            J = MYCOL*NB + JPART      !  Skip over past other processor's columns
            APART(IPART,JPART) = min ( I, J )
          end do
        end do
      

Processors in processor column 0 also get a portion of the right hand side:


        if ( MYCOL == 0 ) then
          do IPART = 1, NB
            BPART(IPART) = 1.0
          end do
        end if
      

To get superior parallel performance, the matrix should actually be divided up into more and smaller blocks. The code here gets more complicated, but the idea is still simple: each processor gets a "chunk" of the data.

Calling ScaLAPACK:

Before calling ScaLAPACK, we have to report how the matrix was divided up. The routine DESCINIT packs up the relevant information into the arrays DESCA and DESCB:


        call DESCINIT ( DESCA, N, N, NB, NB, 0, 0, ICTXT, NB, INFO )

        call DESCINIT ( DESCB, N, 1, NB, 1, 0, 0, ICTXT, NB, INFO )
      
Now we can call the routine that actually does the computation:

        call PSGESV ( N, 1, APART, 1, 1, DESCA, IPIV, BPART, 1, 1, DESCB, INFO )
      

Configuring and releasing the processors:

The code for managing the processors can be confusing, so it's useful to think of it in terms of what it's doing for you. Essentially, you have to choose how many processors you want, get a special identifier called the "context" that refers to this group of processors, arrange the processors in a rectangular grid, and tell each processor where it belongs in the grid. The code to do this might look as follows:


      NPROW = 2
      NPCOL = 2

      call BLACS_GET ( -1, 0, ICTXT )

      call BLACS_GRIDINIT ( ICTXT, 'Row-major', NPROW, NPCOL )

      call BLACS_GRIDINFO ( ICTXT, NPROW, NPCOL, MYROW, MYCOL )
      
After these calls, each processor has received a unique pair of row and column indices in MYROW and MYCOL. Even though the same code is to be executed by all processors, each one knows "who" it is.

Once you are done with ScaLAPACK, you release the processors with the calls


      call BLACS_GRIDEXIT ( ICTXT )

      call BLACS_EXIT ( 0 )
      

Using the Code:

To complete the program, we have to assemble the four steps discussed above into a file, and insert the appropriate data declarations at the head of the program. To see these details, refer to the text of the sample program. Assuming our program is stored in the file "myprog.f", we can compile and run it interactively on the T3E with the commands:


        f90 myprog.f
        mpprun -n 4 a.out &
      

The output will be something like:


       SCTEST1
          Solve A*x=b using ScaLAPACK PSGESV.

        STOP (PE 3)   executed at line 104 in Fortran routine 'SCTEST'
        STOP (PE 2)   executed at line 104 in Fortran routine 'SCTEST'
        STOP (PE 1)   executed at line 104 in Fortran routine 'SCTEST'

          PSGESV returned INFO =  0

        STOP (PE 0)   executed at line 104 in Fortran routine 'SCTEST'

       [1]    Done                          mpprun -n 4 a.out
      

Twiddling the Knobs:

We've kept this introduction light and superficial. We used simple declarations, and default values for many input arguments. However, it is important to note that ScaLAPACK provides a great deal of flexibility.

A program calling ScaLAPACKK, if it is suitably parameterized, can handle many different sizes and arrangements of processors. This can help you find a suitable tradeoff between communication overhead and parallel speedup.

ScaLAPACK can handle complex data as well as real. There are special routines to handle matrices that are banded or positive definite. You can also carry out eigenvector computations. You can divide your matrix into smaller blocks, rather than giving each processor a single contiguous block.

ScaLAPACK includes utility routines to allow you to easily compute row or column maxima or minima on your distributed matrix, or to explicitly copy a blocks of data from one processor to others.

Efficiency:

It's possible to set up the full matrix on one processor, and then copy the appropriate portions to the others. However, this approach has significant costs: your program will have to allocate enough memory for the full matrix, the computation of the full matrix will take place on a single processor, you will have to set up the transfer of each matrix block to the appropriate processor, and wait until these transfers are completed before proceeding. It is much better to figure out the portion of the matrix that will "belong" to each processor, and compute it locally.

In our example, we broke the full matrix A into very large chunks. ScaLAPACK allows us to break A into smaller blocks, which can improve load balancing. For instance, as Gauss elimination proceeds, the beginning rows and columns of the matrix are no longer manipulated. If our matrix was distributed using large chunks, then many processors will fall idle, with nothing to do. Using smaller blocking sizes, we can be sure that most of the processors stay busy from beginning to end of the algorithm. If you appropriately parameterize your code, you can easily see how changing the block size affects performance.

You can return to the HTML web page.


Last revised on 15 September 2005.