Computing on the Cray YMP

A talk given to students at Wheeling Jesuit College, 25 October 1991, by John Burkardt, of the Pittsburgh Supercomputing Center.


TABLE OF CONTENTS

The National Science Foundation Supercomputing Centers

In the early 80's, powerful new computers were built to solve scientific problems too hard for standard machines. The new computers were expensive and difficult to program. Universities did not buy them; big corporations did not buy them; only national laboratories, such as Los Alamos and Lawrence Livermore Laboratories could afford the money to pay for them, and the staff time to get them up and running.

University researchers who needed the power of the new machines had two choices: get on the visiting staff of a national laboratory, or go overseas to CERN or one of the other research centers with a supercomputer. I specifically recall professors at the University of Pittsburgh and CMU who had summer time appointments at Los Alamos; and our own research group was only able to get access to a Cray supercomputer at Huntsville Alabama on weekday evenings, and only because we had a contract from the Air Force.

The National Science Foundation decided that the new machines were too important for researchers to have to beg for, and so they established several centers across the country, whose sole purpose was to "give away" time, and train researchers in using the new machines. The Pittsburgh Supercomputing Center is one of those places, along with centers at San Diego, Urbana/Champaign, and Cornell.

Any researcher in the United States can apply for time. The proposal must include information about why a supercomputer is needed and how the supercomputer will be used. The proposal will be reviewed, either by the center's staff, or by a peer review board. If approved, the researcher is granted the time outright. For free. Typical time requests can range from 10 to 1,000 hours of time for a year. When we have to sell time (to industrial users) we charge roughly $1,000 an hour, so you can see that we're giving away a lot of money!

The NSF also wants to expose university classes to supercomputing, and the centers have a liberal program for making small amounts of time (5 or 10 hours) available to a class for this purpose. In this case, the proposal will simply state the name of the class, a syllabus, and the teacher's past experience. The entire class, or selected members, can then get accounts.

The Pittsburgh Supercomputing Center provides online help and documentation, a newsletter, training classes, telephone consulting, and other services as part of its mission. Our hardware includes both a Cray YMP and a Connection Machine with 32K processors.

The Cray YMP

The Cray YMP is a supercomputer. The crucial design feature was high speed, and NOT big memory.

If we consider raw computational speed as the important measure, then the Cray YMP can carry out 2.6 billion arithmetic operations per second, during which time a typical VAX computer can carry out 1 million such operations. This means a program could run for 8 years on such a VAX, or a day on a Cray.

In order to get high speed, Seymour Cray invented new hardware, developed new software techniques, made some hardware perform impossible things, and made many sacrifices.

I'd like to tell you why high speed is necessary, how it was achieved, and what you can do with it.

Scientific problems are getting much harder

One of the first things we'll notice on the Cray YMP is that there's not a lot of memory. Instead, a lot of effort has been put into making the processor faster. If I'm interested in running "bigger" problems, what should I be looking for: a lot more memory, or a lot faster processor?

The interesting thing to note is that for most problems, at least when run on a serial machine, the processing time has a polynomial relationship to the problem size, or in very hard problems, an exponential relationship.

It is very rare to have a linear relationship between data size and execution time. And for this reason, we will see that if we're solving the problem on one processor, doubling the speed is more helpful than doubling the memory.

As an example, consider matrix multiplication. To compute each element of the matrix will require the pairwise multiplication of a row of the first matrix by a column of the second. If the matrices are both square, of order N, then there are N*N entries in the product, and each element will require N multiplications and N additions, making 2*N*N*N computations. So the computational time goes up cubically. The memory required is 3*N*N, so it goes up less drastically.

Although it's not so obvious, the problem of solving a linear system A*x=b has similar behavior: the time required is of order N*N*N, while the space required is N*N.

Another example is a 3D weather grid. To predict the weather, we divide the world up into 3D boxes. To some extent, the accuracy of predictions depends on the length of a side of the box. So to double accuracy, we make the boxes half as long on a side. But that makes them one eighth as large, so we have eight times as many. We have variables associated with each box (temperature, pressure and so on). We may have to solve a linear system for the variables; if it were a dense linear system (it's not!) then having eight times as many variables would mean we'd have 8*8*8=512 times as much work! Here, asking for just twice as much accuracy causes an enormous increase in work.

If there's any doubt about how fast small problems can get hard, consider a network problem, such as the traveling salesman. We are given the distances between N cities, and asked to find the shortest round trip. The only obvious algorithm is simply to compute the length of every possible round trip. There are (N-2)! such round trips. Let me set up the punchline here by pointing out that the VAX can do 1 million and the Cray roughly 3 billion floating point operations per second, and that there are roughly 30 million seconds in a year, making 30 million million operations for the VAX and 90 million billion operations for the Cray, per year. If each trip requires N-1 additions, how big a problem can the VAX do in a year? The VAX could solve a problem with 17 cities. And the Cray? The Cray could solve a problem with 20 cities. It is unbelievable how hard this problem is! Actually, there are better methods than the brute force way, but the behavior of this problem suggests that computer users will never be satisfied with a given amount of speed.

The Cray Architecture

Physical Memory

A very prominent flaw of the Cray YMP is the limited amount of memory. Our model, the YMP 8-32, has 32 Megawords of memory. That memory must be shared by the operating system, as many as seven other running programs, your program's data and your program's instructions. This means that on average 3 or 4 Megawords is more likely your total memory. Note that a dense 2000 by 2000 matrix requires 4 Megawords, and that doesn't leave any space for a program to do anything with the matrix!

Our VAX/VMS system actually has less physical memory than the Cray, but allows you to run very large jobs by using "virtual memory". The computer simply stores portions of the data and program out on disk, and simulates having more memory. Of course, the computer has to transfer that data into memory from disk when it is needed. But this means that a cheap computer can solve big problems. On paper, at least, the VAX can solve bigger problems than the Cray can.

So why did the Cray designers make this choice? Well, nobody said the VAX was fast. Did you ever try to use virtual memory? If you happen to need a piece of data that's been swapped out, there's a perceptible wait while other data is written out and this data is read in. Luckily, many programs minimize these "swaps", because data items are often used consecutively, and the VAX stores data in chunks or "pages" of perhaps 512 words. But a carelessly written program can cause almost every calculation to require a page swap, so that the program runs a thousand times more slowly.

The Cray designers wanted to be able to guarantee that every data item in the entire memory could be addressed quickly. Moreover, they wanted to guarantee that if a "vector" of data was required, that one item of that vector could be delivered to the processor every clock cycle, with no excuses accepted. If they included virtual memory, they would have had unpredictable data delivery. They chose to keep the memory entirely physical, very fast, and relatively small (since fast memory is expensive!).

New models of the Cray are already available with 128 Megawords of memory, and models with 512 Megawords are planned. But clearly, adding memory to a Cray is not simply a matter of buying more memory chips. The processor must be guaranteed that it can reach data on the added memory chips as fast as on the old ones.

Naturally, user programs immediately hit the memory ceiling. There are many routines that have been written to allow Cray users to solve large problems with small memory. Usually, this is done by opening up a scratch file and writing intermediate results there. This is similar to the VAX's virtual memory. The difference is that the author of the routine knows what data is being moved between disk and memory, and writes the routine to minimize that movement.

Memory Banks

Even though the memory is all physical, it's still too slow for the processor. One design goal of the Cray YMP memory was that it be able to deliver one data item per clock cycle. The problem is that the clock cycle is so short that the memory can NOT do this. The chips used for Cray memory are very fast, but after delivering one data item, they need about four clock cycles to recover. That's not acceptable.

Here's where a compromise was made with the design. It was decided that it was acceptable to lower the requirements; data must still be delivered one every clock cycle, if accessed sequentially; other cases would be negotiable.

Therefore, the designers decided to build one memory out of subunits. Essentially, the memory for a running job was spread across eight separate memory banks. The successive elements of an array were in separate banks. Thus, if I asked for X(1) through X(N), memory bank 1 got enough time to "rest" after delivering X(1) before it had to deliver X(9), and so on.

The problem, of course, occurs because in many cases a vector might be accessed with a jump of 2, 4, or 8, or a two-dimensional array might be accessed across a row rather than down a column. In both cases, it is possible for a memory bank to be asked for data before it is ready. This is not catastrophic; however, this does mean that data transfer to the processor will decrease, and hence the rate of computation will drop.

Ports from Memory to Processors

Another feature of the Cray is the ports between memory and the processors.

A simple computer would have a single port, over which a single data item could be read from memory, or written to memory. Once that transfer had completed, another could begin.

The Cray has three ports for each processor. Two ports are devoted to reading from memory, and one to writing to memory. This is a reflection of the fact that in most computations, several data items are input to a formula for which a single data item is output.

Moreover, if a vector of data is being read or written, then the port can read or write one new item of data each clock cycle, even though each transfer takes significantly longer than one clock cycle.

Just as we have seen how multiple memory banks are used in order to squeeze faster performance out of memory, so the actual port over which the data travels is "pipelined" or "overloaded" or somehow made to perform more work than we would expect it to be able to do.

Registers

We've already mentioned that the transfer of information from memory to the processor is "slow". For this reason, the Cray processors have a number of registers available, some devoted to scalars, and others to vectors.

Vector data travels from memory to a vector register. There it waits to be operated upon, perhaps to be added to another vector, for instance. The result of this operation is stored in another register, which may then be written to memory. However, if the compiler is clever, the data, or even new results, may be re-used several times before being written to memory. Every time that the compiler can avoid writing the data out may represent a significant speedup in execution.

          DO I=1,N
            EX=EXP(-4.0*X(I))
            DUDT(I)=DUXX(I) + (X(I)*T-16*V(I)*U(I)) + X(I)*EX
            DVDT(I)=DVXX(I) + (U(I)-4*X(I)**2) - 10*EX
          END DO
    

First of all, the compiler is clearly going to need to fetch the vectors X, U, V, DUXX and DVXX. It will need to write out the values of two more vectors, DUDT and DVDT.

It would also be nice if it could create some "temporary" vectors, in particular, if the scalar variable EX can be treated as a vector, then the whole loop is a statement about vectors (and scalars T, 16, 4 and 10, whose values are already known, and sitting in a scalar register).

Because we have vector registers, we don't have to get a copy of the X vector for the DUDT calculation, and then get that copy again for the DVDT calculation.

We DON'T have lots of vector registers, so as the formulas get more complicated, we will find the Cray having to put stuff out in memory that it would prefer to keep in a register. If the calculation takes long enough because it's complicated, however, it may be possible that the Cray can shuffle data back and forth without missing a beat!

Multiple Processors

The Cray YMP is descended from serial computers. A serial computer has a massive amount of data, an enormous program, and lets most of the data and program sit idly while it processes a single instruction which manipulates a couple of data items. The fact that most information in a computer spends most of its lifetime waiting for the processor to get to it is known as the "Von Neumann bottleneck".

Some machines, such as the Connection Machine, deal with this problem by using thousands of simple, slow, cheap processors. The Cray YMP tries to deal with this bottleneck by having eight complicated, fast, expensive processors.

Each single processor of the Cray runs about 10 times faster than a VAX processor, disregarding all the other "tricks" we're going to look at.

Each processor can behave independently, running a separate program, or can cooperate with other processors in running a single program. Theoretically, a program running on the Cray using all the processors would run as much as 8 times faster, simply because of the 8 processors.

Just having 8 fast processors has already given us a speedup of 80 over a VAX, with roughly a further factor of 30 to explain. However, some of the items we will be discussing aren't going to add to our speedup, but will simply allow other parts of the computer to keep up with the processor.

Note that we start with 32 Megawords of physical memory, allow about 5 MegaWords for the operating system, leaving 27 Megawords available. This memory is divided among the 8 processors. If one processor is running a 20 Megaword job, for example, the other seven processors are forced to share 7 Megawords. Several of the processors may sit idle, waiting for memory to be freed up. So having eight independent processors on a Cray YMP is NOT necessarily eight times as good as having one processor!

For most of our discussion, we will ignore the fact that there are eight processors.

Functional Units

The "ideal" computer has a central processing unit which picks up one program instruction and carries it out completely before moving on to the next one. An arithmetic computation like

      X = Y + 17*(Z/W)
    
would require the same CPU to divide, multiply and add. This meant the CPU had to be quite talented and versatile. It also meant that exactly one thing went on at a time.

The Cray processor has been broken up into a number of cooperating "functional units". There is still a central processor, but it is more interesting in "conducting" the calculation. The actual work is parceled out among the various units. Each unit has a speciality; in particular, there are a floating point add, multiply, and reciprocal unit. These units do the majority of work in a scientific calculation. Other units that make up the processor include an instruction unit which fetches the next instruction, several logical operation units, and some integer arithmetic units.

At any time, all of the functional units could be busy computing, or perhaps only one. A program which can get more than one functional unit working at the same time will execute more quickly.

Because each functional unit has a different sort of task, the time required for a result varies from unit to unit. In particular, the add and multiply units take 5 cycles for a complete result, and the reciprocal takes 7. We will see in a moment how the functional units were designed to deliver a new result every cycle, rather than every 5 or 7 cycles.

Pipelining

We've seen how the CPU was broken up into individual units to allow multiple operations to go on at the same time. Now we will see how these individual units were further subdivided to allow more opportunity for speedup.

The goal is to design each functional unit so that it can eventually compute one new result every cycle. In order to do this, the algorithm for the operation to be carried out is broken up into individual steps that can be executed in one cycle. Then each step is carried out by a segment of the functional unit, and the unfinished data passed on to the next segment.

For instance, in order to add two quantities we have to determine which is the larger in magnitude, shift the exponent of the other quantity so that the two share the same exponent, add the mantissas, normalize the result.

Chaining

Suppose we have an operation such as the following:

            DO I=1,N
              X(I)=X(I)+FACTOR*Y(I)
            END DO
    

The computation of the right hand side requires two arithmetic operations. The compiler is in charge of deciding how this will be done. The natural way to do this would be to compute FACTOR*Y(I) as a temporary result, and then add that result to X(I).

How long must this take? Well, if we do many multiplications, then each one will take roughly one clock cycle. If we then do the additions, those will take roughly one clock cycle, making a total of two clock cycles per result.

We've already seen that we can do the multiplications very fast, producing one result per clock cycle. These results will go into a register. You might expect that we would have to complete the entire multiplication before moving on to the addition.

But do we really have to wait? The multiplier is busy, but the adder is free. It turns out that the compiler will notice this, and essentially feed the output of the multiplication operation directly to the adder. This means that "once the pipe fills up" we can produce one result of this computation every clock cycle as well!

This kind of cooperation between functional units, which essentially creates a larger functional unit, is called "chaining". Notice that we could NOT have done this if the line read

            X(I)=X(I)+FACTOR+Y(I)
    
since the adder is busy doing (FACTOR+Y(I)), and cannot add X(I) at the same time. The key here was that the adder was available.

        Without chaining  (Time=ADD+MULTIPLY+2*(N-1))

        FACTOR*Y(1)
          FACTOR*Y(2)
            FACTOR*Y(3)
              ...
                FACTOR*Y(N)
                           X(1)+FACTOR*Y(1)
                             X(2)+FACTOR*Y(2)
                               X(3)+FACTOR*Y(3)
                                 ...
                                 X(N)+FACTOR*Y(N)
    
        With chaining:  (Time=MIN(ADD+MULTIPLY)+N-1)

        FACTOR*Y(1)
          X(1)+FACTOR*Y(1)
          FACTOR*Y(2)
            X(2)+FACTOR*Y(2)
            FACTOR*Y(3)
              X(3)+FACTOR*Y(3)
              ...
                FACTOR*Y(N)
                  X(N)+FACTOR*Y(N)
    

Vectorization

Your best friend on the Cray is the compiler. It will do amazing things for you.

The extra speed factor of the Cray comes from vectorization, that is, the ability to carry out the same operations on a large set of data.

The speedup can only occur if your program reaches a point where this occurs. At that point, the proper machine instructions are needed, to force data to "stream" into the processor, to orchestrate the use of the registers, and the functional units.

The compiler examines your program searching for such opportunities.

If it finds a suitable DO loop, and believes it understands it, it issues special instructions telling the Cray how this computation will be "played". A DO loop which allows this kind of processing is said to be "vectorized".

How the Cray Architecture speeds things up

We've talked about a lot of features of the Cray's architecture that are intended to speed up computations. Let's go through one simple example and try to get a feeling for how some of these features cooperate to make possible and sustain a high rate of performance.

The example we will look at will be the same one we discussed a minute ago with "chaining". That is, we are going to add a multiple of one vector to another. This is a frequent operation in computation, occurring for instance during the solution of a system of linear equations.

The FORTRAN statements involved are very simple:

            DO I=1,N
              X(I)=X(I)+FACTOR*Y(I)
            END DO
    

How would a simple computer handle this computation? Well, I would expect the simple computer to get FACTOR in from memory, and keep it there. Then, I would expect that each operation, arithmetic or I/O, would happen one at a time. I'll ignore the fact that the DO loop index must be incremented and checked on each step. In that case, we have

      2 Reads
      1 Multiply
      1 Add
      1 Write
    
for a total of 5 operations per result, and a total of 5*N operations. If we assume the operations take the same amount of time T1, (they probably don't!) then we can write the cost of this computation as 5*N*T1.

What happens on the Cray?

      FACTOR is stored in the processor.  

      Y(1) is requested from memory.

      Y(2) is requested,
      X(1) is requested

      Y(3) is requested and so on.
      X(2) is requested and so on.

      ...

      Y(1) arrives in register 1 on the 17-th cycle

      Y(2) arrives in register 1 and so on.
      X(1) arrives in register 2 and so on.
      Y(1) begins step 1 of multiplication by FACTOR.

      Y(2) begins step 1 of multiplication by FACTOR.
      Y(1) begins step 2 of multiplication by FACTOR.

      Y(3) begins step 1 of multiplication by FACTOR.
      Y(2) begins step 2 of multiplication by FACTOR.
      Y(1) begins step 3 of multiplication by FACTOR.

      Y(4) begins step 1 of multiplication by FACTOR.
      Y(3) begins step 2 of multiplication by FACTOR.
      Y(2) begins step 3 of multiplication by FACTOR.
      Y(1) begins step 4 of multiplication by FACTOR.

      Y(5) begins step 1 of multiplication by FACTOR.
      Y(4) begins step 2 of multiplication by FACTOR.
      Y(3) begins step 3 of multiplication by FACTOR.
      Y(2) begins step 4 of multiplication by FACTOR.
      Y(1) begins step 5 of multiplication by FACTOR, goes to register 3.

      Y(6) begins step 1 of multiplication by FACTOR.
      Y(5) begins step 2 of multiplication by FACTOR.
      Y(4) begins step 3 of multiplication by FACTOR.
      Y(3) begins step 4 of multiplication by FACTOR.
      Y(2) begins step 5 of multiplication by FACTOR, goes to register 3 and so on.
      Y(1) begins step 1 of addition to X(1).

      Y(2) begins step 1 of addition to X(2).
      Y(1) begins step 2 of addition to X(1).

      Y(3) begins step 1 of addition to X(3).
      Y(2) begins step 2 of addition to X(2).
      Y(1) begins step 3 of addition to X(1).

      Y(4) begins step 1 of addition to X(4).
      Y(3) begins step 2 of addition to X(3).
      Y(2) begins step 3 of addition to X(2).
      Y(1) begins step 4 of addition to X(1).

      Y(5) begins step 1 of addition to X(5).
      Y(4) begins step 2 of addition to X(4).
      Y(3) begins step 3 of addition to X(3).
      Y(2) begins step 4 of addition to X(2).
      Y(1) begins step 5 of addition to X(1), and is sent to memory.

      N-1 cycles later, the last result is computed.
      16 cycles from this point, the first result arrives in memory.
    

Now, there is a startup time in this loop of about 17 cycles, during which time no computation occurs. Then there are 5 cycles where we wait to get the first multiplication done. Then there are N cycles, during which time the Cray is computing one new result per cycle. Then there is a shut down time of roughly 16 cycles, as the last results get back to memory. The execution time for one computation is 17 + 5 + 5 + 17 clock cycles, and the execution time for N computations is 17 + 5 + 5 + 17 + N-1.

Notice how we needed two ports to memory, and rapid access to the entire vectors X and Y. We needed registers to store vectors of intermediate results. We were able to "feed" the added and multiplier every step. The results of the multiplier went straight into the adder.

Algorithms

The Cray is a general purpose computer. That means, for instance, that it will run just about any legal program in FORTRAN or C. However, the Cray will not run all such programs equally efficiency.

As far as the Cray is concerned, an algorithm is likely to run well on the Cray if:

Matrix Multiplication

An "ideal" algorithm for the Cray is matrix multiplication. The algorithm can be written:

          DO J=1,N
            DO I=1,N
              A(I,J)=0.0
            END DO
          END DO

          DO K=1,N
            DO J=1,N
              DO I=1,N
                A(I,J)=A(I,J)+B(I,K)*C(K,J)
              END DO
            END DO
          END DO
    

The inner loops have a "large" index of N (assuming N is large!). The multiplication loop has the same "multiply and add" combination that will allow chaining. We only need two ports to memory, since, in the inner loop, C(K,J) behaves like a constant. The code, as written above, executes at 1/2 the top speed of the Cray. (That's good!)

The only problem with this algorithm is the large amount of repeated I/O. Every matrix entry B and C is read into the processor N times. There are ways of reducing this memory traffic.

Gaussian Elimination of a Dense Matrix

Similar good performance can be gotten from a Gaussian elimination routine. Here, the important part of the algorithm looks something like this:

          DO K=1,N-1

            DO I=K+1,N
              A(I,K)=A(I,K)/A(K,K)             <-- Compute multipliers
            END DO

            DO J=K+1,N
              DO I=K+1,N
                A(I,J)=A(I,J)-A(I,K)*A(K,J)    <-- Add row K to row I
            END DO
          END DO

        END DO
    

Here, the size of both inner loops varies between 1 and N, so that we can't say that the loop size stays large. But for a large matrix, the loop size will be large "most of the time". The division will execute slowly. But the next loop is another example of adding a multiple of one vector to another, and will chain, and will execute quickly.

Tridiagonal matrix solving

Solving a banded linear system can be worse, or awful on a Cray, because the inner DO loop no longer has a varying range that is "usually" large, but instead is always small.

About the worst case is a tridiagonal system. Roughly speaking, the algorithm can be almost identical, except that the DO loop labeled 20 goes from I=K+1 to I=K+1. You actually get a slowdown if you try to "vectorize" this loop, because of extra overhead for vectorization that I don't want to tell you about.

What's the solution?

Well, you can always try to find a machine that handles this algorithm better, but you're much more likely to benefit if you find a different algorithm. In fact, at least for positive definite tridiagonal matrices, there's an algorithm called "cyclic reduction" which executes much more quickly on the Cray.

A tridiagonal system has some very special properties that will allow us to carry this operation out. Let's suppose we start out with 7 equations:

      a11*x1 + a12*x2                                              = y1
      a21*x1 + a22*x2 + a23*x3                                     = y2
               a32*x2 + a33*x3 + a34*x4                            = y3
                        a43*x3 + a44*x4 + a43*x5                   = y4
                                 a54*x4 + a55*x5 + a56*x6          = y5
                                          a65*x5 + a66*x6 + a67*x7 = y6
                                                   a76*x6 + a77*x7 = y7
    

You should see that we can use the first equation to eliminate the coefficient a21 in the second equation. We can also use the third equation to eliminate the coefficient a23 in the second equation. This knocks out variables x1 and x3 in the second equation, but adds x4 into that equation.

By the same method, we can eliminate x3 and x5 from the equation for x4, and so on. If we do this, eliminating the odd variables from the even equations, we end up with a smaller tridiagonal system, with half the number of equations, and with half the number of variables.

If we carry out the same operation repeatedly, we eventually end up with a single equation in a single variable. In our example, if we started with 7 equations, this variable would be x4. This we can solve!

We can plug x4 into the two previous equations for x2 and x6 and get them. Then we plug those values into the four previous equations for x1, x3, x5 and x7 and we're done.

Now the interesting thing about the new algorithm is that there's actually slightly more work, but it's organized into DO loops whose length depends on how far along we are: N, N/2, N/4, and so on. But at least they are almost always LONG DO loops, rather than short ones. And simple versions of this algorithm execute 10 times faster than the standard algorithm.

Simple Programming Tips: Ten FORTRAN DO loops

In the bad old days, if you wanted to write in a high level language like FORTRAN, you had to work very hard to get good performance. Except for very simple cases, the FORTRAN compiler could not translate your program into machine instructions for vector processing.

These days, the compiler is actually very good. Most FORTRAN programs can be transferred to the Cray intact, and run with few changes, while getting good performance.

However, there are still some ways of writing FORTRAN programs that confuse the compiler, and keep it from generating fast code. A programmer who is aware of this can design programs that use equivalent, but "safe", features the compiler can handle.

To give you an idea of the level at which problems occur, I present a series of DO loops.

The subroutine call

Consider the following two fragments of a program:

          DO I=1,N
            CALL SCALE(X(I),XMAX,XMIN)
          END DO

          SUBROUTINE SCALE(X,XMAX,XMIN)
          X(I)=(X(I)-XMIN)/(XMAX-XMIN)
          RETURN
          END
    

The DO loop calls SCALE repeatedly, and assuming XMIN and XMAX are what they look like, the values of X will end up lying between 0 and 1. The Cray compiler complains about this loop. Compilers can't see more than one subroutine at a time. The Cray compiler can't tell whether it's safe to try to vectorize the DO 10 loop, because it includes a subroutine it can't include in the examination.

There are two choices: we can include the text of the subroutine in the DO loop, or copy the DO loop into the subroutine, whichever makes more sense:

          DO I=1,N
            X(I)=(X(I)-XMIN)/(XMAX-XMIN)
          END DO
      
or
          CALL SCALE(N,X,XMAX,XMIN)

          SUBROUTINE SCALE(N,X,XMAX,XMIN)

            DO I=1,N
              X(I)=(X(I)-XMIN)/(XMAX-XMIN)
            END DO

            RETURN
          END
      

Nested DO Loops

The magic of vectorization really only occurs in "spurts", that is, when your program reaches a DO loop. Now, in fact, if you have several nested DO loops, the order of nesting has some influence on the speed of execution.

The Cray only vectorizes the innermost loop. Statements in the outer loops are not vectorized, and there is a slight "pause" each time the innermost loop is completed. For instance:

          DO I=1,M
            A(I)=SCALE*Y(I)
            DO J=1,N
              A(I)=A(I)+B(I,J)*X(J)
            END DO
          END DO
    

The first thing to note is that the initialization of A(I) does not happen in a vectorized way. If we wanted that to occur, we could do that in a separate loop. In fact, let's plan to do that, because we're going to need that flexibility shortly!

Now, let's imagine that there are two FORTRAN statements that represent the startup and cleanup activities (which take time!) for the loop:

          DO I=1,M
            A(I)=SCALE*Y(I)
          END DO

          DO I=1,M

            CALL STARTUP

            DO J=1,N
              A(I)=A(I)+B(I,J)*X(J)
            END DO

            CALL CLEANUP

          END DO
    

It should be clear that as this loop stands now, we could interchange the I and the J loops. Is there any reason to prefer one over the other? Well, the choice is between M or N calls to startup and cleanup. If we're dealing with a square matrix, fine. If we're dealing with a very rectangular matrix, of size 6000 by 8, for instance, we definitely want the 6000 value to be the limit on the inner loop.

The Cray can handle some logical tests

We've seen the Cray can vectorize arithmetic in the inner loop just fine. However, what happens when we need to control that arithmetic, with, for instance, a logical test?

          DO I = 1, N

            IF(X(I).NE.0.0)THEN
              Y(I) = 1.0/X(I)
            ELSE
              Y(I) = SQRT(W(I))
            END IF

          END DO
    

The compiler has to see whether it can deal with this code. It examines the IF test, which is a simple test on vector values, and the code that is executed in either case, which by itself would not cause problems.

The compiler figures that it will try to carry out this computation by doing both branches of the IF statement; that is, it will compute a temporary vector of 1/X(I), and it will compute another of SQRT(W(I)), and then it will merge them into Y(I), depending on the value of X(I).

Now you should see a potential problem here; the whole point of having the IF statement was to avoid dividing by zero! Luckily, the Cray will only abort the program if YOU divide by zero, that is, if it's clear that you want to assign 1/0 as the value of a real program variable.

On the other hand, we didn't check for W(I) being nonnegative. If X(17) is zero, and W(17) is -4, then we will get a serious error, but that would have happened on a scalar computer too.

By the way, this loop is handled in a "vectorized" way, but did you notice it contained a call to a function? Luckily, it's the built-in SQRT function, and the Cray can pipeline that computation in a similar way to addition, multiplication, and division. (It's slower than those basic operations, though!)

The Cray can handle accumulating variables

A very natural operation to perform on vectors is to find the sum of the entries. The code for this operation would look like:

          SUM=0.0
          DO I=1,N
            SUM=SUM+X(I)
          END DO
    

Now, it's clear that this operation provides a challenge not posed by a line reading:

            Y(I)=Y(I)+X(I)
    

On the first step, it's easy to begin adding X(1) to SUM. But one clock cycle later, we're aren't finished with adding X(1), and we want to add X(2). And this problem just gets worse.

Logically, then, shouldn't this loop take a whole lot longer? No, that's not the case. The Cray has a clever method for setting up some temporary variables, and using them for partial sums. TEMP(1) has X(1) added to it, TEMP(2) has X(2) added, and so on up to TEMP(8). Then X(9) is added to TEMP(1), which is certainly free again, and so on. At the end of N steps, we just have to do 8 more additions and we're done.

Just to make sure that I'm not utterly wrong, I compared these loops. I timed them for N=1000 and for N=1,000,000:

                 SUM=SUM+X(I)  Y(I)=Y(I)+X(I)

         1,000    0.000011      0.000012

     1,000,000    0.006454      0.010444
    

We first see that the SUM operation is actually faster than the vector addition, for "small" and "large" values of N. And I'm afraid I can't explain where that extra speed is coming from yet!

Similar magic allows you to do a simple MAX or MIN operation on a vector in very fast time (0.012700 seconds for 1,000,000 entries).

I should point out that the Cray should be able to do 1,000,000 additions in 0.006 seconds, and our SUM operation is practically at that limit.

Trying to cut overhead by loop unrolling

Loop unrolling is a simple technique that is sometimes a cheap way to squeeze more performance out of a program.

Loop unrolling essentially rewrites a DO loop, making it repeat half as often, and do twice as much. (Or three times, or four times...).

For instance, here's what one of our previous loops looks like, unrolled:

          DO I=1,N,2
            Y(I)=Y(I)+X(I)
            Y(I+1)=Y(I+1)+X(I+1)
          END DO
    

Now (assuming N is even) this is the same as the previous version, and it's the same as writing a program that starts out

      Y(1)=Y(1)+X(1)
    

and just keeps going, one statement at a time. The whole purpose of DO loops is to allow us to shorten our programs, not lengthen and complicate them!

Before I explain, let's make sure that unrolling can help us in this problem. If I set up the unrolled loop, and time it for 1,000,000 entries, the time drops to 0.008380 seconds. So what is going on?

There are two things I can point to:

There are other reasons for using unrolling, which we haven't talked about here, involving reducing the number of times data is moved back and forth between memory and the processor.

Memory Bank Conflicts

Remember that I mentioned that the Cray uses a set of 8 memory banks to store data? The reason was that a single memory bank could not supply data quickly enough to keep the processor fed. This works well most of the time.

Well, you can see how it fails by looking at the following loop:

          REAL A(80), B(80,80), C(80)

          DO I=1,N
            A(I)=0.0
            DO J=1,N
              A(I)=A(I)+B(I,J)*C(J)
            END DO
          END DO
    

If we run this loop on the Cray, what are our first memory requests?

      A(1), B(1,1) and C(1)
      A(2), B(1,2) and C(2)
      A(3), B(1,3) and C(3)
    

and so on. Now, where is this data stored? Since A is a vector, it's easy to describe how FORTRAN arranges it, and where the Cray puts it. We can assume A(1) is in memory bank 1, and so on up to A(8), with A(9) in memory bank 1 again. That means that the request for A(2) would not interfere with the request for A(1). The same is true for C.

Alas, B is a two dimensional array, and the story is more complicated. FORTRAN arranges B as a very long list, comprising the first column, followed by the second column, and so on. So the order is

      B(1,1), B(2,1), B(3,1), ..., B(80,1), B(1,2), B(2,2), ..., B(80,2), B(1,3)
    

and so on. Now, if this is how FORTRAN arranges the data, and if the Cray stores the data across memory banks in groups of 8, can we tell what our memory requests in the above loop are doing?

      B(1,1) is the first list entry, and so is in the 1-st bank.
      B(1,2) is the 81-st list entry, and so is in the 1-st bank.
      B(1,3) is the 161-st list entry, and so is in the 1-st bank.
    

In fact, as long as the first index is 1, the entries in in bank 1, and so we are repeatedly asking for entries in the same memory bank. This causes the Cray to slow down by a factor of 4 or more!

The warning signs are

There are two possible fixes to this problem:

          REAL A(80), B(80,80), C(80)

          DO I=1,N
            A(I)=0.0
          END DO

          DO J=1,N
            DO I=1,N
              A(I)=A(I)+B(I,J)*C(J)
            END DO
          END DO
    
or
          REAL A(80), B(81,80), C(80)

          DO I=1,N
            A(I)=0.0
            DO J=1,N
              A(I)=A(I)+B(I,J)*C(J)
            END DO
          END DO
    

Can you see the change in the second example, and can you say why it helped?

The problem of recursive data

In order for a loop to vectorize, the Cray has to be able to begin some iterations before it has finished previous ones. Many algorithms are written this way, and the DO loops implementing them can be vectorized.

But it's clearly possible for an algorithm to be unsuitable for this kind of treatement. The tell tale symptom is when a vector entry appears on the left hand side first, and on the right hand side on a later iteration. That means the value that is needed for a later calculation isn't ready when the DO loop begins.

          DO I=2,N
            A(I)=A(I)+A(I-1)
          END DO
    

When the DO loop starts, let's say that A contains (1, 2, 3, ..., 10). The computation of A(2) proceeds just fine: A(2)=A(2)+A(1). But the computation of A(3) starts before A(2) has been computed. So it's going to add A(3)+A(2), using the old value of A(2). The results of this computation will be (1,3,5,7,...,19). But, of course, if we carry out the computation correctly, the answer is (1,3,6,10,15,21,28,36,45,55).

The problem occurs because on the I-th iteration we need the value of A(I-1), which we just began to modify on the (I-1)-th iteration, and haven't finished computing yet.

This is an example of data "recursion". For our purposes, that simply means that one DO loop iteration's computations depend on the results of a previous one. Except for some simple cases (such as the SUM example, earlier), this condition makes it impossible to carry out the "simultaneous" vector computation that the Cray is so good at.

The Cray can handle scattered data

Much of the Cray's power depends on being able to handle vectors. It's not necessary that the vector involved in the DO loop have a simple indexing scheme. In the following formula, all of the elements on the right hand side will be treated as vectors.

          DO J=1,N
            DO I=1,N
              X(I)=A(I,J)+A(J,I)   <-- rows or columns of an array
              Y(I)=A(I,I)+7.0      <-- diagonals of an array
              Z(I)=W(2*I+3)        <-- a vector index that's not I
            END DO
          END DO
    

The Cray can easily treat items as a vector as long as they are either contiguous, or evenly spaced apart, which is true of the entries of the row of an array, or of vector indexing of the form m*I+n.

In fact, there's another way that the Cray can handle data as vectors. There are occasions when we want to be able to select an arbitrary group of elements from an array. This is normally done by using an "index" vector, containing a list of the entries of interest. To sum all the "interesting" entries, we might use a loop like this:

          SUM=0.0
          DO I=1,NUM
            SUM=SUM+A(INDX(I))
          END DO
    

Now the entries of A used in this loop are NOT evenly separated. There is a special feature on the Cray called the "gather/scatter" hardware, which allows it to "gather" the requested values of A from memory almost as fast as if the entries were contiguous. A similar feature allows the indexed variable to appear on the left hand side.

In case you're starting to think the Cray can do anything, consider the problems that can occur with the following loop:

          DO I=1,NUM
            A(INDX(I))=A(INDX(I))+1
          END DO
    

Here, the indexed variable apears on both sides of the equation. The Cray will normally refuse to vectorize this loop unless you "reassure" it. Why? Suppose that two values of INDX are the same. For instance, suppose we have

      NUM = 4,  INDX = (1, 3, 3, 8),  A = (1, 2, 3, 4, 5, 6, 7, 8)
    

Serially, what will be the result?

      A = (2, 2, 5, 4, 5, 6, 7, 9)
    

But if vectorized, there's actually a temporary vector that looks like this:

      ATEMP = (1, 3, 3, 8), which results in ATEMP = (2, 4, 4, 9)
    

and then we have the embarassing problem of trying to store one of the two "4" values back into A, when neither is the right answer.

This is another example which cannot be treated as a vector process, because the operations in the DO loop are not really independent!

GO TO Statements

As the logic of the DO loop gets more complicated, it becomes more likely that the compiler can't or won't vectorize the loop.

GO TO statements are very likely to seriously disrupt the normal flow of computation, whether they are conditional or unconditional.

Here's an example that the Cray can vectorize:

          DO I=1,N
            IF(TEMP.LT.32.15)GO TO 10
            IF(TEMP.GT.212.5)GO TO 20
            PHASE(I)=2
            GO TO 30
10          PHASE(I)=1
            GO TO 30
20          PHASE(I)=3
30          CONTINUE
          END DO
    

The logic of this loop is actually very clean, and equivalent to an IF-THEN/ELSEIF-THEN/ELSE logical block, which the compiler can vectorize. A more complicated example might confuse it, so it would be wise to rewrite such a loop.

Now here's an example that the compiler will not be able to handle as a vector operation:

          DO I=1,N

            J=1
10          CONTINUE
            IF(X(I).LT.Y(J))THEN
              J=J+1
              GO TO 10
            END IF

            FX(I)=FY(J) + (FY(J+1)-FY(J))*((X(I)-Y(J))/(Y(J+1)-Y(J))

          END DO
    

The biggest problem with this loop is that the operations carried out on each vector item are not the same. In other words, X(1) might go through the "GO TO 10" loop 3 times, while X(2) goes through it once, and X(3) goes through it 17 times. There's no way that we can synchronize these operations and keep the pipeline runnning.

The final calculation of FX(I) could be run as a vector operation, but it is not being handled that way, because the whole loop must vectorize. The best way to deal with this problem is to split it up into two loops, so that the vectorizable part vectorizes:

          DO I=1,N
            J=1
10          CONTINUE
            IF(X(I).LT.Y(J))THEN
              J=J+1
              GO TO 10
            END IF
            INDX(I)=J
          END DO

          DO I=1,N
            FX(I)=FY(INDX(I)) + 
         *  (FY(INDX(I)+1)-FY(INDX(I)))*((X(I)-Y(INDX(I)))/(Y(INDX(I)+1)-Y(INDX(I)))
          END DO
    

Reduce Subroutine Calls

We already have seen an example where the "overhead" of a DO loop can become the most significant cost of operation if the computational part of the DO loop is skimpy.

We've also already seen that calling a user subroutine from a DO loop will guarantee that the DO loop does not vectorize.

Here's a sample code that illustrates another common problem:

          DO I=1,N
            Y(I)=AVERAGE(N,A(1,I))
          END DO

          REAL FUNCTION AVERAGE(N,COLUMN)

            DIMENSION COLUMN(N)

            AVERAGE=0.0
            DO I=1,N
              AVERAGE=AVERAGE+COLUMN(I)
            END DO

            AVERAGE=AVERAGE/N

            RETURN
          END 
    

There isn't any problem with vectorization in this setup, but there is a problem with overhead. Just as it costs us some "overhead" everytime we execute a DO loop, in order to update and check the DO loop index, it also costs us overhead to call a function or subroutine.

Before we call the function or subroutine, we have to save the values of the current variables, store onto the stack the subroutine arguments, then we have to "flush" the current set of instructions and read in the instructions of the subroutine. And that's just to GET there. If, after all that work, we don't actually do much there, then the overhead of jumping back and forth can be relatively costly.

In such a case, the programmer has a number of options. First, the programmer can rewrite the program, either copying the text of the subroutine into the calling program, or moving the DO loop into the subroutine. The number of "jumps" to the subroutine will drop from N to 0, or to 1.

The other option, on the Cray, is to use the "inlining" option. The compiler has an automatic feature, which allows it to identify small subroutines, and replace the "jump" code by the actual code of the routine. This may make the resulting program larger, but it will usually run more quickly.

Performance Monitoring: Two Simple Tools

Performance monitoring is the attempt to register the speed or efficiency of a program.

The simplest measures of performance for a scientific program are:

total elapsed CPU time used; total number of floating point operations; average computational rate.

The Cray will print out the total elapsed CPU time for a FORTRAN program when it is finished executing. As you "fiddle" with a program, this value can guide you in seeing roughly whether you've improved or damaged the program. Monitoring this value for various problem sizes can help you estimate how much harder in time your program gets.

The Cray provides a built in subroutine called SECOND, which returns the current value of a CPU clock, which can be used to time portions of a program:

          CALL SECOND(TIME1)
     
          DO I=1,1000000
            X(I)=EXP(Y(I))
          END DO
      
          CALL SECOND(TIME2)
          WRITE(*,*)'A million EXP values in ',TIME2-TIME1,' seconds.'

    

The value of the elapsed CPU time tells you nothing about whether you have reached the top speed of the Cray. The Cray can execute (on one processor) about 333 million floating point operations (additions and multiplies) per second, known as MegaFLOPS. If your program is running at 12 MegaFLOPS, there is a lot of potential speedup, but 120 MegaFLOPS is really running very well.

The Cray has a built in monitoring program that can give you an overall MegaFLOP rate for your entire program. This is only suitable for programs that run for a fair amount of time (10 seconds, say), and do a fair amount of real number computation (not a sorting program, for instance!). In such cases, the HPM program can be used. You simply modify your command line by starting it with "hpm". For instance, whereas I would normally compile and run a program this way:

      cf77 myprog.f
      a.out
    

I would do it this way instead:

      cf77 myprog.f
      hpm a.out
    

This would cause my program to execute as usual, but once it was done, I would get a fairly extensive report on what my program did, and how fast it ran, in MegaFLOPS.

You can return to the HTML web page.


Last revised on 26 June 2007.