The Flavor of Scientific Computing


This talk was given on 08 June 1998, at the beginning of the Research Experience for Undergraduates, conducted by Professor Janet Peterson of the Mathematics Department of Iowa State University.

TABLE OF CONTENTS

Introduction

Welcome to the Research Experience for Undergraduates program! This summer program will expose you to the "flavor" of scientific programming. We're hoping, of course, that you'll discover an appetite for this kind of work, and consider pursuing it in graduate school or industry.

During most of this time, you'll be working with a mentor on one big project, designing, implementing, and debugging programs for a specific nasty research problem. Watch your mentor carefully, pay attention to how the problems are handled, and you will learn a great deal.

For this first week, we hope to give you an easy and comfortable introduction to scientific computing, the local computer environment, the mentors, and your fellow students. We'd also like to see how you handle yourself on some simple problems. So I'd like to present a few simple problems to you, talk over the process of coming up with a solution, and figuring out how we can turn over this work to a computer.

Back to TABLE OF CONTENTS.

A Simple Maze Problem

One of the simplest and oldest puzzles involves a maze. We are given some arrangement of walls and passages, and asked to thread our way through it. As a very simple example, consider the following diagram:

         Enter

     E   +---+   +---+
     x       |       |
     i       |       |
     t   +   +---+   +
         |           |
         |           |
         +---+   +   +
         |       |   |
         |       |   |
         +---+---+---+
      

This example is much too easy for us, but it's the right place to start. We are less interested in solving a maze than in understanding HOW we solve a maze, and whether we can describe the problem and its solution method in a way that will let a computer handle it automatically.

Back to TABLE OF CONTENTS.

How Can We Represent a Maze?

Before we talk about solving this problem, let's start with a very elementary difficulty. How do we represent the maze? We can't expect to hold a picture of the maze up to a camera; we're going to have to summarize the information, using a data structure that a computer can deal with.

Certainly, we can tell the computer that there are 3 rows and 3 columns of "cells". But how do we want to refer, for instance, to the entrance cell? There are two possibilities that naturally come to mind: we could call it "the cell in row 1, column 2". But we could also decide to number the cells 1 through 9, counting across first, say, in which case the entrance would be "cell number 2." Both methods have their advantanges and costs.

How do we convey the location of the walls? Note that our decision about how to represent the nodes (by number, or by row and column) will complicate our simplify this task. Here are some ideas:

All of these methods properly summarize the data, but they are all inconvenient to use. For instance, if we use the first method, the wall number is "compressed", so that the single value 11 tells us "walls to north, east and south". But that means that every time we refer to a wall number, we have to spend some time figuring out what it means. We'll see in a minute that something called "the adjacency matrix" will give us a nice representation of the matrix that is easy to understand, and can actually be used to do some interesting computations.

Back to TABLE OF CONTENTS.

The Right Hand Rule for (Simple) Mazes

Of course, with a simple maze like our example, there's little challenge in coming up with a solution; we can solve it on sight. But suppose we really want to solve enormous mazes, and we're just using this little one for inspiration. Then we need to come up with rules, suggested by the small problem, but which will apply to any problem, no matter how big or complicated. Can we find a reasonable general strategy?

One simple method of finding the solution is called the "right hand rule." Simply place your right hand on the side of the maze wall, and keep it in contact as you walk around the maze. Whenever you come to a decision spot, go to the right, so that your hand stays in contact. It should be easy to see why this rule works. Look at our simple example. The maze is made of two walls. If we keep in touch with one wall, and keep moving in one direction (relative to that wall), then we will eventually go all the way around the wall. But that means we must pass through the exit.

There are reasons that this rule isn't perfect:

Back to TABLE OF CONTENTS.

The Adjacency Matrix

Perhaps one of our problems in thinking about the maze is finding a good way to represent it. A natural way comes from the field of graph theory, and uses a table called the "adjacency matrix". If we have N rooms or cells in the maze, we need to number them from 1 to N, and then build a table of N rows and columns. If we call our matrix A, then the value Aij will be

        1, if cell I is connected to cell J, 
        0, otherwise
      

The adjacency matrix for our maze is as follows (where I've replaced the 0 values by periods, to make the 1's stand out):

           1  2  3  4  5  6  7  8  9
         ----------------------------
      1 |  .  .  .  1  .  .  .  .  .
      2 |  .  .  1  .  .  .  .  .  .
      3 |  .  1  .  .  .  1  .  .  .
      4 |  1  .  .  .  1  .  .  .  .
      5 |  .  .  .  1  .  1  .  1  .
      6 |  .  .  1  .  1  .  .  .  1
      7 |  .  .  .  .  .  .  .  1  .
      8 |  .  .  .  .  1  .  1  .  .
      9 |  .  .  .  .  .  1  .  .  .
      

It's nice how the 1's line up here, but lines of 1's don't correspond to paths in our maze. To try to see a correspondence, consider the fact that we can move from cell 6 to cell 5 to cell 8. The matrix records this by the fact that A6,5 = 1 and A5,8 = 1.

Suppose we wanted to get from cell 6 to cell 8, and we didn't have a picture of the maze. Perhaps we could figure it out from the matrix. If we could get from 6 to 8 directly, then we'd be done. But otherwise, we'd want to check whether we can get there in two steps, say by going from 6 to 1 and 1 to 8, which would require that A(6,1) and A(1,8) be 1. Notice how, for a two step path using cell J, we would look at matrix entries A(6,J) and A(J,8). Another place where we match up matrix entries this way is when we multiply two matrices. What happens when we multiply the adjacency matrix by itself, to get the square of A, which we will call "B":

           1  2  3  4  5  6  7  8  9
         ----------------------------
      1 |  1  .  .  .  1  .  .  .  .
      2 |  .  1  .  .  .  1  .  .  .
      3 |  .  .  2  .  1  .  .  .  1
      4 |  .  .  .  2  .  1  .  1  .
      5 |  1  .  1  .  3  .  1  .  1
      6 |  .  1  .  1  .  3  .  1  .
      7 |  .  .  .  .  1  .  1  .  .
      8 |  .  .  .  1  .  1  .  2  .
      9 |  .  .  1  .  1  .  .  .  1
      

What can we make of this pattern? Notice that the matrix is still symmetric, and that all the diagonal entries are nonzero now. Many entries are greater than 1, and reflect the fact that there are several ways to get from one cell to another in two steps.

Can we use the adjacency matrix to solve our problem? Well, suppose we just keep computing powers of A. Since our starting cell is number 2, and our goal cell is number 1, if we ever compute a power with a nonzero value in the (1,2) entry, then we are done, there IS a path. And in fact, we know that we never have to go higher than A to the N-1 power, because no two cells can be further apart than N-1 steps.

Unfortunately, this method is similar to an "existence" proof. It tells us there is a path, and that it takes so many steps, but it doesn't really tell us what that path is. That's not really a satisfactory solution, so let's look at another method for help.

Back to TABLE OF CONTENTS.

The Depth First Method

Another way to search a maze tries to organize our search by identifying our choices, and allowing us to back up if we realize we've made a mistake. In order to find the goal, we need to assure ourselves that we will search every possible pathway. All we need to do that is a really good memory, or a good way of keeping track of where we've been and where we've made choices. The method is called a "depth first" search.

Suppose that on the beginning position, we could go right or left. We arbitrarily pick decide to go left, but in case that turns out badly and we have to come back, we mark down on a list that at position 1 there was one other choice, namely to go right.

We walk along the path for a while, and come to a spot with three choices. We pick one, and on the list we note our location, and mark down the two directions we didn't go.

We come to a dead end. Now we march back to the last place where we made a decision. There were two choices we made there. We take one of them, removing it from the list, and proceed in that direction.

Eventually we will reach the goal, or come back to the starting position with no choices, in which case we know it is hopeless.

Now one thing you should worry about is what we do if we are walking along and we reach a place we've been before. We will treat this in the same way as if we had hit a dead end, and simply back up. (You really do have to worry about cases like this).

Back to TABLE OF CONTENTS.

OUTLINE OF THE DEPTH FIRST METHOD

The depth first method does not have the flaws of the right hand rule, and is suitable for many other applications besides mazes. Let's take a moment to sketch out the algorithm.

We assume that every position in the maze has an identifying number, which we'll call the "position", and that at each position, there is a list of the next positions we can reach.

We'll use NPATH to count the number of positions in our current path, and the array PATH to actually list the positions. When we add a new step to PATH, we then look at the neighbors of that cell. The number of neighbors that are unused and acccessible will be counted in the NCAN array, and their values will be stored in STACK. The total number of candidates in STACK will be counted by NSTACK.

  START:

    NSTACK = 0
    NPATH = 1
    PATH(NPATH) = ISTART

  BUILD:

    Build the list of candidates for the next step:

    Look at each of the neighbors of the current position.  
    There are usually four neighbors, but watch out for corners and sides.
    If there's a passage to the neighbor, and that neighbor is not
    actually part of the PATH array already, then it's a candidate for
    our next step.

      N = number of candidates found.
      NCAN(NPATH) = N

    The list of candidates is added to onto a "stack":

      STACK(NSTACK+1) = candidate 1
      ...
      STACK(NSTACK+N) = candidate N

      NSTACK = NSTACK + N

  CHECK:

    If NCAN(NPATH) = 0, there are no available, untested candidates 
    for the next step.  We'd like to back up.

      If we are at the starting point, ( NPATH = 0 ) then we can't back up, 
      so we've failed.

      Otherwise, drop the last element of the path:
        NPATH = NPATH - 1
      and go back to CHECK.

  STEP:

    Note that we are about to use one of our candidates:
      NCAN(NPATH) = NCAN(NPATH) - 1
    Prepare to take the next step:
      NPATH = NPATH + 1
    Pop a candidate I from the stack:
      PATH(NPATH) = STACK(NSTACK)
    If our candidate is the goal, we're done!
      if PATH(NPATH) = ISTOP then exit!
    Maintain the stack:
      NSTACK = NSTACK - 1
    Go to BUILD.
  

Back to TABLE OF CONTENTS.

A ROAD RALLY PROBLEM

Now let's consider a variation of our maze problem. We still have a set of locations, and connections between some of them. Again, we will have a "start" and a "goal" location, but now we can assume that we can find a path from start to goal. The new twist is that we are going to measure the distance traveled. We are in a road rally, and we want to find the shortest distance from the start to the goal, along the allowed connections.

Here's a very road map we will be looking at:

                    7
           /-----------------------------\
    start=A                               F=goal
           \-----B-----C-----D-----E-----/
              2     2     2     2     2
  

Again, we can see right away that the right thing to do is take the "northern" road directly from the start to goal, for a total of 7 miles, rather than the southern road that takes 10 miles. But let's forget for a moment that we can see this, and think about how we would go about solving such a problem in general.

The first thing that might come to mind is to look at the adjacency matrix again. If we simply write the distance in the matrix instead of a 1 or 0, then at least we have a simple way of summarizing the information, which we might use when we "talk" to the computer:

       A  B  C  D  E  F  
    A  .  2  .  .  .  7
    B  2  .  2  .  .  .
    C  .  2  .  2  .  .
    D  .  .  2  .  2  .
    E  .  .  .  2  .  2
    F  7  .  .  .  2  .
  

Back to TABLE OF CONTENTS.

A GREEDY ALGORITHM

A natural way to try to solve this problem is to do it in steps. We might try to build the best path by adding one more link to our current path. Starting at "A", we have two choices, routes of 7 and 2 miles in length. Taking the short term mentality, we will choose the shorter path, so that we have tentatively decided to go from A to B. After that, we don't really have any more choices, so that we'll end up going A-B-C-D-E-F, for a total of 10 miles.

This method of building a solution by taking one step at a time, and always taking the "best" option right away, rather than looking at a larger long-term goal, is called a "greedy" algorithm. Its advantage is simplicity, but certainly for our problem, it fails to get the best answer.

Back to TABLE OF CONTENTS.

HOW ABOUT A COMPLETE LIST?

Another possible solution method might be simply to generate all possible paths, and compare their lengths. This is essentially what we did (very efficiently) when we used powers of the adjacency matrix to count all the paths of a certain length. This idea is would certainly work for our example, and save us a lot of thought. But it's not a practical solution. For problems with just a few more cities and roads, this turns out to take an incredible amount of time. For example, a problem with 10 cities, all connected by roads of varying lengths, would require the calculation of roughly 180,000 routes (that's 9!/2). So this is not going to be an option except for small problems.

Back to TABLE OF CONTENTS.

THE DIJKSTRA ALGORITHM

In order to solve our problem, let's go back to the greedy algorithm, and see if we can patch it up. The problem seems to be that the greedy algorithm is always looking at the small picture, comparing the length of a single link to another, and taking the shortest. Let's try to give it a slightly bigger perspective.

The key to a practical algorithm for this problem is to try to only look at a few possibilities at one time, and to use what you learn to eliminate as many candidates as possible on the next step.

To focus our thinking, let's back off a moment from trying to compute the shortest distance from A to the particular city F, and instead think about how we would compile a table of the distances from A to all the cities.

Well, that's almost easy. First, we note that B and F are directly connected to A. But there are other ways to reach F that might be shorter. There can't be a shorter route from A to B, though, because all other roads out of A, which must be part of any alternate route, are longer. So we can confidently say that the distance from A to B is 2.

Now we want to add a new city to our list. We will try to add the city which is closest to A. Our list of city candidates will be:

Using this rule, we add C, and we know for sure that it is 4 units from A. Similarly, things go just as expected for D and E. But now, look how we handle F. We do NOT compare the "short" link from E to F versus the "long" link from A to F. Instead, we correctly consider the total paths, that is, the path made of the single link of length 7, and the path A-B-C-D-E-F, which is of length 10. Now, we are able to correctly notice the shortest path, and we get the true distance from A to F.

This algorithm does seem to do some extra work. Instead of focussing directly on a single city, it painfully works through the distances from A to all cities (although we can stop as soon as we encounter F).

The benefits of the algorithm are many. First, it can be proven to be correct, for any possible problem of this sort. Secondly, we don't just know the distance from A to F, but we can also get the path that is of that length. And finally, the method avoids looking at the huge number of all possible paths, sticking only to the much smaller collection of paths that can be made by adding a single link between a "known" and an "unknown" city.

This problem is easy to discuss, and the algorithm sounds straightforward, so you might think there's no point in actually trying to program it. But I think you'll be surprised if you try. It's a worthwhile exercise to try to program this method for a typical set of data!

This method was invented by Edsgar Dijkstra, a computer scientist famous for preaching against the "GO TO" statement.

Back to TABLE OF CONTENTS.

SUMMARY

I hope I've given you a feeling for the "flavor" of scientific computation, which might be thought of as a process of:

I've kept to simple examples from graph theory so that the problems we worked on weren't so difficult that we lost sight of how we solved them. In the "real world" of scientific, engineering, and biomedical computing, it can be a major accomplishment just to understand the problem being solved!

This summer should be a rich opportunity for you to discover research problems of real interest, and to learn from your mentors the craft of computational science.

As a warmup exercise for your later work, I've prepared a collection of problems for you to try. Pick one, try to sketch out a method of representing the data, and solving the problem. I'll give you some sample data when you tell me the problem you've picked. The problems are described in Problems to Program.

Back to TABLE OF CONTENTS.

You can return to the HTML web page.


Last revised on 15 September 2005.