The IBM analyst Charles Grassl visited CSIT on May 18, and conducted a nonstop series of one-on-one help sessions with 12 researchers. Everyone who attended got a chance to get answers straight from the source.
We hope Charles Grassl will be back again for another series of consultations, perhaps in the fall. However, we'd like to mention a few points that came up repeatedly during the consultations.
One researcher was puzzled because his program would crash at the end of execution, when he was cleaning up by deallocating memory. The error would disappear if he commented out the deallocations, but he was naturally worried that this was a symptom. He also noticed that his code would sometimes fail on 8 processors but run correctly on 16.
Charles Grassl suggested he try the -q64 option on his compilation command. This cleared up both problems. He explained that the default addressing scheme is 32 bit, and IBM is reluctant to change defaults. However, with the larger memory available on big machines, it is becoming necessary to move to 64 bit addressing. You can't hurt your program by going to 64 bit addressing, and it can often help.
Parallel programming requires that any thread of execution can access any subroutine at any time, even if it is already being used by another thread of execution. The most modern libraries are written in this "reentrant" fashion. You can get the best parallel performance by accessing IBM's reentrant libraries. Note that reentrant libraries are NOT the default. To access them, you simply need to use the appropriate versions of the compiler; typically, these are simply the same as the standard names, with an "r" appended to them.
Versions of the FORTRAN compiler that use reentrant libraries include:
xlf_r, xlf90_r, and xlf95_r;Versions of the C/C++ compiler that use reentrant libraries include:
mpxlf_r, mpxlf90_r and mpxlf95_r.
xlc_r, cc_r and CC_r, mpcc_r and mpCC_r.
When a parallel job is running, and data needs to be sent from one process to another, the default action is that the data is transferred through an external switch. On the IBM systems, however, it may be the case that the sending and receiving processes are on the same node.
In that case, there is no need for the data to be sent out and then back in again. The memory transfer can happen much more rapidly, "on the chip". However, again, this is not the default behavior. To request that, where appropriate, the data transfer be done as fast as possible, you need to issue a command like
EXPORT MP_SHARED_MEMORY=yes in the Bourne shell and its relatives,or
setenv MP_SHARED_MEMORY yes in the C or T shell.
Many people were interested in whether it was possible to execute a parallel program interactively; this is NOT the proper way to run a production code, but sometimes a short quick run is needed to demonstrate correctness or test out a theory.
This can be done, on Eclipse, at least. To do so, you have to log in. When you do that, you're actually "talking to" the interactive node, which has 32 processors available. To actually access them, you must first make a file called host.list containing the addresses of the processors you would like to access. For interactive work, you can only access the processors on the interactive node, so the list will contain the same name up to 32 times (your list only has to have as many names as the processors you will ask for).
Your host.list file could be:
csit212.fsu.edu
csit212.fsu.edu
csit212.fsu.edu
csit212.fsu.edu
...list up to 32 copies if you need them...
To run a job in parallel, you need to preface the command with the poe command, and follow it by a switch that specifies the number of processors. Thus, to simply run the ls command on 4 processors, you would type
poe ls -procs 4
Of course, to do anything interesting, you'd probably want to run an MPI or OpenMP executable, so it's more likely you'd type something like
poe a.out -procs 4