- Login to the workshop machine
Workshops differ in how this is done. The instructor will go over this
beforehand.
For this workshop, we will be using an IBM POWER4 system called "newberg".
It is physically a single 32 CPU machine, which has been configured to look
like a 4 node, 8 CPU/node cluster. Newberg is actually a login alias for
berg04. The compute nodes are berg05, berg06, berg07 and berg08.
Note that unlike LC production IBM POWER systems, the workshop machine:
- Allows multiple users to run on the same node simultaneously.
It also allows "over-subscription" of a node. That is, it will
allow users to run more parallel tasks than there are CPUs on a node.
- Is running the Moab batch system instead of LCRM
- Permits interactive jobs to be run in the batch pool
- Copy the example files
- In your home directory, create a subdirectory for the POE example codes
and then cd to it.
mkdir poe
cd poe
- Copy either the Fortran or the C version of the exercise files to your
poe subdirectory:
C: |
cp /usr/global/docs/training/blaise/poe/C/* ~/poe |
Fortran: |
cp /usr/global/docs/training/blaise/poe/Fortran/* ~/poe
|
- List the contents of your poe subdirectory
You should have the following files:
- Understand your system configuration
- Display the pool configuration for the workshop machine. Use each of
the commands below, noting their similarities and differences:
mjstat
sinfo
ju
Questions:
- What pool(s) are configured?
- How many nodes are in each pool - total, available, free?
- How many CPUs do the nodes have?
- How much memory does each node have?
- What are the names of the nodes in the pool?
- Try the command news job.lim.berg to find out additional
configuration on the workshop machine.
- Authentication
- LC has already taken care of this step for you...you need to do nothing.
- You can verify (if you want) that LLNL has authorized you to use
these nodes. Check the /etc/hosts.equiv file. It
should contain the names of all nodes in the system.
- Find out what compilers are available
- See the LC Compiler web page at:
computing.llnl.gov/code/compilers.html
- Click on "newberg". Note also that this page shows compiler information
for all of LC's production systems.
- Compile the poe_hello program
Depending upon your language preference, use one of the IBM parallel
compilers to compile the poe_hello program. Notice that we're using a
very simple compilation and explicitly using large pages (-blpdata),
64-bit (-q64) and level 2 optimization (-O2).
C: |
mpxlc -blpdata -q64 -O2 -o poe_hello poe_hello.c |
Fortran: |
mpxlf -blpdata -q64 -O2 -o poe_hello poe_hello.f |
- Setup your POE environment
- First, find out which POE environment variables have already been set for
you by LC:
setenv | grep MP
Some of the more important ones are discussed below:
Environment Variable Setting |
Description |
MP_LABELIO yes |
Prepend task output with its task id number |
MP_RESD yes |
Non-specific allocation - let the Resource Manager decide which nodes
to use. |
MP_SHARED_MEMORY yes |
Use shared memory (not the switch) for intranode communications |
MP_COREFILE_FORMAT core.light |
Produce light weight (small) corefiles |
MP_EUILIB us |
User Space (fast) communications protocol |
- Tell POE which pool to use and how many tasks to start:
setenv MP_RMPOOL name_of_pool
setenv MP_PROCS 4
- Run your poe_hello executable
- This is the simple part. Just issue the command:
poe_hello
- Provided that everything is working and setup correctly, you should
receive output that looks something like below:
0:Total number of tasks = 4
0:Hello! From task 0 on host berg05
1:Hello! From task 1 on host berg06
2:Hello! From task 2 on host berg07
3:Hello! From task 3 on host berg08
- Maximize your use of all 8 cpus on a node
The previous step was the most "wasteful" way to run a POE program, since by
default, POE will load only one task on a node. To make better use of the SMP nodes, try the following:
- Run 8 poe_hello tasks on each of 2 nodes. Three different
ways to do this are shown below, all of which use
command line flags. The corresponding environment variables could be
used instead. See the
POE man page for details.
Method 1: Specify POE flags for number of nodes and number of tasks:
poe_hello -nodes 2 -procs 16
Method 2: Specify POE flags for number of tasks per node and
and number of tasks:
poe_hello -tasks_per_node 8 -procs 16
Method 3: Specify POE flags for number of nodes and and number of
tasks per node:
unsetenv MP_PROCS
poe_hello -nodes 2 -tasks_per_node 8
- Try the poe_bandwidth exercise code
- Depending upon your language preference, compile the poe_bandwidth
source file as shown:
C: |
mpxlc -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.c |
Fortran: |
mpxlf -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.f |
- Specify using two tasks:
setenv MP_PROCS 2
- Run the executable:
poe_bandwidth
As the program runs, it will display the effective communications bandwidth
between two nodes over the switch.
Sample output from poe_bandwidth (Fortran)
|
0:
0: ****** MPI/POE Bandwidth Test ******
0: Message start size= 100000
0: Message finish size= 2000000
0: Incremented by 100000 bytes per iteration
0: Roundtrips per iteration= 100
0: Task 0 running on: berg06
0: Task 1 running on: berg07
0:
0: Message Size Bandwidth (bytes/sec)
0: 100000 .6623E+09
0: 200000 .9538E+09
0: 300000 .1102E+10
0: 400000 .1131E+10
0: 500000 .1152E+10
0: 600000 .1130E+10
0: 700000 .1106E+10
0: 800000 .1080E+10
0: 900000 .1060E+10
0: 1000000 .1053E+10
0: 1100000 .1068E+10
0: 1200000 .1069E+10
0: 1300000 .1078E+10
0: 1400000 .1084E+10
0: 1500000 .1092E+10
0: 1600000 .1098E+10
0: 1700000 .1105E+10
0: 1800000 .1105E+10
0: 1900000 .1108E+10
0: 2000000 .1112E+10
|
- Now, try running the executable again, but this time take advantage of
RDMA (Remote Direct Memory Access) communications. RDMA communications
move data directly from the memory of one task to another bypassing
the CPUs.
setenv MP_USE_BULK_XFER yes
- Notice the output. You should see a significant increase in bandwidth.
Sample output from poe_bandwidth with RDMA (C language)
|
0:
0:****** MPI/POE Bandwidth Test ******
0:Message start size= 100000 bytes
0:Message finish size= 2000000 bytes
0:Incremented by 100000 bytes per iteration
0:Roundtrips per iteration= 100
0:Task 0 running on: berg06
0:Task 1 running on: berg07
0:
0:Message Size Bandwidth (bytes/sec)
0: 100000 6.533184e+08
0: 200000 3.328522e+08
0: 300000 1.768604e+09
0: 400000 1.878590e+09
0: 500000 1.991087e+09
0: 600000 2.036268e+09
0: 700000 2.092340e+09
0: 800000 2.139869e+09
0: 900000 2.131583e+09
0: 1000000 2.194638e+09
0: 1100000 2.199955e+09
0: 1200000 2.227054e+09
0: 1300000 2.238960e+09
0: 1400000 2.215867e+09
0: 1500000 2.266914e+09
0: 1600000 2.265423e+09
0: 1700000 2.282019e+09
0: 1800000 2.294700e+09
0: 1900000 2.283079e+09
0: 2000000 2.305728e+09
|
- Determine per-task communication bandwidth behavior
In this exercise, pairs of tasks, located on two different nodes,
will communicate with each other.
- Compile the code:
C: |
mpxlc -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.c |
Fortran: |
mpxlf -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.f |
- Then use the smp_bandwidth code to determine per-task bandwidth
characteristics on an smp node:
smp_bandwidth -nodes 2 -procs 2
smp_bandwidth -nodes 2 -procs 4
smp_bandwidth -nodes 2 -procs 8
smp_bandwidth -nodes 2 -procs 16
What happens to the per-task bandwidth as the number of tasks increase?
- Optimize intra-node communication bandwidth
When all of the task communications occur "on-node", it is
possible to optimize the effective per-task bandwidth by utilizing
shared memory instead of the network.
- First, try it without shared memory (using the switch):
setenv MP_SHARED_MEMORY no
smp_bandwidth -nodes 1 -procs 4
smp_bandwidth -nodes 1 -procs 8
- Now use shared memory:
setenv MP_SHARED_MEMORY yes
smp_bandwidth -nodes 1 -procs 4
smp_bandwidth -nodes 1 -procs 8
What differences do you notice?
- Generate diagnostic/statistical information for your run.
- POE provides several environment variables / command flags that collect
diagnostic and statistical information about a job's run. Three of the more
useful ones are shown below. Try running a job after setting these as shown.
Direct stdout to a file so that you can easily read the output after the job
runs.
setenv MP_SAVEHOSTFILE myhosts
setenv MP_PRINTENV yes
setenv MP_STATISTICS print
poe_bandwidth -nodes 2 -procs 2 > myoutput
- After the job completes, examine both the myhosts file and
myoutput file. The MP_PRINTENV environment variable can be
particularly useful for troubleshooting since it tells you all of the POE
environment variable settings. See the POE man page if you have any
questions.
- Be sure to unset these variables when you're done to prevent cluttering
your screen with their output for the remaining exercises.
unsetenv MP_SAVEHOSTFILE
unsetenv MP_PRINTENV
unsetenv MP_STATISTICS
- Try using POE's Multiple Program Multiple Data (MPMD) mode
POE allows you to load and run different executables on different nodes.
This is controlled by the MP_PGMMODEL environment variable.
- First, set some environment variables:
Environment Variable Setting |
Description |
setenv MP_PGMMODEL mpmd |
Specify MPMD mode |
setenv MP_PROCS 4 |
Use 4 tasks again |
setenv MP_NODES 1 |
Use one node for all four tasks |
setenv MP_STDOUTMODE ordered |
Sort the output by task |
- Then, simply issue the poe command. After a moment, you
will be prompted to enter your executables one
at a time. Notice that the machine name where the executable will
run is displayed as part of the prompt.
In any order you choose, enter these four program names. For example:
berg04{class01}64: poe
0031-503 Enter program name and flags for each node
0:berg05> prog1
1:berg05> prog4
2:berg05> prog3
3:berg05> prog2
Note: these four programs are just simple shell scripts used to
demonstrate how to use the MPMD programming model.
- After the last program name is entered, POE will run all four executables.
Observe their different outputs. Note that the output is sorted by task.
Which environment variable setting specifies this?
- Enter unsetenv MP_STDOUTMODE and repeat this example.
What happends to the order of output now? This is the default behavior
for MPI programs.
IGNORE PART 16 - not currently working under Moab
- Try specific node allocation using a host list file
Generally speaking, there aren't many cases where you'll need to "manually"
select which nodes should be used to run your POE job. This step will
demonstrate how to do it though, should you ever have the need.
- First, use your favorite UNIX editor and create a file in your POE
executables directory. Call it
hostfile. As its contents, enter 4 node names from
the workshop node pool - one node name per line.
Click to
see the nodes in the workshop node pool.
You can use 4 different node names, or mix and match anyway you like. For
example, any of the examples below would be OK.
Example 1 |
Example 2 |
Example 3 |
berg05
berg06
berg07
berg08
|
berg06
berg06
berg08
berg08
|
berg05
berg06
berg07
berg07
|
- Set the appropriate POE environment variables which specify specific
node allocation:
Environment Variable Setting |
Description |
setenv MP_HOSTFILE hostfile |
Specify the host file you created |
setenv MP_SAVEHOSTFILE hosts_used |
Save the names of the hosts used to run your program |
setenv MP_PGMMODEL spmd |
Reset from mpmd used in the previous step |
- Run the poe_hello executable again and observe the output.
You'll probably get an informational message that looks like:
ATTENTION: 0031-379 Pool setting ignored when hostfile used
which is usual when you use specific node allocation in this manner.
Check your program output and the contents of the hosts_used file.
Do they match what you specified in your hostlist file?
- Check your hosts_used file, which was created when your
program ran. Do the names match the ones specified by your hostlist file?
- Check out the LC online documentation (or at least know where to
find it):
computing.llnl.gov
There are many things you
will find useful later when your using one of LC's "real" SP systems.