IBM POWER Systems Overview Exercise

Login to the workshop machine
Workshops differ in how this is done. The instructor will go over this beforehand.
For this workshop, we will be using an IBM POWER4 system called "newberg". It is physically a single 32 CPU machine, which has been configured to look like a 4 node, 8 CPU/node cluster. Newberg is actually a login alias for berg04. The compute nodes are berg05, berg06, berg07 and berg08.
Note that unlike LC production IBM POWER systems, the workshop machine:
- Allows multiple users to run on the same node simultaneously. It also allows "over-subscription" of a node. That is, it will allow users to run more parallel tasks than there are CPUs on a node.
- Is running the Moab batch system instead of LCRM
- Permits interactive jobs to be run in the batch pool
Copy the example files
1. In your home directory, create a subdirectory for the POE example codes and then cd to it.
```
mkdir poe 
cd  poe 
```
2. Copy either the Fortran or the C version of the exercise files to your poe subdirectory:
  C:
  cp /usr/global/docs/training/blaise/poe/C/* ~/poe
  
  Fortran:
  cp /usr/global/docs/training/blaise/poe/Fortran/* ~/poe

List the contents of your poe subdirectory

You should have the following files:

C Files Fortran Files Description
poe_hello.c poe_hello.f Simple MPI program which prints a task's rank and hostname.
poe_bandwidth.c poe_bandwidth.f An MPI communications bandwidth test between two tasks only.
smp_bandwidth.c smp_bandwidth.f An MPI communications bandwidth test between any even number of tasks.
prog1 prog2 prog3 prog4 prog1 prog2 prog3 prog4 Simple shell scripts used for MPMD mode

Understand your system configuration
1. Display the pool configuration for the workshop machine. Use each of the commands below, noting their similarities and differences:
```
mjstat
sinfo
ju
```
  Questions:
  - What pool(s) are configured?
  - How many nodes are in each pool - total, available, free?
  - How many CPUs do the nodes have?
  - How much memory does each node have?
  - What are the names of the nodes in the pool?
2. Try the command news job.lim.berg to find out additional configuration on the workshop machine.
Authentication
1. LC has already taken care of this step for you...you need to do nothing.
2. You can verify (if you want) that LLNL has authorized you to use these nodes. Check the /etc/hosts.equiv file. It should contain the names of all nodes in the system.
Find out what compilers are available
1. See the LC Compiler web page at: computing.llnl.gov/code/compilers.html
2. Click on "newberg". Note also that this page shows compiler information for all of LC's production systems.
Compile the poe_hello program
Depending upon your language preference, use one of the IBM parallel compilers to compile the poe_hello program. Notice that we're using a very simple compilation and explicitly using large pages (-blpdata), 64-bit (-q64) and level 2 optimization (-O2).
C:
mpxlc -blpdata -q64 -O2 -o poe_hello poe_hello.c
Fortran:
mpxlf -blpdata -q64 -O2 -o poe_hello poe_hello.f

C Files	Fortran Files	Description
`poe_hello.c`	`poe_hello.f`	Simple MPI program which prints a task's rank and hostname.
`poe_bandwidth.c`	`poe_bandwidth.f`	An MPI communications bandwidth test between two tasks only.
`smp_bandwidth.c`	`smp_bandwidth.f`	An MPI communications bandwidth test between any even number of tasks.
`prog1 prog2 prog3 prog4`	`prog1 prog2 prog3 prog4`	Simple shell scripts used for MPMD mode

Setup your POE environment

First, find out which POE environment variables have already been set for you by LC:

setenv  |  grep MP

Some of the more important ones are discussed below:

Environment Variable Setting Description

MP_LABELIO yes
Prepend task output with its task id number

MP_RESD yes
Non-specific allocation - let the Resource Manager decide which nodes to use.

MP_SHARED_MEMORY yes
Use shared memory (not the switch) for intranode communications

MP_COREFILE_FORMAT core.light
Produce light weight (small) corefiles

MP_EUILIB us
User Space (fast) communications protocol

Tell POE which pool to use and how many tasks to start:
```
setenv MP_RMPOOL name_of_pool
setenv MP_PROCS 4
```

Run your poe_hello executable
1. This is the simple part. Just issue the command:
```
poe_hello
```
2. Provided that everything is working and setup correctly, you should receive output that looks something like below:
```
0:Total number of tasks = 4 
0:Hello! From task 0 on host berg05
1:Hello! From task 1 on host berg06
2:Hello! From task 2 on host berg07
3:Hello! From task 3 on host berg08
```
Maximize your use of all 8 cpus on a node
The previous step was the most "wasteful" way to run a POE program, since by default, POE will load only one task on a node. To make better use of the SMP nodes, try the following:
1. Run 8 poe_hello tasks on each of 2 nodes. Three different ways to do this are shown below, all of which use command line flags. The corresponding environment variables could be used instead. See the POE man page for details.
  Method 1: Specify POE flags for number of nodes and number of tasks:
  poe_hello -nodes 2 -procs 16
  Method 2: Specify POE flags for number of tasks per node and and number of tasks:
  poe_hello -tasks_per_node 8 -procs 16
  Method 3: Specify POE flags for number of nodes and and number of tasks per node:
  unsetenv MP_PROCS poe_hello -nodes 2 -tasks_per_node 8

Environment Variable Setting	Description
MP_LABELIO yes	Prepend task output with its task id number
MP_RESD yes	Non-specific allocation - let the Resource Manager decide which nodes to use.
MP_SHARED_MEMORY yes	Use shared memory (not the switch) for intranode communications
MP_COREFILE_FORMAT core.light	Produce light weight (small) corefiles
MP_EUILIB us	User Space (fast) communications protocol

Try the poe_bandwidth exercise code

Depending upon your language preference, compile the poe_bandwidth source file as shown:

C:
mpxlc -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.c

Fortran:
mpxlf -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.f

Specify using two tasks:
```
setenv MP_PROCS 2
```

Run the executable:

poe_bandwidth

As the program runs, it will display the effective communications bandwidth between two nodes over the switch.

Sample output from poe_bandwidth (Fortran)
0: 0: ****** MPI/POE Bandwidth Test ****** 0: Message start size= 100000 0: Message finish size= 2000000 0: Incremented by 100000 bytes per iteration 0: Roundtrips per iteration= 100 0: Task 0 running on: berg06 0: Task 1 running on: berg07 0: 0: Message Size Bandwidth (bytes/sec) 0: 100000 .6623E+09 0: 200000 .9538E+09 0: 300000 .1102E+10 0: 400000 .1131E+10 0: 500000 .1152E+10 0: 600000 .1130E+10 0: 700000 .1106E+10 0: 800000 .1080E+10 0: 900000 .1060E+10 0: 1000000 .1053E+10 0: 1100000 .1068E+10 0: 1200000 .1069E+10 0: 1300000 .1078E+10 0: 1400000 .1084E+10 0: 1500000 .1092E+10 0: 1600000 .1098E+10 0: 1700000 .1105E+10 0: 1800000 .1105E+10 0: 1900000 .1108E+10 0: 2000000 .1112E+10

Now, try running the executable again, but this time take advantage of RDMA (Remote Direct Memory Access) communications. RDMA communications move data directly from the memory of one task to another bypassing the CPUs.
```
setenv MP_USE_BULK_XFER yes
```

Notice the output. You should see a significant increase in bandwidth.

Sample output from poe_bandwidth with RDMA (C language)
0: 0:****** MPI/POE Bandwidth Test ****** 0:Message start size= 100000 bytes 0:Message finish size= 2000000 bytes 0:Incremented by 100000 bytes per iteration 0:Roundtrips per iteration= 100 0:Task 0 running on: berg06 0:Task 1 running on: berg07 0: 0:Message Size Bandwidth (bytes/sec) 0: 100000 6.533184e+08 0: 200000 3.328522e+08 0: 300000 1.768604e+09 0: 400000 1.878590e+09 0: 500000 1.991087e+09 0: 600000 2.036268e+09 0: 700000 2.092340e+09 0: 800000 2.139869e+09 0: 900000 2.131583e+09 0: 1000000 2.194638e+09 0: 1100000 2.199955e+09 0: 1200000 2.227054e+09 0: 1300000 2.238960e+09 0: 1400000 2.215867e+09 0: 1500000 2.266914e+09 0: 1600000 2.265423e+09 0: 1700000 2.282019e+09 0: 1800000 2.294700e+09 0: 1900000 2.283079e+09 0: 2000000 2.305728e+09

Determine per-task communication bandwidth behavior
In this exercise, pairs of tasks, located on two different nodes, will communicate with each other.
1. Compile the code:
  C:
  mpxlc -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.c
  
  Fortran:
  mpxlf -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.f
2. Then use the smp_bandwidth code to determine per-task bandwidth characteristics on an smp node:
  smp_bandwidth -nodes 2 -procs 2 smp_bandwidth -nodes 2 -procs 4 smp_bandwidth -nodes 2 -procs 8 smp_bandwidth -nodes 2 -procs 16
  What happens to the per-task bandwidth as the number of tasks increase?
Optimize intra-node communication bandwidth
When all of the task communications occur "on-node", it is possible to optimize the effective per-task bandwidth by utilizing shared memory instead of the network.
1. First, try it without shared memory (using the switch):
```
setenv MP_SHARED_MEMORY no
smp_bandwidth -nodes 1 -procs 4
smp_bandwidth -nodes 1 -procs 8 
```
2. Now use shared memory:
```
setenv MP_SHARED_MEMORY yes
smp_bandwidth -nodes 1 -procs 4
smp_bandwidth -nodes 1 -procs 8 
```
  What differences do you notice?
Generate diagnostic/statistical information for your run.
1. POE provides several environment variables / command flags that collect diagnostic and statistical information about a job's run. Three of the more useful ones are shown below. Try running a job after setting these as shown. Direct stdout to a file so that you can easily read the output after the job runs.
```
setenv MP_SAVEHOSTFILE myhosts
setenv MP_PRINTENV yes
setenv MP_STATISTICS print
poe_bandwidth -nodes 2 -procs 2  > myoutput
```
2. After the job completes, examine both the myhosts file and myoutput file. The MP_PRINTENV environment variable can be particularly useful for troubleshooting since it tells you all of the POE environment variable settings. See the POE man page if you have any questions.
3. Be sure to unset these variables when you're done to prevent cluttering your screen with their output for the remaining exercises.
```
unsetenv MP_SAVEHOSTFILE
unsetenv MP_PRINTENV
unsetenv MP_STATISTICS
```

Try using POE's Multiple Program Multiple Data (MPMD) mode

POE allows you to load and run different executables on different nodes. This is controlled by the MP_PGMMODEL environment variable.

First, set some environment variables:

Environment Variable Setting Description

setenv MP_PGMMODEL mpmd
Specify MPMD mode

setenv MP_PROCS 4
Use 4 tasks again

setenv MP_NODES 1
Use one node for all four tasks

setenv MP_STDOUTMODE ordered
Sort the output by task

Then, simply issue the poe command. After a moment, you will be prompted to enter your executables one at a time. Notice that the machine name where the executable will run is displayed as part of the prompt. In any order you choose, enter these four program names. For example:
```
berg04{class01}64: poe
0031-503  Enter program name and flags for each node
0:berg05> prog1
1:berg05> prog4
2:berg05> prog3
3:berg05> prog2
```
Note: these four programs are just simple shell scripts used to demonstrate how to use the MPMD programming model.
After the last program name is entered, POE will run all four executables. Observe their different outputs. Note that the output is sorted by task. Which environment variable setting specifies this?
Enter unsetenv MP_STDOUTMODE and repeat this example. What happends to the order of output now? This is the default behavior for MPI programs.

Environment Variable Setting	Description
setenv MP_PGMMODEL mpmd	Specify MPMD mode
setenv MP_PROCS 4	Use 4 tasks again
setenv MP_NODES 1	Use one node for all four tasks
setenv MP_STDOUTMODE ordered	Sort the output by task

IGNORE PART 16 - not currently working under Moab

Try specific node allocation using a host list file

Generally speaking, there aren't many cases where you'll need to "manually" select which nodes should be used to run your POE job. This step will demonstrate how to do it though, should you ever have the need.

First, use your favorite UNIX editor and create a file in your POE executables directory. Call it hostfile. As its contents, enter 4 node names from the workshop node pool - one node name per line. Click to see the nodes in the workshop node pool.
You can use 4 different node names, or mix and match anyway you like. For example, any of the examples below would be OK.
Example 1 Example 2 Example 3
berg05 berg06 berg07 berg08

berg06 berg06 berg08 berg08

berg05 berg06 berg07 berg07

Example 1	Example 2	Example 3
berg05 berg06 berg07 berg08	berg06 berg06 berg08 berg08	berg05 berg06 berg07 berg07

Set the appropriate POE environment variables which specify specific node allocation:

Environment Variable Setting Description

setenv MP_HOSTFILE hostfile
Specify the host file you created

setenv MP_SAVEHOSTFILE hosts_used
Save the names of the hosts used to run your program

setenv MP_PGMMODEL spmd
Reset from mpmd used in the previous step

Run the poe_hello executable again and observe the output. You'll probably get an informational message that looks like:
```
ATTENTION: 0031-379  Pool setting ignored when hostfile used
```
which is usual when you use specific node allocation in this manner.
Check your program output and the contents of the hosts_used file. Do they match what you specified in your hostlist file?
Check your hosts_used file, which was created when your program ran. Do the names match the ones specified by your hostlist file?

Check out the LC online documentation (or at least know where to find it):
computing.llnl.gov
There are many things you will find useful later when your using one of LC's "real" SP systems.

Environment Variable Setting	Description
setenv MP_HOSTFILE hostfile	Specify the host file you created
setenv MP_SAVEHOSTFILE hosts_used	Save the names of the hosts used to run your program
setenv MP_PGMMODEL spmd	Reset from `mpmd` used in the previous step

This completes the exercise.

Please complete the online evaluation form if you have not already done so for this tutorial.

Where would you like to go now?

C:	*cp /usr/global/docs/training/blaise/poe/C/ ~/poe**
Fortran:	*cp /usr/global/docs/training/blaise/poe/Fortran/ ~/poe**

C:	mpxlc -blpdata -q64 -O2 -o poe_hello poe_hello.c
Fortran:	mpxlf -blpdata -q64 -O2 -o poe_hello poe_hello.f

C:	mpxlc -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.c
Fortran:	mpxlf -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.f

C:	mpxlc -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.c
Fortran:	mpxlf -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.f