- Login to the workshop machine
    Workshops differ in how this is done.  The instructor will go over this     
    beforehand.
    For this workshop, we will be using an IBM POWER4 system called "newberg".
    It is physically a single 32 CPU machine, which has been configured to look
    like a 4 node, 8 CPU/node cluster. Newberg is actually a login alias for
    berg04. The compute nodes are berg05, berg06, berg07 and berg08. 
    Note that unlike LC production IBM POWER systems, the workshop machine:
    
    - Allows multiple users to run on the same node simultaneously. 
        It also allows "over-subscription" of a node. That is, it will 
        allow users to run more parallel tasks than there are CPUs on a node.
    
- Is running the Moab batch system instead of LCRM
    
- Permits interactive jobs to be run in the batch pool
    
- Copy the example files
- In your home directory, create a subdirectory for the POE example codes
    and then cd to it.
mkdir poe 
cd  poe  
 
- Copy either the Fortran or the C version of the exercise files to your
    poe subdirectory:
 
| C: | cp  /usr/global/docs/training/blaise/poe/C/*   ~/poe |  
| Fortran: | cp  /usr/global/docs/training/blaise/poe/Fortran/*   ~/poe 
 |  
 
-  List the contents of your poe subdirectory
You should have the following files:
- Understand your system configuration
- Display the pool configuration for the workshop machine. Use each of
    the commands below, noting their similarities and differences:
mjstat
sinfo
ju Questions:
    - What pool(s) are configured?
    
- How many nodes are in each pool - total, available, free?
    
- How many CPUs do the nodes have?
    
- How much memory does each node have?
    
- What are the names of the nodes in the pool?
    
 
 
- Try the command news job.lim.berg to find out additional
    configuration on the workshop machine.
- Authentication
- LC has already taken care of this step for you...you need to do nothing.
 
- You can verify (if you want) that LLNL has authorized you to use 
    these nodes. Check the /etc/hosts.equiv file.  It 
    should contain the names of all nodes in the system. 
- Find out what compilers are available
- See the LC Compiler web page at:
    computing.llnl.gov/code/compilers.html
 
- Click on "newberg". Note also that this page shows compiler information 
    for all of LC's production systems.
- Compile the poe_hello program
Depending upon your language preference, use one of the IBM parallel 
compilers to compile the poe_hello program. Notice that we're using a 
very simple compilation and explicitly using large pages (-blpdata), 
64-bit (-q64) and level 2 optimization (-O2).
| C: | mpxlc -blpdata -q64 -O2 -o poe_hello poe_hello.c | 
| Fortran: | mpxlf -blpdata -q64 -O2 -o poe_hello poe_hello.f  | 
- Setup your POE environment
- First, find out which POE environment variables have already been set for
    you by LC:
setenv  |  grep MP 
Some of the more important ones are discussed below:
 
 
| Environment Variable Setting | Description |  
| MP_LABELIO yes | Prepend task output with its task id number |  
| MP_RESD yes | Non-specific allocation  - let the Resource Manager decide which nodes 
to use. |  
| MP_SHARED_MEMORY yes | Use shared memory (not the switch) for intranode communications |  
| MP_COREFILE_FORMAT core.light | Produce light weight (small) corefiles |  
| MP_EUILIB us | User Space (fast) communications protocol |  
 
 
- Tell POE which pool to use and how many tasks to start:
setenv MP_RMPOOL name_of_pool
setenv MP_PROCS 4 
- Run your poe_hello executable
- This is the simple part.  Just issue the command:
poe_hello 
 
- Provided that everything is working and setup correctly, you should
    receive output that looks something like below:
0:Total number of tasks = 4 
0:Hello! From task 0 on host berg05
1:Hello! From task 1 on host berg06
2:Hello! From task 2 on host berg07
3:Hello! From task 3 on host berg08
 
- Maximize your use of all 8 cpus on a node
The previous step was the most "wasteful" way to run a POE program, since by
default, POE will load only one task on a node.  To make better use of the SMP nodes, try the following:
- Run 8 poe_hello tasks on each of 2 nodes.  Three different
    ways to do this are shown below, all of which use 
    command line flags.  The corresponding environment variables could be 
    used instead.  See the 
    POE man page for details.
    
    Method 1: Specify POE flags for number of nodes and number of tasks:
     
    poe_hello -nodes 2 -procs 16
    
     
    Method 2: Specify POE flags for number of tasks per node and 
       and number of tasks:
     
    poe_hello  -tasks_per_node 8 -procs 16
    
     
    Method 3: Specify POE flags for number of nodes and and number of 
        tasks per node: 
     
    unsetenv MP_PROCS
    
 poe_hello -nodes 2 -tasks_per_node 8
 
- Try the poe_bandwidth exercise code
- Depending upon your language preference, compile the poe_bandwidth
    source file as shown:
 
| C: | mpxlc -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.c |  
| Fortran: | mpxlf -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.f  |  
 
 
- Specify using two tasks:
setenv MP_PROCS 2 
 
- Run the executable:
    poe_bandwidth 
 As the program runs, it will display the effective communications bandwidth
   between two nodes over the switch.
 
 
|  Sample output from poe_bandwidth (Fortran) |  | 
   0:  
   0: ****** MPI/POE Bandwidth Test ****** 
   0: Message start size=  100000
   0: Message finish size=  2000000
   0: Incremented by  100000  bytes per iteration
   0: Roundtrips per iteration=  100
   0: Task 0 running on: berg06                        
   0: Task 1 running on: berg07                        
   0:  
   0: Message Size   Bandwidth (bytes/sec)
   0:   100000       .6623E+09
   0:   200000       .9538E+09
   0:   300000       .1102E+10
   0:   400000       .1131E+10
   0:   500000       .1152E+10
   0:   600000       .1130E+10
   0:   700000       .1106E+10
   0:   800000       .1080E+10
   0:   900000       .1060E+10
   0:  1000000       .1053E+10
   0:  1100000       .1068E+10
   0:  1200000       .1069E+10
   0:  1300000       .1078E+10
   0:  1400000       .1084E+10
   0:  1500000       .1092E+10
   0:  1600000       .1098E+10
   0:  1700000       .1105E+10
   0:  1800000       .1105E+10
   0:  1900000       .1108E+10
   0:  2000000       .1112E+10
 |  
 
 
- Now, try running the executable again, but this time take advantage of
    RDMA (Remote Direct Memory Access) communications. RDMA communications
    move data directly from the memory of one task to another bypassing
    the CPUs.
setenv MP_USE_BULK_XFER yes 
 
- Notice the output.  You should see a significant increase in bandwidth.
 
|  Sample output from poe_bandwidth with RDMA (C language) |  | 
   0:
   0:****** MPI/POE Bandwidth Test ******
   0:Message start size= 100000 bytes
   0:Message finish size= 2000000 bytes
   0:Incremented by 100000 bytes per iteration
   0:Roundtrips per iteration= 100
   0:Task 0 running on: berg06
   0:Task 1 running on: berg07
   0:
   0:Message Size   Bandwidth (bytes/sec)
   0:   100000     6.533184e+08
   0:   200000     3.328522e+08
   0:   300000     1.768604e+09
   0:   400000     1.878590e+09
   0:   500000     1.991087e+09
   0:   600000     2.036268e+09
   0:   700000     2.092340e+09
   0:   800000     2.139869e+09
   0:   900000     2.131583e+09
   0:  1000000     2.194638e+09
   0:  1100000     2.199955e+09
   0:  1200000     2.227054e+09
   0:  1300000     2.238960e+09
   0:  1400000     2.215867e+09
   0:  1500000     2.266914e+09
   0:  1600000     2.265423e+09
   0:  1700000     2.282019e+09
   0:  1800000     2.294700e+09
   0:  1900000     2.283079e+09
   0:  2000000     2.305728e+09
 |  
 
- Determine per-task communication bandwidth behavior
    In this exercise, pairs of tasks, located on two different nodes,
    will communicate with each other.
- Compile the code:
 
| C: | mpxlc -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.c  |  
| Fortran: | mpxlf -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.f   |  
 
 
- Then use the smp_bandwidth code to determine per-task bandwidth 
    characteristics on an smp node:
    smp_bandwidth -nodes 2 -procs 2
 smp_bandwidth -nodes 2 -procs 4
 smp_bandwidth -nodes 2 -procs 8
 smp_bandwidth -nodes 2 -procs 16
 
    What happens to the per-task bandwidth as the number of tasks increase?
 
- Optimize intra-node communication bandwidth
    When all of the task communications occur "on-node", it is 
    possible to optimize the effective per-task bandwidth by utilizing
    shared memory instead of the network.
- First, try it without shared memory (using the switch):
setenv MP_SHARED_MEMORY no
smp_bandwidth -nodes 1 -procs 4
smp_bandwidth -nodes 1 -procs 8  
 
 
- Now use shared memory:
setenv MP_SHARED_MEMORY yes
smp_bandwidth -nodes 1 -procs 4
smp_bandwidth -nodes 1 -procs 8  What differences do you notice?
- Generate diagnostic/statistical information for your run.
- POE provides several environment variables / command flags that collect
diagnostic and statistical information about a job's run. Three of the more
useful ones are shown below. Try running a job after setting these as shown.
Direct stdout to a file so that you can easily read the output after the job 
runs.
setenv MP_SAVEHOSTFILE myhosts
setenv MP_PRINTENV yes
setenv MP_STATISTICS print
poe_bandwidth -nodes 2 -procs 2  > myoutput
 
- After the job completes, examine both the myhosts file and 
    myoutput file. The MP_PRINTENV environment variable can be
    particularly useful for troubleshooting since it tells you all of the POE 
    environment variable settings. See the POE man page if you have any 
    questions.
 
- Be sure to unset these variables when you're done to prevent cluttering
    your screen with their output for the remaining exercises.
unsetenv MP_SAVEHOSTFILE
unsetenv MP_PRINTENV
unsetenv MP_STATISTICS
 
- Try using POE's Multiple Program Multiple Data (MPMD) mode
POE allows you to load and run different executables on different nodes.  
This is controlled by the MP_PGMMODEL environment variable.
- First, set some environment variables:
 
| Environment Variable Setting | Description |  
| setenv MP_PGMMODEL mpmd | Specify MPMD mode |  
| setenv MP_PROCS 4 | Use 4 tasks again |  
| setenv MP_NODES 1 | Use one node for all four tasks |  
| setenv MP_STDOUTMODE ordered | Sort the output by task |  
 
 
- Then, simply issue the poe command. After a moment, you 
    will be prompted to enter your executables one
    at a time.  Notice that the machine name where the executable will
    run is displayed as part of the prompt.
    In any order you choose, enter these four program names. For example:
berg04{class01}64: poe
0031-503  Enter program name and flags for each node
0:berg05> prog1
1:berg05> prog4
2:berg05> prog3
3:berg05> prog2
Note: these four programs are just simple shell scripts used to 
    demonstrate how to use the MPMD programming model.
 
- After the last program name is entered, POE will run all four executables.
    Observe their different outputs. Note that the output is sorted by task.
    Which environment variable setting specifies this?
 
- Enter unsetenv MP_STDOUTMODE and repeat this example.
    What happends to the order of output now? This is the default behavior
    for MPI programs.
IGNORE PART 16 - not currently working under Moab
- Try specific node allocation using a host list file
Generally speaking, there aren't many cases where you'll need to "manually"
select which nodes should be used to run your POE job.  This step will
demonstrate how to do it though, should you ever have the need.
- First, use your favorite UNIX editor and create a file in your POE
    executables directory.  Call it
    hostfile.  As its contents, enter 4 node names from
    the workshop node pool - one node name per line.
    Click   to
    see the nodes in the workshop node pool.
    You can use 4 different node names, or mix and match anyway you like. For
    example, any of the examples below would be OK.
 
 
| Example 1 | Example 2 | Example 3 |  
| berg05
berg06
berg07
berg08
 | berg06
berg06
berg08
berg08
 | berg05
berg06
berg07
berg07
 |  
 
 
- Set the appropriate POE environment variables which specify specific
    node allocation:
 
| Environment Variable Setting | Description |  
| setenv MP_HOSTFILE hostfile | Specify the host file you created |  
| setenv MP_SAVEHOSTFILE hosts_used | Save the names of the hosts used to run your program |  
| setenv MP_PGMMODEL spmd | Reset from mpmd used in the previous step |  
 
 
- Run the poe_hello executable again and observe the output.  
    You'll probably get an informational message that looks like:
ATTENTION: 0031-379  Pool setting ignored when hostfile used which is usual when you use specific node allocation in this manner.
    Check your program output and the contents of the hosts_used file.
    Do they match what you specified in your hostlist file? 
 
 
- Check your hosts_used file, which was created when your
    program ran.  Do the names match the ones specified by your hostlist file?
- Check out the LC online documentation (or at least know where to 
    find it):
computing.llnl.gov
   There are many things you 
   will find useful later when your using one of LC's "real" SP systems.