IBM POWER Systems Overview Exercise

  1. Login to the workshop machine

    Workshops differ in how this is done. The instructor will go over this beforehand.

    For this workshop, we will be using an IBM POWER4 system called "newberg". It is physically a single 32 CPU machine, which has been configured to look like a 4 node, 8 CPU/node cluster. Newberg is actually a login alias for berg04. The compute nodes are berg05, berg06, berg07 and berg08.

    Note that unlike LC production IBM POWER systems, the workshop machine:

    • Allows multiple users to run on the same node simultaneously. It also allows "over-subscription" of a node. That is, it will allow users to run more parallel tasks than there are CPUs on a node.
    • Is running the Moab batch system instead of LCRM
    • Permits interactive jobs to be run in the batch pool

  2. Copy the example files

    1. In your home directory, create a subdirectory for the POE example codes and then cd to it.

      mkdir poe 
      cd  poe 

    2. Copy either the Fortran or the C version of the exercise files to your poe subdirectory:

      C:
      cp  /usr/global/docs/training/blaise/poe/C/*   ~/poe
      Fortran:
      cp  /usr/global/docs/training/blaise/poe/Fortran/*   ~/poe 
      

  3. List the contents of your poe subdirectory

    You should have the following files:

    C Files Fortran Files Description
    poe_hello.c poe_hello.f Simple MPI program which prints a task's rank and hostname.
    poe_bandwidth.c poe_bandwidth.f An MPI communications bandwidth test between two tasks only.
    smp_bandwidth.c smp_bandwidth.f An MPI communications bandwidth test between any even number of tasks.
    prog1
    prog2
    prog3
    prog4
    prog1
    prog2
    prog3
    prog4
    Simple shell scripts used for MPMD mode

  4. Understand your system configuration

    1. Display the pool configuration for the workshop machine. Use each of the commands below, noting their similarities and differences:
      mjstat
      sinfo
      ju
      Questions:
      • What pool(s) are configured?
      • How many nodes are in each pool - total, available, free?
      • How many CPUs do the nodes have?
      • How much memory does each node have?
      • What are the names of the nodes in the pool?

    2. Try the command news job.lim.berg to find out additional configuration on the workshop machine.

  5. Authentication

    1. LC has already taken care of this step for you...you need to do nothing.

    2. You can verify (if you want) that LLNL has authorized you to use these nodes. Check the /etc/hosts.equiv file. It should contain the names of all nodes in the system.

  6. Find out what compilers are available

    1. See the LC Compiler web page at: computing.llnl.gov/code/compilers.html

    2. Click on "newberg". Note also that this page shows compiler information for all of LC's production systems.

  7. Compile the poe_hello program

    Depending upon your language preference, use one of the IBM parallel compilers to compile the poe_hello program. Notice that we're using a very simple compilation and explicitly using large pages (-blpdata), 64-bit (-q64) and level 2 optimization (-O2).

    C:
    mpxlc -blpdata -q64 -O2 -o poe_hello poe_hello.c
    Fortran:
    mpxlf -blpdata -q64 -O2 -o poe_hello poe_hello.f 

  8. Setup your POE environment

    1. First, find out which POE environment variables have already been set for you by LC:
      setenv  |  grep MP

      Some of the more important ones are discussed below:

      Environment Variable Setting Description
      MP_LABELIO yes
      Prepend task output with its task id number
      MP_RESD yes
      Non-specific allocation - let the Resource Manager decide which nodes to use.
      MP_SHARED_MEMORY yes
      Use shared memory (not the switch) for intranode communications
      MP_COREFILE_FORMAT core.light
      Produce light weight (small) corefiles
      MP_EUILIB us
      User Space (fast) communications protocol

    2. Tell POE which pool to use and how many tasks to start:
      setenv MP_RMPOOL name_of_pool
      setenv MP_PROCS 4

  9. Run your poe_hello executable

    1. This is the simple part. Just issue the command:
      poe_hello

    2. Provided that everything is working and setup correctly, you should receive output that looks something like below:
      0:Total number of tasks = 4 
      0:Hello! From task 0 on host berg05
      1:Hello! From task 1 on host berg06
      2:Hello! From task 2 on host berg07
      3:Hello! From task 3 on host berg08
      

  10. Maximize your use of all 8 cpus on a node

    The previous step was the most "wasteful" way to run a POE program, since by default, POE will load only one task on a node. To make better use of the SMP nodes, try the following:

    1. Run 8 poe_hello tasks on each of 2 nodes. Three different ways to do this are shown below, all of which use command line flags. The corresponding environment variables could be used instead. See the POE man page for details.

      Method 1: Specify POE flags for number of nodes and number of tasks:

      poe_hello -nodes 2 -procs 16

      Method 2: Specify POE flags for number of tasks per node and and number of tasks:

      poe_hello -tasks_per_node 8 -procs 16

      Method 3: Specify POE flags for number of nodes and and number of tasks per node:

      unsetenv MP_PROCS
      poe_hello -nodes 2 -tasks_per_node 8

  11. Try the poe_bandwidth exercise code

    1. Depending upon your language preference, compile the poe_bandwidth source file as shown:

      C:
      mpxlc -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.c
      Fortran:
      mpxlf -blpdata -q64 -O2 -o poe_bandwidth poe_bandwidth.f 

    2. Specify using two tasks:
      setenv MP_PROCS 2

    3. Run the executable:

      poe_bandwidth

      As the program runs, it will display the effective communications bandwidth between two nodes over the switch.

      Sample output from poe_bandwidth (Fortran)
         0:  
         0: ****** MPI/POE Bandwidth Test ****** 
         0: Message start size=  100000
         0: Message finish size=  2000000
         0: Incremented by  100000  bytes per iteration
         0: Roundtrips per iteration=  100
         0: Task 0 running on: berg06                        
         0: Task 1 running on: berg07                        
         0:  
         0: Message Size   Bandwidth (bytes/sec)
         0:   100000       .6623E+09
         0:   200000       .9538E+09
         0:   300000       .1102E+10
         0:   400000       .1131E+10
         0:   500000       .1152E+10
         0:   600000       .1130E+10
         0:   700000       .1106E+10
         0:   800000       .1080E+10
         0:   900000       .1060E+10
         0:  1000000       .1053E+10
         0:  1100000       .1068E+10
         0:  1200000       .1069E+10
         0:  1300000       .1078E+10
         0:  1400000       .1084E+10
         0:  1500000       .1092E+10
         0:  1600000       .1098E+10
         0:  1700000       .1105E+10
         0:  1800000       .1105E+10
         0:  1900000       .1108E+10
         0:  2000000       .1112E+10
      

    4. Now, try running the executable again, but this time take advantage of RDMA (Remote Direct Memory Access) communications. RDMA communications move data directly from the memory of one task to another bypassing the CPUs.
      setenv MP_USE_BULK_XFER yes

    5. Notice the output. You should see a significant increase in bandwidth.

      Sample output from poe_bandwidth with RDMA (C language)
         0:
         0:****** MPI/POE Bandwidth Test ******
         0:Message start size= 100000 bytes
         0:Message finish size= 2000000 bytes
         0:Incremented by 100000 bytes per iteration
         0:Roundtrips per iteration= 100
         0:Task 0 running on: berg06
         0:Task 1 running on: berg07
         0:
         0:Message Size   Bandwidth (bytes/sec)
         0:   100000     6.533184e+08
         0:   200000     3.328522e+08
         0:   300000     1.768604e+09
         0:   400000     1.878590e+09
         0:   500000     1.991087e+09
         0:   600000     2.036268e+09
         0:   700000     2.092340e+09
         0:   800000     2.139869e+09
         0:   900000     2.131583e+09
         0:  1000000     2.194638e+09
         0:  1100000     2.199955e+09
         0:  1200000     2.227054e+09
         0:  1300000     2.238960e+09
         0:  1400000     2.215867e+09
         0:  1500000     2.266914e+09
         0:  1600000     2.265423e+09
         0:  1700000     2.282019e+09
         0:  1800000     2.294700e+09
         0:  1900000     2.283079e+09
         0:  2000000     2.305728e+09
      

  12. Determine per-task communication bandwidth behavior

    In this exercise, pairs of tasks, located on two different nodes, will communicate with each other.

    1. Compile the code:

      C:
      mpxlc -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.c 
      Fortran:
      mpxlf -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.f  

    2. Then use the smp_bandwidth code to determine per-task bandwidth characteristics on an smp node:

      smp_bandwidth -nodes 2 -procs 2
      smp_bandwidth -nodes 2 -procs 4
      smp_bandwidth -nodes 2 -procs 8
      smp_bandwidth -nodes 2 -procs 16

      What happens to the per-task bandwidth as the number of tasks increase?

  13. Optimize intra-node communication bandwidth

    When all of the task communications occur "on-node", it is possible to optimize the effective per-task bandwidth by utilizing shared memory instead of the network.

    1. First, try it without shared memory (using the switch):
      setenv MP_SHARED_MEMORY no
      smp_bandwidth -nodes 1 -procs 4
      smp_bandwidth -nodes 1 -procs 8 

    2. Now use shared memory:
      setenv MP_SHARED_MEMORY yes
      smp_bandwidth -nodes 1 -procs 4
      smp_bandwidth -nodes 1 -procs 8 
      What differences do you notice?

  14. Generate diagnostic/statistical information for your run.

    1. POE provides several environment variables / command flags that collect diagnostic and statistical information about a job's run. Three of the more useful ones are shown below. Try running a job after setting these as shown. Direct stdout to a file so that you can easily read the output after the job runs.
      setenv MP_SAVEHOSTFILE myhosts
      setenv MP_PRINTENV yes
      setenv MP_STATISTICS print
      poe_bandwidth -nodes 2 -procs 2  > myoutput
      
    2. After the job completes, examine both the myhosts file and myoutput file. The MP_PRINTENV environment variable can be particularly useful for troubleshooting since it tells you all of the POE environment variable settings. See the POE man page if you have any questions.

    3. Be sure to unset these variables when you're done to prevent cluttering your screen with their output for the remaining exercises.
      unsetenv MP_SAVEHOSTFILE
      unsetenv MP_PRINTENV
      unsetenv MP_STATISTICS
      

  15. Try using POE's Multiple Program Multiple Data (MPMD) mode

    POE allows you to load and run different executables on different nodes. This is controlled by the MP_PGMMODEL environment variable.

    1. First, set some environment variables:

      Environment Variable Setting Description
      setenv MP_PGMMODEL mpmd
      Specify MPMD mode
      setenv MP_PROCS 4
      Use 4 tasks again
      setenv MP_NODES 1
      Use one node for all four tasks
      setenv MP_STDOUTMODE ordered
      Sort the output by task

    2. Then, simply issue the poe command. After a moment, you will be prompted to enter your executables one at a time. Notice that the machine name where the executable will run is displayed as part of the prompt. In any order you choose, enter these four program names. For example:
      berg04{class01}64: poe
      0031-503  Enter program name and flags for each node
      0:berg05> prog1
      1:berg05> prog4
      2:berg05> prog3
      3:berg05> prog2
      
      Note: these four programs are just simple shell scripts used to demonstrate how to use the MPMD programming model.

    3. After the last program name is entered, POE will run all four executables. Observe their different outputs. Note that the output is sorted by task. Which environment variable setting specifies this?

    4. Enter unsetenv MP_STDOUTMODE and repeat this example. What happends to the order of output now? This is the default behavior for MPI programs.

    IGNORE PART 16 - not currently working under Moab

  16. Try specific node allocation using a host list file

    Generally speaking, there aren't many cases where you'll need to "manually" select which nodes should be used to run your POE job. This step will demonstrate how to do it though, should you ever have the need.

    1. First, use your favorite UNIX editor and create a file in your POE executables directory. Call it hostfile. As its contents, enter 4 node names from the workshop node pool - one node name per line. Click to see the nodes in the workshop node pool.

      You can use 4 different node names, or mix and match anyway you like. For example, any of the examples below would be OK.

      Example 1 Example 2 Example 3
      berg05
      berg06
      berg07
      berg08
      
      berg06
      berg06
      berg08
      berg08
      
      berg05
      berg06
      berg07
      berg07
      

    2. Set the appropriate POE environment variables which specify specific node allocation:

      Environment Variable Setting Description
      setenv MP_HOSTFILE hostfile
      Specify the host file you created
      setenv MP_SAVEHOSTFILE hosts_used
      Save the names of the hosts used to run your program
      setenv MP_PGMMODEL spmd
      Reset from mpmd used in the previous step

    3. Run the poe_hello executable again and observe the output. You'll probably get an informational message that looks like:
      ATTENTION: 0031-379  Pool setting ignored when hostfile used
      which is usual when you use specific node allocation in this manner.

      Check your program output and the contents of the hosts_used file. Do they match what you specified in your hostlist file?

    4. Check your hosts_used file, which was created when your program ran. Do the names match the ones specified by your hostlist file?

  17. Check out the LC online documentation (or at least know where to find it):

    computing.llnl.gov

    There are many things you will find useful later when your using one of LC's "real" SP systems.


This completes the exercise.

Evaluation Form       Please complete the online evaluation form if you have not already done so for this tutorial.

Where would you like to go now?