IBM POWER Systems Overview

Table of Contents

  1. Abstract
  2. Evolution of IBM's POWER Architectures
  3. Hardware Overview
    1. System Components
    2. POWER4 Processor
    3. POWER5 Processor
    4. Nodes
    5. Frames
    6. Switch Network
    7. GPFS Parallel File System
  4. LC POWER Systems
  5. Software and Development Environment
  6. Parallel Operating Environment (POE) Overview
  7. Compilers
  8. MPI
  9. Running on LC's POWER Systems
    1. Large Pages
    2. SLURM
    3. Understanding Your System Configuration
    4. Setting POE Environment Variables
    5. Invoking the Executable
    6. Monitoring Job Status
    7. Interactive Job Specifics
    8. Batch Job Specifics
    9. Optimizing CPU Usage
    10. RDMA
    11. Other Considerations
      • Simultaneous Multi-Threading (SMT)
      • POE Co-Scheduler
  10. Debugging With TotalView
  11. References and More Information
  12. Exercise


Abstract


This tutorial provides an overview of IBM POWER hardware and software components with a practical emphasis on how to develop and run parallel programs on IBM POWER systems. It does not attempt to cover the entire range of IBM POWER products, however. Instead, it focuses on the types of IBM POWER machines and their environment as implemented by Livermore Computing (LC).

From the point of historical interest, the tutorial begins by providing a succinct history of IBM's POWER architectures. Each of the major hardware components of a parallel POWER system are then discussed in detail, including processor architectures, frames, nodes and the internal high-speed switch network. A description of each of LC's IBM POWER systems follows.

The remainder, and majority, of the tutorial then progresses through "how to use" an IBM POWER system for parallel programming, with an emphasis on IBM's Parallel Operating Environment (POE) software. POE provides the facilities for developing and running parallel Fortran, C/C++ programs on parallel POWER systems. POE components are explained and their usage is demonstrated. The tutorial concludes with a brief discussion of LC specifics and mention of several miscellaneous POE components/tools. A lab exercise follows the presentation.

Level/Prerequisites: Intended for those who are new to developing parallel programs in the IBM POWER environment. A basic understanding of parallel programming in C or Fortran is assumed. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.



Evolution of IBM's POWER Architectures

This section provides a brief history of the IBM POWER architecture.

POWER1:
  • 1990: IBM announces the RISC System/6000 (RS/6000) family of superscalar workstations and servers based upon its new POWER architecture:
    • RISC = Reduced Instruction Set Computer
    • Superscalar = Multiple chip units (floating point unit, fixed point unit, load/store unit, etc.) execute instructions simultaneously with every clock cycle
    • POWER = Performance Optimized With Enhanced RISC

  • Initial configurations had a clock speed of 25 MHz, single floating point and fixed point units, and a peak performance of 50 MFLOPS.

  • Clusters are not new: networked configurations of POWER machines became common as distributed memory parallel computing started to become popular.

SP1:

  • IBM's first SP (Scalable POWERparallel) system was the SP1. It was the logical evolution of clustered POWER1 computing. It was also short-lived, serving as a foot-in-the-door to the rapidly growing market of distributed computing. The SP2 (shortly after) was IBM's real entry point into distributed computing.

  • The key innovations of the SP1 included:
    • Reduced footprint: all of those real-estate consuming stand-alone POWER1 machines were put into a rack
    • Reduced maintenance: new software and hardware made it possible for a system administrator to manage many machines from a single console
    • High-performance interprocessor communications over an internal switch network
    • Parallel Environment software made it much easier to develop and run distributed memory parallel programs

  • The SP1 POWER1 processor had a 62.5 MHz clock with peak performance of 125 MFLOPS

POWER2 and SP2:

  • 1993: Continued improvements in the POWER1 processor architecture led to the POWER2 processor. Some of the POWER2 processor improvements included:
    • Floating point and fixed point units increased to two each
    • Increased data cache size
    • Increased memory to cache bandwidth
    • Clock speed of 66.5 MHz with peak performance of 254 MFLOPS
    • Improved instruction set (quad-word load/store, zero-cycle branches, hardware square root, etc)

  • Lessons learned from the SP1 led to the SP2, which incorporated the improved POWER2 processor.

  • SP2 improvements were directed at greater scalability and included:
    • Better system and system management software
    • Improved Parallel Environment software for users
    • Higher bandwidth internal switch network
IBM RS/6000 processors





IBM SP - click for larger image

P2SC:

PowerPC:

POWER3:
  • 1998: The POWER3 SMP architecture is announced. POWER3 represents a merging of the POWER2 uniprocessor architecture and the PowerPC SMP architecture.

  • Key improvements:
    • 64-bit architecture
    • Increased clock speeds
    • Increased memory, cache, disk, I/O slots, memory bandwidth....
    • Increased number of SMP processors

  • Several varieties were produced with very different specs. At the time they were made available, they were known as Winterhawk-1, Winterhawk-2, Nighthawk-1 and Nighthawk-2 nodes.

  • ASC White was based upon the POWER3 (Nighthawk-2) processor. Like ASC Blue-Pacific, ASC White also ranked as the world's #1 computer at one time.

  • Additional details
IBM POWER3

POWER4:
  • In 2001 IBM introduced its 64-bit POWER4 architecture. It is very different from its POWER3 predecessor.

  • The basic building block is a two processor SMP chip with shared L2 cache. Four chips are then joined to make an 8-way SMP "module". Combining modules creates 16, 24 and 32-way SMP machines.

  • Key improvements over POWER3 include:
    • Increased CPUs - up to 32 per node
    • Faster clock speeds - over 1 GHz. Later POWER4 models reached 1.9 GHz.
    • Increased memory, L2 cache, disk, I/O slots, memory bandwidth....
    • New L3 cache - logically shared between modules
IBM pSeries POWER4

POWER5:

IBM, POWER and Linux:

To Sum It Up, an IBM POWER Timeline:

But What About the Future of POWER?

BlueGene:



Hardware

System Components



Hardware

POWER4 Processor

Architecture:

Modules:



Hardware

POWER5 Processor

Architecture:

Modules:

ASC Purple Chips and Modules:
  • ASC Purple compute nodes are p5 575 nodes, which differ from standard p5 nodes in having only one active core in a dual-processor chip.

  • With only one active cpu in a chip, the entire L2 and L3 cache is dedicated. This design benefits scientific HPC applications by providing better cpu-memory bandwidth.

  • ASC Purple nodes are built from Dual-chip Modules (DCMs). Each node has a total of eight DCMs. A photo showing these appears in the next section below.
p5 575 DCM


Hardware

Nodes

Node Characteristics:

Node Types:

p5 575 Node:



Hardware

Frames

Frame Characteristics:

ASC Purple Frames:



Hardware

Switch Network

Quick Intro:

Topology:

Switch Network Characteristics:

Switch Drawer:

Switch Board:

  • The switch board is really the heart of the switch network. The main features of the switch board are listed below.

  • There are 8 logical Switch Chips, each of which is connected to 4 other Switch Chips to form an internal 4x4 crossbar switch.

  • A total of 32 ports controlled by Link Driver Chips on riser cards, are used to connect to nodes and/or other switch boards.

  • Depending upon how the Switch Board is used, it will be called a Node Switch Board (NSB) or Intermediate Switch Board (ISB):
    • NSB: 16 ports are configured for node connections. The other 16 ports are configured for connections to switch boards in other frames.
    • ISB: all ports are used to cascade to other switch boards.
    • Practically speaking, the distinction between an NSB and ISB is only one of topology. An ISB is just located higher up in the network hierarchy.

  • Switch-node connections are by copper cable. Switch-switch connections can be either copper or optical fiber cable.

  • Minimal hardware latency: approximately 59 nanoseconds to cross each Switch Chip.

  • Two simple configurations (96 node and 128 node systems) using both NSB and ISB switch boards are shown below. The number “4" refers to the number of ports connecting each ISB to each NSB. Nodes are not shown, but each NSB may connect to 16 nodes.
HPS Switch Board

Switch Network Interface (SNI): SNI diagram

Switch Communication Protocols:

Switch Application Performance:



Hardware

GPFS Parallel File System

Overview:

LC Configuration Details:



LC POWER Systems


General Configuration:


SCF POWER Systems

ASC PURPLE:

UM and UV:

TEMPEST:


OCF POWER Systems

UP:

BERG, NEWBERG:



Software and Development Environment


The software and development environment for the IBM SPs at LC is similar to what is described in the Introduction to LC Resources tutorial. Items specific to the IBM SPs are discussed below.

AIX Operating System:

Parallel Environment:

Compilers:

Math Libraries Specific to IBM SPs:

Batch Systems:

User Filesystems:

Software Tools:

Video and Graphics Services:



Parallel Operating Environment (POE) Overview


Most of what you'll do on any parallel IBM AIX POWER system will be under IBM's Parallel Operating Environment (POE) software. This section provides a quick overview. Other sections provide the details for actually using POE.

PE vs POE:

Types of Parallelism Supported:

Interactive and Batch:

Typical Usage Progression:

A Few Miscellaneous Words About POE:

Some POE Terminology:



Compilers


Compilers and Compiler Scripts:

Compiler Syntax:

Common Compiler Invocation Commands:

Compiler Options:

32-bit versus 64-bit:

Optimization:

Miscellaneous:

See the IBM Documentation - Really!



MPI


Implementations:

Notes:



Running on LC's POWER Systems

Large Pages

Large Page Overview:

Large Pages and Purple:

How to Enable Large Pages:

When NOT to Use Large Pages:

Miscellaneous Large Page Info:



Running on LC's POWER Systems

SLURM

LC's Batch Schedulers:

SLURM Architecture:

SLURM Commands:

SLURM Environment Variables:

Additional Information:



Running on LC's POWER Systems

Understanding Your System Configuration

First Things First:

System Configuration/Status Information:

LC Configuration Commands:

IBM Configuration Commands:



Running on LC's POWER Systems

Setting POE Environment Variables

In General:
Note Different versions of POE software are not identical in the environment variables they support. Things change.

How to Set POE Environment Variables:

Basic Interactive POE Environment Variables:

Example Basic Interactive Environment Variable Settings:

Other Common/Useful POE Environment Variables

LLNL Preset POE Environment Variables:



Running on LC's POWER Systems

Invoking the Executable

Syntax:

Multiple Program Multiple Data (MPMD) Programs:

Using POE with Serial Programs:

POE Error Messages:



Running on LC's POWER Systems

Monitoring Job Status



Running on LC's POWER Systems

Interactive Job Specifics

The pdebug Interactive Pool/Partition:

Insufficient Resources:

Killing Interactive Jobs:

Running on LC's POWER Systems

Batch Job Specifics

Things Are Changing:

Submitting Batch Jobs:

Quick Summary of Common LCRM Batch Commands:

Batch Jobs and POE Environment Variables:

Killing Batch Jobs:



Running on LC's POWER Systems

Optimizing CPU Usage

SMP Nodes:

Effectively Using Available CPUs:

When Not to Use All CPUs:



Running on LC's POWER Systems

RDMA

What is RDMA?

How to Use RDMA:



Running on LC's POWER Systems

Other Considerations

Simultaneous Multi-Threading (SMT)

POE Co-Scheduler



Debugging With TotalView


TotalView windows

The Very Basics:

  1. Be sure to compile your program with the -g option

  2. When starting TotalView, specify the poe process and then use TotalView's -a option for your program and any other arguments (including POE arguments). For example:
    totalview poe -a myprog -procs 4
  3. TotalView will then load the poe process and open its Root and Process windows as usual. Note that the poe process appears in the Process Window.

  4. Use the Go command in the Process Window to start poe with your executable.

  5. TotalView will then attempt to acquire your partition and load your job. When it is ready to run your job, you will be prompted about stopping your parallel job (below). In most cases, answering yes is the right thing to do.

    TotalView Prompt

  6. Your executable should then appear in the Process Window. You are now ready to begin debugging your parallel program.

  7. For debugging in batch, see Batch System Debugging in LC's TotalView tutorial.

A Couple LC Specific Notes:

TotalView and Large Pages:


This completes the tutorial.

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?



References and More Information