IBM Books

Diagnosis Guide


Runtime notification methods

PSSP provides tools to monitor system status and conditions in a runtime fashion, when the system administrator is actively monitoring the current status of the system. These tools are used when the system administrator wants to know immediately the current status of system resources, or to be notified immediately of problems and potential trouble situations.

Two sets of runtime tools are available. The choice of the tools depends on the capabilities of the system administrator's workstation and the system administrator's preferences. PSSP provides graphical tools for use on the control workstation or network-attached terminals. PSSP also provides command-line tools for those situations when only modem access or s1term access is available.

Graphical tools - SP Perspectives

PSSP provides graphical tools for system administration and monitoring through the SP Perspectives tool suite. Perspectives is engineered for ease-of-use for the system administrator. In order to be used effectively, SP Perspectives requires X11 graphics capable terminals or workstations and high-speed connections. Use Perspectives when monitoring the SP system from the control workstation or from a network-attached workstation.

The basic concepts of Perspectives and examples of its use are included in the SP Perspectives chapter of PSSP: Administration Guide. Perspectives also provides extensive online help information. To understand how to accomplish the tasks that are presented in this chapter, consult the SP Perspectives online help, using this section as a guide to the online help topics.

Individual Perspectives require that certain subsystems be operating and that the user is authorized to communicate with them. Such subsystems include the System Monitor, Event Management, System Data Repository, and Problem Management. For authorization required for each Perspective, see the discussion on using SP Perspectives in PSSP: Administration Guide.

The Perspectives launch pad is started using the perspectives command, which resides in the /usr/lpp/ssp/bin directory. Other Perspectives, such as the Event or Hardware Perspective, can be started from the launch pad. Before starting an SP Perspective, be sure that the DISPLAY environment variable is set to the machine that you want to display the SP Perspective. Also, be sure that you are permitted to display to that machine by running the xhost command on that machine.

Two Perspectives tools are useful for monitoring the system status and detecting problem situations:

The SP Event Perspective

This tool allows the user to specify system conditions that are of concern or importance, and to indicate what actions are to be taken when the condition exists. The Perspective interfaces with the Event Management software subsystem to monitor these conditions and alert the Perspective to the presence of the condition. To effectively use this Perspective, you must understand certain terminology.

Condition
The circumstances within the system that are of interest to the system administrator. Conditions can be created, viewed, and modified through the Conditions pane in the SP Event Perspective. To specify a condition, the system administrator must provide the necessary components to form the condition, including an event expression and, optionally, a rearm expression. Their definitions follow.

The rearm expression indicates when the SP Event Perspective should consider the event to have "stopped". For example, a file system is considered "almost full" when the available space is less than 10% of its capacity. The system administrator may want to consider the condition to exist until the available space reaches 13% of the file system's capacity. The event expression would then be set to 10% and the rearm expression to 13%. As with the event expression, the system administrator can indicate an action to take when the rearm expression occurs, such as deactivating reserve resources that had been activated when the event occurred.

Event Expression
A relational expression that specifies the circumstances under which an event is generated.

Rearm Expression
A relational expression that specifies that the condition that triggered the event is no longer true. It is usually the inverse of the event expression.

Event Definition
An association made by the system administrator between a condition and a response to the presence of that condition.

Registration
The activation of an event definition. By registering an event definition, the system administrator instructs the Perspective to begin monitoring for the condition and to take the associated action if the condition should occur.

Once the user registers the event definition, the action will be run whenever the event or rearm expression occurs. This is independent of whether the Event Perspective is active at the time that the event or rearm expression occurs.

Event
A change in the state of a system resource. For the purposes of this discussion, an event is more narrowly defined as the presence of the condition within the system.

To start the SP Event Perspective, double click on the Event Perspective icon in the SP Perspectives launch pad window.

Users can create conditions for situations that are important to them through the Conditions pane of the SP Event Perspective. A number of default conditions are provided through the SP Event Perspective. You may wish to add more or to tailor the predefined conditions to meet the specific needs of your particular SP installation. The Perspectives online help provides assistance on how to create conditions and how to modify existing conditions. To access this help, click on the Help button from the SP Event Perspective display, and select the Tasks... option. Assistance in handling conditions is available through the Working with Conditions topic.

Once a condition is defined through the SP Event Perspective, an action can then be associate with it. The action may be as simple as a visual notification that the event has occurred, or the action can be more sophisticated, including automatically invoking a command in response to the event. To associate the appropriate action with the presence of the condition (or to the absence of the condition), an event definition must be created. You can create these definitions and examine default definitions through the Event Definitions pane of the SP Event Perspective. The Perspective online help provides assistance on how to create event definitions and how to modify existing definitions. To access this help, click on the Help button from the SP Event Management Perspective display and select the Tasks... option. Assistance in handling event definitions is available through the Working with Event Definitions topic.

Only after both the condition and its associated event definition are defined to the Perspective, can you begin the monitoring of the condition. This is done by registering the event definition through the Event Definitions pane in the SP Event Perspective. To find how this is done, consult the Perspective's Working with Event Definitions online help topic.

Other basic SP Event Perspective tasks are described in the online help. To access this information, click on the Help button from the SP Event Management Perspective display, select the Tasks... option, and click on the How Do I ...? topic.

Depending on how the event definition was constructed, the SP Event Perspective reacts in one or more of the following ways when you register the event definition, and the condition that the event definition is based on occurs:

The actions performed when the event or the rearm expression occurs can be one of the following:

The SP Event Perspective is designed to be a multi-user tool. Multiple users can invoke the SP Event Perspective in parallel and monitor different conditions. Notifications are routed to those users that registered the associated event definition. The Perspective also stores event definitions created by each user in the user's $HOME/.$USER:Events file. By storing these definitions in different files, each user can tailor conditions and event definitions to best suit the user's needs. This also prevents users from accidentally modifying conditions or event definitions created or used by other SP Perspectives users.

The SP Hardware Perspective

This tool allows you to examine the current status of the SP system hardware. Through this tool, you can display a graphical representation of the system's overall structure, assess the current status of system hardware, and issue hardware control commands.

There are some characteristics of the SP Hardware Perspective that the user should keep in mind when using the tool. Unlike the SP Event Perspective, the SP Hardware Perspective does not permit the user to associate an action with the presence of a condition. Users that wish to automate a response to a specific system condition should use the SP Event Perspective. Also, the SP Hardware Perspective only monitors conditions while it is active. If the Perspective is shut down, any monitoring of hardware status is also shut down.

To be able to restart the Perspectives so that monitoring will automatically start, you will need to save the configuration to a profile. From the menu bar select Options > Save Preferences... The Save Preferences dialog will be displayed. For more information on using this dialog, select the Help button at the bottom. To start the Hardware Perspective with the saved profile:

Previous versions of PSSP offered a graphical user interface as part of the System Monitor (spmon) command. PSSP Version 3.1 and later versions of PSSP, incorporate this hardware control capability into the SP Hardware Perspective. While the capability of the spmon command is available through the Perspective, the "look and feel" of the control is somewhat different. The SP Hardware Perspective offers a special online help facility to acclimate former spmon graphical interface users to the new controls. To access this help, click on the Help menu bar item in the SP Hardware Perspective, select the Tasks... option, and then select the Transforming System Monitor Experience into Hardware Perspectives Skills topic from the help menu.

Each Perspective provides its own unique capabilities. For the purposes of problem monitoring and determination, this manual recommends that the SP Event Perspective be used to monitor conditions of interest for the SP system. When the SP Event Perspective indicates that a hardware failure condition exists, the SP Hardware Perspective should be used to examine the current status of the system hardware and obtain more detailed information about the hardware problem.

Command line tools

PSSP provides command-oriented tools for system administration in addition to graphical tools for system administration and monitoring. These tools require no special workstation capability or high-speed connection, making them usable by almost any terminal type in any mode of access. Use these tools when examining system status through a modem connection or through a node's S1 serial port. The tools discussed in this section are documented in greater detail in PSSP: Command and Technical Reference, PSSP: Administration Guide, and AIX Version 4 Commands Reference. These tools do not possess the same ease-of-use characteristics as their Perspectives based counterparts, although they do provide the same basic function.

Several commands are useful for monitoring the system status and detecting problem situations:

The spmon and dsh commands require the user to have specific authorizations. To learn how a user can acquire these authorizations, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.

The spmon command permits the user to control and monitor SP hardware resources through a command-line interface without requiring a graphics-capable terminal or high-speed connection. The spmon command does not provide the capability to examine software status (such as paging space, file system space, or software subsystem activity). The spmon command provides access to more node-specific information than the hmmon command, which is introduced next. The spmon command provides a predefined system query to check the most basic problem conditions within the SP system.

The hmmon command provides hardware monitoring functions similar to the spmon command, and gives you access to more SP hardware information for frames and switches than the spmon command does. The hmmon command provides the capability to monitor frame and switch status as well as node status. The hmmon command is intended as a general-purpose SP hardware monitor. Although it has access to more SP information than the spmon command, it does not have access to some of the node-specific information that the spmon command does. The hmmon command does not provide a predefined system query, which the spmon command does.

The df command is an AIX command that examines the current status of file systems, such as current file system size and current available space within these file systems. While this command is designed to examine the AIX system on which it is issued, it can be invoked remotely with the dsh command to acquire this information for all nodes. Three file systems are of particular importance for all SP nodes:

These space capacities can be verified using the dsh command to invoke the df command on all nodes in the SP system.

The lsps command provides an instant assessment of the currently available paging space for an AIX system. As with the df command, the lsps command provides information for the AIX system on which it runs. Using the lsps command with the dsh command or via a remote command, you can obtain the assessment for all nodes in the SP system.

Paging space availability by itself does not necessarily indicate a problem. Having only ten percent of 2 gigabytes of paging space available is not as significant a condition as having only ten percent of 100MB available. Also, one system 's critical situation may be a tolerable situation for another system. Because of this discrepancy, this manual will not suggest a default figure for a critical paging space situation. Use your knowledge of the system setup, system workload, and any past paging space problems to determine this value.

The lssrc command provides information for software services currently installed on an AIX system. Using lssrc, you can determine if a software service is active or inactive. Use this command in cases where a software service does not appear to be responding to requests for service on a specific node. To check software service status on multiple nodes, use this command through the dsh command.

The dsh command permits the user to issue a command on a remote node and to view the results on the local node. Using dsh, you can issue the commands listed previously on any SP node from a single location. This removes the need to login to each node individually. A user must have specific authorization to use the dsh command. To learn how a user can acquire this authorization, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.

The following scenarios demonstrate how these tools are used to query and monitor the status of the SP system.

Assessing the current status of the SP system

This task is accomplished through the following series of steps:

  1. Preparing to Perform the System Check. Prepare for this task by retrieving the log of the SP system structure. This log is discussed in Create a log of your SP structure and setup. This information is required to use the hmmon command effectively. The hmmon command obtains hardware information about nodes and switch devices using the frame number and slot number of the device, not the network name or IP address assigned to the device.

    This check should be performed by users authorized to invoke the spmon and dsh commands. To learn how a user can acquire this authorization, see the "Using the SP System Monitor" chapter in PSSP: Administration Guide.

  2. Perform a Preliminary Check of the SP System. To perform a basic diagnostic check of the entire SP system, issue the following command from the control workstation:
    /usr/lpp/ssp/bin/spmon -G -d | more
    

    This test verifies several items in the monitor program itself to make sure that it is running. Once the monitor verification completes, the spmon command checks the status of the SP frames and obtains information about the SP nodes. The spmon command performs these tests in a dependent order, so that if one of the early checks fails, subsequent checks are not performed. For example, if a frame cannot be queried, the frame and the nodes within that frame are not checked.

    Example output from the spmon -G -d command: 

    +--------------------------------------------------------------------------------+
    |                                                                                |
    |1.  Checking server process                                                     |
    |Process 10512 has accumulated 192 minutes and 53 seconds.                       |
    |Check ok                                                                        |
    |                                                                                |
    |2.  Opening connection to server                                                |
    |Connection opened                                                               |
    |Check ok                                                                        |
    |                                                                                |
    |3.  Querying frames (s)                                                         |
    |1 frames (s)                                                                    |
    |Check ok                                                                        |
    |                                                                                |
    |4.  Checking frames                                                             |
    |                                                                                |
    |       Controller  Slot 17  Switch    Switch    Power supplies                  |
    |Frame   Responds   Switch   Power    Clocking   A   B   C   D                   |
    |--------------------------------------------------------------                  |
    |   1       yes       yes      on        0       on  on  on  on                  |
    |                                                                                |
    |5.  Checking nodes                                                              |
    |------------------------------- Frame 1 ----------------------                  |
    |Frame  Node  Node         Host/Switch   Key   Env   Front Panel   LCD/LED is    |
    |Slot  Number Type  Power   Responds    Switch Fail    LCD/LED     Flashing      |
    |  1     1    wide   on     yes  yes    normal  no  LEDs are blank   no          |
    |  3     3    thin   on     yes  yes    normal  no  LEDs are blank   no          |
    |  4     4    thin   on     yes  yes    normal  no  LEDs are blank   no          |
    |  5     5    thin   on     yes  yes    normal  no  LEDs are blank   no          |
    |  6     6    thin   on     yes  yes    normal  no  LEDs are blank   no          |
    |  7     7    wide   on     yes  yes    normal  no  LEDs are blank   no          |
    |  9     9    wide   on     yes  yes    normal  no  LEDs are blank   no          |
    |  11    11   wide   on     yes  yes    normal  no  LEDs are blank   no          |
    |  13    13   wide   on     yes  yes    N/A     no  LCDs are blank   no          |
    |                                                                                |
    +--------------------------------------------------------------------------------+

    Note that these tests are numbered. This makes it easy to detect if a test was omitted. The results of this command indicate potential problems if any of these conditions exist:

  3. Obtaining More Information. If the spmon command mentioned previously indicates a potential problem situation, obtain more information in order to resolve the problem.
  4. Checking Basic Software Information. Once hardware failures have been eliminated, it is time to perform some basic software verifications for the SP system. These checks will use the dsh command to invoke AIX commands on multiple nodes in parallel. To verify this, issue the following command from the control workstation:
    dsh -a -f32 hostname
    

    Example output of the dsh -a -f32 hostname command on a small SP system configuration:

    +--------------------------------------------------------------------------------+
    |k21n01.ppd.pok.ibm.com: k21n01.ppd.pok.ibm.com                                  |
    |k21n03.ppd.pok.ibm.com: k21n03.ppd.pok.ibm.com                                  |
    |k21n04.ppd.pok.ibm.com: k21n04.ppd.pok.ibm.com                                  |
    |k21n05.ppd.pok.ibm.com: k21n05.ppd.pok.ibm.com                                  |
    |k21n06.ppd.pok.ibm.com: k21n06.ppd.pok.ibm.com                                  |
    |k21n07.ppd.pok.ibm.com: k21n07.ppd.pok.ibm.com                                  |
    |k21n09.ppd.pok.ibm.com: k21n09.ppd.pok.ibm.com                                  |
    |k21n11.ppd.pok.ibm.com: k21n11.ppd.pok.ibm.com                                  |
    |k21n13.ppd.pok.ibm.com: k21n13.ppd.pok.ibm.com                                  |
    +--------------------------------------------------------------------------------+

    This test will verify that the dsh command can reach the nodes within the SP system. Only nodes that were previously detected as being offline in the earlier tests should fail to respond to this command. If any other nodes within the SP system fail to respond, check for problems by referring to Diagnosing remote command problems on the SP System.

Keeping informed of status changes

Note:
In order to successfully issue these commands, you must have "monitor" permission for kerberos (compat mode) or DCE mode, depending on the authentication method in use.

The previous discussion centered on obtaining the current status of SP system hardware and software. Such efforts are necessary if a problem is suspected and being actively investigated, but repeatedly issuing these commands periodically to examine the current status of the SP system can become tedious. To make the task of monitoring system status easier, PSSP provides monitoring capabilities within the hmmon and spmon commands as well. This avoids the necessity of reissuing the previously discussed commands over and over again to keep informed of the system status. This section describes some of the more common monitor commands.

To set up a monitor to check for frame hardware failures, issue the following background command:

hmmon -G -q -s -v frPowerOff*,controllerResponds,controllerIDMismatch,\
nodefail* range_of_frame_nums:0 &

Example initial output from the hmmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|  1   0  nodefail1            FALSE    0x8802  node 01 I2C not responding       |
|  1   0  nodefail2            TRUE     0x8803  node 02 I2C not responding       |
|  1   0  nodefail3            FALSE    0x8804  node 03 I2C not responding       |
|  1   0  nodefail4            FALSE    0x8805  node 04 I2C not responding       |
|  1   0  nodefail5            FALSE    0x8806  node 05 I2C not responding       |
|  1   0  nodefail6            FALSE    0x8807  node 06 I2C not responding       |
|  1   0  nodefail7            FALSE    0x8808  node 07 I2C not responding       |
|  1   0  nodefail8            TRUE     0x8809  node 08 I2C not responding       |
|  1   0  nodefail9            FALSE    0x880a  node 09 I2C not responding       |
|  1   0  nodefail10           TRUE     0x880b  node 10 I2C not responding       |
|  1   0  nodefail11           FALSE    0x880c  node 11 I2C not responding       |
|  1   0  nodefail12           TRUE     0x880d  node 12 I2C not responding       |
|  1   0  nodefail13           FALSE    0x880e  node 13 I2C not responding       |
|  1   0  nodefail14           TRUE     0x880f  node 14 I2C not responding       |
|  1   0  nodefail15           TRUE     0x8810  node 15 I2C not responding       |
|  1   0  nodefail16           TRUE     0x8811  node 16 I2C not responding       |
|  1   0  nodefail17           FALSE    0x8812  switch I2C not responding        |
|  1   0  frPowerOff           FALSE    0x8846  SEPBU frame power off            |
|  1   0  controllerIDMismatch FALSE    0x8871  frame ID mismatch                |
|  1   0  controllerResponds   TRUE     0x88a8  frame responding to polls        |
+--------------------------------------------------------------------------------+

This command is similar to the one presented previously, except that this version continually monitors the frame condition and generates a message to the terminal if any of the status should change. To stop monitoring this information, terminate the background process.

To set up a monitor to check for SP switch hardware status changes, issue the following background command:

hmmon -G -q -s -v nodePower,powerLED,envLED,\
shutdownTemp range_of_frame_nums:17 &

Example initial output from the hmmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|  1  17  powerLED                   1  0x8c47  node/switch LED 1 (green)        |
|  1  17  envLED                     0  0x8c48  node/switch LED 2 (yellow)       |
|  1  17  nodePower            TRUE     0x8c4a  DC-DC power on                   |
|  1  17  shutdownTemp         FALSE    0x8c59  temperature shutdown             |
+--------------------------------------------------------------------------------+

This command is similar to one presented previously, except this version continually monitors the frame condition and generates a message to the terminal if any of the status should change. To stop monitoring this information, terminate the background process.

To set up a monitor to check for changes in a node's LCD or LED status, issue the following background command:

hmmon -G -q -s -v LED7Seg* range_of_frame_nums:1-16 &

Example initial output from the hmmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|  1   1  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1   1  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1   1  LED7SegC                 255  0x90a1  7 segment LED C                  |
|  1   3  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   3  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   3  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   4  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   4  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   4  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   5  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   5  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   5  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   6  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   6  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   6  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   7  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1   7  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1   7  LED7SegC                 255  0x90a1  7 segment LED C                  |
|  1   9  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1   9  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1   9  LED7SegC                 255  0x90a1  7 segment LED C                  |
|  1  11  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1  11  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1  11  LED7SegC                 255  0x90a1  7 segment LED C                  |
+--------------------------------------------------------------------------------+
This command shows the initial status of these resources, and displays any status changes in these resources when they occur. All values should display a value of 255, indicating that the associated readout element is blank. If any nodes indicate that a segment is not blank, issue the spmon -L command mentioned on (DKILEDNB) to obtain the current LCD or LED readout of the node.

To set up a monitor to check for nodes suddenly losing contact with the SP Switch, issue the following command:

spmon -q -M -l -t frame*/node*/switchResponds/value

Example initial output from the spmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|/SP/frame/frame1/node1/switchResponds/value/1                                   |
|/SP/frame/frame1/node3/switchResponds/value/1                                   |
|/SP/frame/frame1/node4/switchResponds/value/1                                   |
|/SP/frame/frame1/node5/switchResponds/value/1                                   |
|/SP/frame/frame1/node6/switchResponds/value/1                                   |
|/SP/frame/frame1/node7/switchResponds/value/1                                   |
|/SP/frame/frame1/node9/switchResponds/value/0                                   |
|/SP/frame/frame1/node11/switchResponds/value/0                                  |
|/SP/frame/frame1/node13/switchResponds/value/1                                  |
+--------------------------------------------------------------------------------+

The spmon command also displays the current status, and a message to the terminal if any of these values change. All values should be 1. A value of 0 indicates that the node is not responding to the SP Switch. Note that this is the case with two of the nodes in this example, and these nodes should be investigated.

Other conditions can also be monitored using the hmmon and spmon commands; these suggestions offer the most basic of tests. To learn what other conditions can be monitored with these commands, and to tailor these commands to best suit your needs, refer to the hmmon and spmon sections of PSSP: Command and Technical Reference.

All commands can be issued from the same terminal session, but this can lead to confusing output when conditions change, or initial values can scroll off the terminal screen. To keep the monitoring manageable, consider issuing these commands from separate terminals, or from separate terminal windows from a XWindows capable terminal. Issue one monitoring command per terminal or terminal window. This will associate a terminal with each condition being monitored, and simplify the understanding of the monitor output.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]