Diagnosis Guide

Runtime notification methods

PSSP provides tools to monitor system status and conditions in a runtime fashion, when the system administrator is actively monitoring the current status of the system. These tools are used when the system administrator wants to know immediately the current status of system resources, or to be notified immediately of problems and potential trouble situations.

Two sets of runtime tools are available. The choice of the tools depends on the capabilities of the system administrator's workstation and the system administrator's preferences. PSSP provides graphical tools for use on the control workstation or network-attached terminals. PSSP also provides command-line tools for those situations when only modem access or s1term access is available.

Graphical tools - SP Perspectives

PSSP provides graphical tools for system administration and monitoring through the SP Perspectives tool suite. Perspectives is engineered for ease-of-use for the system administrator. In order to be used effectively, SP Perspectives requires X11 graphics capable terminals or workstations and high-speed connections. Use Perspectives when monitoring the SP system from the control workstation or from a network-attached workstation.

The basic concepts of Perspectives and examples of its use are included in the SP Perspectives chapter of PSSP: Administration Guide. Perspectives also provides extensive online help information. To understand how to accomplish the tasks that are presented in this chapter, consult the SP Perspectives online help, using this section as a guide to the online help topics.

Individual Perspectives require that certain subsystems be operating and that the user is authorized to communicate with them. Such subsystems include the System Monitor, Event Management, System Data Repository, and Problem Management. For authorization required for each Perspective, see the discussion on using SP Perspectives in PSSP: Administration Guide.

The Perspectives launch pad is started using the perspectives command, which resides in the /usr/lpp/ssp/bin directory. Other Perspectives, such as the Event or Hardware Perspective, can be started from the launch pad. Before starting an SP Perspective, be sure that the DISPLAY environment variable is set to the machine that you want to display the SP Perspective. Also, be sure that you are permitted to display to that machine by running the xhost command on that machine.

Two Perspectives tools are useful for monitoring the system status and detecting problem situations:

The SP Event Perspective

This tool allows the user to specify system conditions that are of concern or importance, and to indicate what actions are to be taken when the condition exists. The Perspective interfaces with the Event Management software subsystem to monitor these conditions and alert the Perspective to the presence of the condition. To effectively use this Perspective, you must understand certain terminology.

Condition: The circumstances within the system that are of interest to the system administrator. Conditions can be created, viewed, and modified through the Conditions pane in the SP Event Perspective. To specify a condition, the system administrator must provide the necessary components to form the condition, including an event expression and, optionally, a rearm expression. Their definitions follow.
The rearm expression indicates when the SP Event Perspective should consider the event to have "stopped". For example, a file system is considered "almost full" when the available space is less than 10% of its capacity. The system administrator may want to consider the condition to exist until the available space reaches 13% of the file system's capacity. The event expression would then be set to 10% and the rearm expression to 13%. As with the event expression, the system administrator can indicate an action to take when the rearm expression occurs, such as deactivating reserve resources that had been activated when the event occurred.
Event Expression: A relational expression that specifies the circumstances under which an event is generated.
Rearm Expression: A relational expression that specifies that the condition that triggered the event is no longer true. It is usually the inverse of the event expression.
Event Definition: An association made by the system administrator between a condition and a response to the presence of that condition.
Registration: The activation of an event definition. By registering an event definition, the system administrator instructs the Perspective to begin monitoring for the condition and to take the associated action if the condition should occur.
Once the user registers the event definition, the action will be run whenever the event or rearm expression occurs. This is independent of whether the Event Perspective is active at the time that the event or rearm expression occurs.
Event: A change in the state of a system resource. For the purposes of this discussion, an event is more narrowly defined as the presence of the condition within the system.

To start the SP Event Perspective, double click on the Event Perspective icon in the SP Perspectives launch pad window.

Users can create conditions for situations that are important to them through the Conditions pane of the SP Event Perspective. A number of default conditions are provided through the SP Event Perspective. You may wish to add more or to tailor the predefined conditions to meet the specific needs of your particular SP installation. The Perspectives online help provides assistance on how to create conditions and how to modify existing conditions. To access this help, click on the Help button from the SP Event Perspective display, and select the Tasks... option. Assistance in handling conditions is available through the Working with Conditions topic.

Once a condition is defined through the SP Event Perspective, an action can then be associate with it. The action may be as simple as a visual notification that the event has occurred, or the action can be more sophisticated, including automatically invoking a command in response to the event. To associate the appropriate action with the presence of the condition (or to the absence of the condition), an event definition must be created. You can create these definitions and examine default definitions through the Event Definitions pane of the SP Event Perspective. The Perspective online help provides assistance on how to create event definitions and how to modify existing definitions. To access this help, click on the Help button from the SP Event Management Perspective display and select the Tasks... option. Assistance in handling event definitions is available through the Working with Event Definitions topic.

Only after both the condition and its associated event definition are defined to the Perspective, can you begin the monitoring of the condition. This is done by registering the event definition through the Event Definitions pane in the SP Event Perspective. To find how this is done, consult the Perspective's Working with Event Definitions online help topic.

Other basic SP Event Perspective tasks are described in the online help. To access this information, click on the Help button from the SP Event Management Perspective display, select the Tasks... option, and click on the How Do I ...? topic.

Depending on how the event definition was constructed, the SP Event Perspective reacts in one or more of the following ways when you register the event definition, and the condition that the event definition is based on occurs:

The icon representing the event definition within the SP Event Perspective's Event Definitions pane changes to an envelope. This notification can be detected only if the SP Event Perspective is running. If the SP Event Perspective is shut down after you register for the condition, this visual notification is not presented.
The action associated with the event definition is started. This action is specified when you create the event definition. Through this action, you can automate the response to the condition, such as sending e-mail to a system administrator, issuing a command to activate a pager, or issue an administrative command to allocate reserved resources to address the condition.
Once you register the event definition, the action runs whenever the event occurs, whether or not the SP Event Perspective is active at the time the event occurs.

The actions performed when the event or the rearm expression occurs can be one of the following:

A command - The command can perform controls, enable or disable resources, or notify you by other means (like mail or online messages).
An SNMP Trap - This transmits a notification to the network using SNMP protocols, indicating that an event has occurred. This trap can be configured so that certain SNMP applications can receive the notification, or all can receive it. NetView is an example of such an application. Use an SNMP trap when you are using SNMP-based monitoring tools (such as NetView), and you want these tools to detect when events occur on the SP system.
An entry in the AIX Error Log and the BSD System log - This is used to record a persistent record of the event, or the event's rearm condition. The AIX Error Log template HA_PMAN_EVENT_ON is used when the event condition occurs. The template HA_PMAN_EVENT_OFF is used when the rearm condition occurs. These templates are viewed by issuing the command:
```
errpt -at -J HA_PMAN_EVENT_ON -J HA_PMAN_EVENT_OFF
```
Notification can be sent to the system administrator whenever these templates are logged to the AIX Error Log. For instructions on setting up this notification, consult Using the AIX Error Notification Facility.

The SP Event Perspective is designed to be a multi-user tool. Multiple users can invoke the SP Event Perspective in parallel and monitor different conditions. Notifications are routed to those users that registered the associated event definition. The Perspective also stores event definitions created by each user in the user's $HOME/.$USER:Events file. By storing these definitions in different files, each user can tailor conditions and event definitions to best suit the user's needs. This also prevents users from accidentally modifying conditions or event definitions created or used by other SP Perspectives users.

The SP Hardware Perspective

This tool allows you to examine the current status of the SP system hardware. Through this tool, you can display a graphical representation of the system's overall structure, assess the current status of system hardware, and issue hardware control commands.

To examine the current status of a hardware device, select the hardware device by single-clicking on the device's icon in the particular pane. Open the devices' notebook by single clicking the notebook icon (leftmost icon) in the toolbar at the top of the Perspective window. This displays a new window (notebook) that contains the device's current status, settings, and monitored conditions. This is useful for examining a node's LED values, its responsiveness to the network and the switch, its network configuration, and other information.
For further assistance in using the notebook to view hardware status, consult the Perspective's online help. To access this help, click on the Help button from the SP Hardware Perspective display and select the Tasks... option. Assistance in viewing hardware status is available in the Viewing Hardware Attributes topic.
If you want to view the same hardware information from multiple entities, such as the responsiveness to the switch for a series of nodes, opening a notebook for each entity can be time-consuming. The SP Hardware Perspective offers an alternative method for displaying this information. Most information displayed in a notebook can also be displayed in the pane in a table format. The left column of the table contains the objects from the pane while the columns to the right contains the information you want displayed from the notebook.
To switch from the icon to table view in a pane, select the icon on the right of the toolbar which shows a table and an icon. When you point at this icon, the descriptive text reads "Show object in the table view or the icon view". The first time you select this toolbar icon, the "Set Table Attributes" dialog is displayed. This dialog lists the attributes from the objects notebook that you can display in the table. After selecting the desired items from the list, select "OK". The pane will be updated with the items you selected. The table entities that represent variable states of the hardware entity will be color coded to indicate "good", "bad", "caution" status as they did in the notebook.
For assistance on using the table view to examine hardware status, consult the Perspective online help. To access this help, click on the Help button from the SP Hardware Perspective display and select the Overview.. option. From the new window that appears, select the Starting and Customizing SP Perspective topic, then select the Customizing SP Perspective subtopic, and finally select the Using Table View item. The Help option from the selection list window provides a fast path to the help topic.
To monitor the status of hardware devices, select the pane where the devices are contained. Looking at the top row of icons, a graph icon should be visible on the right. When you point at this icon, the descriptive text reads "Set up and begin monitoring". Click on this icon to bring up a window of items that can be monitored. Select the items to be monitored from this list. All objects in the pane will now be monitored for these conditions.
When monitoring is active, the icons of the entities use a visual indication of the status. If all monitored conditions do not indicate a problem, the icons will be presented in a green color. If any of the monitored conditions indicate a problem, the icon will appear to have a red X drawn through it. Note that this will occur even if only one condition indicates a problem. For example, if five nodes are being monitored for five conditions, and one of these five conditions appears on node1a, the icon for node1a will appear with a red X through it, while the remaining nodes will be represented with green icons.
To determine what condition may exist on a marked entity, select the object by single-clicking on its icon or its table entry in the pane. Then open the object's notebook by clicking on the notebook icon in the upper left corner. When the notebook display comes up, page forward to the "Monitored Conditions" page. This page lists conditions being monitored for that object, along with the condition's current state. Any state listed as "Triggered" indicates that the condition is present.
If any object in a pane is presented in gray with a question mark (?) drawn over it for longer than a few seconds, a communication problem exists between the SP Hardware Perspective and the Event Management software subsystem. For assistance in resolving the problem, consult Diagnosing Perspectives problems on the SP System.
For further assistance in setting up and starting the hardware monitor, consult the Perspectives online help facility. To access this help, click on the Help button from the SP Hardware Perspective display and select the Tasks... option. From the new window that appears, select the Monitoring Hardware Objects topic.

There are some characteristics of the SP Hardware Perspective that the user should keep in mind when using the tool. Unlike the SP Event Perspective, the SP Hardware Perspective does not permit the user to associate an action with the presence of a condition. Users that wish to automate a response to a specific system condition should use the SP Event Perspective. Also, the SP Hardware Perspective only monitors conditions while it is active. If the Perspective is shut down, any monitoring of hardware status is also shut down.

To be able to restart the Perspectives so that monitoring will automatically start, you will need to save the configuration to a profile. From the menu bar select Options > Save Preferences... The Save Preferences dialog will be displayed. For more information on using this dialog, select the Help button at the bottom. To start the Hardware Perspective with the saved profile:

If you saved your preferences as a user profile, enter
```
sphardware -userProfile name_you_specified
```
If you saved your preferences as a system profile, enter
```
sphardware -systemProfile name_you_specified
```

Previous versions of PSSP offered a graphical user interface as part of the System Monitor (spmon) command. PSSP Version 3.1 and later versions of PSSP, incorporate this hardware control capability into the SP Hardware Perspective. While the capability of the spmon command is available through the Perspective, the "look and feel" of the control is somewhat different. The SP Hardware Perspective offers a special online help facility to acclimate former spmon graphical interface users to the new controls. To access this help, click on the Help menu bar item in the SP Hardware Perspective, select the Tasks... option, and then select the Transforming System Monitor Experience into Hardware Perspectives Skills topic from the help menu.

Each Perspective provides its own unique capabilities. For the purposes of problem monitoring and determination, this manual recommends that the SP Event Perspective be used to monitor conditions of interest for the SP system. When the SP Event Perspective indicates that a hardware failure condition exists, the SP Hardware Perspective should be used to examine the current status of the system hardware and obtain more detailed information about the hardware problem.

Command line tools

PSSP provides command-oriented tools for system administration in addition to graphical tools for system administration and monitoring. These tools require no special workstation capability or high-speed connection, making them usable by almost any terminal type in any mode of access. Use these tools when examining system status through a modem connection or through a node's S1 serial port. The tools discussed in this section are documented in greater detail in PSSP: Command and Technical Reference, PSSP: Administration Guide, and AIX Version 4 Commands Reference. These tools do not possess the same ease-of-use characteristics as their Perspectives based counterparts, although they do provide the same basic function.

Several commands are useful for monitoring the system status and detecting problem situations:

spmon
hmmon
df
dsh
lsps
lssrc

The spmon and dsh commands require the user to have specific authorizations. To learn how a user can acquire these authorizations, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.

The spmon command permits the user to control and monitor SP hardware resources through a command-line interface without requiring a graphics-capable terminal or high-speed connection. The spmon command does not provide the capability to examine software status (such as paging space, file system space, or software subsystem activity). The spmon command provides access to more node-specific information than the hmmon command, which is introduced next. The spmon command provides a predefined system query to check the most basic problem conditions within the SP system.

The hmmon command provides hardware monitoring functions similar to the spmon command, and gives you access to more SP hardware information for frames and switches than the spmon command does. The hmmon command provides the capability to monitor frame and switch status as well as node status. The hmmon command is intended as a general-purpose SP hardware monitor. Although it has access to more SP information than the spmon command, it does not have access to some of the node-specific information that the spmon command does. The hmmon command does not provide a predefined system query, which the spmon command does.

The df command is an AIX command that examines the current status of file systems, such as current file system size and current available space within these file systems. While this command is designed to examine the AIX system on which it is issued, it can be invoked remotely with the dsh command to acquire this information for all nodes. Three file systems are of particular importance for all SP nodes:

/spdata
This directory contains configuration information for PSSP software and also contains copies of information from the SDR. By default, this directory resides in the / (or root) file system. Insufficient space in this file system can result in failures in PSSP software, especially those dependent on the SDR for proper operation. As a rule of thumb, ensure that this file system has at least 5% of its capacity available at any time.
One method for avoiding space problems for the /spdata directory is to create a separate file system for this directory. Follow the instructions in PSSP: Installation and Migration Guide to create a separate volume group for this file system. Use the same rule of thumb for spotting potential trouble with this file system.
/var
This file system contains AIX system logs, such as the error log and user access logs. It also contains logs maintained by PSSP software for serviceability purposes. Some of these logs are never cleared except by explicit system administrator actions. If left unattended, they can grow to consume all available space.
As a rule of thumb, ensure that 10MB of space is available within this file system at all times. If the file system reaches this threshold, consider either extending the file system's capacity with the chfs command, or examine the file system to determine where the space is being consumed and remove unneeded files.
If /var is continually reaching the suggested threshold, this condition may indicate a chronic problem with some PSSP software or with specific hardware devices. Examine the logs listed in Error logging overview to determine if any show increased or extended activity, and perform any associated problem determination procedures if necessary.
/tmp
This file system is used by various user level applications, software products, and PSSP programs for temporary storage. Some legacy PSSP applications use this file system to store trace logs used for serviceability purposes. Some applications may inadvertently leave temporary files in the /tmp file system, or these applications may terminate before removing these files.
Insufficient space in /tmp can cause PSSP software to fail. As a rule of thumb, ensure that at least 8MB of space is available in this file system at any time. Eight MB is the amount of space a snap command will require if the system has to produce a dump to be sent to the IBM Support Center.

These space capacities can be verified using the dsh command to invoke the df command on all nodes in the SP system.

The lsps command provides an instant assessment of the currently available paging space for an AIX system. As with the df command, the lsps command provides information for the AIX system on which it runs. Using the lsps command with the dsh command or via a remote command, you can obtain the assessment for all nodes in the SP system.

Paging space availability by itself does not necessarily indicate a problem. Having only ten percent of 2 gigabytes of paging space available is not as significant a condition as having only ten percent of 100MB available. Also, one system 's critical situation may be a tolerable situation for another system. Because of this discrepancy, this manual will not suggest a default figure for a critical paging space situation. Use your knowledge of the system setup, system workload, and any past paging space problems to determine this value.

The lssrc command provides information for software services currently installed on an AIX system. Using lssrc, you can determine if a software service is active or inactive. Use this command in cases where a software service does not appear to be responding to requests for service on a specific node. To check software service status on multiple nodes, use this command through the dsh command.

The dsh command permits the user to issue a command on a remote node and to view the results on the local node. Using dsh, you can issue the commands listed previously on any SP node from a single location. This removes the need to login to each node individually. A user must have specific authorization to use the dsh command. To learn how a user can acquire this authorization, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.

The following scenarios demonstrate how these tools are used to query and monitor the status of the SP system.

Assessing the current status of the SP system

This task is accomplished through the following series of steps:

Preparing to Perform the System Check. Prepare for this task by retrieving the log of the SP system structure. This log is discussed in Create a log of your SP structure and setup. This information is required to use the hmmon command effectively. The hmmon command obtains hardware information about nodes and switch devices using the frame number and slot number of the device, not the network name or IP address assigned to the device.
This check should be performed by users authorized to invoke the spmon and dsh commands. To learn how a user can acquire this authorization, see the "Using the SP System Monitor" chapter in PSSP: Administration Guide.

Perform a Preliminary Check of the SP System. To perform a basic diagnostic check of the entire SP system, issue the following command from the control workstation:

/usr/lpp/ssp/bin/spmon -G -d | more

This test verifies several items in the monitor program itself to make sure that it is running. Once the monitor verification completes, the spmon command checks the status of the SP frames and obtains information about the SP nodes. The spmon command performs these tests in a dependent order, so that if one of the early checks fails, subsequent checks are not performed. For example, if a frame cannot be queried, the frame and the nodes within that frame are not checked.

Example output from the spmon -G -d command:

+--------------------------------------------------------------------------------+
|                                                                                |
|1.  Checking server process                                                     |
|Process 10512 has accumulated 192 minutes and 53 seconds.                       |
|Check ok                                                                        |
|                                                                                |
|2.  Opening connection to server                                                |
|Connection opened                                                               |
|Check ok                                                                        |
|                                                                                |
|3.  Querying frames (s)                                                         |
|1 frames (s)                                                                    |
|Check ok                                                                        |
|                                                                                |
|4.  Checking frames                                                             |
|                                                                                |
|       Controller  Slot 17  Switch    Switch    Power supplies                  |
|Frame   Responds   Switch   Power    Clocking   A   B   C   D                   |
|--------------------------------------------------------------                  |
|   1       yes       yes      on        0       on  on  on  on                  |
|                                                                                |
|5.  Checking nodes                                                              |
|------------------------------- Frame 1 ----------------------                  |
|Frame  Node  Node         Host/Switch   Key   Env   Front Panel   LCD/LED is    |
|Slot  Number Type  Power   Responds    Switch Fail    LCD/LED     Flashing      |
|  1     1    wide   on     yes  yes    normal  no  LEDs are blank   no          |
|  3     3    thin   on     yes  yes    normal  no  LEDs are blank   no          |
|  4     4    thin   on     yes  yes    normal  no  LEDs are blank   no          |
|  5     5    thin   on     yes  yes    normal  no  LEDs are blank   no          |
|  6     6    thin   on     yes  yes    normal  no  LEDs are blank   no          |
|  7     7    wide   on     yes  yes    normal  no  LEDs are blank   no          |
|  9     9    wide   on     yes  yes    normal  no  LEDs are blank   no          |
|  11    11   wide   on     yes  yes    normal  no  LEDs are blank   no          |
|  13    13   wide   on     yes  yes    N/A     no  LCDs are blank   no          |
|                                                                                |
+--------------------------------------------------------------------------------+

Note that these tests are numbered. This makes it easy to detect if a test was omitted. The results of this command indicate potential problems if any of these conditions exist:

The command does not run.
The command does not perform all five verification checks.
The fourth test indicates that the frame's controller is not responding, the switch power is not on, or any of the power supplies are listed as off.
The fifth test indicates any abnormal conditions: a node's power is off, the host responds does not read yes, an environment failure is indicated, or the LCD or LED of the node is not blank (but not flashing).
The fifth test indicates that the node's LCDs or LEDs are flashing.
This indicates that a system dump was attempted.
The fifth test indicates that the node is not responding to the switch device.

Obtaining More Information. If the spmon command mentioned previously indicates a potential problem situation, obtain more information in order to resolve the problem.
- If either the first or second test of spmon -G -d failed, consult Diagnosing System Monitor problems.
- If the third or fourth test failed, use the hmmon command to detect if there are problems with the frame itself. Issue this command to obtain this information:
```
hmmon -G -q -s -v frPowerOff*,controllerResponds,\
controllerIDMismatch,nodefail* range_of_frame_nums:0
```
  Output is similar to:
```
+--------------------------------------------------------------------------------+
|                                                                                |
|1 0 nodefail1            FALSE   0x8802  node 01 I2C not responding             |
|1 0 nodefail2            TRUE    0x8803  node 02 I2C not responding             |
|1 0 nodefail3            FALSE   0x8804  node 03 I2C not responding             |
|1 0 nodefail4            TRUE    0x8805  node 04 I2C not responding             |
|1 0 nodefail5            FALSE   0x8806  node 05 I2C not responding             |
|1 0 nodefail6            FALSE   0x8807  node 06 I2C not responding             |
|1 0 nodefail7            FALSE   0x8808  node 07 I2C not responding             |
|1 0 nodefail8            FALSE   0x8809  node 08 I2C not responding             |
|1 0 nodefail9            FALSE   0x880a  node 09 I2C not responding             |
|1 0 nodefail10           FALSE   0x880b  node 10 I2C not responding             |
|1 0 nodefail11           FALSE   0x880c  node 11 I2C not responding             |
|1 0 nodefail12           FALSE   0x880d  node 12 I2C not responding             |
|1 0 nodefail13           FALSE   0x880e  node 13 I2C not responding             |
|1 0 nodefail14           TRUE    0x880f  node 14 I2C not responding             |
|1 0 nodefail15           FALSE   0x8810  node 15 I2C not responding             |
|1 0 nodefail16           TRUE    0x8811  node 16 I2C not responding             |
|1 0 nodefail17           FALSE   0x8812  switch  I2C not responding             |
|1 0 frPowerOff           FALSE   0x8846  SEPBU   frame power off                |
|1 0 controllerIDMismatch FALSE   0x8871  frame ID mismatch                      |
|1 0 controllerResponds   TRUE    0x88a8  frame responding to polls              |
+--------------------------------------------------------------------------------+
```
  This command tests if any of the frame's power supplies are off, if the frame controller is experiencing problems, or any of the node slot connections are bad. Keep in mind the warning made earlier, since wide and high nodes occupy more than one node slot in a frame, node failures will be detected for node slots that cannot be used because a wide or high node occupies that space.
  Such a situation is demonstrated in the example output listed previously. In this example, the nodes occupying slots 1 and 3 are wide nodes, as are the nodes occupying slots 13 and 15. Node slots 2, 4, 14, and 16 are therefore unusable, but the hmmon command indicates that nodes in these unavailable slots have failed. The log of the SP structure and setup is needed to understand which slots are "supposed" to indicate node failures, and which slots are not.
  Check for any of these conditions in the hmmon command output:
  - controllerResponds reads FALSE
  - controllerIDMismatch reads TRUE
  - nodefail17 reads TRUE (Indicating a failure in the SP switch)
  - Other nodefails show TRUE, and these node failure cannot be attributed to a wide or high node occupying that slot
  If a controller ID mismatch is shown, consult the Managing a HACWS Configuration chapter in PSSP: Administration Guide. For controller responsiveness problems, perform hardware diagnostics on the frame controller. For nodefail17 failures, perform hardware diagnostics on the switch device. For other node failures, perform hardware diagnostics on the node occupying that slot.
- If the fourth test of the spmon -G -d command indicated that the switch power was off, issue the following hmmon command to determine if this was caused by a hardware condition:
```
hmmon -G -Q -s -v nodePower,powerLED,envLED,shutdownTemp frame_num:17
```
  Example output of the hmmon command, showing switch information for frame 1
```
+--------------------------------------------------------------------------------+
|                                                                                |
|  1  17  powerLED                   1  0x8c47  node/switch LED 1 (green)        |
|  1  17  envLED                     0  0x8c48  node/switch LED 2 (yellow)       |
|  1  17  nodePower            TRUE     0x8c4a  DC-DC power on                   |
|  1  17  shutdownTemp         FALSE    0x8c59  temperature shutdown             |
+--------------------------------------------------------------------------------+
```
  This hmmon command will indicate if the switch has power, if power is available for the switch, if the switch's power was shut down automatically, and if the switch power was shut down due to high temperature. If the switch cannot obtain power, verify that the switch is correctly cabled to its power source. For other conditions, perform hardware diagnostics on the switch device.
- If the fifth test of the spmon -G -d command indicates that a node does not have power, and the node's power was not shut off manually, issue the following hmmon command to determine if the power was disabled because of a hardware condition:
```
hmmon -G -Q -s -v nodePower,powerLED,envLED frame_num:node_num
```
  Example output of the hmmon command for a single node in a single frame:
```
+--------------------------------------------------------------------------------+
|                                                                                |
|  1   1  nodePower            TRUE     0x904a  DC-DC power on                   |
|  1   1  powerLED                   1  0x9047  node/switch LED 1 (green)        |
|  1   1  envLED                     0  0x9048  node/switch LED 2 (yellow)       |
+--------------------------------------------------------------------------------+
```
  This hmmon command will indicate if the node has power, if power is available for the node, and if the node's power was shut down automatically. If the node cannot get power, verify that the node is correctly cabled to its power source. For other conditions, perform hardware diagnostics on the node.
- If the fifth test of the spmon -G d command indicates that a node's LED/LCD display is not blank, a hardware or operating system error has occurred. The LED/LCD code contains important failure information. When this condition exists, examine the node's LED/LCD value and record the value displayed. Use the following command to examine this value:
```
/usr/lpp/ssp/bin/spmon -L framenumber/nodenumber
```
  To determine the explanation and action for the error, look up this code in SP-specific LED/LCD values. If a three-digit LED/LCD code is not listed in this table, consult Other LED/LCD codes.
- If the fifth test of the spmon -G -d command indicated that the node's LED/LCD value was flashing, and the spmon -L command in the previous bullet indicates that the LED/LCD value is 888, a system dump was initiated on this node. The flashing 888 LED/LCD value indicates that a series of values are stored in the LED/LCD display.
  Step through this list of codes and record each value shown using the following sequence of steps:
  1. Issue the command
```
/usr/lpp/ssp/bin/spmon -reset -t framenumber/nodenumber
```
    to step to the next stored LED/LCD value.
  2. Issue the command
```
/usr/lpp/ssp/bin/spmon -L framenumber/nodenumber
```
    to retrieve the new LED/LCD value.
  3. Record this LED/LCD value
  Repeat these steps until the spmon -L command displays a value of 888 again. Retain this list of codes; they will be required by the IBM Support Center. To determine the explanation and action for these error codes, look up the codes in SP-specific LED/LCD values. If a three-digit LED/LCD code is not listed in this table consult Other LED/LCD codes. Finally, save and verify the system dump, following the instructions provided in Producing a system dump.

Checking Basic Software Information. Once hardware failures have been eliminated, it is time to perform some basic software verifications for the SP system. These checks will use the dsh command to invoke AIX commands on multiple nodes in parallel. To verify this, issue the following command from the control workstation:

dsh -a -f32 hostname

Example output of the dsh -a -f32 hostname command on a small SP system configuration:

+--------------------------------------------------------------------------------+
|k21n01.ppd.pok.ibm.com: k21n01.ppd.pok.ibm.com                                  |
|k21n03.ppd.pok.ibm.com: k21n03.ppd.pok.ibm.com                                  |
|k21n04.ppd.pok.ibm.com: k21n04.ppd.pok.ibm.com                                  |
|k21n05.ppd.pok.ibm.com: k21n05.ppd.pok.ibm.com                                  |
|k21n06.ppd.pok.ibm.com: k21n06.ppd.pok.ibm.com                                  |
|k21n07.ppd.pok.ibm.com: k21n07.ppd.pok.ibm.com                                  |
|k21n09.ppd.pok.ibm.com: k21n09.ppd.pok.ibm.com                                  |
|k21n11.ppd.pok.ibm.com: k21n11.ppd.pok.ibm.com                                  |
|k21n13.ppd.pok.ibm.com: k21n13.ppd.pok.ibm.com                                  |
+--------------------------------------------------------------------------------+

This test will verify that the dsh command can reach the nodes within the SP system. Only nodes that were previously detected as being offline in the earlier tests should fail to respond to this command. If any other nodes within the SP system fail to respond, check for problems by referring to Diagnosing remote command problems on the SP System.

Check the paging space that is in use on all nodes by using the lsps command on the control workstation:

dsh -av -f32 lsps -s | more

Example output of the dsh -av -f32 lsps -s command on a small SP system configuration:

+--------------------------------------------------------------------------------+
|                                                                                |
|k21n01.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n01.ppd.pok.ibm.com:       768MB               8%                            |
|k21n03.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n03.ppd.pok.ibm.com:       768MB              17%                            |
|k21n04.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n04.ppd.pok.ibm.com:       768MB               8%                            |
|k21n05.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n05.ppd.pok.ibm.com:       768MB              13%                            |
|k21n06.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n06.ppd.pok.ibm.com:       768MB              12%                            |
|k21n07.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n07.ppd.pok.ibm.com:       768MB              11%                            |
|k21n09.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n09.ppd.pok.ibm.com:       768MB               9%                            |
|k21n11.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n11.ppd.pok.ibm.com:       768MB               9%                            |
|k21n13.ppd.pok.ibm.com: Total Paging Space   Percent Used                       |
|k21n13.ppd.pok.ibm.com:       768MB              15%                            |
+--------------------------------------------------------------------------------+

Lack of available paging space can lead to thrashing conditions on a node. If these nodes are running parallel applications, the entire application will be slowed to the rate of the slowest responding node. The extent to which low paging space and thrashing can be tolerated differs from one customer environment to the next. As a general rule of thumb, investigate any nodes indicating that 80% or more of its paging space is currently in use.

Check for file systems that are close to their capacity, concentrating on the file systems mentioned earlier in this section, by issuing the dsh command from the control workstation, to invoke the df command:

dsh -av -f32 df /spdata /var /tmp | more

Example output from the dsh -av -f32 df /spdata /var /tmp command on a small SP system configuration:

+--------------------------------------------------------------------------------+
|                                                                                |
|k21n01: Filesystem    512-blocks      Free %Used    Iused %Iused Mounted on     |
|k21n01: /dev/hd4           32768       432   99%     1403    18%  /             |
|k21n01: /dev/hd9var       147456     45480   70%      610     4%  /var          |
|k21n01: /dev/hd3           98304     38632   61%       85     1%  /tmp          |
|k21n03: Filesystem    512-blocks      Free %Used    Iused %Iused Mounted on     |
|k21n03: /dev/hd4           32768     16960   49%     1431    18%  /             |
|k21n03: /dev/hd9var       704512     99968   86%      595     1%  /var          |
|k21n03: /dev/hd3           98304     50424   49%      278     3%  /tmp          |
|k21n04: Filesystem    512-blocks      Free %Used    Iused %Iused Mounted on     |
|k21n04: /dev/hd4           32768     16584   50%     1512    19%  /             |
|k21n04: /dev/hd9var       147456    107312   28%      644     4%  /var          |
|k21n04: /dev/hd3           98304     91232    8%       74     1%  /tmp          |
+--------------------------------------------------------------------------------+

An obvious warning sign is if any of these file systems should appear to be more than 90% utilized. If any file systems appear over 90% utilized, examine the file systems for large files that can be removed or compressed, or consider extending the file system size. Attempt to keep 10MB available in the /var file system and 8MB available in the /tmp file system, to ensure that PSSP software and service software function correctly.

Keeping informed of status changes

Note:: In order to successfully issue these commands, you must have "monitor" permission for kerberos (compat mode) or DCE mode, depending on the authentication method in use.

The previous discussion centered on obtaining the current status of SP system hardware and software. Such efforts are necessary if a problem is suspected and being actively investigated, but repeatedly issuing these commands periodically to examine the current status of the SP system can become tedious. To make the task of monitoring system status easier, PSSP provides monitoring capabilities within the hmmon and spmon commands as well. This avoids the necessity of reissuing the previously discussed commands over and over again to keep informed of the system status. This section describes some of the more common monitor commands.

To set up a monitor to check for frame hardware failures, issue the following background command:

hmmon -G -q -s -v frPowerOff*,controllerResponds,controllerIDMismatch,\
nodefail* range_of_frame_nums:0 &

Example initial output from the hmmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|  1   0  nodefail1            FALSE    0x8802  node 01 I2C not responding       |
|  1   0  nodefail2            TRUE     0x8803  node 02 I2C not responding       |
|  1   0  nodefail3            FALSE    0x8804  node 03 I2C not responding       |
|  1   0  nodefail4            FALSE    0x8805  node 04 I2C not responding       |
|  1   0  nodefail5            FALSE    0x8806  node 05 I2C not responding       |
|  1   0  nodefail6            FALSE    0x8807  node 06 I2C not responding       |
|  1   0  nodefail7            FALSE    0x8808  node 07 I2C not responding       |
|  1   0  nodefail8            TRUE     0x8809  node 08 I2C not responding       |
|  1   0  nodefail9            FALSE    0x880a  node 09 I2C not responding       |
|  1   0  nodefail10           TRUE     0x880b  node 10 I2C not responding       |
|  1   0  nodefail11           FALSE    0x880c  node 11 I2C not responding       |
|  1   0  nodefail12           TRUE     0x880d  node 12 I2C not responding       |
|  1   0  nodefail13           FALSE    0x880e  node 13 I2C not responding       |
|  1   0  nodefail14           TRUE     0x880f  node 14 I2C not responding       |
|  1   0  nodefail15           TRUE     0x8810  node 15 I2C not responding       |
|  1   0  nodefail16           TRUE     0x8811  node 16 I2C not responding       |
|  1   0  nodefail17           FALSE    0x8812  switch I2C not responding        |
|  1   0  frPowerOff           FALSE    0x8846  SEPBU frame power off            |
|  1   0  controllerIDMismatch FALSE    0x8871  frame ID mismatch                |
|  1   0  controllerResponds   TRUE     0x88a8  frame responding to polls        |
+--------------------------------------------------------------------------------+

This command is similar to the one presented previously, except that this version continually monitors the frame condition and generates a message to the terminal if any of the status should change. To stop monitoring this information, terminate the background process.

To set up a monitor to check for SP switch hardware status changes, issue the following background command:

hmmon -G -q -s -v nodePower,powerLED,envLED,\
shutdownTemp range_of_frame_nums:17 &

Example initial output from the hmmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|  1  17  powerLED                   1  0x8c47  node/switch LED 1 (green)        |
|  1  17  envLED                     0  0x8c48  node/switch LED 2 (yellow)       |
|  1  17  nodePower            TRUE     0x8c4a  DC-DC power on                   |
|  1  17  shutdownTemp         FALSE    0x8c59  temperature shutdown             |
+--------------------------------------------------------------------------------+

This command is similar to one presented previously, except this version continually monitors the frame condition and generates a message to the terminal if any of the status should change. To stop monitoring this information, terminate the background process.

To set up a monitor to check for changes in a node's LCD or LED status, issue the following background command:

hmmon -G -q -s -v LED7Seg* range_of_frame_nums:1-16 &

Example initial output from the hmmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|  1   1  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1   1  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1   1  LED7SegC                 255  0x90a1  7 segment LED C                  |
|  1   3  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   3  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   3  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   4  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   4  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   4  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   5  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   5  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   5  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   6  LED7SegA                 255  0x949f  7 segment LED A                  |
|  1   6  LED7SegB                 255  0x94a0  7 segment LED B                  |
|  1   6  LED7SegC                 255  0x94a1  7 segment LED C                  |
|  1   7  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1   7  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1   7  LED7SegC                 255  0x90a1  7 segment LED C                  |
|  1   9  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1   9  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1   9  LED7SegC                 255  0x90a1  7 segment LED C                  |
|  1  11  LED7SegA                 255  0x909f  7 segment LED A                  |
|  1  11  LED7SegB                 255  0x90a0  7 segment LED B                  |
|  1  11  LED7SegC                 255  0x90a1  7 segment LED C                  |
+--------------------------------------------------------------------------------+

This command shows the initial status of these resources, and displays any status changes in these resources when they occur. All values should display a value of 255, indicating that the associated readout element is blank. If any nodes indicate that a segment is not blank, issue the spmon -L command mentioned on (DKILEDNB) to obtain the current LCD or LED readout of the node.

To set up a monitor to check for nodes suddenly losing contact with the SP Switch, issue the following command:

spmon -q -M -l -t frame*/node*/switchResponds/value

Example initial output from the spmon command:

+--------------------------------------------------------------------------------+
|                                                                                |
|/SP/frame/frame1/node1/switchResponds/value/1                                   |
|/SP/frame/frame1/node3/switchResponds/value/1                                   |
|/SP/frame/frame1/node4/switchResponds/value/1                                   |
|/SP/frame/frame1/node5/switchResponds/value/1                                   |
|/SP/frame/frame1/node6/switchResponds/value/1                                   |
|/SP/frame/frame1/node7/switchResponds/value/1                                   |
|/SP/frame/frame1/node9/switchResponds/value/0                                   |
|/SP/frame/frame1/node11/switchResponds/value/0                                  |
|/SP/frame/frame1/node13/switchResponds/value/1                                  |
+--------------------------------------------------------------------------------+

The spmon command also displays the current status, and a message to the terminal if any of these values change. All values should be 1. A value of 0 indicates that the node is not responding to the SP Switch. Note that this is the case with two of the nodes in this example, and these nodes should be investigated.

Other conditions can also be monitored using the hmmon and spmon commands; these suggestions offer the most basic of tests. To learn what other conditions can be monitored with these commands, and to tailor these commands to best suit your needs, refer to the hmmon and spmon sections of PSSP: Command and Technical Reference.

All commands can be issued from the same terminal session, but this can lead to confusing output when conditions change, or initial values can scroll off the terminal screen. To keep the monitoring manageable, consider issuing these commands from separate terminals, or from separate terminal windows from a XWindows capable terminal. Issue one monitoring command per terminal or terminal window. This will associate a terminal with each condition being monitored, and simplify the understanding of the monitor output.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]