PSSP provides tools to monitor system status and conditions in a runtime fashion, when the system administrator is actively monitoring the current status of the system. These tools are used when the system administrator wants to know immediately the current status of system resources, or to be notified immediately of problems and potential trouble situations.
Two sets of runtime tools are available. The choice of the tools depends on the capabilities of the system administrator's workstation and the system administrator's preferences. PSSP provides graphical tools for use on the control workstation or network-attached terminals. PSSP also provides command-line tools for those situations when only modem access or s1term access is available.
PSSP provides graphical tools for system administration and monitoring through the SP Perspectives tool suite. Perspectives is engineered for ease-of-use for the system administrator. In order to be used effectively, SP Perspectives requires X11 graphics capable terminals or workstations and high-speed connections. Use Perspectives when monitoring the SP system from the control workstation or from a network-attached workstation.
The basic concepts of Perspectives and examples of its use are included in the SP Perspectives chapter of PSSP: Administration Guide. Perspectives also provides extensive online help information. To understand how to accomplish the tasks that are presented in this chapter, consult the SP Perspectives online help, using this section as a guide to the online help topics.
Individual Perspectives require that certain subsystems be operating and that the user is authorized to communicate with them. Such subsystems include the System Monitor, Event Management, System Data Repository, and Problem Management. For authorization required for each Perspective, see the discussion on using SP Perspectives in PSSP: Administration Guide.
The Perspectives launch pad is started using the perspectives command, which resides in the /usr/lpp/ssp/bin directory. Other Perspectives, such as the Event or Hardware Perspective, can be started from the launch pad. Before starting an SP Perspective, be sure that the DISPLAY environment variable is set to the machine that you want to display the SP Perspective. Also, be sure that you are permitted to display to that machine by running the xhost command on that machine.
Two Perspectives tools are useful for monitoring the system status and detecting problem situations:
This tool allows the user to specify system conditions that are of concern or importance, and to indicate what actions are to be taken when the condition exists. The Perspective interfaces with the Event Management software subsystem to monitor these conditions and alert the Perspective to the presence of the condition. To effectively use this Perspective, you must understand certain terminology.
The rearm expression indicates when the SP Event Perspective should consider the event to have "stopped". For example, a file system is considered "almost full" when the available space is less than 10% of its capacity. The system administrator may want to consider the condition to exist until the available space reaches 13% of the file system's capacity. The event expression would then be set to 10% and the rearm expression to 13%. As with the event expression, the system administrator can indicate an action to take when the rearm expression occurs, such as deactivating reserve resources that had been activated when the event occurred.
Once the user registers the event definition, the action will be run whenever the event or rearm expression occurs. This is independent of whether the Event Perspective is active at the time that the event or rearm expression occurs.
To start the SP Event Perspective, double click on the Event Perspective icon in the SP Perspectives launch pad window.
Users can create conditions for situations that are important to them through the Conditions pane of the SP Event Perspective. A number of default conditions are provided through the SP Event Perspective. You may wish to add more or to tailor the predefined conditions to meet the specific needs of your particular SP installation. The Perspectives online help provides assistance on how to create conditions and how to modify existing conditions. To access this help, click on the Help button from the SP Event Perspective display, and select the Tasks... option. Assistance in handling conditions is available through the Working with Conditions topic.
Once a condition is defined through the SP Event Perspective, an action can then be associate with it. The action may be as simple as a visual notification that the event has occurred, or the action can be more sophisticated, including automatically invoking a command in response to the event. To associate the appropriate action with the presence of the condition (or to the absence of the condition), an event definition must be created. You can create these definitions and examine default definitions through the Event Definitions pane of the SP Event Perspective. The Perspective online help provides assistance on how to create event definitions and how to modify existing definitions. To access this help, click on the Help button from the SP Event Management Perspective display and select the Tasks... option. Assistance in handling event definitions is available through the Working with Event Definitions topic.
Only after both the condition and its associated event definition are defined to the Perspective, can you begin the monitoring of the condition. This is done by registering the event definition through the Event Definitions pane in the SP Event Perspective. To find how this is done, consult the Perspective's Working with Event Definitions online help topic.
Other basic SP Event Perspective tasks are described in the online help. To access this information, click on the Help button from the SP Event Management Perspective display, select the Tasks... option, and click on the How Do I ...? topic.
Depending on how the event definition was constructed, the SP Event Perspective reacts in one or more of the following ways when you register the event definition, and the condition that the event definition is based on occurs:
Once you register the event definition, the action runs whenever the event occurs, whether or not the SP Event Perspective is active at the time the event occurs.
The actions performed when the event or the rearm expression occurs can be one of the following:
errpt -at -J HA_PMAN_EVENT_ON -J HA_PMAN_EVENT_OFF
Notification can be sent to the system administrator whenever these templates are logged to the AIX Error Log. For instructions on setting up this notification, consult Using the AIX Error Notification Facility.
The SP Event Perspective is designed to be a multi-user tool. Multiple users can invoke the SP Event Perspective in parallel and monitor different conditions. Notifications are routed to those users that registered the associated event definition. The Perspective also stores event definitions created by each user in the user's $HOME/.$USER:Events file. By storing these definitions in different files, each user can tailor conditions and event definitions to best suit the user's needs. This also prevents users from accidentally modifying conditions or event definitions created or used by other SP Perspectives users.
This tool allows you to examine the current status of the SP system hardware. Through this tool, you can display a graphical representation of the system's overall structure, assess the current status of system hardware, and issue hardware control commands.
For further assistance in using the notebook to view hardware status, consult the Perspective's online help. To access this help, click on the Help button from the SP Hardware Perspective display and select the Tasks... option. Assistance in viewing hardware status is available in the Viewing Hardware Attributes topic.
If you want to view the same hardware information from multiple entities, such as the responsiveness to the switch for a series of nodes, opening a notebook for each entity can be time-consuming. The SP Hardware Perspective offers an alternative method for displaying this information. Most information displayed in a notebook can also be displayed in the pane in a table format. The left column of the table contains the objects from the pane while the columns to the right contains the information you want displayed from the notebook.
To switch from the icon to table view in a pane, select the icon on the right of the toolbar which shows a table and an icon. When you point at this icon, the descriptive text reads "Show object in the table view or the icon view". The first time you select this toolbar icon, the "Set Table Attributes" dialog is displayed. This dialog lists the attributes from the objects notebook that you can display in the table. After selecting the desired items from the list, select "OK". The pane will be updated with the items you selected. The table entities that represent variable states of the hardware entity will be color coded to indicate "good", "bad", "caution" status as they did in the notebook.
For assistance on using the table view to examine hardware status, consult the Perspective online help. To access this help, click on the Help button from the SP Hardware Perspective display and select the Overview.. option. From the new window that appears, select the Starting and Customizing SP Perspective topic, then select the Customizing SP Perspective subtopic, and finally select the Using Table View item. The Help option from the selection list window provides a fast path to the help topic.
When monitoring is active, the icons of the entities use a visual indication of the status. If all monitored conditions do not indicate a problem, the icons will be presented in a green color. If any of the monitored conditions indicate a problem, the icon will appear to have a red X drawn through it. Note that this will occur even if only one condition indicates a problem. For example, if five nodes are being monitored for five conditions, and one of these five conditions appears on node1a, the icon for node1a will appear with a red X through it, while the remaining nodes will be represented with green icons.
To determine what condition may exist on a marked entity, select the object by single-clicking on its icon or its table entry in the pane. Then open the object's notebook by clicking on the notebook icon in the upper left corner. When the notebook display comes up, page forward to the "Monitored Conditions" page. This page lists conditions being monitored for that object, along with the condition's current state. Any state listed as "Triggered" indicates that the condition is present.
If any object in a pane is presented in gray with a question mark (?) drawn over it for longer than a few seconds, a communication problem exists between the SP Hardware Perspective and the Event Management software subsystem. For assistance in resolving the problem, consult Diagnosing Perspectives problems on the SP System.
For further assistance in setting up and starting the hardware monitor, consult the Perspectives online help facility. To access this help, click on the Help button from the SP Hardware Perspective display and select the Tasks... option. From the new window that appears, select the Monitoring Hardware Objects topic.
There are some characteristics of the SP Hardware Perspective that the user should keep in mind when using the tool. Unlike the SP Event Perspective, the SP Hardware Perspective does not permit the user to associate an action with the presence of a condition. Users that wish to automate a response to a specific system condition should use the SP Event Perspective. Also, the SP Hardware Perspective only monitors conditions while it is active. If the Perspective is shut down, any monitoring of hardware status is also shut down.
To be able to restart the Perspectives so that monitoring will automatically start, you will need to save the configuration to a profile. From the menu bar select Options > Save Preferences... The Save Preferences dialog will be displayed. For more information on using this dialog, select the Help button at the bottom. To start the Hardware Perspective with the saved profile:
sphardware -userProfile name_you_specified
sphardware -systemProfile name_you_specified
Previous versions of PSSP offered a graphical user interface as part of the System Monitor (spmon) command. PSSP Version 3.1 and later versions of PSSP, incorporate this hardware control capability into the SP Hardware Perspective. While the capability of the spmon command is available through the Perspective, the "look and feel" of the control is somewhat different. The SP Hardware Perspective offers a special online help facility to acclimate former spmon graphical interface users to the new controls. To access this help, click on the Help menu bar item in the SP Hardware Perspective, select the Tasks... option, and then select the Transforming System Monitor Experience into Hardware Perspectives Skills topic from the help menu.
Each Perspective provides its own unique capabilities. For the purposes of problem monitoring and determination, this manual recommends that the SP Event Perspective be used to monitor conditions of interest for the SP system. When the SP Event Perspective indicates that a hardware failure condition exists, the SP Hardware Perspective should be used to examine the current status of the system hardware and obtain more detailed information about the hardware problem.
PSSP provides command-oriented tools for system administration in addition to graphical tools for system administration and monitoring. These tools require no special workstation capability or high-speed connection, making them usable by almost any terminal type in any mode of access. Use these tools when examining system status through a modem connection or through a node's S1 serial port. The tools discussed in this section are documented in greater detail in PSSP: Command and Technical Reference, PSSP: Administration Guide, and AIX Version 4 Commands Reference. These tools do not possess the same ease-of-use characteristics as their Perspectives based counterparts, although they do provide the same basic function.
Several commands are useful for monitoring the system status and detecting problem situations:
The spmon and dsh commands require the user to have specific authorizations. To learn how a user can acquire these authorizations, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.
The spmon command permits the user to control and monitor SP hardware resources through a command-line interface without requiring a graphics-capable terminal or high-speed connection. The spmon command does not provide the capability to examine software status (such as paging space, file system space, or software subsystem activity). The spmon command provides access to more node-specific information than the hmmon command, which is introduced next. The spmon command provides a predefined system query to check the most basic problem conditions within the SP system.
The hmmon command provides hardware monitoring functions similar to the spmon command, and gives you access to more SP hardware information for frames and switches than the spmon command does. The hmmon command provides the capability to monitor frame and switch status as well as node status. The hmmon command is intended as a general-purpose SP hardware monitor. Although it has access to more SP information than the spmon command, it does not have access to some of the node-specific information that the spmon command does. The hmmon command does not provide a predefined system query, which the spmon command does.
The df command is an AIX command that examines the current status of file systems, such as current file system size and current available space within these file systems. While this command is designed to examine the AIX system on which it is issued, it can be invoked remotely with the dsh command to acquire this information for all nodes. Three file systems are of particular importance for all SP nodes:
This directory contains configuration information for PSSP software and also contains copies of information from the SDR. By default, this directory resides in the / (or root) file system. Insufficient space in this file system can result in failures in PSSP software, especially those dependent on the SDR for proper operation. As a rule of thumb, ensure that this file system has at least 5% of its capacity available at any time.
One method for avoiding space problems for the /spdata directory is to create a separate file system for this directory. Follow the instructions in PSSP: Installation and Migration Guide to create a separate volume group for this file system. Use the same rule of thumb for spotting potential trouble with this file system.
This file system contains AIX system logs, such as the error log and user access logs. It also contains logs maintained by PSSP software for serviceability purposes. Some of these logs are never cleared except by explicit system administrator actions. If left unattended, they can grow to consume all available space.
As a rule of thumb, ensure that 10MB of space is available within this file system at all times. If the file system reaches this threshold, consider either extending the file system's capacity with the chfs command, or examine the file system to determine where the space is being consumed and remove unneeded files.
If /var is continually reaching the suggested threshold, this condition may indicate a chronic problem with some PSSP software or with specific hardware devices. Examine the logs listed in Error logging overview to determine if any show increased or extended activity, and perform any associated problem determination procedures if necessary.
This file system is used by various user level applications, software products, and PSSP programs for temporary storage. Some legacy PSSP applications use this file system to store trace logs used for serviceability purposes. Some applications may inadvertently leave temporary files in the /tmp file system, or these applications may terminate before removing these files.
Insufficient space in /tmp can cause PSSP software to fail. As a rule of thumb, ensure that at least 8MB of space is available in this file system at any time. Eight MB is the amount of space a snap command will require if the system has to produce a dump to be sent to the IBM Support Center.
These space capacities can be verified using the dsh command to invoke the df command on all nodes in the SP system.
The lsps command provides an instant assessment of the currently available paging space for an AIX system. As with the df command, the lsps command provides information for the AIX system on which it runs. Using the lsps command with the dsh command or via a remote command, you can obtain the assessment for all nodes in the SP system.
Paging space availability by itself does not necessarily indicate a problem. Having only ten percent of 2 gigabytes of paging space available is not as significant a condition as having only ten percent of 100MB available. Also, one system 's critical situation may be a tolerable situation for another system. Because of this discrepancy, this manual will not suggest a default figure for a critical paging space situation. Use your knowledge of the system setup, system workload, and any past paging space problems to determine this value.
The lssrc command provides information for software services currently installed on an AIX system. Using lssrc, you can determine if a software service is active or inactive. Use this command in cases where a software service does not appear to be responding to requests for service on a specific node. To check software service status on multiple nodes, use this command through the dsh command.
The dsh command permits the user to issue a command on a remote node and to view the results on the local node. Using dsh, you can issue the commands listed previously on any SP node from a single location. This removes the need to login to each node individually. A user must have specific authorization to use the dsh command. To learn how a user can acquire this authorization, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.
The following scenarios demonstrate how these tools are used to query and monitor the status of the SP system.
This task is accomplished through the following series of steps:
This check should be performed by users authorized to invoke the spmon and dsh commands. To learn how a user can acquire this authorization, see the "Using the SP System Monitor" chapter in PSSP: Administration Guide.
/usr/lpp/ssp/bin/spmon -G -d | more
This test verifies several items in the monitor program itself to make sure that it is running. Once the monitor verification completes, the spmon command checks the status of the SP frames and obtains information about the SP nodes. The spmon command performs these tests in a dependent order, so that if one of the early checks fails, subsequent checks are not performed. For example, if a frame cannot be queried, the frame and the nodes within that frame are not checked.
Example output from the spmon -G -d command:
+--------------------------------------------------------------------------------+ | | |1. Checking server process | |Process 10512 has accumulated 192 minutes and 53 seconds. | |Check ok | | | |2. Opening connection to server | |Connection opened | |Check ok | | | |3. Querying frames (s) | |1 frames (s) | |Check ok | | | |4. Checking frames | | | | Controller Slot 17 Switch Switch Power supplies | |Frame Responds Switch Power Clocking A B C D | |-------------------------------------------------------------- | | 1 yes yes on 0 on on on on | | | |5. Checking nodes | |------------------------------- Frame 1 ---------------------- | |Frame Node Node Host/Switch Key Env Front Panel LCD/LED is | |Slot Number Type Power Responds Switch Fail LCD/LED Flashing | | 1 1 wide on yes yes normal no LEDs are blank no | | 3 3 thin on yes yes normal no LEDs are blank no | | 4 4 thin on yes yes normal no LEDs are blank no | | 5 5 thin on yes yes normal no LEDs are blank no | | 6 6 thin on yes yes normal no LEDs are blank no | | 7 7 wide on yes yes normal no LEDs are blank no | | 9 9 wide on yes yes normal no LEDs are blank no | | 11 11 wide on yes yes normal no LEDs are blank no | | 13 13 wide on yes yes N/A no LCDs are blank no | | | +--------------------------------------------------------------------------------+
Note that these tests are numbered. This makes it easy to detect if a test was omitted. The results of this command indicate potential problems if any of these conditions exist:
hmmon -G -q -s -v frPowerOff*,controllerResponds,\ controllerIDMismatch,nodefail* range_of_frame_nums:0
Output is similar to:
+--------------------------------------------------------------------------------+ | | |1 0 nodefail1 FALSE 0x8802 node 01 I2C not responding | |1 0 nodefail2 TRUE 0x8803 node 02 I2C not responding | |1 0 nodefail3 FALSE 0x8804 node 03 I2C not responding | |1 0 nodefail4 TRUE 0x8805 node 04 I2C not responding | |1 0 nodefail5 FALSE 0x8806 node 05 I2C not responding | |1 0 nodefail6 FALSE 0x8807 node 06 I2C not responding | |1 0 nodefail7 FALSE 0x8808 node 07 I2C not responding | |1 0 nodefail8 FALSE 0x8809 node 08 I2C not responding | |1 0 nodefail9 FALSE 0x880a node 09 I2C not responding | |1 0 nodefail10 FALSE 0x880b node 10 I2C not responding | |1 0 nodefail11 FALSE 0x880c node 11 I2C not responding | |1 0 nodefail12 FALSE 0x880d node 12 I2C not responding | |1 0 nodefail13 FALSE 0x880e node 13 I2C not responding | |1 0 nodefail14 TRUE 0x880f node 14 I2C not responding | |1 0 nodefail15 FALSE 0x8810 node 15 I2C not responding | |1 0 nodefail16 TRUE 0x8811 node 16 I2C not responding | |1 0 nodefail17 FALSE 0x8812 switch I2C not responding | |1 0 frPowerOff FALSE 0x8846 SEPBU frame power off | |1 0 controllerIDMismatch FALSE 0x8871 frame ID mismatch | |1 0 controllerResponds TRUE 0x88a8 frame responding to polls | +--------------------------------------------------------------------------------+
This command tests if any of the frame's power supplies are off, if the frame controller is experiencing problems, or any of the node slot connections are bad. Keep in mind the warning made earlier, since wide and high nodes occupy more than one node slot in a frame, node failures will be detected for node slots that cannot be used because a wide or high node occupies that space.
Such a situation is demonstrated in the example output listed previously. In this example, the nodes occupying slots 1 and 3 are wide nodes, as are the nodes occupying slots 13 and 15. Node slots 2, 4, 14, and 16 are therefore unusable, but the hmmon command indicates that nodes in these unavailable slots have failed. The log of the SP structure and setup is needed to understand which slots are "supposed" to indicate node failures, and which slots are not.
Check for any of these conditions in the hmmon command output:
If a controller ID mismatch is shown, consult the Managing a HACWS Configuration chapter in PSSP: Administration Guide. For controller responsiveness problems, perform hardware diagnostics on the frame controller. For nodefail17 failures, perform hardware diagnostics on the switch device. For other node failures, perform hardware diagnostics on the node occupying that slot.
hmmon -G -Q -s -v nodePower,powerLED,envLED,shutdownTemp frame_num:17
Example output of the hmmon command, showing switch information for frame 1
+--------------------------------------------------------------------------------+ | | | 1 17 powerLED 1 0x8c47 node/switch LED 1 (green) | | 1 17 envLED 0 0x8c48 node/switch LED 2 (yellow) | | 1 17 nodePower TRUE 0x8c4a DC-DC power on | | 1 17 shutdownTemp FALSE 0x8c59 temperature shutdown | +--------------------------------------------------------------------------------+
This hmmon command will indicate if the switch has power, if power is available for the switch, if the switch's power was shut down automatically, and if the switch power was shut down due to high temperature. If the switch cannot obtain power, verify that the switch is correctly cabled to its power source. For other conditions, perform hardware diagnostics on the switch device.
hmmon -G -Q -s -v nodePower,powerLED,envLED frame_num:node_num
Example output of the hmmon command for a single node in a single frame:
+--------------------------------------------------------------------------------+ | | | 1 1 nodePower TRUE 0x904a DC-DC power on | | 1 1 powerLED 1 0x9047 node/switch LED 1 (green) | | 1 1 envLED 0 0x9048 node/switch LED 2 (yellow) | +--------------------------------------------------------------------------------+
This hmmon command will indicate if the node has power, if power is available for the node, and if the node's power was shut down automatically. If the node cannot get power, verify that the node is correctly cabled to its power source. For other conditions, perform hardware diagnostics on the node.
/usr/lpp/ssp/bin/spmon -L framenumber/nodenumber
To determine the explanation and action for the error, look up this code in SP-specific LED/LCD values. If a three-digit LED/LCD code is not listed in this table, consult Other LED/LCD codes.
Step through this list of codes and record each value shown using the following sequence of steps:
/usr/lpp/ssp/bin/spmon -reset -t framenumber/nodenumber
to step to the next stored LED/LCD value.
/usr/lpp/ssp/bin/spmon -L framenumber/nodenumber
to retrieve the new LED/LCD value.
Repeat these steps until the spmon -L command displays a value of 888 again. Retain this list of codes; they will be required by the IBM Support Center. To determine the explanation and action for these error codes, look up the codes in SP-specific LED/LCD values. If a three-digit LED/LCD code is not listed in this table consult Other LED/LCD codes. Finally, save and verify the system dump, following the instructions provided in Producing a system dump.
dsh -a -f32 hostname
Example output of the dsh -a -f32 hostname command on a small SP system configuration:
+--------------------------------------------------------------------------------+ |k21n01.ppd.pok.ibm.com: k21n01.ppd.pok.ibm.com | |k21n03.ppd.pok.ibm.com: k21n03.ppd.pok.ibm.com | |k21n04.ppd.pok.ibm.com: k21n04.ppd.pok.ibm.com | |k21n05.ppd.pok.ibm.com: k21n05.ppd.pok.ibm.com | |k21n06.ppd.pok.ibm.com: k21n06.ppd.pok.ibm.com | |k21n07.ppd.pok.ibm.com: k21n07.ppd.pok.ibm.com | |k21n09.ppd.pok.ibm.com: k21n09.ppd.pok.ibm.com | |k21n11.ppd.pok.ibm.com: k21n11.ppd.pok.ibm.com | |k21n13.ppd.pok.ibm.com: k21n13.ppd.pok.ibm.com | +--------------------------------------------------------------------------------+
This test will verify that the dsh command can reach the nodes within the SP system. Only nodes that were previously detected as being offline in the earlier tests should fail to respond to this command. If any other nodes within the SP system fail to respond, check for problems by referring to Diagnosing remote command problems on the SP System.
dsh -av -f32 lsps -s | more
Example output of the dsh -av -f32 lsps -s command on a small SP
system configuration:
+--------------------------------------------------------------------------------+ | | |k21n01.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n01.ppd.pok.ibm.com: 768MB 8% | |k21n03.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n03.ppd.pok.ibm.com: 768MB 17% | |k21n04.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n04.ppd.pok.ibm.com: 768MB 8% | |k21n05.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n05.ppd.pok.ibm.com: 768MB 13% | |k21n06.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n06.ppd.pok.ibm.com: 768MB 12% | |k21n07.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n07.ppd.pok.ibm.com: 768MB 11% | |k21n09.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n09.ppd.pok.ibm.com: 768MB 9% | |k21n11.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n11.ppd.pok.ibm.com: 768MB 9% | |k21n13.ppd.pok.ibm.com: Total Paging Space Percent Used | |k21n13.ppd.pok.ibm.com: 768MB 15% | +--------------------------------------------------------------------------------+
Lack of available paging space can lead to thrashing conditions on a node. If these nodes are running parallel applications, the entire application will be slowed to the rate of the slowest responding node. The extent to which low paging space and thrashing can be tolerated differs from one customer environment to the next. As a general rule of thumb, investigate any nodes indicating that 80% or more of its paging space is currently in use.
dsh -av -f32 df /spdata /var /tmp | more
Example output from the dsh -av -f32 df /spdata /var /tmp command on a small SP system configuration:
+--------------------------------------------------------------------------------+ | | |k21n01: Filesystem 512-blocks Free %Used Iused %Iused Mounted on | |k21n01: /dev/hd4 32768 432 99% 1403 18% / | |k21n01: /dev/hd9var 147456 45480 70% 610 4% /var | |k21n01: /dev/hd3 98304 38632 61% 85 1% /tmp | |k21n03: Filesystem 512-blocks Free %Used Iused %Iused Mounted on | |k21n03: /dev/hd4 32768 16960 49% 1431 18% / | |k21n03: /dev/hd9var 704512 99968 86% 595 1% /var | |k21n03: /dev/hd3 98304 50424 49% 278 3% /tmp | |k21n04: Filesystem 512-blocks Free %Used Iused %Iused Mounted on | |k21n04: /dev/hd4 32768 16584 50% 1512 19% / | |k21n04: /dev/hd9var 147456 107312 28% 644 4% /var | |k21n04: /dev/hd3 98304 91232 8% 74 1% /tmp | +--------------------------------------------------------------------------------+
An obvious warning sign is if any of these file systems should appear to be more than 90% utilized. If any file systems appear over 90% utilized, examine the file systems for large files that can be removed or compressed, or consider extending the file system size. Attempt to keep 10MB available in the /var file system and 8MB available in the /tmp file system, to ensure that PSSP software and service software function correctly.
The previous discussion centered on obtaining the current status of SP system hardware and software. Such efforts are necessary if a problem is suspected and being actively investigated, but repeatedly issuing these commands periodically to examine the current status of the SP system can become tedious. To make the task of monitoring system status easier, PSSP provides monitoring capabilities within the hmmon and spmon commands as well. This avoids the necessity of reissuing the previously discussed commands over and over again to keep informed of the system status. This section describes some of the more common monitor commands.
To set up a monitor to check for frame hardware failures, issue the following background command:
hmmon -G -q -s -v frPowerOff*,controllerResponds,controllerIDMismatch,\ nodefail* range_of_frame_nums:0 &
Example initial output from the hmmon command:
+--------------------------------------------------------------------------------+ | | | 1 0 nodefail1 FALSE 0x8802 node 01 I2C not responding | | 1 0 nodefail2 TRUE 0x8803 node 02 I2C not responding | | 1 0 nodefail3 FALSE 0x8804 node 03 I2C not responding | | 1 0 nodefail4 FALSE 0x8805 node 04 I2C not responding | | 1 0 nodefail5 FALSE 0x8806 node 05 I2C not responding | | 1 0 nodefail6 FALSE 0x8807 node 06 I2C not responding | | 1 0 nodefail7 FALSE 0x8808 node 07 I2C not responding | | 1 0 nodefail8 TRUE 0x8809 node 08 I2C not responding | | 1 0 nodefail9 FALSE 0x880a node 09 I2C not responding | | 1 0 nodefail10 TRUE 0x880b node 10 I2C not responding | | 1 0 nodefail11 FALSE 0x880c node 11 I2C not responding | | 1 0 nodefail12 TRUE 0x880d node 12 I2C not responding | | 1 0 nodefail13 FALSE 0x880e node 13 I2C not responding | | 1 0 nodefail14 TRUE 0x880f node 14 I2C not responding | | 1 0 nodefail15 TRUE 0x8810 node 15 I2C not responding | | 1 0 nodefail16 TRUE 0x8811 node 16 I2C not responding | | 1 0 nodefail17 FALSE 0x8812 switch I2C not responding | | 1 0 frPowerOff FALSE 0x8846 SEPBU frame power off | | 1 0 controllerIDMismatch FALSE 0x8871 frame ID mismatch | | 1 0 controllerResponds TRUE 0x88a8 frame responding to polls | +--------------------------------------------------------------------------------+
This command is similar to the one presented previously, except that this version continually monitors the frame condition and generates a message to the terminal if any of the status should change. To stop monitoring this information, terminate the background process.
To set up a monitor to check for SP switch hardware status changes, issue the following background command:
hmmon -G -q -s -v nodePower,powerLED,envLED,\ shutdownTemp range_of_frame_nums:17 &
Example initial output from the hmmon command:
+--------------------------------------------------------------------------------+ | | | 1 17 powerLED 1 0x8c47 node/switch LED 1 (green) | | 1 17 envLED 0 0x8c48 node/switch LED 2 (yellow) | | 1 17 nodePower TRUE 0x8c4a DC-DC power on | | 1 17 shutdownTemp FALSE 0x8c59 temperature shutdown | +--------------------------------------------------------------------------------+
This command is similar to one presented previously, except this version continually monitors the frame condition and generates a message to the terminal if any of the status should change. To stop monitoring this information, terminate the background process.
To set up a monitor to check for changes in a node's LCD or LED status, issue the following background command:
hmmon -G -q -s -v LED7Seg* range_of_frame_nums:1-16 &
Example initial output from the hmmon command:
+--------------------------------------------------------------------------------+ | | | 1 1 LED7SegA 255 0x909f 7 segment LED A | | 1 1 LED7SegB 255 0x90a0 7 segment LED B | | 1 1 LED7SegC 255 0x90a1 7 segment LED C | | 1 3 LED7SegA 255 0x949f 7 segment LED A | | 1 3 LED7SegB 255 0x94a0 7 segment LED B | | 1 3 LED7SegC 255 0x94a1 7 segment LED C | | 1 4 LED7SegA 255 0x949f 7 segment LED A | | 1 4 LED7SegB 255 0x94a0 7 segment LED B | | 1 4 LED7SegC 255 0x94a1 7 segment LED C | | 1 5 LED7SegA 255 0x949f 7 segment LED A | | 1 5 LED7SegB 255 0x94a0 7 segment LED B | | 1 5 LED7SegC 255 0x94a1 7 segment LED C | | 1 6 LED7SegA 255 0x949f 7 segment LED A | | 1 6 LED7SegB 255 0x94a0 7 segment LED B | | 1 6 LED7SegC 255 0x94a1 7 segment LED C | | 1 7 LED7SegA 255 0x909f 7 segment LED A | | 1 7 LED7SegB 255 0x90a0 7 segment LED B | | 1 7 LED7SegC 255 0x90a1 7 segment LED C | | 1 9 LED7SegA 255 0x909f 7 segment LED A | | 1 9 LED7SegB 255 0x90a0 7 segment LED B | | 1 9 LED7SegC 255 0x90a1 7 segment LED C | | 1 11 LED7SegA 255 0x909f 7 segment LED A | | 1 11 LED7SegB 255 0x90a0 7 segment LED B | | 1 11 LED7SegC 255 0x90a1 7 segment LED C | +--------------------------------------------------------------------------------+This command shows the initial status of these resources, and displays any status changes in these resources when they occur. All values should display a value of 255, indicating that the associated readout element is blank. If any nodes indicate that a segment is not blank, issue the spmon -L command mentioned on (DKILEDNB) to obtain the current LCD or LED readout of the node.
To set up a monitor to check for nodes suddenly losing contact with the SP Switch, issue the following command:
spmon -q -M -l -t frame*/node*/switchResponds/value
Example initial output from the spmon command:
+--------------------------------------------------------------------------------+ | | |/SP/frame/frame1/node1/switchResponds/value/1 | |/SP/frame/frame1/node3/switchResponds/value/1 | |/SP/frame/frame1/node4/switchResponds/value/1 | |/SP/frame/frame1/node5/switchResponds/value/1 | |/SP/frame/frame1/node6/switchResponds/value/1 | |/SP/frame/frame1/node7/switchResponds/value/1 | |/SP/frame/frame1/node9/switchResponds/value/0 | |/SP/frame/frame1/node11/switchResponds/value/0 | |/SP/frame/frame1/node13/switchResponds/value/1 | +--------------------------------------------------------------------------------+
The spmon command also displays the current status, and a message to the terminal if any of these values change. All values should be 1. A value of 0 indicates that the node is not responding to the SP Switch. Note that this is the case with two of the nodes in this example, and these nodes should be investigated.
Other conditions can also be monitored using the hmmon and spmon commands; these suggestions offer the most basic of tests. To learn what other conditions can be monitored with these commands, and to tailor these commands to best suit your needs, refer to the hmmon and spmon sections of PSSP: Command and Technical Reference.
All commands can be issued from the same terminal session, but this can lead to confusing output when conditions change, or initial values can scroll off the terminal screen. To keep the monitoring manageable, consider issuing these commands from separate terminals, or from separate terminal windows from a XWindows capable terminal. Issue one monitoring command per terminal or terminal window. This will associate a terminal with each condition being monitored, and simplify the understanding of the monitor output.