Diagnosis Guide

Asynchronous (batch) notification methods

As system administrator, you cannot always devote your entire attention to monitoring the current status of a system, trying to detect problem conditions before they occur. For even moderately sized SP system configurations, this task can be time consuming and tedious. Other tasks require your attention, so actively monitoring the SP system for potential problem indications cannot become a task that consumes all your time and effort.

Fortunately, PSSP provides tools to monitor system status and conditions on your behalf, as well as the tools discussed previously to assess the current status of the system. Using these tools, you can indicate conditions of particular interest, request asynchronous notification of these events, and cause actions to be initiated when these conditions occur. In essence, the SP system monitors itself, takes action itself, and notifies you of the condition after it occurs. These monitoring tools can be used when you are not immediately available, such as during off-peak hours, or can be used to remove most of your monitoring burden.

Two sets of monitoring tools are available. As with the runtime notification tools mentioned previously, the choice of tool depends on the capabilities of your workstation and your preferences. PSSP provides graphical monitoring tools for use on the control workstation or a network attached terminals, and also provides command-line monitoring tools for those situations where only modem access or s1term access is available.

Graphics tools - SP Event Perspective

This tool was introduced in The SP Event Perspective. To use the SP Event Perspective effectively, you must understand certain terminology. These terms were introduced in The SP Event Perspective. Please refer to that section to become familiar with these terms.

Individual SP Perspectives require that certain subsystems be operating and that the user is authorized to communicate with them. Such subsystems include the System Monitor, Event Management, System Data Repository, and Problem Management. For the authorization required for each SP Perspective, see the discussion on using SP Perspectives in PSSP: Administration Guide.

Users of the SP Event Perspective can set up the Perspective to send a notification to the system administrator when conditions of interest exist on the system (or, to use the SP Perspectives terminology, when an event occurs). This is done by associating an action with the event in the event definition. This action can be any command or script that can be run from the AIX command line, including the creation of an electronic mail message, starting a process that can place a telephone call to the system administrator's pager, send a message to a specific user at a specific terminal, or any other notification command. The action invoked when the event occurs is called 'the command'.

When creating or modifying the event definition, the user can specify a command to be issued when the condition exists. The following AIX command can be used to have the SP Event Perspective send an electronic mail message to a specific user when the event occurs:

/usr/bin/echo \
"event_condition has occurred `/usr/bin/date` - Location Info: $PMAN_IVECTOR" | \
/usr/bin/mail -s "event_condition Notification" \
username@address

To understand the mechanics of setting up a command within an event definition, consult the Perspectives online help. Click on the Help menu button on the SP Event Perspective display, and select the Tasks... option. Assistance on specifying event definitions is available through the Working with Event Definitions topic.

Whenever the user registers for the event definition through the SP Event Perspective, 'the command' will be issued if the condition exists in the system. The SP Event Perspective does not have to be currently active in order for 'the command' to be issued; provided the user has registered for the event definition, the system will continue to monitor itself for this condition, and issue 'the command' if the condition exists. In other words, the user can use the SP Event Perspective to set up event definitions, register for events, then shut down the SP Event Perspective, and the system will still issue the notification command when the condition occurs. The SP system continually monitors itself for the condition and issues the notification command until the user cancels the event registration.

'The command' associated with an event definition can also be used to automate a response to this condition, instead of merely notifying the system administrator or another user of the condition. This topic will be discussed in Automating your response to problems.

Command line tools - Problem Management

Note:: Problem management does not run in Restricted Root Access (RRA) mode. Problem management does not run when the system administrator has set Authorization for AIX Remote Commands to "none". See Special troubleshooting considerations for more details.

Problem Management is a software subsystem used in command line and script oriented environments to specify conditions that should be monitored by the SP system, and to specify actions to take when these conditions exist on the SP system. This is the same software subsystem invoked internally by the SP Event Perspective discussed in the previous section.

For users attempting to connect to the SP nodes through low-speed modems or using non-graphical terminals, Problem Management provides a command line interface that can be used in place of the SP Event Perspective. As with other command line oriented tools, the Problem Management command line interface is not as intuitive to use or designed for ease of use as is its graphical counterpart.

Prepare to monitor the system. To become familiar with Problem Management, especially regarding the security requirements, see the chapter on Problem Management in PSSP: Administration Guide.

Understand what you want to monitor. Problem Management expects the user to know the conditions that are to be monitored. Unlike the SP Event Perspective, Problem Management does not provide an interactive method to query for the list of available conditions and a means to select from these conditions. The user must identify the conditions to be monitored, and provide them as a list to Problem Management. These conditions are identified by naming the associated resource variables, the internal mechanism that contains the current status of the associated resource.

PSSP provides over 300 default resource variables. Conditions to monitor on the SP system provides a suggested list of resource variables to monitor, but specific SP systems may require that additional resources also be monitored. The full list of resource variables is maintained by the Event Management subsystem, and the list can be retrieved using the haemqvar command. This command generates large amounts of information, so it is best to start with a brief report from this command to identify those resources to be monitored:

haemqvar -d | more

This command provides the names of the available resource variables, and a short description of each resource variable:

+--------------------------------------------------------------------------------+
|                                                                                |
|IBM.PSSP.aixos.Proc.swpque   Average count of processes                         |
|                             waiting to be paged in.                            |
|IBM.PSSP.aixos.Proc.runque   Average count of processes that are                |
|                             waiting for the cpu.                               |
|IBM.PSSP.aixos.pagsp.size   Size of paging space (4K pages).                    |
|IBM.PSSP.aixos.pagsp.%free   Free portion of this paging space (percent).       |
|IBM.PSSP.aixos.PagSp.totalsize Total active paging space size (4K pages).       |
|IBM.PSSP.aixos.PagSp.totalfree Total free disk paging space (4K pages).         |
|IBM.PSSP.aixos.PagSp.%totalused Total used disk paging space (percent).         |
|IBM.PSSP.aixos.PagSp.%totalfree   Total free disk space (percent).              |
|IBM.PSSP.aixos.Mem.Virt.pgspgout 4K pages written to paging space by VMM.       |
|IBM.PSSP.aixos.Mem.Virt.pgspgin 4K pages read from paging space by VMM.         |
|IBM.PSSP.aixos.Mem.Virt.pagexct   Total page faults.                            |
|IBM.PSSP.aixos.Mem.Virt.pageout   4K pages written by VMM.                      |
|IBM.PSSP.aixos.Mem.Virt.pagein    4K pages read by VMM.                         |
|IBM.PSSP.aixos.Mem.Real.size   Size of physical memory (4K pages).              |
|IBM.PSSP.aixos.Mem.Real.numfrb   Number of pages on free list.                  |
|IBM.PSSP.aixos.Mem.Real.%pinned   Percent memory which is pinned.               |
|IBM.PSSP.aixos.Mem.Real.%free   Percent memory which is free.                   |
|                          :                                                     |
|                          :                                                     |
+--------------------------------------------------------------------------------+

The haemqvar command lists resources available only on the node where the command is issued. Keep in mind that resources may exist on some nodes and not on others. PSSP: Command and Technical Reference gives a detailed description of the haemqvar command, and how it can be used to locate any resource variable available within the SP system.

Once the resource variable to be monitored has been identified, the value type and locator for each resource variable must be identified. The locator informs Problem Management where to monitor the resource. For example, Problem Management needs to know the name of the file system and the node on which a file system resides, if it is to monitor that file system for the amount of space it has available. This information is conveyed to Problem Management through the locator value. To obtain the locator for a resource variable, issue the following haemqvar command:

haemqvar "" resource_variable_name "*"

This command provides details on the resource variable, including the locator keyword needed for Problem Management. The additional information can be helpful in constructing an effective Problem Management definition for the condition.

For example, to obtain the locator field for the IBM.PSSP.CSS.ipackets_drop variable, and to understand more about the variable, issue:

haemqvar "" IBM.PSSP.CSS.ipackets_drop "*"

which produces output similar to:

+--------------------------------------------------------------------------------+
|                                                                                |
|Variable Name:  IBM.PSSP.CSS.ipackets_drop                                      |
|Value Type:     Quantity                                                        |
|Data Type:      long                                                            |
|Initial Value:  0                                                               |
|Class:          IBM.PSSP.CSS                                                    |
|Locator:        NodeNum                                                         |
|Variable Description:                                                           |
|    Number of packets not passed up.                                            |
|                                                                                |
|    A message received by a node from the switch of the                         |
|    Communication SubSystem (CSS) is comprised of packets.                      |
|    IBM.PSSP.CSS.ipackets_drop is the count of the number of good               |
|    incoming packets at the subject node's CSS interface which                  |
|    were dropped by the adapter microcode, since that interface                 |
|    was last initialized.                                                       |
|                                                                                |
|    If a node has too heavy a general workload, it may not service its          |
|    CSS interface often enough, causing its messages to linger in the           |
|    switch network.  If this is allowed to continue, the switch can             |
|    become backed up causing other nodes to encounter poor switch               |
|    performance; in fact, this condition can cause the entire                   |
|    switch to clog. Instead, the adapter microcode drops any "excess"           |
|    packet -- a reliable protocol will eventually retry the message.            |
|                                                                                |
|    For performance reasons, counts such as this are only updated               |
|    approximately once every 2 minutes.                                         |
|                                                                                |
|    This variable is supplied by the "IBM.PSSP.harmld" resource monitor.        |
|                                                                                |
|    Example expression:                                                         |
|                                                                                |
|    To be notified when IBM.PSSP.CSS.ipackets_drop exceeds 100 on any           |
|    node, register for the following event:                                     |
|                                                                                |
|          Resource variable:  IBM.PSSP.CSS.ipackets_drop                        |
|          Resource ID:        NodeNum=*                                         |
|          Expression:         X>100                                             |
|          Re-arm expression:  X<100                                             |
|                                                                                |
|    Resource ID wildcarding:                                                    |
|                                                                                |
|    The resource variable's resource ID is used to specify the number           |
|    of the node (NodeNum) to be monitored. The NodeNum resource ID              |
|    element value may be wildcarded in order to apply a query or event          |
|    registration to all nodes in the domain.                                    |
|                                                                                |
|    Related Resource Variables:                                                 |
|                                                                                |
|      IBM.PSSP.CSS.ibadpackets    Number of bad packets received                |
|                                  by the adapter.                               |
|      IBM.PSSP.CSS.ipackets_lsw   Packets received on interface                 |
|                                  (lsw bits 30-0).                              |
|      IBM.PSSP.CSS.ipackets_msw   Packets received on interface                 |
|                                  (msw bits 61-31).                             |
|Resource ID:    NodeNum=int                                                     |
|  NodeNum: The number of the node for which the information applies.            |
+--------------------------------------------------------------------------------+

The Locator: field indicates the keyword to be used with Problem Management to identify where the resource should be monitored. Note that the haemqvar command offers advice on how to use the locator field in the output.

Identify the conditions of interest. Problem Management is informed of the conditions to be monitored through the pmandef command. One pmandef command is needed for each condition to be monitored. This command is used to subscribe to the event, which is similar in concept to the SP Event Perspective's registration of an event definition. To create the subscription, the following information is needed:
- The resource variable name, obtained in Step 2.
- The resource variable locator, also obtained in Step 2.
- The event expression, which indicates the condition of interest in this resource variable
- The rearm expression, which indicates when the condition is no longer of interest
- An event handle, which is a symbolic name that the system administrator will use to refer to this definition.
The event expression indicates the value that the resource variable will have when notice should be given. This value is assigned by the system administrator. The rearm expression indicates the value of the resource variable that indicates that the condition of interest is no longer present. How these expression are coded depends on the value type of the resource variable. The event handle is a name assigned by the system administrator, which should be descriptive of the condition being monitored.
For example, consider a case where the system administrator is interested in paging space on any SP node. If paging space reaches 90% capacity, the system administrator considers the node to be "thrashing" and wants to be notified. The system administrator considers the node to be "thrashing" once this threshold is reached, even if a little paging space frees up. The system administrator does not consider the "thrashing" problem to be resolved unless 40% of the paging space becomes available again. Using this scenario and the haemqvar commands from Step 2, the system administrator identifies these conditions of interest:
- The resource variable name is IBM.PSSP.aixos.PagSp.%totalused, which contains the percentage of used paging space on a node.
- The resource variable locator is NodeNum, meaning that a node number is needed to indicate where the resource is to be monitored. The system administrator wants to monitor the condition on all nodes, so the locator expression is NodeNum=*.
- The event expression must indicate the 90% capacity condition, so the expression X>90 is used.
- The rearm expression must indicate the condition that "turns off" the event, which is when 40% of paging space becomes available. The expression X< 60 is used.
- The system administrator assigns the name Node_Thrash_Monitor to this event definition.
Identify these conditions for all resource variables to be monitored. Conditions to monitor on the SP system lists some basic resource variables to monitor and the associated event expressions and rearm expressions.
Decide how to notify the system administrator. Problem Management associates an action with the event definition. When the condition exists within the system (or, to use the correct terminology, when the event occurs), Problem Management performs the action associated with the event. This action can be any command or script that can be issued from the AIX command line, including the creation of an electronic mail message, starting a process that can place a telephone call to the system administrator's pager, send a message to a specific user at a specific terminal, or any other notification command. This action is termed 'the command'.
The user can specify a command to be issued when the condition exists. The following AIX command can be used to have Problem Management send an electronic mail message to a specific user when the event occurs:
```
/usr/bin/echo \
"event_handle has occurred `/usr/bin/date` - Location Info: $PMAN_IVECTOR" | \
/usr/bin/mail -s "event_handle Notification" \
username@address
```
An action can also be run when the condition no longer exists: when the rearm expression has been met. This action, called the 'rearm command', can inform the system administrator that the condition no longer exists, so that the system administrator knows that the condition no longer needs attention. For example:
```
/usr/bin/echo \
"event_handle condition ended `/usr/bin/date` - Location Info: $PMAN_IVECTOR" | \
/usr/bin/mail -s "event_handle Condition Ended"\
username@address
```
Create an event definition file. For every resource variable to be monitored, one pmandef command must be issued. If more than a handful of resources variables are to be monitored, this can result in a lot of typing. For convenience, create a file containing the pmandef commands to define these events to Problem Management. This will simplify the procedure for instructing Problem Management of the resources to monitor, and makes it easier to reissue these same commands at a later time.
The pmandef command informs Problem Management of the conditions to be monitored by subscribing for events. This concept is almost exactly the same as the SP Event Perspective's concept of registering event definitions. To subscribe for events on the chosen resource variables, create a file to contain pmandef commands in the following format:
```
pmandef -s event_handle \
-e 'resource_variable_name:locator:event_expression' \
-r "rearm_expression" \
-c event_command \
-C rearm_command \
-n 0
```
Substitute the following information for the keywords in the previous command format:
- event_handle is the event handle assigned by the system administrator in Step 3.
- resource_variable_name is the name of the resource variable, obtained in Step 2.
- locator is the locator expression, indicating where the resource is to be monitored, determined in Steps 2 and 3.
- event_expression indicates the value the resource variable will have when the condition of interest exists, determined in Step 3.
- rearm_expression indicates the "shut off" value the resource variable will have when the condition no longer exists, determined in Step 3.
- event_command indicates the notification command ('the command') to use for informing the system administrator that the condition exists, created in Step 4.
- rearm_command indicates the notification command ('rearm command") to use for informing the system administrator that the condition no longer exists, created in Step 4.
Continuing with the previous example, the pmandef command to subscribe for the node thrashing condition would be:
```
pmandef -s Node_Thrash_Monitor \
-e 'IBM.PSSP.aixos.PagSp.%totalused:Nodenum=*:X>90' \
-r "X<60" \
-c '/usr/bin/echo "Node_Thrash_Monitor Alert `/usr/bin/date` \
   - Location Info: $PMAN_IVECTOR" | /usr/bin/mail\
-s "Node_Thrash_Monitor Alert" root@adminnode.ibm.com' 
-C '/usr/bin/echo "Node_Thrash_Monitor Cancellation `/usr/bin/date` \
   - Location Info: $PMAN_IVECTOR" | /usr/bin/mail 
-s "Node_Thrash_Monitor Cancellation" root@adminnode.ibm.com' \
-n 0
```
One pmandef command is required for each condition being monitored. Save this file and note its name for future reference.
Subscribe to the Events through Problem Management. To record these event definitions to Problem Management, issue the pmandef commands recorded in the file created in Step 5 by issuing the ksh filename command, where filename is the name of the file created in Step 5. Immediately after issuing the ksh command, issue the following command:
```
pmandef -d all
```
Problem Management not only subscribes to events with the pmandef -s command, but it also begins monitoring the resources as well. The pmandef -d all command disables the monitoring of these resources.
Begin monitoring. Begin monitoring these resources when you are ready. To begin monitoring the events that were subscribed in Step 6, issue this command:
```
pmandef -a all
```
This instructs Problem Management to begin monitoring all the conditions that were defined in Step 6. Should any of these events occur, Problem Management will issue 'the command' associated with the event, to inform the system administrator of the event.
Tailor the monitoring. At times, certain conditions should not be checked on certain nodes. For example, Problem Management may be monitoring the space available in the /tmp file system on all nodes, but the system administrator expects /tmp to exceed that limit on a specific node (for example: node 5 on a 32-node SP system) for a certain period of time. If the monitoring is not tailored or modified to compensate for this expected event, the system administrator will be notified that the event occurred just as if it were an unexpected event.
The system administrator can modify the subscribed event to Problem Management. To do this, the system administrator needs to know the following:
- The event handle used for the condition, assigned in Step 3.
- The new locator that excludes the location where the condition is not to be monitored.
Modification of the subscription is done in two steps:
1. The event subscription is disabled, using the pmandef command:
```
pmandef -d event_handle
```
  This deactivates the monitoring for this condition.
2. A new event expression is created, using a locator that excludes the location where the condition is not to be monitored. Using the example of /tmp monitoring, where node 5 is not to be monitored, the event expression would appear as:
```
IBM.PSSP.aixos.FS/%totused:NodeNum=1-4,6-32;VG=rootvg;
LV=hd3:X>90
```
3. Issue the pmandef -s command, using the structure provided in Step 5 and the new event expression.
Stop monitoring. To stop monitoring of all events previously enabled in Step 7, issue the following command:
```
pmandef -d all
```

These steps provide an overview of how Problem Management can be used to monitor system events and notify the system administrator when events occur. This is not a complete tutorial on the use of Problem Management. For greater detail on the capabilities and uses of Problem Management, especially with regard to security, consult the Problem Management chapter in PSSP: Administration Guide, the pmandef command, and the haemqvar command in PSSP: Command and Technical Reference.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]