Diagnosis Guide

Automating your response to problems

Detecting potential problem conditions before they become critical situations is the best way to resolve SP system problems. The condition is brought to the attention of the system administrator, allowing the system administrator to respond before the condition impacts other hardware and software components. But what if the procedure for correcting the situation is always the same? What if the system administrator will always run the same set of commands to address the condition whenever the condition occurs? Is it really necessary to require system administrator intervention, when the system administrator is going to perform the same action in all cases? The answer is no.

PSSP provides the capability to set an automated response to a system condition. This capability is provided as part of the SP Event Perspective, described in The SP Event Perspective and the Problem Management subsystem, described in Command line tools - Problem Management. Using these tools, the system administrator can assign a specific action to be run automatically by these tools when the system condition exists, and also when the condition "goes away" (when the rearm event occurs). In the previous discussions of this chapter, this action was kept rather simple: the action caused a notification to be sent to the system administrator. The associated action does not need to be this simple: the action can be any AIX command or script. When the event occurs, these tools will run the command or script in response to the event.

Important - WHEN actions are performed

A response to a particular system condition may not always be the same, despite initial appearances. For example, when a file system is close to reaching its capacity, the appropriate response in most cases is to increase the file system's capacity using the chfs command. However, disk space is not a limitless resource, and eventually all disk space will be consumed if this approach is used whenever the file system reaches its capacity. Although this is a convenient solution, it is not always the correct solution. The file system should be checked for large obsolete files that can be deleted, users that exceed their quotas, directories that can be mounted on other file systems to save space in this file system, and other solutions.

When a system administrator associates an action with an event through Problem Management or the SP Event Perspective, this action is performed each time the event occurs (and Problem Management or Perspectives is monitoring the condition). Neither Problem Management nor the SP Event Perspective can decide that the action should not be performed for this specific occurrence of the event, but rather the system administrator needs to do some more analysis. This decision making process has to either be incorporated into the AIX command or script that will run in response to the event, or it has to be left to the system administrator's discretion.

Two strategies are offered:

Specify an action for the event that performs two actions: the action notifies the system administrator of the event and the action also issues a command to address the event. This strategy allows the system to respond automatically to the condition and attempt to resolve the condition before it becomes a critical situation. The action also alerts the system administrator. The system administrator can then assess if the action should have been applied in this case.
If the action was appropriate, the system administrator does not need to take action. However, if recent history indicates that the event has been recurring at an unusual rate, or history indicates that the action really does not resolve the condition and the event continues to occur, the system administrator still receives the notification and can respond to the condition.
Build a response command or script that incorporates a decision making process within it. This response command can attempt to determine if a particular action is appropriate for the condition based on other information, such as other events or the recent history of actions taken in response to this event.

The second alternative can involve complicated logic, making it more difficult to implement. For this reason, the first strategy is recommended.

Important - WHERE actions are performed

Actions associated with events are performed on the node that requested that the event be monitored, which is not necessarily the node where the condition exists. For example, if a system administrator used Problem Management from the control workstation to monitor conditions on all nodes in the SP system and that condition suddenly exists on Node 42, the action is run on the control workstation, not Node 42. If the system administrator had associated a chfs command with the event, the chfs command would run on the control workstation and modify the control workstation's file system space, not the file system on Node 42.

When associating actions with events, keep in mind that the action will be performed by default on the node that asked for the event to be monitored. If action is to be taken on the node where the condition actually exists, the command invoked must determine where the condition occurred from the event information, and attempt to invoke remote processes on that node.

Both the SP Event Perspective and Problem Management make available several environmental values to the commands associated with the event. These variables are described in the chapter on using the Problem Management subsystem in PSSP: Administration Guide. Any command or script invoked by Problem Management or the SP Event Perspective has access to these variables. The variable PMAN_IVECTOR indicates where the condition exists. The command or script can parse the value of this environment variable, extract the node location information, and use that information to construct the appropriate remote command.

For example, consider the case where the /var file system is being monitored to ensure that it does not reach its capacity. When the file system does reach its capacity, the chfs command is to be invoked on the node where the condition exists to extend the size of the /var file system. To perform this action, a Korn Shell script is created. This script examines the contents of the PMAN_IVECTOR value, which has the following components to identify where the condition exists:

VG=rootvg;LV=hd9var;NodeNum=node_number_where_condition_exists

Once the node number has been found in the PMAN_IVECTOR value, the script will then find the host name for that node in the SDR. The script then uses the dsh command to issue the chfs command on the remote node to extend the size of the /var file system:

+--------------------------------------------------------------------------------+
|                                                                                |
|#!/bin/ksh                                                                      |
|OLDIFS=$IFS                                                                     |
|IFS=';'                                                                         |
|set $PMAN_IVECTOR                                                               |
|for TOKEN in $*                                                                 |
|do                                                                              |
|if [[ $TOKEN = NodeNum* ]]                                                      |
|        then                                                                    |
|              IFS='='                                                           |
|              print "$TOKEN" | read JUNK NODENUM                                |
|              HOST=$(SDRGetObjects -x Node node_number\=\=$NODENUM | \          |
|                      awk '{print $11}')                                        |
|        fi                                                                      |
|done                                                                            |
|IFS=$OLDIFS                                                                     |
|if [[ "$HOST" != "" ]]                                                          |
|then                                                                            |
|        dsh -w $HOST /usr/sbin/chfs -a size\=+1 /var                            |
|fi                                                                              |
+--------------------------------------------------------------------------------+

This script is saved to an AIX file on the node where the monitoring request is made, and the execute permission is set on the file. The full path name of the file can then be provided to either Problem Management or the SP Event Perspective as a command to be issued when the event occurs.

Graphical tools - SP Event Perspective

The Asynchronous (batch) notification methods introduced the concept of associating an action with a specific condition through the SP Event Perspective. This was done by providing a command within an event definition. In the previous section, this command was relatively simple: it issued an electronic mail message to a specific user to report the occurrence of the event.

To create an automatic response to the condition, provide an AIX command or script in the command field in addition to (or in place of) the notification command that was used in the previous discussion. Be sure to understand where the SP Event Perspective will attempt to run the command before assigning a command to this event, by referring to Important - WHERE actions are performed.

Command line tools - Problem Management

Asynchronous (batch) notification methods introduced the concept of associating an action with a specific condition through the pmandef command of Problem Management. This association was done by specifying an argument to the -c and -C options of the pmandef command. In the previous section, this argument was relatively simple: it issued an electronic mail message to a specific user to report the occurrence of the event.

To create an automatic response to the condition, provide an AIX command or script as an argument to -c or -C options of the pmandef command, in addition to (or in place of) the notification command that was used in the previous discussion. Be sure to understand where Problem Management will attempt to run the command before assigning a command to this event by referring to Important - WHERE actions are performed.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]