As system administrator, you cannot always devote your entire attention to monitoring the current status of a system, trying to detect problem conditions before they occur. For even moderately sized SP system configurations, this task can be time consuming and tedious. Other tasks require your attention, so actively monitoring the SP system for potential problem indications cannot become a task that consumes all your time and effort.
Fortunately, PSSP provides tools to monitor system status and conditions on your behalf, as well as the tools discussed previously to assess the current status of the system. Using these tools, you can indicate conditions of particular interest, request asynchronous notification of these events, and cause actions to be initiated when these conditions occur. In essence, the SP system monitors itself, takes action itself, and notifies you of the condition after it occurs. These monitoring tools can be used when you are not immediately available, such as during off-peak hours, or can be used to remove most of your monitoring burden.
Two sets of monitoring tools are available. As with the runtime notification tools mentioned previously, the choice of tool depends on the capabilities of your workstation and your preferences. PSSP provides graphical monitoring tools for use on the control workstation or a network attached terminals, and also provides command-line monitoring tools for those situations where only modem access or s1term access is available.
This tool was introduced in The SP Event Perspective. To use the SP Event Perspective effectively, you must understand certain terminology. These terms were introduced in The SP Event Perspective. Please refer to that section to become familiar with these terms.
Individual SP Perspectives require that certain subsystems be operating and that the user is authorized to communicate with them. Such subsystems include the System Monitor, Event Management, System Data Repository, and Problem Management. For the authorization required for each SP Perspective, see the discussion on using SP Perspectives in PSSP: Administration Guide.
Users of the SP Event Perspective can set up the Perspective to send a notification to the system administrator when conditions of interest exist on the system (or, to use the SP Perspectives terminology, when an event occurs). This is done by associating an action with the event in the event definition. This action can be any command or script that can be run from the AIX command line, including the creation of an electronic mail message, starting a process that can place a telephone call to the system administrator's pager, send a message to a specific user at a specific terminal, or any other notification command. The action invoked when the event occurs is called 'the command'.
When creating or modifying the event definition, the user can specify a command to be issued when the condition exists. The following AIX command can be used to have the SP Event Perspective send an electronic mail message to a specific user when the event occurs:
/usr/bin/echo \ "event_condition has occurred `/usr/bin/date` - Location Info: $PMAN_IVECTOR" | \ /usr/bin/mail -s "event_condition Notification" \ username@address
To understand the mechanics of setting up a command within an event definition, consult the Perspectives online help. Click on the Help menu button on the SP Event Perspective display, and select the Tasks... option. Assistance on specifying event definitions is available through the Working with Event Definitions topic.
Whenever the user registers for the event definition through the SP Event Perspective, 'the command' will be issued if the condition exists in the system. The SP Event Perspective does not have to be currently active in order for 'the command' to be issued; provided the user has registered for the event definition, the system will continue to monitor itself for this condition, and issue 'the command' if the condition exists. In other words, the user can use the SP Event Perspective to set up event definitions, register for events, then shut down the SP Event Perspective, and the system will still issue the notification command when the condition occurs. The SP system continually monitors itself for the condition and issues the notification command until the user cancels the event registration.
'The command' associated with an event definition can also be used to automate a response to this condition, instead of merely notifying the system administrator or another user of the condition. This topic will be discussed in Automating your response to problems.
Problem Management is a software subsystem used in command line and script oriented environments to specify conditions that should be monitored by the SP system, and to specify actions to take when these conditions exist on the SP system. This is the same software subsystem invoked internally by the SP Event Perspective discussed in the previous section.
For users attempting to connect to the SP nodes through low-speed modems or using non-graphical terminals, Problem Management provides a command line interface that can be used in place of the SP Event Perspective. As with other command line oriented tools, the Problem Management command line interface is not as intuitive to use or designed for ease of use as is its graphical counterpart.
PSSP provides over 300 default resource variables. Conditions to monitor on the SP system provides a suggested list of resource variables to monitor, but specific SP systems may require that additional resources also be monitored. The full list of resource variables is maintained by the Event Management subsystem, and the list can be retrieved using the haemqvar command. This command generates large amounts of information, so it is best to start with a brief report from this command to identify those resources to be monitored:
haemqvar -d | more
This command provides the names of the available resource variables, and a short description of each resource variable:
+--------------------------------------------------------------------------------+ | | |IBM.PSSP.aixos.Proc.swpque Average count of processes | | waiting to be paged in. | |IBM.PSSP.aixos.Proc.runque Average count of processes that are | | waiting for the cpu. | |IBM.PSSP.aixos.pagsp.size Size of paging space (4K pages). | |IBM.PSSP.aixos.pagsp.%free Free portion of this paging space (percent). | |IBM.PSSP.aixos.PagSp.totalsize Total active paging space size (4K pages). | |IBM.PSSP.aixos.PagSp.totalfree Total free disk paging space (4K pages). | |IBM.PSSP.aixos.PagSp.%totalused Total used disk paging space (percent). | |IBM.PSSP.aixos.PagSp.%totalfree Total free disk space (percent). | |IBM.PSSP.aixos.Mem.Virt.pgspgout 4K pages written to paging space by VMM. | |IBM.PSSP.aixos.Mem.Virt.pgspgin 4K pages read from paging space by VMM. | |IBM.PSSP.aixos.Mem.Virt.pagexct Total page faults. | |IBM.PSSP.aixos.Mem.Virt.pageout 4K pages written by VMM. | |IBM.PSSP.aixos.Mem.Virt.pagein 4K pages read by VMM. | |IBM.PSSP.aixos.Mem.Real.size Size of physical memory (4K pages). | |IBM.PSSP.aixos.Mem.Real.numfrb Number of pages on free list. | |IBM.PSSP.aixos.Mem.Real.%pinned Percent memory which is pinned. | |IBM.PSSP.aixos.Mem.Real.%free Percent memory which is free. | | : | | : | +--------------------------------------------------------------------------------+
The haemqvar command lists resources available only on the node where the command is issued. Keep in mind that resources may exist on some nodes and not on others. PSSP: Command and Technical Reference gives a detailed description of the haemqvar command, and how it can be used to locate any resource variable available within the SP system.
Once the resource variable to be monitored has been identified, the value type and locator for each resource variable must be identified. The locator informs Problem Management where to monitor the resource. For example, Problem Management needs to know the name of the file system and the node on which a file system resides, if it is to monitor that file system for the amount of space it has available. This information is conveyed to Problem Management through the locator value. To obtain the locator for a resource variable, issue the following haemqvar command:
haemqvar "" resource_variable_name "*"
This command provides details on the resource variable, including the locator keyword needed for Problem Management. The additional information can be helpful in constructing an effective Problem Management definition for the condition.
For example, to obtain the locator field for the IBM.PSSP.CSS.ipackets_drop variable, and to understand more about the variable, issue:
haemqvar "" IBM.PSSP.CSS.ipackets_drop "*"
which produces output similar to:
+--------------------------------------------------------------------------------+ | | |Variable Name: IBM.PSSP.CSS.ipackets_drop | |Value Type: Quantity | |Data Type: long | |Initial Value: 0 | |Class: IBM.PSSP.CSS | |Locator: NodeNum | |Variable Description: | | Number of packets not passed up. | | | | A message received by a node from the switch of the | | Communication SubSystem (CSS) is comprised of packets. | | IBM.PSSP.CSS.ipackets_drop is the count of the number of good | | incoming packets at the subject node's CSS interface which | | were dropped by the adapter microcode, since that interface | | was last initialized. | | | | If a node has too heavy a general workload, it may not service its | | CSS interface often enough, causing its messages to linger in the | | switch network. If this is allowed to continue, the switch can | | become backed up causing other nodes to encounter poor switch | | performance; in fact, this condition can cause the entire | | switch to clog. Instead, the adapter microcode drops any "excess" | | packet -- a reliable protocol will eventually retry the message. | | | | For performance reasons, counts such as this are only updated | | approximately once every 2 minutes. | | | | This variable is supplied by the "IBM.PSSP.harmld" resource monitor. | | | | Example expression: | | | | To be notified when IBM.PSSP.CSS.ipackets_drop exceeds 100 on any | | node, register for the following event: | | | | Resource variable: IBM.PSSP.CSS.ipackets_drop | | Resource ID: NodeNum=* | | Expression: X>100 | | Re-arm expression: X<100 | | | | Resource ID wildcarding: | | | | The resource variable's resource ID is used to specify the number | | of the node (NodeNum) to be monitored. The NodeNum resource ID | | element value may be wildcarded in order to apply a query or event | | registration to all nodes in the domain. | | | | Related Resource Variables: | | | | IBM.PSSP.CSS.ibadpackets Number of bad packets received | | by the adapter. | | IBM.PSSP.CSS.ipackets_lsw Packets received on interface | | (lsw bits 30-0). | | IBM.PSSP.CSS.ipackets_msw Packets received on interface | | (msw bits 61-31). | |Resource ID: NodeNum=int | | NodeNum: The number of the node for which the information applies. | +--------------------------------------------------------------------------------+
The Locator: field indicates the keyword to be used with Problem Management to identify where the resource should be monitored. Note that the haemqvar command offers advice on how to use the locator field in the output.
The event expression indicates the value that the resource variable will have when notice should be given. This value is assigned by the system administrator. The rearm expression indicates the value of the resource variable that indicates that the condition of interest is no longer present. How these expression are coded depends on the value type of the resource variable. The event handle is a name assigned by the system administrator, which should be descriptive of the condition being monitored.
For example, consider a case where the system administrator is interested in paging space on any SP node. If paging space reaches 90% capacity, the system administrator considers the node to be "thrashing" and wants to be notified. The system administrator considers the node to be "thrashing" once this threshold is reached, even if a little paging space frees up. The system administrator does not consider the "thrashing" problem to be resolved unless 40% of the paging space becomes available again. Using this scenario and the haemqvar commands from Step 2, the system administrator identifies these conditions of interest:
Identify these conditions for all resource variables to be monitored. Conditions to monitor on the SP system lists some basic resource variables to monitor and the associated event expressions and rearm expressions.
The user can specify a command to be issued when the condition exists. The following AIX command can be used to have Problem Management send an electronic mail message to a specific user when the event occurs:
/usr/bin/echo \ "event_handle has occurred `/usr/bin/date` - Location Info: $PMAN_IVECTOR" | \ /usr/bin/mail -s "event_handle Notification" \ username@address
An action can also be run when the condition no longer exists: when the rearm expression has been met. This action, called the 'rearm command', can inform the system administrator that the condition no longer exists, so that the system administrator knows that the condition no longer needs attention. For example:
/usr/bin/echo \ "event_handle condition ended `/usr/bin/date` - Location Info: $PMAN_IVECTOR" | \ /usr/bin/mail -s "event_handle Condition Ended"\ username@address
The pmandef command informs Problem Management of the conditions to be monitored by subscribing for events. This concept is almost exactly the same as the SP Event Perspective's concept of registering event definitions. To subscribe for events on the chosen resource variables, create a file to contain pmandef commands in the following format:
pmandef -s event_handle \ -e 'resource_variable_name:locator:event_expression' \ -r "rearm_expression" \ -c event_command \ -C rearm_command \ -n 0
Substitute the following information for the keywords in the previous command format:
Continuing with the previous example, the pmandef command to subscribe for the node thrashing condition would be:
pmandef -s Node_Thrash_Monitor \
-e 'IBM.PSSP.aixos.PagSp.%totalused:Nodenum=*:X>90' \
-r "X<60" \
-c '/usr/bin/echo "Node_Thrash_Monitor Alert `/usr/bin/date` \
- Location Info: $PMAN_IVECTOR" | /usr/bin/mail\
-s "Node_Thrash_Monitor Alert" root@adminnode.ibm.com'
-C '/usr/bin/echo "Node_Thrash_Monitor Cancellation `/usr/bin/date` \
- Location Info: $PMAN_IVECTOR" | /usr/bin/mail
-s "Node_Thrash_Monitor Cancellation" root@adminnode.ibm.com' \
-n 0
One pmandef command is required for each condition being monitored. Save this file and note its name for future reference.
pmandef -d all
Problem Management not only subscribes to events with the pmandef -s command, but it also begins monitoring the resources as well. The pmandef -d all command disables the monitoring of these resources.
pmandef -a all
This instructs Problem Management to begin monitoring all the conditions that were defined in Step 6. Should any of these events occur, Problem Management will issue 'the command' associated with the event, to inform the system administrator of the event.
The system administrator can modify the subscribed event to Problem Management. To do this, the system administrator needs to know the following:
Modification of the subscription is done in two steps:
pmandef -d event_handle
This deactivates the monitoring for this condition.
IBM.PSSP.aixos.FS/%totused:NodeNum=1-4,6-32;VG=rootvg; LV=hd3:X>90
pmandef -d all
These steps provide an overview of how Problem Management can be used to monitor system events and notify the system administrator when events occur. This is not a complete tutorial on the use of Problem Management. For greater detail on the capabilities and uses of Problem Management, especially with regard to security, consult the Problem Management chapter in PSSP: Administration Guide, the pmandef command, and the haemqvar command in PSSP: Command and Technical Reference.