IBM Books

Administration Guide


Understanding the Problem Management daemon

The pmand daemon is a client of Event Management; it can be configured to register for Event Management events and perform actions when those events occur.

Event Management provides access to events throughout an SP system partition; therefore, pmand can monitor and react to events on the node on which it is running as well as on all other nodes in the system partition and the control workstation.

When you install your system, a pmand daemon is automatically configured on each node in a system partition. Additionally, there is a pmand daemon running on the control workstation for each system partition in the SP system. When running on a node, the pmand daemon:

There are no restrictions on what a pmand daemon can monitor:

In Figure 40 pmand daemons are running on a 4 node SP system partition.

Figure 40. An example of the Problem Management daemon configuration

View figure.

Each pmand daemon has access to events on the node on which it is running, as well as to the other nodes in the system partition. The access to the events is provided by Event Management, which is able, due to its distributed nature, to monitor resource variables throughout the system partition and to generate events based on the values of the resource variables. When an event that any of the daemons has subscribed to occurs (whether that event is local or remote), all the pmand daemons registered for it will perform the actions they are configured for (if any).

Because pmand is a daemon, its subscriptions to events are persistent. That is, the daemon continues to subscribe to events even after the process or user who created the subscription has gone away. A system administrator, for example, can set up automated operations with unattended monitoring and recovery actions.

Controlling pmand

The pmand daemon is under the System Resource Controller (SRC) and can be controlled by the following commands:

Creating Problem Management subscriptions

The pmandef command is the mechanism provided for creating Problem Management subscriptions for the pmand daemon to Event Management services. The pmandef command provides for defining:

The pmandef command also provides for:

For more information on pmandef, refer to PSSP: Command and Technical Reference.

Running a command

Use pmandef to specify a command to run when a specified event or rearm event occurs. For example:

pmandef -s Program_Monitor \
-e 'IBM.PSSP.Prog.pcount:NodeNum=12;ProgName=mycmd;UserName=bob:X@0==0'\
-r "X@0>0" -c "echo program has stopped >/tmp/myevent.out" \
-C "echo program has restarted >/tmp/myrearm.out"

Running this example on node 5 causes the command echo program has stopped >/tmp/myevent.out to run on node 5 whenever the number of processes named mycmd and owned by user bob on node 12 becomes 0 (the event). When this number increases back to 1 (the rearm event), the command echo program has restarted >/tmp/myrearm.out runs on node 5.

You can specify if you want the command to run on a node other than the one from which the pmandef command was issued. For example:

pmandef -s Program_Monitor \
-e
'IBM.PSSP.Prog.pcount:NodeNum=1-5,13;ProgName=mycmd;UserName=bob:X@0==0'\
-r "X@0>0" -c /usr/local/bin/start_recovery \
-C /usr/local/bin/stop_recovery -n 1-3,7

This example causes the commands to run on nodes 1, 2, 3 and 7, whenever bob's program dies or is restarted on any of nodes 1, 2, 3, 4, 5 or 13. If bob's program dies on node 4, then the command /usr/local/bin/start_recovery runs on nodes 1, 2, 3 and 7.

Any number of commands can run simultaneously.

You can specify a timeout, in seconds, for each command. The minimum timeout that can be specified is 10 seconds. If the command has not exited before the specified timeout, the command is killed.

For information on command termination status, use the lssrc -ls command.

The command environment

The Problem Management subsystem makes all of the contents of an Event Management notification available in the command's environment when the command is run:

PMAN_HANDLE
The name that identifies this subscription to the Problem Management subsystem. This name was given as the argument of the -s flag to pmandef.

PMAN_PRINCIPAL
The name of the Kerberos V4 principal that owns this subscription, if one exists.

PMAN_DCEPRIN
The name of the DCE principal that owns this subscription, if one exists.

PMAN_RVNAME
The Event Management resource variable.

PMAN_IVECTOR
The Event Management resource identifier.

PMAN_PRED
Either the Event Management expression or rearm expression, depending on whether this is an event or rearm event.

PMAN_TIME
The time that the event was reported to the Problem Management subsystem.

PMAN_LOCATION
The node number of the node on which the event was generated, usually (but not always) the node on which the event occurred.

PMAN_RVTYPE
One of long, float or sbs, depending on whether the type of the resource variable value is a long integer, a floating point value or a Structured Byte String.

If the PMAN_RVTYPE is either long or float, then the resource variable value is stored in PMAN_RVVALUE, and PMAN_RVVALUE is to be interpreted as type PMAN_RVTYPE.

If PMAN_RVTYPE is sbs, then the resource variable value is composed of one or more structure elements. There is no PMAN_RVVALUE environment variable. Instead there is a separate environment variable for each element, and the PMAN_RVCOUNT environment variable defines the number of elements. For example, if there are 3 structure elements within the Structured Byte String, the PMAN_RVCOUNT will be 3, and there will be 3 separate environment variables for the 3 structure elements: PMAN_RVFIELD0, PMAN_RVFIELD1 and PMAN_RVFIELD2. Each of these 3 environment variables contains a name=value pair, where name is the structure element name, and value is the structure element value.

For example, the following command:

pmandef -s example \
-e 'IBM.PSSP.Prog.pcount:NodeNum=9-11;ProgName=mycmd;UserName=root:X@0==0' \
-c "/usr/local/bin/recovery_cmd" -n 12

requests the /usr/local/bin/recovery_cmd command to run on node 12, when the number of processes named mycmd and owned by root on nodes 9, 10, or 11 becomes zero. If the mycmd program terminates on node 10, the command /usr/local/bin/recovery_cmd runs on node 12, and the following environment variables are included in its environment:

This information could be used by any command. Two utilities that report this information are provided as part of the Problem Management subsystem: notify_event and log_event. (These commands are provided to get you started; you might want to write more sophisticated commands.)

notify_event captures event information and mails it to the user running the command on the local node.

log_event captures event information and logs it to a wraparound file. The syntax for log_event is:

/usr/lpp/ssp/bin/log_event log_filename

log_event uses the AIX alog command to write to a wraparound file. The size of the wraparound file is limited to 64K. The alog command must be used to read the file. Refer to the AIX alog man page for more information on this command.

Issuing an SNMP trap

Use the pmandef command to subscribe to an Event Management event and specify that an SNMP trap be issued for that event. For example:

pmandef -s Filesystem_Monitor \
-e 'IBM.PSSP.aixos.FS.%totused:NodeNum=10;VG=myvg;LV=mylv:X>95' \
-t 1234 -n 10

In this example, whenever the file system associated with the mylv logical volume and myvg volume group on node 10 becomes more that 95% full, an SNMP trap will be generated on node 10.

For complete information on how pmand can be configured to issue an SNMP trap when an event for which it is registered occurs, refer to Chapter 27, Managing SP system events in a network environment.

Logging an event

You can specify that pmand write event notification information, along with some optional specified text, to the AIX Error Log and BSD syslog facilities. For example:

pmandef -s Filesystem_Monitor \
-e 'IBM.PSSP.aixos.FS.%totused:NodeNum=11;VG=myvg;mylv:X>95' \
-l "filesystem is almost full" -h local

In this example, whenever the file system associated with the mylv logical volume and myvg volume group on node 11 becomes more than 95% full, the text filesystem is almost full is written to the AIX Error Log and BSD syslog facilities on node 11 (via the -h local option).

Obtaining problem management access

The pmandef command is built upon the Sysctl facility, which uses the SP security services to provide authorized users, both root and non-root users, with the ability to create, modify, and delete Problem Management subscriptions. For further information about the Sysctl facility see Chapter 6, Controlling remote execution by using Sysctl.

How a user is authorized to access Problem Management depends on which SP trusted services authentication methods have been enabled:

Access to Problem Management is protected on a node by node basis. Each node contains a copy of the /etc/sysctl.pman.acl text file. When DCE authentication is enabled, there is a separate DCE ACL for each node. Therefore, it is possible for a user to be authorized to access Problem Management on some nodes but not on others. From a security standpoint, there is nothing to gain by authorizing a user to access Problem Management on some nodes but not on other nodes within a single SP system partition, because it is only necessary to have Problem Management access on a single node in order to define subscriptions that get processed on any nodes within the same SP system partition.

System administrators can consider granting Problem Management access to untrusted users. While an untrusted user can create a subscription that specifies that a command is to run as root whenever the event occurs, the pmand daemon will not run such a command for an untrusted user. In fact, the pmand daemon will not perform any action that the end user is not otherwise allowed to do. However, by granting Problem Management access to a user, the system administrator authorizes the pmand daemon and other PSSP subsystems to consume system resources on behalf of this user. The risks involved range from wasting of system resources by careless users to denial-of-service attacks on the SP system by malicious users. While it might not be desirable to grant Problem Management access to all users, there might be some users who are not trusted enough to have the root password but are trusted enough to use Problem Management. For further information see Authorizing event response actions.

The pmandef command requires the user to have Problem Management access on the local node to make the necessary changes to the SDR. If the user does not have Problem Management access on the local node, the command fails. The user should also have Problem Management access on the nodes that are affected by the subscription for the pmand daemons on those nodes to dynamically process the request. If the user does not have Problem Management access to any of these nodes, the pmand daemons on the unauthorized nodes will not process the request dynamically, but the request will eventually be processed the next time those pmand daemons are restarted or refreshed.

Understanding subscription ownership

When a new Problem Management subscription is created using the pmandef -s command, the user's current security identity is used to establish ownership of the subscription. Modifications to the subscription using the pmandef command with the -u, -d and -a flags are only allowed by a user who can be identified as the subscription owner.

How subscription ownership is established depends on which SP trusted services authentication methods have been enabled:

Changing subscription ownership

If subscriptions are created while SP trusted services are enabled for one set of authentication methods, and later you reconfigure SP trusted services to use a different set of authentication methods, then it might not be possible to establish ownership of some subscriptions. For example, if subscriptions were created while DCE was the only authentication method enabled, and later Kerberos V4 becomes the only authentication method enabled, then the owners of the DCE-based subscriptions will not be able to modify their subscriptions since the subscriptions do not contain Kerberos V4 principals with which to establish ownership.

In these situations use the pmanchown command to change the ownership of the isolated subscriptions from a security identity that cannot be authenticated to something that can be authenticated. In the example above, the ownership of the DCE-based subscriptions can be changed from the user's DCE principal to the user's Kerberos V4 principal. For more information on the pmanchown command, see the book PSSP: Command and Technical Reference.

Authorizing event response actions

After the pmand daemon receives notification that an event has occurred, and before it performs the action for that event, the pmand daemon checks to see whether the subscription owner is authorized to perform the requested action on the node where it is running. If the requested action is execution of a command, the subscription owner must have AIX Remote Commands access to the node as the target user. The target user is by default the same user who issued the pmandef -s command to create the subscription. A different user can be specified to the pmandef command by using the -U flag.

The underlying principal is that the pmand daemon will execute a command in response to an event only if the subscription owner has the ability to execute the same command by other means. If the user can log in to the node as the target user and execute the command directly from the command line, or at least run the command as the target user by invoking the rsh command from a remote node, then no extra privileges can be gained from using Problem Management. The only thing the end user gains is the automation of responses to events within the SP system.

How event response authorization is done depends on the type of the subscription's ownership and which AIX remote command authentication methods have been enabled by the system administrator. In one case it also depends on which SP trusted services authentication methods have been enabled. The specific steps that pmand uses to determine whether the subscription owner is authorized to run a command as the requested target user on the local node are as follows:

Note that standard UNIX authentication is not used when either DCE or Kerberos V4 has been enabled as an SP trusted services authentication method by the system administrator. This is necessary because, by enabling DCE or Kerberos V4 as an SP trusted services authentication method, the system administrator defines a security policy that requires all PSSP subsystems to use only strong forms of user authentication. Since standard UNIX authentication is a weak form of user authentication, it cannot be used when the PSSP security policy does not allow it.

When the action to be performed is an entry in the AIX error log and BSD syslog or the generation of an SNMP trap, all of the preceding rules apply, except the target user is always root, regardless of which target user is contained in the subscription. This restricts these actions to system administrators. Authorization checking for AIX error log and BSD syslog actions and SNMP trap actions are done separately from authorization checking for command execution actions. Therefore, if a subscription requests that a command executes for instance as joeuser and an SNMP trap is generated in response to an event, the command authorization checking will use joeuser as the target user, while the SNMP trap authorization checking will use root as the target user.

When the action to be performed is execution of a command, the pmand daemon also checks to see whether the system administrator has imposed any AIX login restrictions for the target user on the local node. The pmand daemon enforces the same login restrictions as the AIX rsh service The login restrictions that are checked by both pmand and rsh include the following:

See the AIX security documentation for the entire list of user login restrictions which are enforced by the rsh service.

In most cases you can determine whether pmand will refuse to execute a user's command by answering the question "Can the subscription owner access the rsh service as the target user on the node where pmand is to execute the command?" Except for cases where rsh will allow authorization based on the target user's $HOME/.rhosts file and pmand will not, if the answer to this question is yes, then pmand will allow the user's command to execute. If the answer is no, then pmand will refuse to execute the command, and it will make note of this in the pmand daemon log file, which exists in the directory /var/adm/SPlogs/pman. Keep in mind that a target user's AIX login restrictions can vary from node to node, and the AIX remote command authentication methods can also vary from node to node, so the answer to this question can be yes for some nodes and no for other nodes within the same SP system.

|When restricted root access is enabled, most SP-generated entries |are removed from the root user's /.klogin, |/.k5login or /.rhosts files on nodes |throughout the SP system. (See Restricted root access for more about that.) Any Problem Management |subscriptions which rely on the removed SP-generated entries for root |authorization will no longer be allowed to run root-authorized actions in |response to events, based on the rules already described. If you want |to use Problem Management in a restricted root access environment, you can |either add your own entries to the root user's |/.klogin, /.k5login or |/.rhosts files on the nodes where root authorization is |required, or you can choose to limit your event response actions to running |commands as non-root users.

Requesting Problem Management subscription information

Use the pmanquery command to query the SDR for a description of a Problem Management subscription. The pmanquery command outputs the details of the subscription information in raw format, which can then be used by other applications. The following example queries all subscriptions:

pmanquery -n all -k all -p all -U all -H all

For more information on pmanquery, refer to PSSP: Command and Technical Reference.

Monitoring default events

The Problem Management subsystem provides for a set of default events to be monitored in the /usr/lpp/ssp/install/bin/pmandefaults script. This script contains a series of pmandef commands that request an event notification to be mailed to root on the control workstation, when the specified event occurs. These are events that would interest many system administrators. Events are defined for all nodes in the current system partition, and all events are monitored from the control workstation. The list of events includes:

This script is a suggested starting point for configuring the Problem Management subsystem on your SP system. You can choose to run the script as is, or you can make your own copy of the script and modify it to suit your needs, or you can choose not to run the script at all.

The script takes no arguments. It can be executed once for each system partition. Specify the system partition by setting the SP_NAME environment variable.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]