Administration Guide

Understanding the Problem Management daemon

The pmand daemon is a client of Event Management; it can be configured to register for Event Management events and perform actions when those events occur.

Event Management provides access to events throughout an SP system partition; therefore, pmand can monitor and react to events on the node on which it is running as well as on all other nodes in the system partition and the control workstation.

When you install your system, a pmand daemon is automatically configured on each node in a system partition. Additionally, there is a pmand daemon running on the control workstation for each system partition in the SP system. When running on a node, the pmand daemon:

Monitors events occurring on the node on which the daemon is running
Monitors events on all other nodes in the system partition
Monitors events not associated with a node, such as frame events, as supplied by Event Management

There are no restrictions on what a pmand daemon can monitor:

Any number of pmand daemons can monitor and act on a single event
A single pmand daemon can monitor any number of events locally or remotely
A single pmand daemon can monitor the same event multiple times and all the actions associated with all the event registrations are taken by the daemon when the event occurs

In Figure 40 pmand daemons are running on a 4 node SP system partition.

Figure 40. An example of the Problem Management daemon configuration

View figure.

Each pmand daemon has access to events on the node on which it is running, as well as to the other nodes in the system partition. The access to the events is provided by Event Management, which is able, due to its distributed nature, to monitor resource variables throughout the system partition and to generate events based on the values of the resource variables. When an event that any of the daemons has subscribed to occurs (whether that event is local or remote), all the pmand daemons registered for it will perform the actions they are configured for (if any).

Because pmand is a daemon, its subscriptions to events are persistent. That is, the daemon continues to subscribe to events even after the process or user who created the subscription has gone away. A system administrator, for example, can set up automated operations with unattended monitoring and recovery actions.

Controlling pmand

The pmand daemon is under the System Resource Controller (SRC) and can be controlled by the following commands:

To start pmand running, issue startsrc -s on a node.
```
startsrc -s pman
```
To start pmand running, issue startsrc -s on a control workstation.
```
startsrc -s pman.system_partition_name
```
To stop pmand running, issue stopsrc -s on a node.
```
stopsrc -s pman
```
To stop pmand running, issue stopsrc -s on a control workstation.
```
stopsrc -s pman.system_partition_name
```
To refresh pmand, issue refresh -s.
```
refresh -s pman
```
This causes pmand to update its internal configuration from the SDR and start pairing actions with events as specified by the pmandef command. If a pmand refresh occurs, all currently monitored events will be unregistered from before the configuration information is reread from the SDR. The SDR contains persistent information, so that a refresh results only in configuration changes that have been put into the SDR. If you have not deleted or modified a configuration record for a particular event, refreshing the daemon results in reregistering for the same event. To refresh pmand on the control workstation, issue refresh -s.
```
refresh -s pman.system_partition_name
```
To receive status on the pmand daemon, issue lssrc -ls. To check the status of pmand running on a node, enter:
```
lssrc -ls pman
```
To check the status of pmand running on the control workstation, enter:
```
lssrc -ls pman.system_partition_name
```
This command provides the following status information:
- When pmand was started.
- When pmand was last refreshed.
- Whether tracing (debug mode) is on or off. When debug mode is on, all SRC requests and all events are logged to the /var/adm/SPlogs/pman directory.
- Events for which registrations are as yet unacknowledged.
- Events for which actions are currently being taken.
- Events currently ready to be acted on by this daemon.

Creating Problem Management subscriptions

The pmandef command is the mechanism provided for creating Problem Management subscriptions for the pmand daemon to Event Management services. The pmandef command provides for defining:

Event Manager events to register for
Actions to take when those events occur:
- Run a command
- Issue an SNMP trap
- Write to the AIX Error Log and BSD syslog facilities

The pmandef command also provides for:

Activating a Problem Management subscription
Deactivating a Problem Management subscription
Querying a Problem Management subscription
Removing a Problem Management subscription

For more information on pmandef, refer to PSSP: Command and Technical Reference.

Running a command

Use pmandef to specify a command to run when a specified event or rearm event occurs. For example:

pmandef -s Program_Monitor \
-e 'IBM.PSSP.Prog.pcount:NodeNum=12;ProgName=mycmd;UserName=bob:X@0==0'\
-r "X@0>0" -c "echo program has stopped >/tmp/myevent.out" \
-C "echo program has restarted >/tmp/myrearm.out"

Running this example on node 5 causes the command echo program has stopped >/tmp/myevent.out to run on node 5 whenever the number of processes named mycmd and owned by user bob on node 12 becomes 0 (the event). When this number increases back to 1 (the rearm event), the command echo program has restarted >/tmp/myrearm.out runs on node 5.

You can specify if you want the command to run on a node other than the one from which the pmandef command was issued. For example:

pmandef -s Program_Monitor \
-e
'IBM.PSSP.Prog.pcount:NodeNum=1-5,13;ProgName=mycmd;UserName=bob:X@0==0'\
-r "X@0>0" -c /usr/local/bin/start_recovery \
-C /usr/local/bin/stop_recovery -n 1-3,7

This example causes the commands to run on nodes 1, 2, 3 and 7, whenever bob's program dies or is restarted on any of nodes 1, 2, 3, 4, 5 or 13. If bob's program dies on node 4, then the command /usr/local/bin/start_recovery runs on nodes 1, 2, 3 and 7.

Any number of commands can run simultaneously.

You can specify a timeout, in seconds, for each command. The minimum timeout that can be specified is 10 seconds. If the command has not exited before the specified timeout, the command is killed.

For information on command termination status, use the lssrc -ls command.

The command environment

The Problem Management subsystem makes all of the contents of an Event Management notification available in the command's environment when the command is run:

PMAN_HANDLE

The name that identifies this subscription to the Problem Management subsystem. This name was given as the argument of the -s flag to pmandef.

PMAN_PRINCIPAL

The name of the Kerberos V4 principal that owns this subscription, if one exists.

PMAN_DCEPRIN

The name of the DCE principal that owns this subscription, if one exists.

PMAN_RVNAME

The Event Management resource variable.

PMAN_IVECTOR

The Event Management resource identifier.

PMAN_PRED

Either the Event Management expression or rearm expression, depending on whether this is an event or rearm event.

PMAN_TIME

The time that the event was reported to the Problem Management subsystem.

PMAN_LOCATION

The node number of the node on which the event was generated, usually (but not always) the node on which the event occurred.

PMAN_RVTYPE

One of long, float or sbs, depending on whether the type of the resource variable value is a long integer, a floating point value or a Structured Byte String.

If the PMAN_RVTYPE is either long or float, then the resource variable value is stored in PMAN_RVVALUE, and PMAN_RVVALUE is to be interpreted as type PMAN_RVTYPE.

If PMAN_RVTYPE is sbs, then the resource variable value is composed of one or more structure elements. There is no PMAN_RVVALUE environment variable. Instead there is a separate environment variable for each element, and the PMAN_RVCOUNT environment variable defines the number of elements. For example, if there are 3 structure elements within the Structured Byte String, the PMAN_RVCOUNT will be 3, and there will be 3 separate environment variables for the 3 structure elements: PMAN_RVFIELD0, PMAN_RVFIELD1 and PMAN_RVFIELD2. Each of these 3 environment variables contains a name=value pair, where name is the structure element name, and value is the structure element value.

For example, the following command:

pmandef -s example \
-e 'IBM.PSSP.Prog.pcount:NodeNum=9-11;ProgName=mycmd;UserName=root:X@0==0' \
-c "/usr/local/bin/recovery_cmd" -n 12

requests the /usr/local/bin/recovery_cmd command to run on node 12, when the number of processes named mycmd and owned by root on nodes 9, 10, or 11 becomes zero. If the mycmd program terminates on node 10, the command /usr/local/bin/recovery_cmd runs on node 12, and the following environment variables are included in its environment:

PMAN_HANDLE (example)
PMAN_PRINCIPAL (root.admin@PPD.POK.IBM.COM)
PMAN_DCEPRIN (/.../test_dcecell/cell_admin)
PMAN_RVNAME (IBM.PSSP.Prog.pcount)
PMAN_IVECTOR (ProgName=mycmd;UserName=root;NodeNum=10)
PMAN_PRED (X@0==0)
PMAN_TIME (Thu Aug 22 00:42:08 1996)
PMAN_LOCATION (10)
PMAN_RVTYPE (sbs)
PMAN_RVCOUNT (3)
PMAN_RVFIELD0 (CurPIDCount=0)
PMAN_RVFIELD1 (PrevPIDCount=1)
PMAN_RVFIELD2 (CurPIDList=)

This information could be used by any command. Two utilities that report this information are provided as part of the Problem Management subsystem: notify_event and log_event. (These commands are provided to get you started; you might want to write more sophisticated commands.)

notify_event captures event information and mails it to the user running the command on the local node.

log_event captures event information and logs it to a wraparound file. The syntax for log_event is:

/usr/lpp/ssp/bin/log_event log_filename

log_event uses the AIX alog command to write to a wraparound file. The size of the wraparound file is limited to 64K. The alog command must be used to read the file. Refer to the AIX alog man page for more information on this command.

Issuing an SNMP trap

Use the pmandef command to subscribe to an Event Management event and specify that an SNMP trap be issued for that event. For example:

pmandef -s Filesystem_Monitor \
-e 'IBM.PSSP.aixos.FS.%totused:NodeNum=10;VG=myvg;LV=mylv:X>95' \
-t 1234 -n 10

In this example, whenever the file system associated with the mylv logical volume and myvg volume group on node 10 becomes more that 95% full, an SNMP trap will be generated on node 10.

For complete information on how pmand can be configured to issue an SNMP trap when an event for which it is registered occurs, refer to Chapter 27, Managing SP system events in a network environment.

Logging an event

You can specify that pmand write event notification information, along with some optional specified text, to the AIX Error Log and BSD syslog facilities. For example:

pmandef -s Filesystem_Monitor \
-e 'IBM.PSSP.aixos.FS.%totused:NodeNum=11;VG=myvg;mylv:X>95' \
-l "filesystem is almost full" -h local

In this example, whenever the file system associated with the mylv logical volume and myvg volume group on node 11 becomes more than 95% full, the text filesystem is almost full is written to the AIX Error Log and BSD syslog facilities on node 11 (via the -h local option).

Obtaining problem management access

The pmandef command is built upon the Sysctl facility, which uses the SP security services to provide authorized users, both root and non-root users, with the ability to create, modify, and delete Problem Management subscriptions. For further information about the Sysctl facility see Chapter 6, Controlling remote execution by using Sysctl.

How a user is authorized to access Problem Management depends on which SP trusted services authentication methods have been enabled:

When DCE is the only authentication method enabled, access to Problem Management is protected by the DCE ACL for the etc/sysctl.pman.acl file. By default this DCE ACL contains an entry for the sysctl-pman DCE group. While you can modify the DCE ACL directly, IBM suggests that you authorize users to access Problem Management by adding their DCE principals to the sysctl-pman DCE group. For related information see Managing access by group membership, Managing access using ACL files, and Sysctl files.
When Kerberos V4 or compatibility is the only authentication method enabled, access to Problem Management is protected by the /etc/sysctl.pman.acl file, which is a text file that gets processed by the Sysctl subsystem. You authorize users to access Problem Management by adding their Kerberos V4 principals to the /etc/sysctl.pman.acl file. For related information see Sysctl files.
When DCE and Kerberos V4 are both enabled as authentication methods, access to Problem Management is protected by both the DCE ACL and the /etc/sysctl.pman.acl text file described above. You authorize users to access Problem Management by adding their DCE principals to the sysctl-pman DCE group, by modifying the DCE ACL directly, or by adding their Kerberos V4 principals to the /etc/sysctl.pman.acl text file.
When no authentication methods are enabled, access to Problem Management is protected by the /etc/sysctl.pman.acl text file. You cannot authorize individual users to access Problem Management. You can only authorize all unauthenticated users to access Problem Management. If you choose not to do this, then no users can access Problem Management. For related information see Sysctl files.

Access to Problem Management is protected on a node by node basis. Each node contains a copy of the /etc/sysctl.pman.acl text file. When DCE authentication is enabled, there is a separate DCE ACL for each node. Therefore, it is possible for a user to be authorized to access Problem Management on some nodes but not on others. From a security standpoint, there is nothing to gain by authorizing a user to access Problem Management on some nodes but not on other nodes within a single SP system partition, because it is only necessary to have Problem Management access on a single node in order to define subscriptions that get processed on any nodes within the same SP system partition.

System administrators can consider granting Problem Management access to untrusted users. While an untrusted user can create a subscription that specifies that a command is to run as root whenever the event occurs, the pmand daemon will not run such a command for an untrusted user. In fact, the pmand daemon will not perform any action that the end user is not otherwise allowed to do. However, by granting Problem Management access to a user, the system administrator authorizes the pmand daemon and other PSSP subsystems to consume system resources on behalf of this user. The risks involved range from wasting of system resources by careless users to denial-of-service attacks on the SP system by malicious users. While it might not be desirable to grant Problem Management access to all users, there might be some users who are not trusted enough to have the root password but are trusted enough to use Problem Management. For further information see Authorizing event response actions.

The pmandef command requires the user to have Problem Management access on the local node to make the necessary changes to the SDR. If the user does not have Problem Management access on the local node, the command fails. The user should also have Problem Management access on the nodes that are affected by the subscription for the pmand daemons on those nodes to dynamically process the request. If the user does not have Problem Management access to any of these nodes, the pmand daemons on the unauthorized nodes will not process the request dynamically, but the request will eventually be processed the next time those pmand daemons are restarted or refreshed.

Understanding subscription ownership

When a new Problem Management subscription is created using the pmandef -s command, the user's current security identity is used to establish ownership of the subscription. Modifications to the subscription using the pmandef command with the -u, -d and -a flags are only allowed by a user who can be identified as the subscription owner.

How subscription ownership is established depends on which SP trusted services authentication methods have been enabled:

When DCE is the only authentication method enabled, the user's DCE principal is used to establish subscription ownership. When a subscription is created, the user's DCE principal is stored into the SDR along with the rest of the subscription data. Modifications to the subscription using the -u, -d and -a flags are only allowed by a user who can be authenticated as this same DCE principal.
When Kerberos V4 or compatibility is the only authentication method enabled, the user's Kerberos V4 principal is used to establish subscription ownership. When a subscription is created, the user's Kerberos V4 principal is stored into the SDR along with the rest of the subscription data. Modifications to the subscription using the -u, -d and -a flags are only allowed by a user who can be authenticated as this same Kerberos V4 principal.
When DCE and compatibility are both enabled as authentication methods, either the user's DCE principal or Kerberos V4 principal is used to establish subscription ownership. When a subscription is created, if the user is logged into both DCE and Kerberos V4, then both the DCE principal and Kerberos V4 principal are stored into the SDR along with the rest of the subscription data. If the user is not logged into both DCE and Kerberos V4, then whichever security identity is available, either the DCE principal or the Kerberos V4 principal, is stored into the SDR. Modifications to the subscription using the -u, -d and -a flags are allowed only by a user who can be authenticated as the subscription owner using either security identity. If a subscription contains both a DCE principal and a Kerberos V4 principal, then a user who can be authenticated as either the DCE principal or the Kerberos V4 principal can modify the subscription. If a subscription does not contain a DCE principal, then only the Kerberos V4 principal is used to establish subscription ownership. Similarly, if a subscription does not contain a Kerberos V4 principal, then only the DCE principal is used to establish subscription ownership.
When no authentication methods are enabled, subscription ownership is based on the combination of the user's AIX user name and source host name, the host name of the node from which the user issues the pmandef command. When a subscription is created, the user's AIX user name (like root) and the host name of the node where the pmandef command is running (like node1.xyz.com) are stored into the SDR along with the rest of the subscription data. Modifications to the subscription using the -u, -d and -a flags are only allowed by the same AIX user running on the same node. This means that you can only run the pmandef command with the -u, -d and -a flags on the same node from which you issued the pmandef command with the -s flag to create the subscription.

Changing subscription ownership

If subscriptions are created while SP trusted services are enabled for one set of authentication methods, and later you reconfigure SP trusted services to use a different set of authentication methods, then it might not be possible to establish ownership of some subscriptions. For example, if subscriptions were created while DCE was the only authentication method enabled, and later Kerberos V4 becomes the only authentication method enabled, then the owners of the DCE-based subscriptions will not be able to modify their subscriptions since the subscriptions do not contain Kerberos V4 principals with which to establish ownership.

In these situations use the pmanchown command to change the ownership of the isolated subscriptions from a security identity that cannot be authenticated to something that can be authenticated. In the example above, the ownership of the DCE-based subscriptions can be changed from the user's DCE principal to the user's Kerberos V4 principal. For more information on the pmanchown command, see the book PSSP: Command and Technical Reference.

Authorizing event response actions

After the pmand daemon receives notification that an event has occurred, and before it performs the action for that event, the pmand daemon checks to see whether the subscription owner is authorized to perform the requested action on the node where it is running. If the requested action is execution of a command, the subscription owner must have AIX Remote Commands access to the node as the target user. The target user is by default the same user who issued the pmandef -s command to create the subscription. A different user can be specified to the pmandef command by using the -U flag.

The underlying principal is that the pmand daemon will execute a command in response to an event only if the subscription owner has the ability to execute the same command by other means. If the user can log in to the node as the target user and execute the command directly from the command line, or at least run the command as the target user by invoking the rsh command from a remote node, then no extra privileges can be gained from using Problem Management. The only thing the end user gains is the automation of responses to events within the SP system.

How event response authorization is done depends on the type of the subscription's ownership and which AIX remote command authentication methods have been enabled by the system administrator. In one case it also depends on which SP trusted services authentication methods have been enabled. The specific steps that pmand uses to determine whether the subscription owner is authorized to run a command as the requested target user on the local node are as follows:

If the subscription contains a DCE principal, and if Kerberos V5 is enabled as an AIX remote command authentication method, then the subscription owner is authorized if the DCE principal's underlying Kerberos V5 principal has been listed in the target user's $HOME/.k5login file. If this step fails, authorization processing continues to the next step.
If the subscription contains a Kerberos V4 principal, and if Kerberos V4 is enabled as an AIX remote command authentication method, then the subscription owner is authorized if the Kerberos V4 principal has been listed in the target user's $HOME/.klogin file. If this step fails, authorization processing continues to the next step.
If no SP trusted services authentication methods are enabled, and if the subscription contains both a source AIX user name (the user who issued the pmandef -s command to create the subscription) and source host name (the node where the pmandef -s command was issued), and if standard UNIX is enabled as an AIX remote command authentication method, then the subscription owner is authorized if the source AIX user name and source host name combination have been listed in the target user's $HOME/.rhosts file. If this step fails, the subscription owner is not authorized, so the event response is not executed.

Note that standard UNIX authentication is not used when either DCE or Kerberos V4 has been enabled as an SP trusted services authentication method by the system administrator. This is necessary because, by enabling DCE or Kerberos V4 as an SP trusted services authentication method, the system administrator defines a security policy that requires all PSSP subsystems to use only strong forms of user authentication. Since standard UNIX authentication is a weak form of user authentication, it cannot be used when the PSSP security policy does not allow it.

When the action to be performed is an entry in the AIX error log and BSD syslog or the generation of an SNMP trap, all of the preceding rules apply, except the target user is always root, regardless of which target user is contained in the subscription. This restricts these actions to system administrators. Authorization checking for AIX error log and BSD syslog actions and SNMP trap actions are done separately from authorization checking for command execution actions. Therefore, if a subscription requests that a command executes for instance as joeuser and an SNMP trap is generated in response to an event, the command authorization checking will use joeuser as the target user, while the SNMP trap authorization checking will use root as the target user.

When the action to be performed is execution of a command, the pmand daemon also checks to see whether the system administrator has imposed any AIX login restrictions for the target user on the local node. The pmand daemon enforces the same login restrictions as the AIX rsh service The login restrictions that are checked by both pmand and rsh include the following:

Does the target user account exist?
Has the target user's account been locked?
Is the target user allowed to access the node at this time of day?
Has the target user been specifically denied access to the rsh service by including the string !RSH in the ttys attribute for this user? (You can set the user's ttys attribute using the AIX mkuser or chuser commands.)

See the AIX security documentation for the entire list of user login restrictions which are enforced by the rsh service.

In most cases you can determine whether pmand will refuse to execute a user's command by answering the question "Can the subscription owner access the rsh service as the target user on the node where pmand is to execute the command?" Except for cases where rsh will allow authorization based on the target user's $HOME/.rhosts file and pmand will not, if the answer to this question is yes, then pmand will allow the user's command to execute. If the answer is no, then pmand will refuse to execute the command, and it will make note of this in the pmand daemon log file, which exists in the directory /var/adm/SPlogs/pman. Keep in mind that a target user's AIX login restrictions can vary from node to node, and the AIX remote command authentication methods can also vary from node to node, so the answer to this question can be yes for some nodes and no for other nodes within the same SP system.

|When restricted root access is enabled, most SP-generated entries |are removed from the root user's /.klogin, |/.k5login or /.rhosts files on nodes |throughout the SP system. (See Restricted root access for more about that.) Any Problem Management |subscriptions which rely on the removed SP-generated entries for root |authorization will no longer be allowed to run root-authorized actions in |response to events, based on the rules already described. If you want |to use Problem Management in a restricted root access environment, you can |either add your own entries to the root user's |/.klogin, /.k5login or |/.rhosts files on the nodes where root authorization is |required, or you can choose to limit your event response actions to running |commands as non-root users.

Requesting Problem Management subscription information

Use the pmanquery command to query the SDR for a description of a Problem Management subscription. The pmanquery command outputs the details of the subscription information in raw format, which can then be used by other applications. The following example queries all subscriptions:

pmanquery -n all -k all -p all -U all -H all

For more information on pmanquery, refer to PSSP: Command and Technical Reference.

Monitoring default events

The Problem Management subsystem provides for a set of default events to be monitored in the /usr/lpp/ssp/install/bin/pmandefaults script. This script contains a series of pmandef commands that request an event notification to be mailed to root on the control workstation, when the specified event occurs. These are events that would interest many system administrators. Events are defined for all nodes in the current system partition, and all events are monitored from the control workstation. The list of events includes:

The /var file system is more than 95 percent full.
The /tmp file system is more than 90 percent full.
An error log record of type PERM has been written to the AIX Error Log.
The inetd daemon has terminated.
The sdrd daemon has terminated (control workstation only).
The sysctld daemon has terminated.
The hrd daemon has terminated (control workstation only).
The fsd daemon has terminated (nodes only).

This script is a suggested starting point for configuring the Problem Management subsystem on your SP system. You can choose to run the script as is, or you can make your own copy of the script and modify it to suit your needs, or you can choose not to run the script at all.

The script takes no arguments. It can be executed once for each system partition. Specify the system partition by setting the SP_NAME environment variable.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]