IBM Books

Administration Guide


Configuring and operating Event Management

The following sections describe how the components of the Event Management subsystem work together to provide event management services. Included are discussions of:

Configuring Event Management

|The RSCT software should have been installed as part of the |installation of the PSSP software with AIX 4.3.3 or as part of |AIX 5L 5.1. The Event Management subsystem is contained in the |...basic.rte and ...basic.sp file |sets. The EMAPI libraries are contained in the |...clients.rte and ...clients.sp file |sets.

After the components are installed, the subsystem must be configured for operation. Event Management configuration is performed by the haemctrl command, which is invoked by the syspar_ctrl command.

The syspar_ctrl command configures all of the system partition-sensitive subsystems. The person who installs PSSP issues the syspar_ctrl command during installation of the control workstation. The syspar_ctrl command is executed automatically on the nodes when the nodes are installed. The syspar_ctrl command is also executed automatically when system partitions are created or destroyed. For more information on using the syspar_ctrl command, see PSSP Installation and Migration Guide and PSSP Command and Technical Reference.

The haemctrl command provides a number of functions for controlling the operation of the Event Management system. You can use it to:

Except for the clean and unconfigure function, haemctrl affects the Event Management subsystem in the current system partition, that is, the system partition that is specified by the SP_NAME environment variable.

Adding the subsystem

If the haemctrl command is running on the control workstation, the first step in the add function is to select an Event Manager daemon communications port number and save it in the Syspar_ports SDR class. This port number is then placed in the /etc/services file. If the haemctrl command is running on a node, the port number is fetched from the SDR and placed in the /etc/services file. Port numbers are selected from the range 10000 through 10100.

The second step is to add the Event Management startup program to the System Resource Controller (SRC) using the mkssys command. On the control workstation, the IP address of the system partition is an argument to the haemd_SP program in the SRC subsystem specification. The third step is to add the aixos resource monitor daemon harmad to the SRC using the mkssys command.

The fourth step is to add an entry to the /etc/inittab file so the Event Manager daemon and the aixos resource monitor will be started during boot. However, if haemctrl is running on a High Availability Control Workstation (HACWS), no entry is made in the /etc/inittab file. Instead, HACWS manages starting and stopping the Event Manager daemon and the aixos resource monitor.

The remaining steps in the add function are performed only on the control workstation. The haemloadcfg program is run to load the default configuration data into the SDR. The haemcfg command is run to create the EMCDB and place it into the staging directory. Finally, if it is not already stored in the SDR, an Event Manager daemon remote client communications port number is selected from the range 10000 through 10100. This port number is then placed in the /etc/services file. This port number is used by all of the Event Manager daemons on the control workstation.

Note that if the haemctrl add function terminates with an error, the command can be rerun after the problem is fixed. The command takes into account any steps that already completed successfully.

Starting and stopping the subsystem

The start and stop functions of the haemctrl command simply run the startsrc and stopsrc commands, respectively. However, haemctrl automatically specifies the subsystem argument to these SRC commands.

Deleting the subsystem

The delete function of the haemctrl command removes the subsystem from the SRC, removes the entry from /etc/inittab, and removes the Event Manager daemon communications port number from /etc/services. It does not remove anything from the SDR, because the Event Management subsystem may still be configured on other nodes in the domain.

Cleaning the subsystem

The clean function of the haemctrl command performs the same function as the delete function, except in all system partitions. In addition, it removes the Event Manager daemon remote client communications port number from the /etc/services file.

The clean function does not remove anything from the SDR. This function is provided to support restoring the system to a known state, where the known state is the (possibly restored copy of the) SDR database.

Unconfiguring the subsystem

The unconfigure function of the haemctrl command performs the same function as the clean function and then removes all port numbers from the SDR allocated by the Event Management subystem. This function can only be performed on the control workstation and must be preceded by executing the clean function of the haemctrl command on all of the nodes.

The purpose of this function is to remove allocated port numbers from the SDR in a consistent manner.

Tracing the subsystem

The tracing function of the haemctrl command is provided to supply additional problem determination information when it is requested by the IBM Support Center. Normally, tracing should not be turned on, because it may slightly degrade Event Management subsystem performance and can consume large amounts of disk space in the /var file system.

Refreshing the subsystem

The refresh function of the haemctrl command initiates a procedure in the Event Management subsystem to refresh the subsystem's security configuration. The security configuration is modified to use the current SP Trusted Services authentication methods. See Understanding Event Management security.

Initializing Event Manager daemon

Normally, the Event Manager daemon startup program, haemd_SP, is started by an entry in the /etc/inittab file using the startsrc command. If necessary, you can start the startup program using the haemctrl command or the startsrc command directly. The startup program performs the following steps:

  1. It gets the number of the node where it is running using the /usr/lpp/ssp/install/bin/node_number command. Node 0 is the control workstation.
  2. It fetches the name of the system partition and the EMCDB version string from the Syspar SDR class. (Recall that one instance of the Event Manager daemon runs on the control workstation for each system partition to which the Event Management subsystem was added.) It also fetches the Event Manager daemon remote client communications port number from the SP_ports SDR class.
  3. Finally, the startup program invokes the Event Manager program haemd, passing the information just collected and any arguments passed to the startup program itself. Note that a new process is not started; the process image is just replaced. This permits the Event Manager daemon to be controlled by the SRC. During its initialization, the Event Manager program performs the following steps:
    1. It performs actions that are necessary to become a daemon. This includes establishing communications with the SRC subsystem so that it can return status in response to SRC commands.
    2. It removes from the registration cache, all of the subdirectories for local EM clients that no longer exist. That is, if the process ID in the subdirectory name cannot be found, it removes the subdirectory.

      Note that subdirectories for remote clients cannot be removed automatically, because the Event Manager daemon cannot determine if remote processes still exist.

    3. It tries to connect to the Group Services subsystem. If the connection cannot be established because the Group Services subsystem is not running, it is scheduled to be retried in 5 seconds. This continues until the connection to Group Services is established. Meanwhile, Event Manager daemon initialization continues.
    4. It enters the main control loop.

      In this loop, the Event Manager daemon waits for requests from EM clients, messages from resource monitors and other Event Manager daemons, messages from the Group Services subsystem, and requests from the SRC for status. It also waits for internal signals that indicate a function that was previously scheduled should now be executed, for example, retrying a connection to Group Services.

      However, EM client requests, messages from resource monitors, and messages from other Event Manager daemons (called peers) are refused until the Event Manager daemon has successfully joined the daemon peer group (the ha_em_peers group) and has fetched the correct version of the EMCDB.

Joining the peer group

After the Event Manager daemon has successfully established a connection with the Group Services subsystem, it tries to join the daemon peer group, a Group Services group called ha_em_peers. If this is the first Event Manager daemon to come up in the domain, it establishes the peer group. Otherwise, the other daemons in the peer group either accept or reject the daemon's join request. If an existing peer group member is still recovering from a prior termination of the joining daemon, the join request is rejected. If its join request is rejected, the daemon tries to join again in 15 seconds. This continues until the daemon's join request is accepted.

When it joins the daemon peer group, the Event Manager daemon examines the group state. The group state is the EMCDB version string.

If the group state is null, the joining daemon proposes that the group state be set to the version string that the daemon has fetched from the SDR. If several daemons try to join the group at about the same time, and the group state is null, then each daemon proposes the group state. When the group is formed, Group Services selects one of the proposals and sets the group state to it. Note that each daemon is proposing the EMCDB version string that it has fetched from the SDR. Unless the haemcfg command has been run at about the same time, the proposed version strings should be identical.

If the group state is not null when it is examined by the joining daemon, a group has already formed and the daemon does not propose a new group state.

After the daemon has successfully joined the peer group, it compares the EMCDB version string contained in the group state to the version string it fetched from the SDR. If they are different, the version that was fetched from the SDR is replaced by the version in the group state.

An Event Manager daemon is prevented from joining the peer group as long as any other Event Manager daemon, currently in the peer group, is non-responsive to "pings" from the Group Services subsystem. (When an Event Manager daemon successfully joins the peer group, Group Services requests a response from the Event Manager daemon every two minutes. If the daemon does not respond to the request within two minutes, it is considered to be non-responsive. A daemon is also considered to be non-responsive if it does not reply to the join requests of other daemons within one minute.) The Event Manager daemon status, as displayed by the lssrc command, indicates if a daemon cannot join the peer group. If this is the case, the em.default.domain_name file of any other daemon in the peer group should be examined for errors indicating that an Event Manager daemon is non-responsive. If so, and the non-responsive daemon does not terminate itself within a few minutes, perform the User Response specified for the error.

Reading the EMCDB

Once the daemon has joined the peer group and has determined the EMCDB version, it reads the run-time EMCDB file from the /etc/ha/cfg directory. If the file does not exist, it is copied from the staging directory on the control workstation.

Once the daemon has read the file, it compares the version string in the EMCDB to the one it fetched (from the SDR or from the group state). If the two version strings do not match, and the daemon has not just copied the EMCDB from the control workstation, then it copies the run-time EMCDB from the control workstation. If the version strings still do not match, the daemon terminates with an error.

Whenever it is necessary to copy the EMCDB from the control workstation, the EMCDB version string is used to determine how the copy is done. If the EMCDB version string was obtained from the group state then it is used to copy a back level EMCDB from the staging directory on the control workstation. Otherwise, the staging file /spdata/sys1/ha/cfg/em.domain_name.cdb is copied. Note that back level copies of the EMCDB should be removed from the staging directory only if their version string suffix indicates a time stamp older than the current version string found in the daemon peer group state (the current version string is found in the Event Manager daemon status, as displayed by the lssrc command. See Displaying the status of the Event Manager daemon.

After the daemon has read and validated the EMCDB, it enables daemon communications. This permits EM clients to send requests to the daemon, resource monitors to connect to the daemon, and peers to send messages to the daemon. At this point, the initialization of the Event Manager daemon is complete.

To copy the EMCDB from the control workstation to the /etc/ha/cfg directory, the Event Manager daemon uses the /usr/sbin/rsct/install/bin/haemrcpcdb script. This script uses the rcp command to perform the actual copy.

The way in which Event Manager daemons determine the EMCDB version has important implications for the configuration of the subsystem. To place a new version of the EMCDB into production (that is, to make it the run-time version that is used by the Event Management subsystem), you must stop each Event Manager daemon in the domain after the haemcfg command is run. Stopping the daemons dissolves the existing peer group. Once the existing peer group is dissolved, the daemons can be restarted. As they restart, the daemons form a new peer group. A new EMCDB version string can be submitted as the group state only when a peer group is formed.

Operating the Event Manager daemon

Normal operation of the Event Management subsystem requires no administrative intervention. The subsystem recovers from temporary failures automatically. However, there are some characteristics that might be of interest to administrators.

For performance reasons, the Event Manager daemon has an internal limit of 256 open file descriptors. In practice, this limits the number of EM client sessions, either local or remote, to about 225. This file descriptor limit is per daemon; it does not limit the number of EM clients in the domain.

The Event Manager daemon connects to a resource monitor of type server as the resource monitor starts. If a server resource monitor is running prior to the start of the Event Manager daemon, the daemon connects to the resource monitor after enabling daemon communications. If able, the daemon starts a server resource monitor when necessary. However, connection and start attempts are constrained under the following circumstances:

  1. If it is necessary that the daemon start the resource monitor before each connection attempt, then after three attempts within two hours the resource monitor is "locked" and no further attempts are made.
  2. If the resource monitor is not startable by the Event Manager daemon, then after about three successful connections within two hours the resource monitor is "locked" and no further attempts are made.

The rationale for locking the resource monitor is that, if it cannot be started and stay running or successful connections are frequently being lost, then a problem exists with the resource monitor. Once the problem has been determined and corrected, the haemunlkrm command can be used to unlock the resource monitor and connect to it, starting it first if necessary. Note that locking does not apply to client type resource monitors.

The primary function of the Event Manager daemon is to generate events, by observing resource variable values and applying expressions to those values. However, this function is performed for a resource variable only if an EM client has registered to receive events for that resource variable. The Event Manager daemon also observes a resource variable once to satisfy a query request, if the resource variable is not already being observed. When observations are necessary, the Event Manager daemon commands the appropriate resource monitor (if it has a connection type of server) to supply resource variable values. When observations are no longer necessary, the Event Manager daemon commands the resource monitor to stop supplying values. In this way, the Event Manager daemon performs no action for resource variables that are not of interest to clients.

Even if a resource monitor that has a connection type of server is running, it does not supply data to the Event Manager daemon except by command of the daemon.

The Event Manager daemon either observes a resource variable located in shared memory every X seconds, where X is the observation interval that is specified in the resource variable's resource class definition, or when the resource variable's value is sent to the Event Manager daemon by the resource monitor (transparently, via the RMAPI). The values of resource variables of value type Counter and Quantity are located in shared memory. The values of resource variables of value type State are not.

All resource variables that are located in shared memory with the same observation interval are observed on the same time boundary. This minimizes the observation overhead, no matter when the request for a resource variable is made.

Understanding Event Management security

The Event Management subsystem is an SP trusted service. Event Management enforces the security policies as defined by the trusted services authentication methods configured in the domain (the SP system partition) in which the Event Management subsystem is running. These authentication methods can be any combination of DCE, compatibility, and none.

Authentication and authorization of EM clients

Table 18 defines the behavior of an EM client and the EM daemon to which it connects for each combination of authentication methods. Since an EM client may not be on the same node as the EM daemon to which it connects, the table includes entries for each possible combination of authentication methods configured on the node where the EM client and the EM daemon are running. Each table column represents a combination of authentication methods for the EM daemon and each table row represents a combination of authentication methods for the EM client.

Table 18. Behavior of the Event Management subsystem with respect to trusted service authentication methods.

EM Client EM Daemon
DCE DCE and Compat Compat None No support
DCE MutualAuth MutualAuth Error Error Error
DCE and Compat ClientAuth OK OK OK OK
Compat Error OK OK OK OK
None Error OK OK OK OK
No support Error OK OK OK Not applicable

The row and column headings have the following meanings:

DCE
Only the DCE authentication method is configured on the node.

DCE and Compat
Both the DCE and compatibility authentication methods are configured on the node.

Compat
Only the compatibility authentication method is configured on the node.

None
Neither the DCE nor the compatibility authentication methods are configured on the node.

No support
The node is installed with an earlier version than PSSP 3.2.

The table cell labels have the following meanings:

MutualAuth
For the combination of authentication methods that intersect in a table cell with this label, the EM daemon authenticates the DCE principal under which the EM client is running and then the EM client authenticates the DCE principal under which the EM daemon is running. This mutual authentication ensures that both the EM daemon and the EM client recognize the identity of the other.

ClientAuth
For the combination of authentication methods that intersect in the table cell with this label, the EM daemon authenticates the DCE principal under which the EM client is running; the EM client does no authentication of the EM daemon.

Error
For the combination of authentication methods that intersect in a table cell with this label, no communication is permitted between the EM client and the EM daemon.

OK
For the combination of authentication methods that intersect in a table cell labeled OK, communication is always permitted: the EM client is unauthenticated.

Whenever an EM client is authenticated by the EM daemon, the DCE principal that is executing the client must also by defined in the Event Management DCE group named haem-users. The access control policy of the Event Management subsystem is that an authenticated EM client must be a member of the DCE group haem-users in order to access the Event Management subsystem and the resources which it monitors. When communication between an unauthenticated EM client and an EM daemon is permitted, no additional authorization is required.

Note:
The name haem-users is the default group name. That name can be changed locally by the system administrator. If it is locally changed, remember to replace the name haem-users with your local group name wherever it appears in the documentation.

All authentication and authorization logic is implemented in the EMAPI library and the EM daemon. This logic is executed whenever an EM client application attempts to start a session with the Event Management subsystem. Errors in authentication, authorization, or an invalid combination of authentication methods between the EM client and the EM daemon, as indicated by the label Error in Table 18, result in a failure of the request to start the EM session.

Security and the peer group

Each EM daemon has a security state that matches the SP trusted services authentication methods configured on the node where the daemon is running. The states are the following:

DCE
DCE and Compatibility
Compatibility
None
No support

The EM peer group also has a security state, as maintained in the peer group state value as one of the following three keywords:

SEC
Security is enabled in the peer group.

NOSEC
Security is disabled in the peer group.

NOSECSUPPORT
No security is supported in the peer group.

The security state of each EM daemon in the peer group must match that of the peer group as defined in Table 19.

Table 19. Daemon and peer group security states

EM daemon security state EM peer group security state
DCE SEC
DCE and Compat NOSEC (or null)
Compat NOSEC (or null)
None NOSEC (or null)
No support NOSECSUPPORT (or null)

If the peer group contains versions of EM daemons from earlier releases, then the peer group state value might not contain any of the keywords SEC, NOSEC, or NOSECSUPPORT. The security state value might be null. The peer group state value can be observed by displaying the status of the EM daemon.

When the EM daemon starts, it determines if the node has been installed with PSSP 3.2. If not the security state of the daemon is No support. Otherwise, the daemon obtains the currently configured SP trusted services authentication methods and sets the respective security state. If the daemon cannot obtain the authentication methods, it sets the security state to None. When the daemon then joins the peer group, if the peer group currently has no state set, it proposes a peer group security state to match. Upon completion of the join, the daemon checks if the security states match. If they do not match as in Table 19, the daemon logs an error and exits.

The result of this procedure is that the first daemon that joins the peer group sets the security state of the group, as defined on the node. If multiple daemons join the group at the same time, and the group does not currently exist, then Group Services arbitrarily picks one of the proposed group states. This can result in one or more daemons exiting with an error if their proposed state is not the one picked. This can only occur if the nodes where the daemons are executing do not all have compatible security configurations, as defined in Table 19.

Effect of migrating or changing security configuration

As indicated by the information in Table 18 and Table 19, the Event Management subsystem supports a mixture of authentication methods within a domain, including nodes where an EM client is executing outside of the domain. This permits the EM subsystem to support migration to PSSP 3.2 one node at a time.

After migration is complete, or any time the SP trusted services authentication methods are changed on the nodes of a domain, the Event Management subsystem needs to be refreshed. The refresh usually happens automatically by some task during migration and configuration. For instance, the haemctrl -r command is invoked automatically by the chauthpts command. If you ever need to, you can use the haemctrl -r command directly to refresh the subsystem. The daemon that receives the command to refresh obtains a new security state by obtaining the current authentication methods. It then proposes a peer group state change that matches this new state, as defined in Table 19. When each daemon receives the proposal, it also obtains the current configuration methods on the node. If the latest methods match the proposed state, then the daemon votes ACCEPT, else it votes REJECT. If the protocol is approved by all daemons then each daemon uses the security state just obtained. After the security state of the daemon is updated, the daemon checks all client connections to see if they have an appropriate security state according to Table 20. If not, the connections are closed. When the client detects that its connection has terminated, it can do another start session or a restart session. In either case, authentication and authorization are once again performed according to the policy in Table 18.

Table 20. Validation of EM client connection

Connection Authentication State New EM Daemon Security Configurations
DCE DCE and Compat Compat None
Auth Keep Keep Keep Keep
No Auth Close Keep Keep Keep

Table 20 shows the policy used to validate authorization of the EM client connection:

Auth
The client connection was authenticated when it was originally made (either MutualAuth or ClientAuth).

No Auth
The client connection is unauthenticated.

Keep
Keep the connection.

Close
Close the connection.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]