The following sections describe how the components of the Event Management subsystem work together to provide event management services. Included are discussions of:
|The RSCT software should have been installed as part of the |installation of the PSSP software with AIX 4.3.3 or as part of |AIX 5L 5.1. The Event Management subsystem is contained in the |...basic.rte and ...basic.sp file |sets. The EMAPI libraries are contained in the |...clients.rte and ...clients.sp file |sets.
After the components are installed, the subsystem must be configured for operation. Event Management configuration is performed by the haemctrl command, which is invoked by the syspar_ctrl command.
The syspar_ctrl command configures all of the system partition-sensitive subsystems. The person who installs PSSP issues the syspar_ctrl command during installation of the control workstation. The syspar_ctrl command is executed automatically on the nodes when the nodes are installed. The syspar_ctrl command is also executed automatically when system partitions are created or destroyed. For more information on using the syspar_ctrl command, see PSSP Installation and Migration Guide and PSSP Command and Technical Reference.
The haemctrl command provides a number of functions for controlling the operation of the Event Management system. You can use it to:
Except for the clean and unconfigure function, haemctrl affects the Event Management subsystem in the current system partition, that is, the system partition that is specified by the SP_NAME environment variable.
If the haemctrl command is running on the control workstation, the first step in the add function is to select an Event Manager daemon communications port number and save it in the Syspar_ports SDR class. This port number is then placed in the /etc/services file. If the haemctrl command is running on a node, the port number is fetched from the SDR and placed in the /etc/services file. Port numbers are selected from the range 10000 through 10100.
The second step is to add the Event Management startup program to the System Resource Controller (SRC) using the mkssys command. On the control workstation, the IP address of the system partition is an argument to the haemd_SP program in the SRC subsystem specification. The third step is to add the aixos resource monitor daemon harmad to the SRC using the mkssys command.
The fourth step is to add an entry to the /etc/inittab file so the Event Manager daemon and the aixos resource monitor will be started during boot. However, if haemctrl is running on a High Availability Control Workstation (HACWS), no entry is made in the /etc/inittab file. Instead, HACWS manages starting and stopping the Event Manager daemon and the aixos resource monitor.
The remaining steps in the add function are performed only on the control workstation. The haemloadcfg program is run to load the default configuration data into the SDR. The haemcfg command is run to create the EMCDB and place it into the staging directory. Finally, if it is not already stored in the SDR, an Event Manager daemon remote client communications port number is selected from the range 10000 through 10100. This port number is then placed in the /etc/services file. This port number is used by all of the Event Manager daemons on the control workstation.
Note that if the haemctrl add function terminates with an error, the command can be rerun after the problem is fixed. The command takes into account any steps that already completed successfully.
The start and stop functions of the haemctrl command simply run the startsrc and stopsrc commands, respectively. However, haemctrl automatically specifies the subsystem argument to these SRC commands.
The delete function of the haemctrl command removes the subsystem from the SRC, removes the entry from /etc/inittab, and removes the Event Manager daemon communications port number from /etc/services. It does not remove anything from the SDR, because the Event Management subsystem may still be configured on other nodes in the domain.
The clean function of the haemctrl command performs the same function as the delete function, except in all system partitions. In addition, it removes the Event Manager daemon remote client communications port number from the /etc/services file.
The clean function does not remove anything from the SDR. This function is provided to support restoring the system to a known state, where the known state is the (possibly restored copy of the) SDR database.
The unconfigure function of the haemctrl command performs the same function as the clean function and then removes all port numbers from the SDR allocated by the Event Management subystem. This function can only be performed on the control workstation and must be preceded by executing the clean function of the haemctrl command on all of the nodes.
The purpose of this function is to remove allocated port numbers from the SDR in a consistent manner.
The tracing function of the haemctrl command is provided to supply additional problem determination information when it is requested by the IBM Support Center. Normally, tracing should not be turned on, because it may slightly degrade Event Management subsystem performance and can consume large amounts of disk space in the /var file system.
The refresh function of the haemctrl command initiates a procedure in the Event Management subsystem to refresh the subsystem's security configuration. The security configuration is modified to use the current SP Trusted Services authentication methods. See Understanding Event Management security.
Normally, the Event Manager daemon startup program, haemd_SP, is started by an entry in the /etc/inittab file using the startsrc command. If necessary, you can start the startup program using the haemctrl command or the startsrc command directly. The startup program performs the following steps:
Note that subdirectories for remote clients cannot be removed automatically, because the Event Manager daemon cannot determine if remote processes still exist.
In this loop, the Event Manager daemon waits for requests from EM clients, messages from resource monitors and other Event Manager daemons, messages from the Group Services subsystem, and requests from the SRC for status. It also waits for internal signals that indicate a function that was previously scheduled should now be executed, for example, retrying a connection to Group Services.
However, EM client requests, messages from resource monitors, and messages from other Event Manager daemons (called peers) are refused until the Event Manager daemon has successfully joined the daemon peer group (the ha_em_peers group) and has fetched the correct version of the EMCDB.
After the Event Manager daemon has successfully established a connection with the Group Services subsystem, it tries to join the daemon peer group, a Group Services group called ha_em_peers. If this is the first Event Manager daemon to come up in the domain, it establishes the peer group. Otherwise, the other daemons in the peer group either accept or reject the daemon's join request. If an existing peer group member is still recovering from a prior termination of the joining daemon, the join request is rejected. If its join request is rejected, the daemon tries to join again in 15 seconds. This continues until the daemon's join request is accepted.
When it joins the daemon peer group, the Event Manager daemon examines the group state. The group state is the EMCDB version string.
If the group state is null, the joining daemon proposes that the group state be set to the version string that the daemon has fetched from the SDR. If several daemons try to join the group at about the same time, and the group state is null, then each daemon proposes the group state. When the group is formed, Group Services selects one of the proposals and sets the group state to it. Note that each daemon is proposing the EMCDB version string that it has fetched from the SDR. Unless the haemcfg command has been run at about the same time, the proposed version strings should be identical.
If the group state is not null when it is examined by the joining daemon, a group has already formed and the daemon does not propose a new group state.
After the daemon has successfully joined the peer group, it compares the EMCDB version string contained in the group state to the version string it fetched from the SDR. If they are different, the version that was fetched from the SDR is replaced by the version in the group state.
An Event Manager daemon is prevented from joining the peer group as long as any other Event Manager daemon, currently in the peer group, is non-responsive to "pings" from the Group Services subsystem. (When an Event Manager daemon successfully joins the peer group, Group Services requests a response from the Event Manager daemon every two minutes. If the daemon does not respond to the request within two minutes, it is considered to be non-responsive. A daemon is also considered to be non-responsive if it does not reply to the join requests of other daemons within one minute.) The Event Manager daemon status, as displayed by the lssrc command, indicates if a daemon cannot join the peer group. If this is the case, the em.default.domain_name file of any other daemon in the peer group should be examined for errors indicating that an Event Manager daemon is non-responsive. If so, and the non-responsive daemon does not terminate itself within a few minutes, perform the User Response specified for the error.
Once the daemon has joined the peer group and has determined the EMCDB version, it reads the run-time EMCDB file from the /etc/ha/cfg directory. If the file does not exist, it is copied from the staging directory on the control workstation.
Once the daemon has read the file, it compares the version string in the EMCDB to the one it fetched (from the SDR or from the group state). If the two version strings do not match, and the daemon has not just copied the EMCDB from the control workstation, then it copies the run-time EMCDB from the control workstation. If the version strings still do not match, the daemon terminates with an error.
Whenever it is necessary to copy the EMCDB from the control workstation, the EMCDB version string is used to determine how the copy is done. If the EMCDB version string was obtained from the group state then it is used to copy a back level EMCDB from the staging directory on the control workstation. Otherwise, the staging file /spdata/sys1/ha/cfg/em.domain_name.cdb is copied. Note that back level copies of the EMCDB should be removed from the staging directory only if their version string suffix indicates a time stamp older than the current version string found in the daemon peer group state (the current version string is found in the Event Manager daemon status, as displayed by the lssrc command. See Displaying the status of the Event Manager daemon.
After the daemon has read and validated the EMCDB, it enables daemon communications. This permits EM clients to send requests to the daemon, resource monitors to connect to the daemon, and peers to send messages to the daemon. At this point, the initialization of the Event Manager daemon is complete.
To copy the EMCDB from the control workstation to the /etc/ha/cfg directory, the Event Manager daemon uses the /usr/sbin/rsct/install/bin/haemrcpcdb script. This script uses the rcp command to perform the actual copy.
The way in which Event Manager daemons determine the EMCDB version has important implications for the configuration of the subsystem. To place a new version of the EMCDB into production (that is, to make it the run-time version that is used by the Event Management subsystem), you must stop each Event Manager daemon in the domain after the haemcfg command is run. Stopping the daemons dissolves the existing peer group. Once the existing peer group is dissolved, the daemons can be restarted. As they restart, the daemons form a new peer group. A new EMCDB version string can be submitted as the group state only when a peer group is formed.
Normal operation of the Event Management subsystem requires no administrative intervention. The subsystem recovers from temporary failures automatically. However, there are some characteristics that might be of interest to administrators.
For performance reasons, the Event Manager daemon has an internal limit of 256 open file descriptors. In practice, this limits the number of EM client sessions, either local or remote, to about 225. This file descriptor limit is per daemon; it does not limit the number of EM clients in the domain.
The Event Manager daemon connects to a resource monitor of type server as the resource monitor starts. If a server resource monitor is running prior to the start of the Event Manager daemon, the daemon connects to the resource monitor after enabling daemon communications. If able, the daemon starts a server resource monitor when necessary. However, connection and start attempts are constrained under the following circumstances:
The rationale for locking the resource monitor is that, if it cannot be started and stay running or successful connections are frequently being lost, then a problem exists with the resource monitor. Once the problem has been determined and corrected, the haemunlkrm command can be used to unlock the resource monitor and connect to it, starting it first if necessary. Note that locking does not apply to client type resource monitors.
The primary function of the Event Manager daemon is to generate events, by observing resource variable values and applying expressions to those values. However, this function is performed for a resource variable only if an EM client has registered to receive events for that resource variable. The Event Manager daemon also observes a resource variable once to satisfy a query request, if the resource variable is not already being observed. When observations are necessary, the Event Manager daemon commands the appropriate resource monitor (if it has a connection type of server) to supply resource variable values. When observations are no longer necessary, the Event Manager daemon commands the resource monitor to stop supplying values. In this way, the Event Manager daemon performs no action for resource variables that are not of interest to clients.
Even if a resource monitor that has a connection type of server is running, it does not supply data to the Event Manager daemon except by command of the daemon.
The Event Manager daemon either observes a resource variable located in shared memory every X seconds, where X is the observation interval that is specified in the resource variable's resource class definition, or when the resource variable's value is sent to the Event Manager daemon by the resource monitor (transparently, via the RMAPI). The values of resource variables of value type Counter and Quantity are located in shared memory. The values of resource variables of value type State are not.
All resource variables that are located in shared memory with the same observation interval are observed on the same time boundary. This minimizes the observation overhead, no matter when the request for a resource variable is made.
The Event Management subsystem is an SP trusted service. Event Management enforces the security policies as defined by the trusted services authentication methods configured in the domain (the SP system partition) in which the Event Management subsystem is running. These authentication methods can be any combination of DCE, compatibility, and none.
Table 18 defines the behavior of an EM client and the EM daemon to
which it connects for each combination of authentication methods. Since
an EM client may not be on the same node as the EM daemon to which it
connects, the table includes entries for each possible combination of
authentication methods configured on the node where the EM client and the EM
daemon are running. Each table column represents a combination of
authentication methods for the EM daemon and each table row represents a
combination of authentication methods for the EM client.
EM Client | EM Daemon | ||||
---|---|---|---|---|---|
DCE | DCE and Compat | Compat | None | No support | |
DCE | MutualAuth | MutualAuth | Error | Error | Error |
DCE and Compat | ClientAuth | OK | OK | OK | OK |
Compat | Error | OK | OK | OK | OK |
None | Error | OK | OK | OK | OK |
No support | Error | OK | OK | OK | Not applicable |
The row and column headings have the following meanings:
The table cell labels have the following meanings:
Whenever an EM client is authenticated by the EM daemon, the DCE principal that is executing the client must also by defined in the Event Management DCE group named haem-users. The access control policy of the Event Management subsystem is that an authenticated EM client must be a member of the DCE group haem-users in order to access the Event Management subsystem and the resources which it monitors. When communication between an unauthenticated EM client and an EM daemon is permitted, no additional authorization is required.
All authentication and authorization logic is implemented in the EMAPI library and the EM daemon. This logic is executed whenever an EM client application attempts to start a session with the Event Management subsystem. Errors in authentication, authorization, or an invalid combination of authentication methods between the EM client and the EM daemon, as indicated by the label Error in Table 18, result in a failure of the request to start the EM session.
Each EM daemon has a security state that matches the SP trusted services authentication methods configured on the node where the daemon is running. The states are the following:
The EM peer group also has a security state, as maintained in the peer group state value as one of the following three keywords:
The security state of each EM daemon in the peer group must match that of
the peer group as defined in Table 19.
Table 19. Daemon and peer group security states
EM daemon security state | EM peer group security state |
---|---|
DCE | SEC |
DCE and Compat | NOSEC (or null) |
Compat | NOSEC (or null) |
None | NOSEC (or null) |
No support | NOSECSUPPORT (or null) |
If the peer group contains versions of EM daemons from earlier releases, then the peer group state value might not contain any of the keywords SEC, NOSEC, or NOSECSUPPORT. The security state value might be null. The peer group state value can be observed by displaying the status of the EM daemon.
When the EM daemon starts, it determines if the node has been installed with PSSP 3.2. If not the security state of the daemon is No support. Otherwise, the daemon obtains the currently configured SP trusted services authentication methods and sets the respective security state. If the daemon cannot obtain the authentication methods, it sets the security state to None. When the daemon then joins the peer group, if the peer group currently has no state set, it proposes a peer group security state to match. Upon completion of the join, the daemon checks if the security states match. If they do not match as in Table 19, the daemon logs an error and exits.
The result of this procedure is that the first daemon that joins the peer group sets the security state of the group, as defined on the node. If multiple daemons join the group at the same time, and the group does not currently exist, then Group Services arbitrarily picks one of the proposed group states. This can result in one or more daemons exiting with an error if their proposed state is not the one picked. This can only occur if the nodes where the daemons are executing do not all have compatible security configurations, as defined in Table 19.
As indicated by the information in Table 18 and Table 19, the Event Management subsystem supports a mixture of authentication methods within a domain, including nodes where an EM client is executing outside of the domain. This permits the EM subsystem to support migration to PSSP 3.2 one node at a time.
After migration is complete, or any time the SP trusted services
authentication methods are changed on the nodes of a domain, the Event
Management subsystem needs to be refreshed. The refresh usually happens
automatically by some task during migration and configuration. For
instance, the haemctrl -r command is invoked automatically by the
chauthpts command. If you ever need to, you can use the
haemctrl -r command directly to refresh the subsystem. The
daemon that receives the command to refresh obtains a new security state by
obtaining the current authentication methods. It then proposes a peer
group state change that matches this new state, as defined in Table 19. When each daemon receives the proposal, it also
obtains the current configuration methods on the node. If the latest
methods match the proposed state, then the daemon votes ACCEPT, else it votes
REJECT. If the protocol is approved by all daemons then each daemon
uses the security state just obtained. After the security state of the
daemon is updated, the daemon checks all client connections to see if they
have an appropriate security state according to Table 20. If not, the connections are closed. When the
client detects that its connection has terminated, it can do another start
session or a restart session. In either case, authentication and
authorization are once again performed according to the policy in Table 18.
Table 20. Validation of EM client connection
Connection Authentication State | New EM Daemon Security Configurations | |||
DCE | DCE and Compat | Compat | None | |
Auth | Keep | Keep | Keep | Keep |
No Auth | Close | Keep | Keep | Keep |
Table 20 shows the policy used to validate authorization of the EM client connection: