The Event Management subsystem uses the AIX Error Log as the main repository for errors and informational messages. Besides the logging of informational messages and errors, the Event management daemon can log additional and detailed information if tracing is activated.
Event Management does not use the AIX Error Log in the same way as Topology Services or Group Services. In PSSP 3.2, Event Management uses only two AIX error log templates. The important information from these entries consists of informational and error message placed in the Detail Data field. All error messages have numbers and are documented in PSSP: Messages Reference.
There are two types of error messages that Event Management logs in the AIX Error log:
See the Detail Data field for each entry.
The log file is located in the /var/ha/log directory. The file is named em.default.syspar. It contains any error message from the Event Management daemon that cannot be written to the AIX error log. Normally, all the daemon error messages are written to the AIX error log. This log also contains error messages that result from repetitive operational errors. For example, errors that are logged when the Event Management daemon cannot connect to Group Services, and it retries every five seconds, or errors logged when it tries to join the ha_em_peers group every 15 seconds.
The size of the em.default.syspar file is examined every two minutes. If the size exceeds 256KB, the file is renamed with a suffix of .last, and a new default file is created. No more than two copies of this file are kept.
If Event Management cannot start a resource monitor, it also records additional information in the em.defaults.syspar log file. The error information includes the name of the resource monitor that could not be started.
The Event Management daemon errors are categorized here:
This problem may happen when the system administrator re-creates the Event Management subsystem by using the haemctrl command on the control workstation. It can also happen if the system administrator creates a new EMCDB file by using the haemcfg command. Every time a new EMCDB is created (by using the haemcfg command), its version number is stored in the Syspar SDR class. Daemons use that version number and file only if the state value of the ha_em_peers group in Group Services contains the same value.
The file is transferred from the control workstation to the nodes using a remote copy command, and it is stored in /etc/ha/cfg. If the file does not exist, an error message is logged in /etc/ha/log/em.default.syspar and in the AIX error log.
Problems with resource monitors are usually communication problems. One way of verifying that the RMs are connected to and communicating with the Event Management daemon is to issue the command:
lssrc -ls haem.syspar
and check the Resource Monitor section. The output is similar to:
Resource Monitor Information Name Inst Type FD SHMID PID Locked IBM.PSSP.CSSLogMon 0 C -1 -1 -2 00/00 No IBM.PSSP.SDR 0 C -1 -1 -2 00/00 No IBM.PSSP.harmld 0 S 20 11 28954 01/01 No IBM.PSSP.harmpd 0 S 19 -1 28684 01/01 No IBM.PSSP.hmrmd 0 S 21 -1 21766 01/01 No IBM.PSSP.pmanrmd 0 C 14 -1 -2 00/00 No Membership 0 I -1 -1 -2 00/00 No Response 0 I -1 -1 -2 00/00 No aixos 0 S 12 10 -2 00/01 No
The connection type specifies how the resource monitor connects to Event Management:
The last two columns of the output named Locked, represent counters for successful connections to the resource monitor. The Event Management daemon maintains two counters: one for start attempts and one for successful connections. If either of these counters reaches the start limit or connect limit respectively, the RM is locked.
The counters are cleared two hours after the first start or connect. For starts, the limit is three. For connects, the limits is the number of instances configured for the resource monitor (rmNum_instances in the EM_Resource_Monitor class) multiplied by three. For all resource monitors shipped with PSSP, rmNUM_instances is one.
Once the Event Management daemon has successfully connected to a resource monitor of type server, the daemon attempts to reconnect to the resource monitor if it should terminate. The reconnection is attempted at the rate of one per minute. However, reconnection attempts are limited under the following circumstances:
The reason for locking the resource monitor is that if it cannot be started and remain running, or successful connections are frequently being lost, a problem exists with the resource monitor. Once you isolate and correct the problem, unlock and start the resource monitor by issuing the haemunlkrm command. This command resets the start and connect counters to zero and also resets the two hour window.