Diagnosis Guide

Error information

The Event Management subsystem uses the AIX Error Log as the main repository for errors and informational messages. Besides the logging of informational messages and errors, the Event management daemon can log additional and detailed information if tracing is activated.

AIX Error Log for Event Management

Event Management does not use the AIX Error Log in the same way as Topology Services or Group Services. In PSSP 3.2, Event Management uses only two AIX error log templates. The important information from these entries consists of informational and error message placed in the Detail Data field. All error messages have numbers and are documented in PSSP: Messages Reference.

There are two types of error messages that Event Management logs in the AIX Error log:

HA001_TR - for informational messages
HA002_ER - for error messages

See the Detail Data field for each entry.

Error log files

The log file is located in the /var/ha/log directory. The file is named em.default.syspar. It contains any error message from the Event Management daemon that cannot be written to the AIX error log. Normally, all the daemon error messages are written to the AIX error log. This log also contains error messages that result from repetitive operational errors. For example, errors that are logged when the Event Management daemon cannot connect to Group Services, and it retries every five seconds, or errors logged when it tries to join the ha_em_peers group every 15 seconds.

The size of the em.default.syspar file is examined every two minutes. If the size exceeds 256KB, the file is renamed with a suffix of .last, and a new default file is created. No more than two copies of this file are kept.

If Event Management cannot start a resource monitor, it also records additional information in the em.defaults.syspar log file. The error information includes the name of the resource monitor that could not be started.

Event Management daemon errors

The Event Management daemon errors are categorized here:

Initialization errors

Not running as root.
Cannot get group attribute for the haemrm group.
Cannot set UID.
Malloc failed.

Event Management Configuration Database (EMCDB) operations

Cannot open, read, write or checksum the EMCDB file.

Signals

Cannot ignore or set SIGPIPE, SIGALRM, SIGCHLD
sigthreadmask() failed.

Sockets

Cannot open, read, or write sockets (UDP and TCP/IP)

Register and unregister Events (EMAPI)

Missing information, for example: node_number
Syntax error, for example, and error in an instance vector or expression

Environment setting

Cannot determine environment (SP or HACMP node)
Incorrect or missing node_number

Group Services

Connection to Group Services failed.
Cannot join peers group.
Error reading group state value.

Reliable Messages Library (PRM)

Cannot initialize PRM services.
Cannot set PRM limits.
Cannot send or receive messages.

AIX SRC subsystem

Event Management daemon not started by the AIX SRC

Event Management daemon

Cannot create or access runtime directory.
Cannot create lock file in runtime directory.

Resource Monitor operations

Errors communicating with resource monitors.

SP System Security Services

Cannot load security library
Error from security routine (not equal to SPSEC_SUCCESS)

System Performance Measurement Interface (SPMI)

Cannot get statistics from SPMI.

EMCDB problems

EMCDB Version is incorrect

This problem may happen when the system administrator re-creates the Event Management subsystem by using the haemctrl command on the control workstation. It can also happen if the system administrator creates a new EMCDB file by using the haemcfg command. Every time a new EMCDB is created (by using the haemcfg command), its version number is stored in the Syspar SDR class. Daemons use that version number and file only if the state value of the ha_em_peers group in Group Services contains the same value.

The file is transferred from the control workstation to the nodes using a remote copy command, and it is stored in /etc/ha/cfg. If the file does not exist, an error message is logged in /etc/ha/log/em.default.syspar and in the AIX error log.

Resource Monitor problems

Problems with resource monitors are usually communication problems. One way of verifying that the RMs are connected to and communicating with the Event Management daemon is to issue the command:

lssrc -ls haem.syspar

and check the Resource Monitor section. The output is similar to:

Resource Monitor Information
        Name          Inst     Type      FD    SHMID    PID   Locked
IBM.PSSP.CSSLogMon       0        C       -1     -1      -2  00/00  No
IBM.PSSP.SDR             0        C       -1     -1      -2  00/00  No
IBM.PSSP.harmld          0        S       20     11   28954  01/01  No
IBM.PSSP.harmpd          0        S       19     -1   28684  01/01  No
IBM.PSSP.hmrmd           0        S       21     -1   21766  01/01  No
IBM.PSSP.pmanrmd         0        C       14     -1      -2  00/00  No
Membership               0        I       -1     -1      -2  00/00  No
Response                 0        I       -1     -1      -2  00/00  No
aixos                    0        S       12     10      -2  00/01  No

The connection type specifies how the resource monitor connects to Event Management:

Type server (S) corresponds to external daemons, and their PID is in the PID column.
Type client (C) are usually scripts or commands that run and send updates to the EM regarding resource variables.
Type internal (I) corresponds to resource monitors internal to the Event Management daemon.

The last two columns of the output named Locked, represent counters for successful connections to the resource monitor. The Event Management daemon maintains two counters: one for start attempts and one for successful connections. If either of these counters reaches the start limit or connect limit respectively, the RM is locked.

The counters are cleared two hours after the first start or connect. For starts, the limit is three. For connects, the limits is the number of instances configured for the resource monitor (rmNum_instances in the EM_Resource_Monitor class) multiplied by three. For all resource monitors shipped with PSSP, rmNUM_instances is one.

Once the Event Management daemon has successfully connected to a resource monitor of type server, the daemon attempts to reconnect to the resource monitor if it should terminate. The reconnection is attempted at the rate of one per minute. However, reconnection attempts are limited under the following circumstances:

If it is necessary that the daemon start the resource monitor before each reconnection attempt. After three attempts within two hours, the resource monitor is locked, and no further attempts are made.
If the resource monitor cannot be started by the Event Management daemon after three unsuccessful reconnections within two hours, the resource monitor is locked. No further reconnection attempts are made.

The reason for locking the resource monitor is that if it cannot be started and remain running, or successful connections are frequently being lost, a problem exists with the resource monitor. Once you isolate and correct the problem, unlock and start the resource monitor by issuing the haemunlkrm command. This command resets the start and connect counters to zero and also resets the two hour window.

Note:: Locking does not apply to client type resource monitors.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]