IBM Books

Diagnosis Guide


Error information

The Event Management subsystem uses the AIX Error Log as the main repository for errors and informational messages. Besides the logging of informational messages and errors, the Event management daemon can log additional and detailed information if tracing is activated.

AIX Error Log for Event Management

Event Management does not use the AIX Error Log in the same way as Topology Services or Group Services. In PSSP 3.2, Event Management uses only two AIX error log templates. The important information from these entries consists of informational and error message placed in the Detail Data field. All error messages have numbers and are documented in PSSP: Messages Reference.

There are two types of error messages that Event Management logs in the AIX Error log:

  1. HA001_TR - for informational messages
  2. HA002_ER - for error messages

See the Detail Data field for each entry.

Error log files

The log file is located in the /var/ha/log directory. The file is named em.default.syspar. It contains any error message from the Event Management daemon that cannot be written to the AIX error log. Normally, all the daemon error messages are written to the AIX error log. This log also contains error messages that result from repetitive operational errors. For example, errors that are logged when the Event Management daemon cannot connect to Group Services, and it retries every five seconds, or errors logged when it tries to join the ha_em_peers group every 15 seconds.

The size of the em.default.syspar file is examined every two minutes. If the size exceeds 256KB, the file is renamed with a suffix of .last, and a new default file is created. No more than two copies of this file are kept.

If Event Management cannot start a resource monitor, it also records additional information in the em.defaults.syspar log file. The error information includes the name of the resource monitor that could not be started.

Event Management daemon errors

The Event Management daemon errors are categorized here:

Initialization errors

Event Management Configuration Database (EMCDB) operations

Signals

Sockets

Register and unregister Events (EMAPI)

Environment setting

Group Services

Reliable Messages Library (PRM)

AIX SRC subsystem

Event Management daemon

Resource Monitor operations

SP System Security Services

System Performance Measurement Interface (SPMI)

EMCDB problems

EMCDB Version is incorrect

This problem may happen when the system administrator re-creates the Event Management subsystem by using the haemctrl command on the control workstation. It can also happen if the system administrator creates a new EMCDB file by using the haemcfg command. Every time a new EMCDB is created (by using the haemcfg command), its version number is stored in the Syspar SDR class. Daemons use that version number and file only if the state value of the ha_em_peers group in Group Services contains the same value.

The file is transferred from the control workstation to the nodes using a remote copy command, and it is stored in /etc/ha/cfg. If the file does not exist, an error message is logged in /etc/ha/log/em.default.syspar and in the AIX error log.

Resource Monitor problems

Problems with resource monitors are usually communication problems. One way of verifying that the RMs are connected to and communicating with the Event Management daemon is to issue the command:

lssrc -ls haem.syspar

and check the Resource Monitor section. The output is similar to:

Resource Monitor Information
        Name          Inst     Type      FD    SHMID    PID   Locked
IBM.PSSP.CSSLogMon       0        C       -1     -1      -2  00/00  No
IBM.PSSP.SDR             0        C       -1     -1      -2  00/00  No
IBM.PSSP.harmld          0        S       20     11   28954  01/01  No
IBM.PSSP.harmpd          0        S       19     -1   28684  01/01  No
IBM.PSSP.hmrmd           0        S       21     -1   21766  01/01  No
IBM.PSSP.pmanrmd         0        C       14     -1      -2  00/00  No
Membership               0        I       -1     -1      -2  00/00  No
Response                 0        I       -1     -1      -2  00/00  No
aixos                    0        S       12     10      -2  00/01  No
 

The connection type specifies how the resource monitor connects to Event Management:

The last two columns of the output named Locked, represent counters for successful connections to the resource monitor. The Event Management daemon maintains two counters: one for start attempts and one for successful connections. If either of these counters reaches the start limit or connect limit respectively, the RM is locked.

The counters are cleared two hours after the first start or connect. For starts, the limit is three. For connects, the limits is the number of instances configured for the resource monitor (rmNum_instances in the EM_Resource_Monitor class) multiplied by three. For all resource monitors shipped with PSSP, rmNUM_instances is one.

Once the Event Management daemon has successfully connected to a resource monitor of type server, the daemon attempts to reconnect to the resource monitor if it should terminate. The reconnection is attempted at the rate of one per minute. However, reconnection attempts are limited under the following circumstances:

  1. If it is necessary that the daemon start the resource monitor before each reconnection attempt. After three attempts within two hours, the resource monitor is locked, and no further attempts are made.
  2. If the resource monitor cannot be started by the Event Management daemon after three unsuccessful reconnections within two hours, the resource monitor is locked. No further reconnection attempts are made.

The reason for locking the resource monitor is that if it cannot be started and remain running, or successful connections are frequently being lost, a problem exists with the resource monitor. Once you isolate and correct the problem, unlock and start the resource monitor by issuing the haemunlkrm command. This command resets the start and connect counters to zero and also resets the two hour window.

Note:
Locking does not apply to client type resource monitors.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]