Diagnosis Guide

Trace information

ATTENTION - READ THIS FIRST
Do not activate this trace facility until you have read this section completely, and understand this material. If you are not certain how to properly use this facility, or if you are not under the guidance of IBM Service, do not activate this facility. Activating this facility may result in degraded performance of your system. Activating this facility may also result in longer response times, higher processor loads, and the consumption of system disk resources. Activating this facility may also obscure or modify the symptoms of timing-related problems.

ATTENTION - READ THIS FIRST

Do not activate this trace facility until you have read this section completely, and understand this material. If you are not certain how to properly use this facility, or if you are not under the guidance of IBM Service, do not activate this facility.

Activating this facility may result in degraded performance of your system. Activating this facility may also result in longer response times, higher processor loads, and the consumption of system disk resources. Activating this facility may also obscure or modify the symptoms of timing-related problems.

Consult these logs for debugging purposes. They all refer to a particular instance of the Topology Services daemon running on the local node.

Topology Services service log

This log contains trace information about the activities performed by the daemon. When a problem occurs, logs from multiple nodes will often be needed. These log files must be collected before they wrap or are removed.

The trace is located in:

/var/ha/log/hats.DD.hhmmss.partition_name for PSSP nodes.
/var/ha/log/topsvcs.DD.hhmmss.cluster_name for HACMP nodes.

where DD is the day of the month when the daemon was started, and hhmmss is the time when the daemon was started.

If obtaining logs from all nodes is not feasible, the following is a list of nodes from which logs should be collected:

The node where the problem was seen
The Group Leader node on each network
The Group Leader is the node which has the highest IP address on a network.
The Downstream Neighbor on each network
This is the node whose IP address is immediately lower than the address of the node where the problem was seen. The node with the lowest IP address has a Downstream Neighbor of the node with the highest IP address.
The control workstation

Service Log long tracing

The most detailed level of tracing is Service log long tracing. It is started with the command:

traceson -l -s subsystem_name

where subsystem_name is:

hats on PSSP nodes
hats.partition_name on the PSSP control workstation
topsvcs on HACMP nodes

The long trace is stopped with this command: tracesoff -s subsystem_name, which causes short tracing to be in effect. When the log file reaches the maximum line number, the current log is saved in a file with a suffix of .bak, and the original file is truncated. When the daemon is restarted, a new log file is created. Only the last five log files are kept.

With service log long tracing, trace records are generated under the following conditions:

Each message sent or received
Each adapter that is disabled or re-enabled
Details of protocols being run
Details of node reachability information
Refresh
Client requests and notifications
Groups formed, elements added and removed

Data in the Service log is in English. Each Service log entry has this format:

date     daemon name     message

Adapters are identified by a pair:

(IP address:incarnation number)

Groups are identified by a pair:

(IP address of Group Leader:incarnation number of group)

Long tracing should be activated on request from IBM Service. It can be activated (just for about one minute, to avoid overwriting other data in the log file), when the error condition is still present.

Service Log normal tracing

Service log normal tracing is the default, and is always running. There is negligible impact if no node or adapter events occur on the system. An adapter death event may result in approximately 50 lines of log information for the Group Leader and "mayor" nodes, or up to 250 lines for the Group Leader and "mayor" nodes on systems of approximately 400 nodes. All other nodes will produce less than 20 lines. Log file sizes can be increased as described in Changing the service log size.

With normal tracing, trace records are generated for these conditions:

Each adapter that is disabled or re-enabled
Some protocol messages sent or received
Refresh
Client requests and notifications
Groups formed, members added and removed

No entries are created when no adapter or node events are happening on the system.

With normal tracing, the log trimming rate depends heavily on the frequency of adapter or node events on the system. The location of the log file and format of the information is the same as that of the long tracing described previously.

If the Service log file, using normal tracing, keeps growing even when no events appear to be happening on the system, this may indicate a problem. Search for possible entries in the AIX error log or in the User log. See Topology Services user log.

Changing the service log size

The long trace generates approximately 10KB of data per minute of trace activity. By default, log files have a maximum of 5000 lines, which will be filled in 30 minutes or less if long tracing is requested. To change the log file size:

For PSSP, issue this command on the control workstation:
```
hatstune -l new_max_lines -r
```
The full path name of this command is: /usr/sbin/rsct/bin/hatstune.
For example, hatstune -l 10000 -r changes the maximum number of lines in a log file to 10000. The -r flag causes the Topology Services subsystem to be refreshed in all the nodes.

For HACMP, use this sequence:

smit hacmp
    Cluster Configuration
      Cluster Topology
        Configure Topology Services and Group Services
          Change / Show Topology and Group Services Configuration
 
      Cluster Topology
          Synchronize Cluster Topology

Topology Services user log

The Topology Services user log contains error and informational messages produced by the daemon. This trace is always running. It has negligible impact on the performance of the system, under normal circumstances.

The trace is located in: /var/ha/log/hats.DD.hhmmss.partition_name.lang for PSSP nodes, and /var/ha/log/topsvcs.DD.hhmmss.cluster_name.lang for HACMP nodes, where DD is the day of the month when the daemon was started, hhmmss is the time when the daemon was started, and lang is the language used by the daemon.

Data in the user log is in the language where the daemon is run, which is the node's administrative language. Messages in the user log have a catalog message number, which can be used to obtain a translation of the message in the desired language.

The size of the log file is changed using the same commands that change the size of the service log. Truncation of the log, saving of log files, and other considerations are the same as for the service log.

Each user log entry has this format:

date     daemon name       message

Adapters are identified by a pair:

(IP address:incarnation number)

Groups are identified by a pair:

(IP address of Group Leader:incarnation number of group)

The main source for diagnostics is the AIX error log. Some of the error messages produced in the user log occur under normal circumstances, but if they occur repeatedly they indicate an error. Some error messages give additional detail for an entry in the error log. Therefore, this log file should be examined when an entry is created in the system error log.

hats or topsvcs script log

This is the Topology Services startup script log. It contains configuration data used to build the machines.lst configuration file. This log also contains error messages if the script was unable to produce a valid machines.lst file and start the daemon. The startup script is run at subsystem startup time and at refresh time.

This log refers to a particular instance of the Topology Services script running on the local node. In PSSP, the control workstation is responsible for building the machines.lst file from the adapter information in the SDR. Therefore, it is usually on the control workstation that the startup script encounters problems. In HACMP, the machines.lst file is built on every node.

The size of the file varies from 1KB to 50KB according to the size of the machine. The trace runs whenever the startup script runs. The trace is located in:

/var/ha/log/hats.partition_name.* for PSSP nodes
/var/ha/log/topsvcs.default.* for HACMP nodes

A new instance of the hats startup script log is created each time the script starts. A copy of the script log is made just before the script exits. Only the last seven instances of the log file are kept, and they are named file.1 through file.7. Therefore, the contents of the log must be saved before the subsystem is restarted or refreshed many times.

The file.1 is an identical copy of the current startup script log. At each startup, file.1 is renamed to file.2; file.2 is renamed to file.3, and so on. Therefore, the previous file.7 is lost.

Entries in the startup script log are kept both in English and in the node's language (if different). Trace records are created for these conditions:

The machines.lst file is retrieved from the SDR.
The machines.lst file is built using information from the SDR.
An error is encountered that prevents the hats script from making progress.

There is no fixed format for the records of the log. The following information is in the file:

The date and time when the hats script started running
The arguments passed to the script
A copy of machines.lst file generated
Topology Services tunable parameters
The date and time when the hats script finished running
The netmasks of the adapters in the configuration
If the script fails to find an address on the control workstation to include in the configuration, the output of the netstat -in and ifconfig commands for each Ethernet adapter in the control workstation is included.
If the script was called for a refresh operation, the output of the refresh command is included in the log file.

The following information is in the HACMP startup script log:

The date when the topsvcs script finished running.
The HACMP version.
A copy of the machines.lst file generated.
A copy of the output of the cllsif command, containing the HACMP adapter configuration.
The contents of the HACMPnim and HACMPtopsvcs ODM classes.
A copy of the output of the clhandle -ac command.

The main source for diagnostics is the AIX error log. The hats script log file should be used when the error log shows that the startup script was unable to complete its tasks and start the daemon.

For a PSSP system, good results are indicated when the message:

Exec /usr/sbin/rsct/bin/hatsd -n node_number

appears towards the beginning of the file.

For an HACMP system, good results are indicated by the absence of fatal error messages.

For a PSSP system, error results are indicated by one or more error messages. For example:

hats: 2523-605 Cannot find the address of the control workstation.

and the absence of the Exec message.

For an HACMP system, error results are indicated by the presence of the message:

topsvcs: 2523-600 Exit with return code: error code

and possibly other error messages.

All error messages have the message catalog number with them.

Network Interface Module (NIM) log

This log contains trace information about the activities of the Network Interface Modules (NIMs), which are processes used by the Topology Services daemon to monitor each network interface. These logs need to be collected before they wrap or are removed.

The trace is located in:

/var/ha/log/nim.hats.interface name.partition name.00n for PSSP nodes.
/var/ha/log/nim.topsvcsinterface name.cluster name.00n for HACMP nodes.

Where 00n is a sequence number of 001, 002, or 003. These three logs are always kept. Log file 003 is overwritten by 002, 002 is overwritten by 001, and 001 is overwritten by 003.

Trace records are generated under the following conditions:

A connection with a given adapter is established.
A connection with a given adapter is closed.
A daemon has sent a command to start or stop heartbeating.
A daemon has sent a command to start or stop monitoring heartbeats.
A local adapter goes up or down.
A message is sent or received.
A heartbeat from the remote adapter has been missed

Data in the NIM log is in English only. The format of each message is:

time-of-day    message

An instance of the NIM log file will wrap when the file reaches around 200kB. Normally, it takes around 10 minutes to fill an instance of the log file. Since 3 instances are kept, the NIM log files needs to be saved within 30 minutes of when the adapter-related problem occurred.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]