IBM Books

Diagnosis Guide


Error information

AIX Error Logs and templates

The error log file is stored in /var/adm/ras/errlog by default. One entry is logged for each occurrence of the condition. The condition is logged on every node where the event occurred.

The Error Log file may wrap, since the file has a limited size. Data is stored in a circular fashion. Also, the system is shipped with a crontab file to delete hardware errors more than 90 days old and software errors and operator messages more than 30 days old.

The command:

/usr/lib/errdemon -l 

shows current settings for the error logging daemon.

The command:

/usr/lib/errdemon -s

is used to change the size of the error log file.

Both commands require root authority.

Unless otherwise noted, each entry refers to a particular instance of the Topology Services daemon on the local node. Unless otherwise noted, entries are created on each occurrence of the condition.

Table 58 lists the error log templates used by Topology Services, sorted by Error Label. An Explanation and Details are given for each error.

The Topology Services subsystem creates AIX error log entries for the following conditions:

When you retrieve an error log entry, look for the Detail Data section near the bottom of the entry.

Table 58. AIX Error Log templates for Topology Services

Label and Error ID Type Description
TS_ASSERT_EM

82D819EF

PEND Explanation: Topology Services daemon exited abnormally.

Details: This entry indicates that the Topology Services daemon exited with an assert statement, resulting in a core dump being generated. Standard fields indicate that the Topology Services daemon exited abnormally. Detail Data fields contain the location of the core file. This is an internal error.

Data needed for IBM Service to diagnose the problem is stored in the core file (whose location is given in the error log) and in the Topology Services daemon service log. See Topology Services service log. Since only six instances of the Topology Services daemon service log are kept, it should be copied to a safe place. Also, only three instances of the core file are kept. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_AUTHMETH_ER

C1FDC4E7

PERM Explanation: The Topology Services startup script cannot retrieve active authentication methods using command /usr/sbin/rsct/bin/lsauthpts.

Details: This entry indicates that command /usr/lpp/ssp/bin/lsauthpts, run by the Topology Service startup script on the control workstation, was unable to retrieve the active authentication methods in a system partition. This error occurs when the startup script is running on the control workstation during initial startup or refresh. When this error occurs, all Topology Services daemons in the system partition will terminate their operations and exit. Diagnosing this problem requires collecting data only on the control workstation.

Standard fields indicate that the startup script cannot retrieve active authentication methods in a system partition using command lsauthpts. The problem may be one of the following:

  • The system partition has an incorrect set of active partition methods.
  • The current system partition cannot be identified.

Detail Data fields contain the return code of command lsauthpts and the location of the startup script log. The error message returned by command lsauthpts can be found in the startup script log. For more information about SP security services, see Diagnosing SP Security Services problems.

TS_CMDFLAG_ER

979E20DB

PERM Explanation: Topology Services cannot be started due to incorrect flags.

Details: This entry indicates that the Topology Services daemon was unable to start because incorrect command line arguments were passed to it. This entry refers to a particular instance of Topology Services on the local node.

Other nodes may have been affected by the same problem. Standard fields indicate that the daemon was unable to start because incorrect flags were passed to it. Detail Data fields show the pathname to the daemon user log, which contains more detail about the problem.

This problem may be one of the following:

  • Topology Services was started manually in an incorrect way.
  • Incompatible versions of the daemon and startup script are being used.
  • The AIX SRC definition for the subsystem was manually set to an incorrect value.

Information about the cause of the problem may not be available once the problem is cleared.

TS_CTIPDUP_ER PERM Explanation: See TS_HAIPDUP_ER.
TS_CTNODEDUP_ER PERM Explanation: See TS_HANODEDUP_ER.
TS_CTLOCAL_ER PERM Explanation: See TS_HALOCAL_ER.
TS_CPU_USE_ER

FD20FB81

PERM Explanation: The Topology Services daemon is using too much CPU. The daemon will exit.

Details: This entry indicates that the Topology Services daemon will exit because it has been using almost 100% of the CPU. Since Topology Services runs in a real time fixed priority, exiting in this case is necessary. Otherwise, all other applications in the node will be prevented from running, Also, it is likely that the daemon is not working properly if it is using all the CPU. A core dump is created to allow debugging the cause of the problem.

This entry refers to a particular instance of Topology Services running on a node. The standard fields indicate that the Topology Services daemon is exiting because it is using too much of the CPU, and explains some of the possible causes. The detailed fields show the amount of CPU used by the daemon (in milliseconds) and the interval (in milliseconds) where the CPU usage occurred. Collect the data described in Information to collect before contacting the IBM Support Center and contact the IBM Support Center. In particular, the daemon log file and the most recent core files should be collected.

TS_CWSADDR_ER

A35F9C3B

PERM Explanation: Topology Services cannot find the control workstation address.

Details: This entry indicates that the Topology Services startup script was unable to choose a suitable Ethernet adapter on the control workstation to add to the machines.lst configuration file. This error occurs when the startup script is running on the control workstation The failure prevents the Topology Services subsystem from starting or refreshing on the control workstation and on the nodes. Diagnosing the problem requires collecting data only on the control workstation.

Standard fields indicate that the startup script was unable to find the control workstation adapter to insert in the machines.lst file. The problem could be one of the following:

  • There is no Ethernet adapter on the control workstation that is on the same subnet as the en0 adapter on one of the nodes.
  • The netmask of the Ethernet adapters on the control workstation or the nodes is incorrect.
  • The Ethernet adapter on the control workstation which should belong to the (Topology Services) SP Ethernet adapter membership group is not configured correctly.

Detailed information about the problem is stored in the startup script log file /var/ha/log/hats.partition_name on PSSP. Only a limited number (currently seven) of copies of this log file is kept. Details about the problem are stored in the startup script log file. Messages in this log file are stored both in English and in the node's language.

The following can be found in the startup script's log file:

  • The address and "netmask" for all Ethernet adapters on the control workstation.
  • The address and "netmask" of the en0 adapters in all the nodes.

When examining this file, look for missing Ethernet adapters on the control workstation. Check also for the Ethernet adapter on the control workstation which is on the SP Ethernet, having a different "netmask" than the en0 on the nodes.

TS_DCECRED_ER

81132988

PERM Explanation: Topology Services cannot obtain credentials to update the SDR.

Details: This entry indicates that the Topology Services startup script was unable to obtain the DCE credentials it needs to write data into the SDR. The hats script failed to login as the service principal ssp/spbgroot using command dsrvtgt. This error affects the startup script on the control workstation. When this problem occurs, the daemon will not start on the control workstation. If the startup script is being run as part of a refresh operation, the refresh operation fails (does not take effect in any of the nodes).

Standard fields indicate that the startup script was unable to obtain DCE credentials to write into the SDR, and present possible causes. Detailed fields contain the return code of the /usr/lpp/ssp/bin/dsrvtgt ssp/spbgroot command and the location of the startup script log, which contains more details about the problem.

This error typically indicates problems in DCE or security services. For DCE configuration problems, see the configuration log file /opt/dcelocal/etc/cfgdce.log. For other DCE problems, see log files in the /opt/dcelocal/var/svc directory. For security services problems, see Diagnosing SP Security Services problems. Specifically, the procedures for restoring key files is discussed in Action 5 - Correct key files.

TS_DEATH_TR

A99EB4EA

UNKN Explanation: Lost contact with a neighboring adapter.

Details: This entry indicates that heartbeat messages are no longer being received from the neighboring adapter. This entry refers to a particular instance of the Topology Services daemon on the local node. The source of the problem could be either the local or remote node. Data from the remote node should also be obtained.

Standard fields indicate that a local adapter is no longer receiving packets from the remote adapter. Detail Data fields contain the node number and IP address of the remote adapter. Data about the loss of connectivity may not be available after the problem is cleared.

The local or remote adapter may have malfunctioned. Network connectivity to the remote adapter may have been lost. A remote node may have gone down. The Topology Services daemon on the remote node may have been blocked.

If the problem is with the local adapter, an error log entry of type TS_LOC_DOWN_ST should follow in a few seconds. Information on the remote node should be collected to obtain a better picture of what failure has occurred.

TS_DMS_WARNING_ST

4F35BB80

INFO Explanation: The Dead Man Switch timer is close to triggering.

Details: This entry indicates that the Dead Man Switch has been reset with a small time-to-trigger value left on the timer. This means that the system is in a state where the Dead Man Switch timer is close to triggering. This condition affects the node where the error log entry appears. If steps are not taken to correct the problem, the node may be brought down by the Dead Man Switch timer.

This entry is logged on each occurrence of the condition. Some possible causes are outlined. Detailed fields contain the amount of time remaining in the Dead Man Switch timer and also the interval to which the Dead Man Switch timer is being reset.

Program /usr/sbin/rsct/bin/hatsdmsinfo displays the latest "time-to-trigger" values and the values of "time-to-trigger" that are smaller than a given threshold. Small "time-to-trigger" values indicate that the Dead Man Switch timer is close to triggering.

TS_DUPNETNAME_ER

CE953608

PERM Explanation: Duplicated network name in machines.lst file.

Details: This entry indicates that a duplicate network name was found by the Topology Services daemon while reading the machines.lst configuration file. This entry refers to a particular instance of Topology Services on the local node. Other nodes may be affected by the same problem, since the machines.lst file is the same on all nodes. If this problem occurs at startup time, the daemon exits.

Standard fields indicate that a duplicate network name was found in the machines.lst file. Detail Data fields show the name that was duplicated.

In HACMP/ES, the command /usr/es/sbin/cluster/utilities/cllsif displays all the adapters in the HACMP configuration. Having adapters of different types belonging to the same network is the cause of this problem.

TS_FD_INVAL_ADDR_ST PERM Explanation: An adapter is not configured or has an address outside the cluster configuration.

Details: This entry indicates that a given adapter in the cluster (PSSP, or HACMP) configuration is either not configured, or has an address which is outside the cluster configuration. This entry affects the local node, and causes the corresponding adapter to be considered down.

Detailed data fields show the interface name, current address of the interface, and expected boot-time address.

Probable causes for the problem are:

  • There is a mismatch between the cluster adapter configuration and the actual addresses configured on the local adapters.
  • The adapter is not correctly configured.

Save the output of the command netstat -in. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center if the source of the problem cannot be found.

TS_FD_INTFC_NAME_ST PERM Explanation: An interface name is missing from the adapter configuration.

Details: The Topology Services startup script reads information from the cluster repository, containing for each adapter its address, boot-time interface name, and node number. This error entry is created when the interface name information is missing. This usually points to a problem when generating the adapter configuration.

The detailed data fields contain the address in the Topology Services configuration and the interface name which has been "assigned" to the adapter by the Topology Services daemon.

In HACMP, the information is stored in the HACMP Global ODM. Commands /usr/es/sbin/cluster/utilities/cllsif and /usr/es/sbin/cluster/utilities/clhandle retrieve the adapter and node information used by Topology Services in HACMP.

See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

This problem, in most of the cases, will not prevent Topology Services from correctly monitoring the adapter. However, internal problems may occur if a subsequent Topology Services refresh (which in HACMP is done via a Topology DARE) is attempted.

TS_HAIPDUP_ER

1BDC2F53

PERM Explanation: IP address duplication in Topology Services configuration file.

Details: This entry indicates that Topology Services was not able to start or refresh because the same IP address appeared twice in the configuration. This entry refers to a particular instance of Topology Services on the local node, but the problem may affect all the nodes. If this problem occurs at startup time, the daemon exits. To diagnose the problem in PSSP, retrieve data from the control workstation. To diagnose the problem in HACMP, retrieve data from any of the nodes that were affected by the problem.

Standard fields indicate that the same IP address appeared twice in the Topology Services machines.lst configuration file. Detail Data fields show the node number of one of the nodes hosting the duplicated address and the duplicated IP address. Information about the cause of the problem may not be available once the problem is cleared.

In the PSSP realm, the adapter configuration is stored in the Adapter class of the SDR. On the HACMP realm, information is stored in the HACMP Global ODM. Commands /usr/es/sbin/cluster/utilities/clhandle and /usr/es/sbin/cluster/utilities/cllsif retrieve the adapter and node information used by Topology Services in HACMP.

If the problem is caused by an incorrect adapter address specification in PSSP, refer to the "IP Address and Host Name Changes for SP Systems" Appendix in PSSP: Administration Guide for instructions on how to change IP addresses.

TS_HALOCAL_ER

F8949418

PERM Explanation: Local node missing in Topology Services configuration file.

Details: Standard fields indicate that the local node was not present in the machines.lst file. This is a problem with the configuration stored in the Registry. On the PSSP realm, information is stored in the Adapter class of the SDR.

On the HACMP realm, information is stored in the HACMP Global ODM. Commands /usr/es/sbin/cluster/utilities/clhandle and /usr/es/sbin/cluster/utilities/cllsif retrieve the adapter and node information used by Topology Services in HACMP. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_HANODEDUP_ER

BCE1B994

PERM Explanation: Node number duplicated in Topology Services configuration file.

Details: This entry indicates that Topology Services was not able to start or refresh because the same node appeared twice on the same network. This entry refers to a particular instance of Topology Services on the local node, but the problem should affect all the nodes. If this problem occurs at startup time, the daemon exits. To diagnose the problem in PSSP, retrieve data from the control workstation. To diagnose the problem in HACMP, retrieve data from any of the nodes that were affected by the problem.

Standard fields indicate that the same node appeared twice in the same network in the Topology Services machines.lst configuration file. Detail Data fields show the interface name of one of the adapters and the node number that appears twice. Information about the cause of the problem may not be available once the problem is cleared.

In the PSSP realm, the adapter configuration is stored in the Adapter class of the SDR. On the HACMP realm, information is stored in the HACMP Global ODM. Commands /usr/es/sbin/cluster/utilities/clhandle and /usr/es/sbin/cluster/utilities/cllsif retrieve the adapter and node information used by Topology Services in HACMP.

If the problem is caused by an incorrect adapter address specification in PSSP, refer to the "IP Address and Host Name Changes for SP Systems" Appendix in PSSP: Administration Guide for instructions on how to change IP addresses.

TS_IOCTL_ER

7C090481

PERM Explanation: An ioctl call failed.

Details: This entry indicates that an ioctl() call used by the Topology Services daemon to obtain local adapter information failed. This is a possible AIX-related problem. The Topology Services daemon issued an ioctl() call to obtain information about the network adapters currently installed on the node. If this calls fails, there is a potential problem in AIX. The Topology Services daemon exits. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_IPADDR_ER

209F6175

PERM Explanation: Cannot convert IP address in dotted decimal notation to a number.

Details: This entry indicates that an IP address listed in the machines.lst configuration file was incorrectly formatted and could not be converted by the Topology Services daemon. If this problem occurs at startup time, the daemon exits.

Standard fields indicate that the daemon was unable to interpret an IP address listed in the machines.lst file. The Detail Data fields contain the given IP address in dotted decimal notation and the node number where the address was found. The problem may be that the file system where the run directory is located is corrupted, or information in the System Registry (SDR for PSSP, Global ODM for HACMP) is not correct.

The machines.lst file is kept in the daemon "run" directory (/var/ha/run/hats.partition_name for PSSP). The file is overwritten each time the subsystem is restarted. A copy of the file is kept in the startup script's log file, /var/ha/log/hats.partition_name. A number of instances (currently 7) of this log file is kept, but the information is lost if many attempts are made to start the subsystem.

TS_KEYS_ER

A45AC96A

PERM Explanation: Topology Services startup script cannot retrieve key file information using command /usr/sbin/rsct/bin/hats_keys.

Details: This entry indicates that command /usr/sbin/rsct/bin/hats_keys, run by the Topology Service startup script on the control workstation, was unable to retrieve the Topology Services key file information. This error occurs when the startup script is running on the control workstation during initial startup or refresh. When this error occurs, all Topology Services daemons in the system partition will terminate their operations and exit.

Diagnosing this problem requires collecting data only on the control workstation. The pathname of Topology Services key file is /spdata/sys1/keyfiles/rsct/syspar_name/hats, where syspar_name is the name of the SP system partition.

Standard fields indicate that the startup script was unable to retrieve the Topology Services key file information using command hats_keys, and present possible causes. Detail Data fields contain the return code of command hats_keys and the location of the startup script log. The error message returned by command hats_keys is in the startup script log.

This error typically indicates problems in DCE or SP Security Services. For DCE configuration problems, see the configuration log file /opt/dcelocal/etc/cfgdce.log. For other DCE problems, see log files in the /opt/dcelocal/var/svc directory. For SP security services problems, see Diagnosing SP Security Services problems. Specifically, the procedures for restoring key files is discussed in Action 5 - Correct key files.

TS_LATEHB_PE

7AD2CABA

PERF Explanation: Late in sending heartbeat to neighbors.

Details: This entry indicates that the Topology Services daemon was unable to run for a period of time. This entry refers to a particular instance of the Topology Services daemon on the local node. The node that is the Downstream Neighbor may perceive the local adapter as dead and issue a TS_DEATH_TR error log entry.

A node's Downstream Neighbor is the node whose IP address is immediately lower than the address of the node where the problem was seen. The node with the lowest IP address has a Downstream Neighbor of the node with the highest IP address.

Standard fields indicate that the Topology Services daemon was unable to send messages for a period of time. Detail Data fields show how many seconds late the daemon was in sending messages. This entry is created when the amount of time that the daemon was late in sending heartbeats is equal to or greater than the amount of time needed for the remote adapter to consider the local adapter as down.

Data about the reason for the Topology Services daemon being blocked is not usually kept, unless system tracing is being run on the node. The Service log file keeps information about Topology Services events happening on the node at the time the daemon was blocked. See Topology Services service log.

Look for error log entries TS_NODEDOWN_EM in other nodes that refer to this node. If these are present , it means that the daemon blockage needs to be corrected, or the Topology Services tuning parameters have to be changed. Refer to the "Node appears to go down and then up a few/several seconds later" symptom in Error symptoms, responses, and recoveries.

TS_LIBERR_EM

8C2164B7

PEND Explanation: Topology Services client library error.

Details: This entry indicates that the Topology Services library had an error. It refers to a particular instance of the Topology Services library on the local node. This problem will affect the client associated with the library (RSCT Event Manager or more likely RSCT Group Services).

Standard fields indicate that the Topology Services library had an error. Detail Data fields contain the error code returned by the Topology Services API.

Data needed for IBM Service to diagnose the problem is stored in the Topology Services daemon service log, located at /var/ha/log/hats.*.partition_name for PSSP or /var/ha/log/topsvcs.*.cluster_name for HACMP/ES. Since this file may wrap, it should be saved.

The RSCT Group Services daemon (the probable client connected to the library) is likely to have exited with an assert and to have produced an error log entry with template GS_TS_RETCODE_ER. Refer to Diagnosing Group Services problems for a list of the information to save. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_LOC_DOWN_ST

A0B80A40

INFO Explanation: Local adapter down.

Details: This entry indicates that one of the local adapters is down. This entry refers to a particular instance of the Topology Services daemon on the local node. If there are multiple Topology Services daemons running in the node (for example, the PSSP version and the HACMP version), each daemon creates its own error log entry.

Standard fields indicate that a local adapter is down. Detail Data fields show the interface name, adapter offset (index of the network in the machines.lst file), and the adapter address according to Topology Services. This address may differ from the adapter's actual address if the adapter is incorrectly configured. Information about the source of the problem may be lost after the condition is cleared.

Possible problems are:

  • The adapter may have malfunctioned.
  • The adapter may be incorrectly configured. See entry for TS_UNS_SIN_TR.
  • There is no other adapter functioning in the network.
  • Connectivity has been lost in the network.
  • The SP Switch or SP Switch2 adapter may be fenced.
  • A problem in Topology Services' adapter health logic.

Perform these steps:

  1. Verify that the address of the adapter listed in the output of
    ifconfig  interface_name
    

    is the same as the one shown in this error log entry. If they are different, the adapter has been configured with an incorrect address.

  2. If the output of the ifconfig command does not show the UP flag, this means that the adapter has been forced down by the command:
    ifconfig  interface_name down
    

    If the adapter is an SP Switch or SP Switch2 adapter, it may be fenced.

  3. Issue the command netstat -in to verify whether the receive and send counters are being incremented for the given adapter. The counters are the numbers below the Ipkts (receive) and Opkts (send) columns. If both counters are increasing, the adapter is likely to be working and the problem may be in Topology Services.
  4. Issue the ping command to determine whether there is connectivity to any other adapter in the same network. If ping receives responses, the adapter is likely to be working and the problem may be in Topology Services.
  5. Refer to Operational test 4 - Check address of local adapter.
TS_LOGFILE_ER

42D96ED3

PERM Explanation: The daemon failed to open the log file.

Details: This entry indicates that the Topology Services daemon was unable to open its log file. Standard fields indicate that the daemon was unable to open its log file. Detail Data fields show the name of the log file. The situation that caused the problem may clear when the file system problem is corrected. The Topology Services daemon exits. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_LONGLINE_ER

4D0CE96E

PERM Explanation: The Topology Services daemon cannot start because the machines.lst file has a line that is too long.

Details: This entry indicates that the Topology Services daemon was unable to start because there is a line which is too long in the machines.lst configuration file. This entry refers to a particular instance of Topology Services on the local node. If this problem occurs at startup time, the daemon exits. The problem is likely to affect other nodes, since the machines.lst file should be the same at all nodes.

Standard fields indicate that the daemon was unable to start because the machines.lst configuration file has a line longer than 80 characters. Detail Data fields show the pathname of the machines.lst configuration file. It is possible that the network name is too long, or there is a problem in the /var/ha file system.

TS_LSOCK_ER

4E358F5D

PERM Explanation: The daemon failed to open a listening socket for connection requests.

Details: This entry indicates that the Topology Services daemon was unable to open a socket connection to communicate with its clients.

Standard fields indicate that the daemon was unable to open the socket. Detail Data fields show the operation being attempted at the socket (in English) and the system error value returned by the system call. The situation that caused the problem may clear with a reboot. The netstat command shows the sockets in use in the node. The Topology Services daemon exits. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_MACHLIST_ER

EDFE80F3

PERM Explanation: The Topology Services configuration file cannot be opened.

Details: This entry indicates that the Topology Services daemon was unable to read its machines.lst configuration file. Standard fields indicate that the daemon was unable to read the machines.lst file. Detail Data fields show the pathname of the file. Information about the cause of the problem is not available after the condition is cleared. If this problem occurs at startup time, the daemon exits. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_MIGRATE_ER

8535A705

PERM Explanation: Migration-refresh error.

Details: This entry indicates that the Topology Services daemon has found a problem during a migration-refresh. The migration-refresh is a refresh operation issued at the end of an HACMP node by node migration, when the last node is moved to the newer release. The problem may be caused by the information placed on the Global ODM when the migration protocol is complete.

This entry refers to a particular instance of the Topology Services daemon on the local node. It is likely that some of the other nodes have a similar problem. Standard fields indicate that the Topology Services daemon encountered problems during a migration-refresh.

HACMP may have loaded incorrect information into the Global ODM.

Data read by the Topology Services startup script is left on the Topology Services run directory and will be overwritten in the next refresh or startup operation. The data in the "run" directory should be saved. The Topology Services "Service" log file has a partial view of what was in the Global ODM at the time of the refresh operation. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_MISCFG_EM

6EA7FC9E

PEND Explanation: Local adapter incorrectly configured.

Details: This entry indicates that one local adapter is either missing or has an address that is different from the address that Topology Services expects. Standard fields indicate that a local adapter is incorrectly configured. Detail Data fields contain information about the adapter, such as the interface name, adapter offset (network index in the machines.lst file), and expected address.

Possible sources of the problem are:

  • The adapter may have been configured with a different IP address.
  • The adapter is not configured.
  • Topology Services was started after a "Force Down" in HACMP.

This entry is created on the first occurrence of the condition. No data is stored about the condition after the problem is cleared.

Use the interface name in the error report to find the adapter that is incorrectly configured. Command: ifconfig interface_name displays information about the adapter.

TS_NIM_DIED_ER PERM Explanation: One of the NIM processes terminated abnormally.

Details: This entry is created when one of the "Network Interface Modules" (NIM) -- processes used by Topology Services to monitor the state of each adapter, terminates abnormally.

When a NIM terminates, the Topology Services daemon will restart another, but if the replacement NIM also terminates quickly then no other NIM will be started, and the adapter will be flagged as down.

Detailed data fields show:

  • Process exit value, if not terminated with a signal (A value from 1 to 99), will be an 'errno' value from invoking the NIM process.
  • Signal number (0: no signal).
  • Whether a core file was created (1: core file; 0: no core file).
  • Process id (PID).
  • Interface name being monitored by the NIM.
  • Pathname of NIM executable file.

See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_NIM_NETMON_ERROR_ER PERM Explanation: An error occurred in the netmon library, used by the NIM ("Network Interface Module" -- processes used by Topology Services to monitor the state of each adapter) in determining whether the local adapter is alive.

Details: This entry is created when there is an internal error in the netmon library. As a result, the local adapter will be flagged as down, even though the adapter may still be working properly.

A possible cause for the problem (other than a problem in the library) is the presence of some non-supported adapter in the cluster configuration.

Detailed data fields show:

  • Errno value.
  • Error code from netmon library.
  • Function name in library that presented a problem.
  • Interface name being monitored.

See Information to collect before contacting the IBM Support Center and contact the IBM Support Center. It is important to collect the information as soon as possible, since log information for the netmon library is kept in log files that may wrap within 30 minutes.

TS_NIM_OPEN_ERROR_ER PERM Explanation: NIM ("Network Interface Module" -- processes used by Topology Services to monitor the state of each adapter) failed to connect to the local adapter that it is supposed to monitor.

Details: This entry is created when the NIM is unable to connect to the local adapter that needs to be monitored. As a result, the adapter will be flagged as down, even though the adapter might still be working properly.

Detailed data fields show:

  • Interface name.
  • Description 1: description of the problem.
  • Description 2: description of the problem.
  • Value 1 - used by the IBM Support Center.
  • Value 2 - used by the IBM Support Center.

Some possible causes for the problem are:

  • NIM process was blocked while responding to NIM open command.
  • NIM failed to open non-IP device.
  • NIM received an unexpected error code from a system call.

See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_NODEDOWN_EM

4D9226A5

PEND Explanation: Remote nodes were seen as down by Topology Services.

Details: This is an indication that the Topology Services daemon detected one or more remote nodes as being down. This refers to a particular instance of the Topology Services daemon. Data should be collected on the remote nodes that were seen as dead. Standard fields indicate that remote nodes were seen as dead and present possible causes.

Detailed fields contain the path name of a file containing the numbers of the affected nodes. The file with the node numbers may eventually be deleted by the system. The file is located in: /var/adm/ffdc/dumps/hatsd.*

Verify that the nodes listed in the specified file actually went down and investigate why.

TS_NODENUM_ER

2033793C

PERM Explanation: The local node number is not known to Topology Services.

Details: This entry indicates that Topology Services was not able to find the local node number. Standard fields indicate that the daemon was unable to find its local node number. The Topology Services daemon exits. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_NODEUP_ST

95A9DAD0

INFO Explanation: Remote nodes that were previously down were seen as up by Topology Services. This is an indication that the Topology Services daemon detected one or more previously down nodes as being up. It refers to a particular instance of the Topology Services daemon.

Details: In case the same nodes were seen as dead a short time before, data should be collected on the remote nodes. Standard fields indicate that remote nodes were seen as up and present possible causes. Detailed fields contain, in the section, a reference to the entry where the same nodes were seen as dead. If these nodes were seen as down before at different times, the reference code will be for one of these instances.

The Detail Data also contains the path name of a file which stores the numbers of the nodes that were seen as up, along with the error id for the error log entry where each node was seen as dead previously. The file with the node numbers may eventually be deleted by the system. The file is located in: /var/adm/ffdc/dumps/hatsd.*.

If the same nodes were recently seen as dead (follow the REFERENCE CODE), examine the remote nodes for the reason why the nodes were temporarily seen as dead. This entry is logged when a remote node is seen as alive. The same node may have been seen as dead some time ago. If so, the TS_NODEUP_ST will have, as part of the Detail Data, a location of a file whose contents are similar to:

.ZOWYB/Z5Kzr.zBI14tVQ7....................
     1 

The file contains the ERROR ID of the error log entry of the corresponding TS_NODEDOWN_EM entry (when the same node was flagged as dead).

TS_OFF_LIMIT_ER PERM Explanation: Number of network offsets exceeds Topology Services limit.

Details: This entry is created whenever the number of adapters and networks in the cluster configuration exceeds the Topology Services daemon's internal limit for maximum number of "heartbeat rings" of 16.

Notice that a single cluster network may map to multiple "heartbeat rings". This will happen when a node has multiple adapters in the same network, since a heartbeat ring is limited to a single adapter per node.

If this error occurs, a number of adapters and networks in the configuration may remain unmonitored by Topology Services.

The detailed data fields contain the first network in the configuration to be ignored and the maximum number of networks allowed.

When attempting to resolve the problem, initially focus on the nodes that have the most adapters in the configuration, and proceed to remove some adapters from the configuration.

TS_REFRESH_ER

5FB345F4

PERM Explanation: Topology Services refresh error.

Details: This entry indicates that a problem occurred during a Topology Services refresh operation. A refresh operation can be a result of a hatsctrl -r command on PSSP systems, or a Topology DARE (Dynamic Automatic Reconfiguration Event) in HACMP. Topology DARE is an HACMP feature to change the configuration of the cluster dynamically, without having to shut the cluster down. Topology DARE is invoked by customers when a configuration change, such as adding a node, is made.

This SMIT sequence performs the Topology DARE:

SMIT
     Cluster Topology
            Synchronize Cluster Topology
 

This entry refers to a particular instance of the Topology Services daemon on the local node. On HACMP, the problem may have occurred in other nodes as well. Standard fields indicate that a refresh error occurred.

The machines.lst file has some incorrect information. The problem is probably created during a migration-refresh on an HACMP node by node migration. Data used to build the machines.lst file is stored in the daemon's "run" directory and may be lost if Topology Services is restarted or a new refresh is attempted.

More details about the problem are in the User log file. See Topology Services user log. Additional details are stored in the Service log. See Topology Services service log. If this problem occurs at startup time, the Topology Services daemon may exit. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_RSOCK_ER

F11523E1

PERM Explanation: The daemon failed to open socket for peer daemon communication.

Details: This entry indicates that the Topology Services daemon was unable to open a UDP socket for communication with peer daemons in other nodes. Standard fields indicate that the daemon was unable to open the socket. Detail Data fields describe the operation being attempted at the socket (in English), the reason for the error, the system error value, and the port number.

The port number may be in use by either another subsystem or by another instance of the Topology Services daemon. If the AIX SRC subsystem loses its connection to the Topology Services daemon, the AIX SRC may erroneously allow a second instance of the daemon to be started, leading to this error. The situation that caused the problem may clear with a node reboot.

Follow the procedures described for the "Nodes or adapters leave membership after refresh" symptom in Error symptoms, responses, and recoveries to find a possible Topology Services daemon running at the node and stop it. If no process is found that is using the peer socket, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center. Include also an AIX System Dump. See Producing a system dump.

TS_SDR_ER

0BD8A620

PERM Explanation: Cannot retrieve data from the SDR.

Details: This entry indicates that the Topology Services startup script hats was unable to retrieve information from the System Data Repository (SDR). This entry refers to a particular instance of Topology Services on the local node. If the SDR itself is having problems, it is likely that other nodes are also affected. Standard fields indicate that data could not be retrieved from the SDR.

This could be:

  • A problem with the SDR subsystem.
  • Too much contention for the SDR when a large number of nodes are trying to access the SDR simultaneously.
  • Too much traffic on the SP Ethernet.

Information about the cause of the problem may not be available once the problem is cleared. Diagnose the SDR subsystem. See Diagnosing SDR problems.

TS_SECMODE_ER

D1BD179A

PERM Explanation: Failed to determine local DCE security mode.

Details: This entry indicates that errors have occurred while the Topology Services daemon is trying to obtain DCE security information in a partition where DCE is the sole authentication method. When this error occurs, the affected daemon will terminate its operation and exit. This error should not occur if "Compatibility" is one of the partition's authentication methods.

The following are probable causes for the problem:

  • The Topology Services daemon failed to load the security services library (library /usr/lib/libspsec.a possibly not installed).
  • Calls to security services library functions failed to either initialize the library, obtain the authentication method, or read the key file.
  • The local node is not configured in DCE-only mode.
  • Topology Services on the control workstation could not determine the partition's security state, using the /usr/lpp/ssp/bin/lsauthpts command, or it could not access the Topology Services key file.

The Detail Data fields contain the location of the Topology Services User log file, which includes more detailed information about the problem. For SP Security Services problems, see Diagnosing SP Security Services problems.

TS_SECURITY_ST

78278638

INFO Explanation: Authentication failure in Topology Services.

Details: This entry indicates that the Topology Services daemon cannot authenticate a message from one of the peer daemons running in a remote node. This entry refers to a particular instance of the Topology Services daemon on the local node. The node which is sending these messages must also be examined.

Standard fields indicate that a message cannot be authenticated. Detail Data fields show the source of the message. The possible problems are:

  • There is an attempt at a security breach.
  • The Time-Of-Day clocks in the nodes are not synchronized.
  • There are stale packets flowing through the network.
  • IP packets are being corrupted.
  • The security key file is not in sync across all nodes in the system partition.

An entry is created the first time a message cannot be authenticated. After that, entries are created less frequently. Information about the network must be collected while the messages are still being received. The command iptrace should be used to examine the packets arriving at the node.

Perform the following steps:

  1. Examine the output of the lssrc -ls hats command on the local node and on the node sending the message. Look for field "Key version" in the output and check whether the numbers are the same on both nodes.
  2. Check that the key file is the same in all the nodes in the partition.
TS_SECURITY2_ST

E486FA26

INFO Explanation: More authentication failures in Topology Services.

Details: This entry indicates that there have been additional incoming messages that could not be authenticated. For the first such message, error log entry TS_SECURITY_ST is created. If additional messages cannot be authenticated, error log entries with label TS_SECURITY2_ST are created less and less frequently.

The standard fields indicate that incoming messages cannot be authenticated. The detailed fields show an interval in seconds and the number of messages in that interval that could not be authenticated.

For more details and diagnosis steps, see the entry for the TS_SECURITY_ST label.

TS_SEMGET_ER

68547A69

PERM Explanation: Cannot get shared memory or semaphore segment. This indicates that the Topology Services daemon was unable to start because it could not obtain a shared memory or semaphore segment. This entry refers to a particular instance of the Topology Services daemon on the local node. The daemon exits

Details: Standard fields indicate that the daemon could not start because it was unable to get a shared memory or a semaphore segment. The Detail Data fields contain the key value and the number of bytes requested for shared memory, or the system call error value for a semaphore.

The reason why this error has occurred may not be determined if the subsystem is restarted and this error no longer occurs.

TS_SERVICE_ER

F93756D2

PERM Explanation: Unable to obtain port number from the /etc/services file.

Details: This entry indicates that the Topology Services daemon was unable to obtain the port number for daemon peer communication from /etc/services. This entry refers to a particular instance of the Topology Services daemon on the local node. The daemon exits. Other nodes may be affected if their /etc/services have similar contents as that on the local node.

Standard fields indicate that the daemon was unable to obtain the port number from /etc/services. Detail Data fields show the service name used as search key to query /etc/services.

TS_SHMAT_ER

DA6A5149

PERM Explanation: Cannot attach to shared memory segment.

Details: This entry indicates that the Topology Services daemon was unable to start because it could not attach to a shared memory segment. Standard fields indicate that the daemon could not start because it was unable to attach to a shared memory segment. The daemon exits. The Detail Data fields contain the shared memory identifier and number of bytes requested.

The reason why the error occurred may not be found if the subsystem is restarted and the same error does not occur.

TS_SHMEMKEY_ER

41E8D858

PERM Explanation: Cannot get IPC key.

Details: This indicates that the Topology Services daemon was unable to start because it could not obtain an IPC key. This refers to a particular instance of the Topology Services daemon on the local node. The daemon exits.

Standard fields indicate that the daemon could not start because it was unable to obtain an IPC key. The Detail Data fields contain the pathname of the UNIX-domain socket used for daemon-client communication. This pathname is given to the ftok() subroutine in order to obtain an IPC key.

This entry is created when the UNIX-domain socket file has been removed. The reason why this error has occurred may not be determined if the subsystem is restarted and this error no longer occurs.

TS_SHMGET_ER

42416EB1

PERM See TS_SEMGET_ER
TS_SP_DIR_ER

596A9ABD

PERM Explanation: Cannot create directory.

Details: This entry indicates that the Topology Services startup script hats was unable to create one of the directories it needs for processing. Standard fields indicate that a directory could not be created by the startup script hats. Detail Data fields show the directory that could not be created. Information about the cause of the problem may not be available once the problem is cleared.

TS_SPIPDUP_ER

E68EB007

PERM See TS_HAIPDUP_ER
TS_SPLOCAL_ER

74B0CCF7

PERM See TS_HALOCAL_ER
TS_SPNODEDUP_ER

C82AB176

PERM See TS_HANODEDUP_ER
TS_START_ST

645637FC

INFO Explanation: The Topology Services daemon has started.

This is an indication that the Topology Services daemon has started. This entry refers to a particular instance of the Topology Services daemon on the local node, or particular partition on the control workstation.

Details: Standard fields indicate that the daemon started. The Topology Services subsystem was started by a user or during system boot. Detail Data will be in the language where the errpt command is run. The Detail Data contains the location of the log and run directories and also which user or process started the daemon.

TS_STOP_ST

A204A4EE

INFO Explanation: The Topology Services daemon has stopped.

This is an indication that the Topology Services daemon has stopped. This entry refers to a particular instance of the Topology Services daemon on the local node or particular partition on the control workstation.

Details: The Topology Services subsystem shutdown was caused by a signal sent by a user or process. Standard fields indicate that the daemon stopped. The standard fields are self-explanatory.

If stopping the daemon is not desired, you must quickly understand what caused this condition. If the daemon was stopped by the AIX SRC, the word "SRC" is present in the Detail Data .

The REFERENCE CODE field in the Detail Data section refers to the error log entry for the start of Topology Services. Detail Data is in English. Detail Data fields point to the process (SRC) or signal that requested the daemon to stop.

TS_SYSPAR_ER

8C616BE5

PERM Explanation: Cannot find system partition name.

Details: This entry indicates that the Topology Services startup script hats was unable to obtain the partition name using the /usr/lpp/ssp/bin/spget_syspar command. Standard fields indicate that a problem occurred in /usr/lpp/ssp/bin/spget_syspar. Information about the cause of the problem in spget_syspar may be lost once the problem is cleared.

Issue the commands: /usr/lpp/ssp/bin/spget_syspar and /usr/lpp/ssp/bin/spget_syspar -n. If either returns an error and a nonzero exit code, perform problem determination procedures on the SDR. See Diagnosing SDR problems.

TS_THATTR_ER

B705E4E5

PERM Explanation: Cannot create or destroy a thread attributes object.

Details: This entry indicates that Topology Services was unable to create or destroy a thread attributes object. Standard fields indicate that the daemon was unable to create or destroy a thread attributes object. Detail Data fields show which of the Topology Services threads was being handled. The Topology Services daemon exits. See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

TS_THCREATE_ER

5540C482

PERM Explanation: Cannot create a thread.

Details: This entry indicates that Topology Services was unable to create one of its threads. Standard fields indicate that the daemon was unable to create a thread. Detail Data fields show which of the Topology Services threads was being created.

TS_THREAD_STUCK_ER

47E4956B

PERM Explanation: Main thread is blocked. Daemon will exit.

Details: This entry indicates that the Topology Services daemon will exit because its main thread was blocked for longer than a pre-established time threshold. If the main thread remains blocked for too long, it is possible that the node is considered dead by the other nodes.

The main thread needs to have timely access to the CPU, otherwise it would fail to send "heartbeat" messages, run adapter membership protocols, and notify Group Services about adapter and node events. If the main thread is blocked for too long, the daemon exits with a core dump, to allow debugging of the cause of the problem.

This entry refers to a particular instance of Topology Services running on a node. The standard fields indicate that the Topology Services daemon will exit because the main thread was blocked for too long, and explains some of the possible causes. The detailed fields show the number of seconds that the main thread appeared to be blocked, the number of recent page faults involving I/O operations, and the interval in milliseconds where these page faults occurred. If the number of page faults is non-zero, the problem could be related to memory contention.

For information about diagnosing and working around the problem in case its root cause is a resource shortage, see Action 5 - Investigate hatsd problem. If a resource shortage does not seem to be a factor, the cause could be a problem in the daemon or in a service invoked by it. Contact the IBM Support Center.

TS_UNS_SIN_TR

029E523B

UNKN Explanation: Local adapter in unstable singleton state.

Details: This entry indicates that a local adapter is staying too long in a singleton unstable state. Though the adapter is able to receive some messages, there could be a problem with it, which may prevent outgoing messages from reaching their destinations.

This entry refers to a particular instance of the Topology Services daemon on the local node. Examine the Service log on other nodes to determine if other nodes are receiving messages from this adapter. See Topology Services service log.

Standard fields indicate that a local adapter is in an unstable singleton state. Detail Data fields show the interface name, adapter offset (index of the network in the machines.lst file), and the adapter address according to Topology Services, which may differ from the adapter's actual address if the adapter is incorrectly configured. The adapter may be unable to send messages. The adapter may be receiving broadcast messages but not unicast messages.

Information about the adapter must be collected while the adapter is still in this condition. Issue the commands: ifconfig interface_name and netstat -in and record the output.

Perform these steps:

  1. Check if the address displayed in the error report entry is the same as the actual adapter address, which can be obtained by issuing this command: ifconfig interface_name. If they are not the same, the adapter has been configured with the wrong address.
  2. Issue command ping address from the local node for all the other addresses in the same network. If ping indicates that there is no reply (for example: 10 packets transmitted, 0 packets received, 100% packet loss) for all the destinations, the adapter may be incorrectly configured.
  3. Refer to Operational test 6 - Check whether the adapter can communicate with other adapters in the network.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]