Diagnosis Guide

Error symptoms, responses, and recoveries

Use the following table to diagnose problems with the Topology Services component of RSCT. Locate the symptom and perform the action described in the following table.

Table 59. Topology Services symptoms

Symptom	Recovery
Adapter membership groups do not include all the nodes in the configuration.	See Operational test 1 - Verify status and adapters.
Topology Services subsystem fails to start.	See Action 1 - Investigate startup failure.
The refresh operation fails or has no effect.	See Action 2 - Investigate refresh failure.
A local adapter is notified as being down by Topology Services.	See Action 3 - Correct local adapter problem.
Adapters appear to be going up and down continuously.	See Action 4 - Investigate partial connectivity problem.
A node appears to go down and then up a few seconds later.	See Action 5 - Investigate hatsd problem.
Adapter (or host_responds) appears to go down and then up a few seconds later.	See Action 6 - Investigate IP communication problem.
Group Services exits abnormally because of a Topology Services Library error. Error log entry with template GS_TS_RETCODE_ER is present.	See Action 7 - Investigate Group Services failure.
A node running HACMP/ES crashes and produces an AIX dump. System Dump analysis reveals a "panic" in function haDMS_kex:dead_man_sw_handler.	See Action 8 - Investigate node crash.
Nodes or adapters leave membership after a refresh.	See Action 9 - Investigate problems after a refresh.
If DCE is the only active authentication method, and it is suspected that the Topology Services authentication keys has been compromised.	See Action 10 - Correct authentication keys.

Actions

Action 1 - Investigate startup failure

Some of the possible causes are:

SDR-related problems that prevent the startup script from obtaining configuration data from the SDR.
Adapter configuration problems, such as duplicated IP addresses in the configuration or no control workstation Ethernet adapter in the same subnet as the nodes.
AIX-related problems, such as a shortage of space in the /var directory or a port number already in use.
SP Security Services problems that prevent Topology Services from obtaining credentials, determining the active authentication method, or determining the authentication keys to use.

See Operational test 2 - Determine why the Topology Services subsystem is inactive. To verify the correction, see Operational test 1 - Verify status and adapters.

Action 2 - Investigate refresh failure

The most probable cause is that an incorrect adapter or network configuration was passed to Topology Services. In PSSP, the following command:

/usr/sbin/rsct/bin/hatsctrl -r

may produce a message similar to:

2523-300 Refresh operation failed because of errors in machines.lst file.

The same message will be present in the startup script log. Another message may also appear:

hatsctrl: 2523-646 Refresh operation failed.
              Details are in the AIX Error Log and in the hats script log
              (/var/ha/log/hats.partition_name).

Also, configuration errors result in AIX Error Log entries being created. Some of the template labels that may appear are:

TS_SPNODEDUP_ER
TS_HANODEDUP_ER
TS_SPIPDUP_ER
TS_HAIPDUP_ER
TS_CWSADDR_ER
TS_SDR_ER

In addition, when DCE is the only active authentication method for SP trusted services, refresh may fail if the Topology Services startup script fails to get DCE credentials to update the SDR. When this happens, an AIX error log entry with the following label may appear: TS_DCECRED_ER.

The AIX error log entries should provide enough information to determine the cause of the problem. Detailed information about the configuration and the error can be found in the startup script log and the user log.

For the problems that result in the error log entries listed here, the solution involves changing the IP address of one or more adapters. The procedure is different depending on whether the problem occurs in PSSP or HACMP. After the adapter configuration problem is fixed, a new refresh operation can be attempted. On PSSP, the command to use is the hatsctrl command described previously.

On HACMP, the following sequence results in a Topology Services refresh:

 smit hacmp
                    Cluster Configuration
                        Cluster Topology
                            Synchronize Cluster Topology

Good results are indicated by the lack of the error message 2523-300 or similar messages, and the lack of error AIX log entries listed earlier. Good results are also indicated by a change in the Configuration Instance, when checked using these steps:

Issue this command: lssrc -ls subsystem_name before the refresh.
Wait until one minute after the refresh completes.
Issue the command again: lssrc -ls subsystem_name.
Verify that the number near the Configuration Instance is different. The number remains unchanged when the refresh fails.

If the lssrc command is issued right after the refresh, text similar to the following may appear as part of the output:

 Daemon is in a refresh quiescent period.
     Next Configuration Instance = 926456205

This message indicates that the refresh operation is in progress.

Error results are indicated by messages 2523-300 and 2523-646, and error log entries listed previously.

Action 3 - Correct local adapter problem

Probable causes of this problem are:

The adapter is not working.
The network may be down.
In the case of an SP Switch or SP Switch2 adapter, the switch may be down, or the local adapter may be fenced.
The adapter may have been configured with an incorrect IP address.
Topology Services is unable to get response packets back to the adapter.
There is a problem in the subsystem's "adapter self-death" procedures.

See Operational test 4 - Check address of local adapter to analyze the problem. The repair action depends on the nature of the problem. For problems 1 through problem 4, the underlying cause for the adapter to be unable to communicate must be found and corrected.

For problem 5, Topology Services requires that at least one other adapter in the network exist, so that packets can be exchanged between the local and remote adapters. Without such an adapter, a local adapter would be unable to receive any packets. Therefore, there would be no way to confirm that the local adapter is working.

Note that for configurations having only two nodes, when a node or a node's switch adapter fails, the switch adapter on the other node will also be flagged as down. This is because the remaining adapter will have no other adapter to communicate with. The same is also true with two-node SP system partitions.

To verify the repair, issue the lssrc command as described in Operational test 1 - Verify status and adapters. If the problem is due to Topology Services being unable to obtain response packets back to the adapter (problem 5), the problem can be circumvented by adding machine names to file /usr/sbin/cluster/netmon.cf.

These machines should be routers or any machines that are external to the configuration, but are in one of the networks being monitored by the subsystem. Any entry in this file is used as a target for a probing packet when Topology Services is attempting to determine the health of a local adapter. The format of the file is as follows:

machine name or IP address 1
machine name or IP address 2
..........

where the IP addresses are in dotted decimal format. The use of this file is explained in HACMP: Planning Guide. If the file does not exist, it should be created. To remove this recovery action, remove the entries added to the file, delete the file, or rename the file.

Action 4 - Investigate partial connectivity problem

The most probable cause is a partial connectivity scenario. This means that one adapter or a group of adapters can communicate with some, but not all, remote adapters. Stable groups in Topology Services require that all adapters in a group be able to communicate with each other.

Some possible sources of partial connectivity are:

Physical connectivity
Incorrect routing at one or more nodes
Adapter or network problems which result in packets larger than a certain size being lost
Incorrect ARP setting in large machine configurations
The total number of entries in the ARP table must be a minimum of two times the number of nodes. The number of entries in the ARP table is calculated by multiplying the arptab_bsiz parameter by the arptab_nb parameter. The parameters arptab_bsiz and arptab_nb are tunable parameters controlled by the AIX no (network options) command.
High network traffic, which causes a significant portion of the packets to be lost.

To check whether there is partial connectivity on the network, run Operational test 10 - Check neighboring adapter connectivity. The underlying connectivity problem must be isolated and corrected. To verify the correction, issue the lssrc command from Operational test 1 - Verify status and adapters.

The problem can be bypassed if the connectivity test revealed that one or more nodes have only partial connectivity to the others. In this case, Topology Services can be stopped on these partial connectivity nodes. If the remaining adapters in the network have complete connectivity to each other, they should form a stable group.

Topology Services subsystem can be stopped on a node by issuing the command:

/usr/sbin/rsct/bin/hatsctrl -k

Note that the nodes where the subsystem was stopped will be marked as down by the others. Applications such as IBM Virtual Shared Disk and GPFS will be unable to use these nodes.

To test and verify this recovery, issue the lssrc command as described in Operational test 1 - Verify status and adapters. The Group ID information in the output should not change across two invocations approximately one minute apart.

Once this recovery action is no longer needed, restart Topology Services by issuing this command:

/usr/sbin/rsct/bin/hatsctrl -s

Action 5 - Investigate hatsd problem

Probable causes of this problem are:

The Topology Services daemon is temporarily blocked.
The Topology Services daemon exited on the node.
IP communication problem, such as mbuf shortage or excessive adapter traffic.

Probable cause 1 can be determined by the presence of an AIX error log entry with TS_LATEHB_PE template on the affected node. This entry indicates that the daemon was blocked and for how long. When the daemon is blocked, it cannot send messages to other adapters, and as a result other adapters may consider the adapter dead in each adapter group. This results in the node being considered dead.

The following are some of the reasons for the daemon to be blocked:

A memory shortage, which causes excessive paging and thrashing behavior; the daemon stays blocked, awaiting a page-in operation.
A memory shortage combined with excessive disk I/O traffic, which results in slow paging operations.
The presence of a fixed-priority process with higher priority than the Topology Services daemon, which prevents the daemon from running.
Excessive interrupt traffic, which prevents any process in the system from being run in a timely manner.

A memory shortage is usually detected by the vmstat command. Issue the command:

vmstat -s

to display several memory-related statistics. Large numbers for paging space page ins or paging space page outs (a significant percentage of the page ins counter) indicate excessive paging.

Issue the command: vmstat 5 7 to display some virtual memory counters over a 30-second period or time. If the number of free pages (number under the fre heading) is close to 0 (less than 100 or so), this indicates excessive paging. A nonzero value under po (pages paged out to paging space) occurring consistently also indicates heavy paging activity.

In a system which appears to have enough memory, but is doing very heavy I/O operations, it is possible that the virtual memory manager may "steal" pages from processes ("computational pages") and assign them to file I/O ("permanent pages"). In this case, to allow more computational pages to be kept in memory, the vmtune command can be used to change the proportion of computational pages and permanent pages.

The same command can also be used to increase the number of free pages in the node, below which the virtual memory manager starts stealing pages and adding them to the free list. Increasing this number should prevent the number of free pages from reaching zero, which would force page allocation requests to wait. This number is controlled by the minfree parameter of the vmtune command.

Command:

/usr/samples/kernel/vmtune -f 256 -F 264 -p 1 -P 2

can be used to increase minfree to 256 and give more preference to computational pages. More information is in the minfree parameter description of the Appendix "Summary of Tunable AIX Parameters", in AIX Versions 3.2 and 4 Performance Tuning Guide.

If the reason for the blockage cannot be readily identified, AIX tracing can be set up for when the problem recurs. The command:

/usr/bin/trace -a -l -L 16000000 -T 8000000 -o /tmp/trace_raw

should be run in all the nodes where the problem is occurring. Enough space for a 16MB file should be reserved on the file system where the trace file is stored (/tmp in this example).

The trace should be stopped with the command:

/usr/bin/trcstop

as soon as the TS_LATEHB_PE entry is seen in the AIX error log. The resulting trace file and the /unix file should be saved for use by the IBM Support Center.

The underlying problem that is causing the Topology Services daemon to be blocked must be understood and solved. Problems related to memory thrashing behavior are addressed by AIX Versions 3.2 and 4 Performance Tuning Guide. In most cases, obtaining the AIX trace for the period that includes the daemon blockage (as outlined previously) is essential to determine the source of the problem.

For problems related to memory thrashing, it has been observed that if the Topology Services daemon is unable to run in a timely manner, this indicates that the amount of paging is causing little useful activity to be accomplished on the node.

Memory contention problems in Topology Services can be reduced by using the AIX Workload Manager. See Preventing memory contention problems with the AIX Workload Manager.

For problems related to excessive disk I/O, these steps can be taken in AIX to reduce the I/O rate:

Set I/O pacing.
I/O pacing limits the number of pending write operations to file systems, thus reducing the disk I/O rate. AIX is installed with I/O pacing disabled. I/O pacing can be enabled with the command:
```
chdev -l sys0 -a maxpout='33' -a minpout='24'
```
This command sets the high-water and low-water marks for pending write-behind I/Os per file. The values can be tuned if needed.
Change the frequency of syncd.
If this daemon is run more frequently, fewer number of pending I/O operations will need to be flushed to disk. Therefore, the invocation of syncd will cause less of a peak in I/O operations.
To change the frequency of syncd, edit (as root) the /sbin/rc.boot file. Search for the following two lines:
```
echo "Starting the sync daemon" | alog -t boot
nohup /usr/sbin/syncd 60 > /dev/null 2>&1 &
```
The period is set in seconds in the second line, immediately following the invocation of /usr/sbin/syncd. In this example, the interval is set to 60 seconds. A recommended value for the period is 10 seconds. A reboot is needed for the change to take effect.

If the problem is related to a process running with a fixed AIX priority which is higher (that is, smaller number) than that of the Topology Services daemon, the problem may be corrected by changing the daemon's priority. In PSSP, this can be done by issuing this command on the control workstation:

/usr/sbin/rsct/bin/hatstune -p new_value -r

Note:: Command hatstune was introduced in PSSP 3.2 as a replacement for making direct changes to the TS_Config SDR class using command SDRChangeAttrValues.

Probable cause 2 can be determined by the presence of an AIX error log entry that indicates that the daemon exited. See AIX Error Logs and templates for the list of possible error templates used. Look also for an error entry with a LABEL of CORE_DUMP and PROGRAM NAME of hatsd. This indicates that the daemon exited abnormally, and a core file should exist in the daemon's run directory.

If the daemon produced one of the error log entries before exiting, the error log entry itself, together with the information from AIX Error Logs and templates, should provide enough information to diagnose the problem. If the CORE_DUMP entry was created, follow instructions in Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Probable cause 3 is the most difficult to analyze, since there may be multiple causes for packets to be lost. Some commands are useful in determining if packets are being lost or discarded at the node. Issue these commands:

netstat -D
The Idrops and Odrops headings are the number of packets dropped in each interface or device.
netstat -m
The failed heading is the number of mbuf allocation failures.
netstat -s
The socket buffer overflows text is the number of packets discarded due to lack of socket space.
The ipintrq overflows text is the number of input packets discarded because of lack of space in the packet interrupt queue.
netstat -v
This command shows several adapter statistics, including packets lost due to lack of space in the adapter transmit queue, and packets lost probably due to physical connectivity problems ("CRC Errors").
vmstat -i
This command shows the number of device interrupts for each device, and gives an idea of the incoming traffic.

There can be many causes for packets to be discarded or lost, and the problem needs to be pursued as an IP-related problem. Usually the problem is caused by one or more of the following:

Excessive IP traffic on the network or the node itself.
Inadequate IP or UDP tuning.
Physical problems in the adapter or network.

If causes 1 and 2 do not seem to be present, and cause 3 could not be determined, some of the commands listed previously should be issued in loop, so that enough IP-related information is kept in case the problem happens again.

The underlying problem that is causing packets to be lost must be understood and solved. The repair is considered effective if the node is no longer considered temporarily down under a similar workload.

In some environments (probable causes 1 and 3), the problem may be bypassed by relaxing the Topology Services tunable parameters, to allow a node not to be considered down when it cannot temporarily send network packets. Changing the tunable parameters, however, also means that it will take longer to detect a node or adapter as down.

Note:: Before the tunable parameters are changed, record the current values, so that they can be restored to their original values if needed.

This solution can only be applied when:

There seems to be an upper bound on the amount of "outage" the daemon is experiencing.
The applications running on the system can withstand the longer adapter or node down detection time.

In PSSP, the Topology Services "sensitivity" factor can be changed by issuing this command on the control workstation:

/usr/sbin/rsct/bin/hatstune -s new_value -r

The adapter and node detection time is given by the formula:

2 * Sensitivity * Frequency

(two multiplied by the value of Sensitivity multiplied by the value of Frequency), where Sensitivity is the value returned by the command:

SDRGetObjects TS_Config Sensitivity

and Frequency is the value returned by the command:

SDRGetObjects TS_Config Frequency

Both values are also returned by the command:

/usr/sbin/rsct/bin/hatstune -v

In HACMP, the "Sensitivity" and "Frequency" tunable parameters are network-specific. The tunable parameters for each network may be changed with the sequence:

smit hacmp
                  Cluster Configuration
                    Cluster Topology
                      Configure Network Modules
                        Change/Show a Cluster Network Module
                          (select a network module)
                            Select the "Failure Detection Rate"

The topology must be synchronized for the tunable parameter changes to take effect. This is achieved with the sequence:

smit hacmp
                  Cluster Configuration
                    Cluster Topology
                      Synchronize Cluster Topology

To verify that the tuning changes have taken effect, issue the command:

lssrc -ls subsystem_name

approximately one minute after making the changes. The tunable parameters in use are shown in the output in a line similar to the following:

HB Interval = 1 secs. Sensitivity = 4 missed beats

For each network, HB Interval is the Frequency parameter, and Sensitivity is the Sensitivity parameter.

For examples of tuning parameters that can be used in different environments, consult the chapter "The Topology Services Subsystem" of PSSP: Administration Guide.

Good results are indicated by the tunable parameters being set to the desired values.

Error results are indicated by the parameters having their original values or incorrect values.

To verify whether the tuning changes were effective in masking the daemon outage, the system has to undergo a similar workload to that which caused the outage.

To remove the tuning changes, follow the same tuning changes outlined previously, but this time restore the previous values of the tunable parameters.

Preventing memory contention problems with the AIX Workload Manager

Memory contention has often caused the Topology Services daemon to be blocked for significant periods of time. This results in "false node downs", and in the triggering of the Dead Man Switch timer in HACMP/ES. An AIX error log entry with label TS_LATEHB_PE may appear when running RSCT 1.2 or higher. The message "Late in sending Heartbeat by ..." will appear in the daemon log file in any release of RSCT, indicating that the Topology Services daemon was blocked. Another error log entry that could be created is TS_DMS_WARNING_ST.

In many cases, such as when the system is undergoing very heavy disk I/O, it is possible for the Topology Services daemon to be blocked in paging operations, even though it looks like the system has enough memory. Two possible causes for this phenomenon are:

In steady state, when there are no node and adapter events on the system, the Topology Services daemon uses a "working set" of pages that is substantially smaller than its entire addressing space. When node or adapter events happen, the daemon faces the situation where additional pages it needs to process the events are not present in memory.
When heavy file I/O is taking place, the operating system may reserve a larger percentage of memory pages to files, making fewer pages available to processes.
When heavy file I/O is taking place, paging I/O operations may be slowed down by contention for the disk.

The probability that the Topology Services daemon gets blocked for paging I/O may be reduced by making use of the AIX Workload Manager (WLM). WLM is an operating system feature introduced in AIX Version 4.3.3. It is designed to give the system administrator greater control over how the scheduler and Virtual Memory Manager (VMM) allocate CPU and physical memory resources to processes. WLM gives the system administrator the ability to create different classes of service, and specify attributes for those classes.

The following explains how WLM can be used to allow the Topology Services daemon to obtain favorable treatment from the VMM. There is no need to involve WLM in controlling the daemon's CPU use, because the daemon is already configured to run at a real time fixed scheduling priority. WLM will not assign priority values smaller than 40 to any thread.

These instructions are given using SMIT, but it is also possible to use WLM or AIX commands to achieve the same goals. For versions of AIX before the 4330-02 Recommended Maintenance Level (which can be ordered using APAR IY06844), HACMP/ES should not be active on the machine when WLM is started. If HACMP/ES is active on the machine when WLM is started, it will not recognize the Topology Services daemon as being in the newly created class. For the same reason, in PSSP the Topology Services subsystem must be restarted after WLM is started. Starting with 4330-02, WLM is able to classify processes that started before WLM itself is started, so restarting Topology Services is not needed.

Initially, use the sequence:

   smit wlm
      Add a Class

to add a TopologyServices class to WLM. Ensure that the class is at Tier 0 and has Minimum Memory of 20%. These values will cause processes in this class to receive favorable treatment from the VMM. Tier 0 means that the requirement from this class will be satisfied before the requirements from other classes with higher tiers. Minimum Memory should prevent the process's pages from being taken by other processes, while the process in this class is using less than 20% of the machine's memory.

Use the sequence:

   smit wlm
      Class Assignment Rules
         Create a new Rule

to create a rule for classifying the Topology Services daemon into the new class. In this screen, specify 1 as Order of the Rule, TopologyServices as Class, and /usr/sbin/rsct/bin/hatsd as Application.

To verify the rules that are defined, use the sequence:

    smit wlm
      Class Assignment Rules
          List all Rules

To start WLM, after the new class and rule are already in place, use the sequence:

    smit wlm
       Start/Stop/Update WLM
          Start Workload Management

To verify that the Topology Services daemon is indeed classified in the new class, use command:

ps -ef -o pid,class,args | grep hatsd | grep -v grep

One sample output of this command is:

15200     TopologyServices /usr/sbin/rsct/bin/hatsd -n 5

The TopologyServices text in this output indicates that the Topology Services daemon is a member of the TopologyServices class.

If WLM is already being used, the system administrator must ensure that the new class created for the Topology Services daemon does not conflict with other already defined classes. For example, the sum of all "minimum values" in a tier must be less than 100%. On the other hand, if WLM is already in use, the administrator must ensure that other applications in the system do not cause the Topology Services daemon to be deprived of memory. One way to prevent other applications from being more privileged than the Topology Services daemon in regard to memory allocation is to place other applications in tiers other than tier 0.

If WLM is already active on the system when the new classes and rules are added, WLM needs to be restarted in order to recognize the new classes and rules.

For more information on WLM, see the chapter "AIX Workload Manager" in AIX 5L Version 5.1 Differences Guide.

Action 6 - Investigate IP communication problem

Probable causes of this problem are:

The Topology Services daemon was temporarily blocked.
The Topology Services daemon exited on the node.
IP communication problem, such as mbuf shortage or excessive adapter traffic.

Probable cause 1 and probable cause 2 are usually only possible when all the monitored adapters in the node are affected. This is because these are conditions that affect the daemon as a whole, and not just one of the adapters in a node.

Probable cause 3, on the other hand, may result in a single adapter in a node being considered as down. Follow the procedures described to diagnose symptom "Node appears to go down and then up", Action 5 - Investigate hatsd problem. If probable cause 1 or probable cause 2 is identified as the source of the problem, follow the repair procedures described under the same symptom.

If these causes are ruled out, the problem is likely related to IP communication. The instructions in "Node appears to go down and then up", Action 5 - Investigate hatsd problem describe what communication parameters to monitor in order to pinpoint the problem.

To identify the network that is affected by the problem, issue command errpt -J TS_DEATH_TR | more. This is the AIX error log entry created when the local adapter stopped receiving heartbeat messages from its neighbor adapter. The neighbor's address, which is listed in the error log entry, indicates which network is affected.

A note on "local" and "remote" adapter notifications

In PSSP, when there is a communication flicker, a node may temporarily appear down on host_responds without the same node ever failing or even considering its local adapter to be down. This happens because host_responds is actually the Ethernet adapter membership as seen from the control workstation. If the Ethernet adapter membership partitions because of a temporary communication problem, only the adapters on the same group as the control workstation will be considered up by host_responds. Consequently, if the control workstation is temporarily isolated, all the nodes will be considered down.

Action 7 - Investigate Group Services failure

This is most likely a problem in the Topology Services daemon, or a problem related to the communication between the daemon and the Topology Services library, which is used by the Group Services daemon. Usually the problem occurs in HACMP during IP address takeover, when multiple adapters in the network temporarily have the same address. The problem may also happen during a Topology DARE operation in HACMP or a Topology Services refresh in PSSP. See TS_REFRESH_ER on page ***.

When this problem occurs, the Group Services daemon exits and produces an error log entry with a LABEL of GS_TS_RETCODE_ER. This entry will have the Topology Services return code in the Detail Data field. Topology Services will produce an error log entry with a LABEL of TS_LIBERR_EM. Follow the instructions in Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Action 8 - Investigate node crash

If a node crashes, perform AIX system dump analysis. Probable causes of this problem are:

The Dead Man Switch timer was triggered, probably because the Topology Services daemon was blocked.
An AIX-related problem.

When the node restarts, perform AIX system dump analysis. Initially, issue the following sequence in SMIT to obtain information about the dump:

smit
                  Problem Determination
                    System Dump
                      Show Information About the Previous System Dump

Then, issue this SMIT sequence to obtain information about the dump device. It is the device listed after the primary text in this sequence.

smit
                  Problem Determination
                    System Dump
                      Show Current Dump Devices

Issue the command:

/usr/sbin/crash dump_device/unix

The command will present a > prompt. Issue the stat subcommand. The output is similar to the following:

sysname: AIX
        nodename: bacchus
        release: 3
        version: 4
        machine: 000022241C00
        time of crash: Thu Mar 11 10:23:34 EST 1999
        age of system: 2 hr., 2 min.
        xmalloc debug: disabled
        abend code: 700
        csa: 0x322eb0
        exception struct:
                0x00000000 0x00000000 0x00000000 0x00000000 0x00000000
        panic:

Note the panic text. If this is absent, this dump was not caused by the "Dead Man Switch" timer trigger. If panic is present in the output, issue subcommand:

t -m

The output should be similar to the following:

Skipping first MST
 
MST STACK TRACE:
0x00322eb0 (excpt=00000000:00000000:00000000:00000000:00000000) (intpri=5)
        IAR:      .panic_trap+0 (000213a4):     teq   r1,r1
        LR:       .[haDMS_kex:dead_man_sw_handler]+54 (011572bc)
        00322dc8: .clock+f0 (0001307c)
        00322e28: .i_poll+6c (0006e58c)
        00322e78: ex_flih_rs1+e8 (000ce28c)

If the lines panic_trap and haDMS_kex:dead_man_sw_handler are present, this is a dump produced because of the Dead Man Switch timer trigger. Otherwise, there is another source for the problem. For problems unrelated to the "Dead Man Switch" timer, contact the IBM Support Center. Use the quit subcommand to exit the program. For more information about producing or saving System Dumps, see Producing a system dump.

If the dump was produced by the Dead Man Switch timer, it is likely that the problem was caused by the Topology Services daemon being blocked. HACMP/ES uses this mechanism to protect data in multi-tailed disks. When the timer is triggered, other nodes are already in the process of taking over this node's resources, since Topology Services is blocked in the node. If the node was allowed to continue functioning, both this node and the node taking over this node's disk would be concurrently accessing the disk, possibly causing data corruption.

The Dead Man Switch (DMS) timer is periodically stopped and reset by the Topology Services daemon. If the daemon gets blocked and does not have a chance to reset the timer, the timer-handling function runs, causing the node to crash. Each time the daemon resets the timer, the remaining amount left in the previous timer is stored. The smaller the remaining time, the closer the system is to triggering the timer. These "time-to-trigger" values can be retrieved with command:

/usr/sbin/rsct/bin/hatsdmsinfo

The output of this command is similar to:

      Information for Topology Services -- HACMP/ES
      DMS Trigger time: 8.000 seconds.
      Last DMS Resets                          Time to Trigger (seconds)
      11/11/99 09:21:28.272                     7.500
      11/11/99 09:21:28.772                     7.500
      11/11/99 09:21:29.272                     7.500
      11/11/99 09:21:29.772                     7.500
      11/11/99 09:21:30.272                     7.500
      11/11/99 09:21:30.782                     7.490
      
      DMS Resets with small time-to-trigger    Time to Trigger (seconds)
      Threshold value: 6.000 seconds.
      11/11/99 09:18:44.316                     5.540

If small "time-to-trigger" values are seen, the HACMP tunables described in Action 5 - Investigate hatsd problem need to be changed, and the root cause for the daemon being blocked needs to be investigated. Small time-to-trigger" values also result in an AIX error log entry with template TS_DMS_WARNING_ST. Therefore, when this error log entry appears, it indicates that the system is getting close to triggering the Dead Man Switch timer. Actions should be taken to correct the system condition that leads to the timer trigger.

For additional diagnosis and repair procedures, follow the instructions for the symptom "Node appears to go down and then up a few seconds later" Action 5 - Investigate hatsd problem, which is also related to the Topology Services daemon being blocked.

Action 9 - Investigate problems after a refresh

Probable causes of this problem are:

A refresh operation fails on the node.
Adapters are configured with an incorrect address in the registry (SDR on PSSP nodes, Global ODM on HACMP nodes).
Topology Services startup script on the control workstation fails to determine the active authentication methods or fails to determine the authentication keys.
Topology Services daemon on the nodes fails to confirm the local active authentication methods or fails to obtain the authentication keys.

Verify whether all nodes were able to complete the refresh operation, by running Operational test 8 - Check if configuration instance and security status are the same across all nodes. If this test reveals that nodes are running with different Configuration Instances (from the lssrc output), at least one node was unable to complete the refresh operation successfully.

Issue the command errpt -J "TS_*" | more on all nodes. Entry TS_SDR_ER is one of the more likely candidates. It indicates a problem while trying to obtain a copy of the machines.lst file from the SDR. The startup script log provides more details about this problem.

Other error log entries that may be present are:

TS_REFRESH_ER
TS_MACHLIST_ER
TS_LONGLINE_ER
TS_SPNODEDUP_ER or TS_HANODEDUP_ER
TS_SPIPDUP_ER or TS_HAIPDUP_ER
TS_IPADDR_ER
TS_MIGRATE_ER
TS_AUTHMETH_ER
TS_KEY_ER
TS_SECMODE_ER

For information about each error log entry and how to correct the problem, see Error information.

If a node does not respond to the command: lssrc -ls subsystem, (the command hangs), this indicates a problem in the connection between Topology Services and the AIX SRC subsystem. Such problems will also cause in the Topology Services daemon to be unable to receive the refresh request.

If no TS_ error log entry is present, and all nodes are responding to the lssrc command, and lssrc is returning different Configuration Instances for different nodes, contact the IBM Support Center.

If all nodes respond to the lssrc command, and the Configuration Instances are the same across all nodes, follow Configuration verification tests to find a possible configuration problem. Error log entry TS_MISCFG_EM is present if the adapter configuration on the SDR (for PSSP) or ODM ( for HACMP) does not match the actual address configured in the adapter.

For problems caused by loss of connection with the AIX SRC, the Topology Services subsystem may be restarted. For PSSP systems, issuing the command: /usr/sbin/rsct/bin/hatsctrl -k WILL NOT WORK because the connection with the AIX SRC subsystem was lost. To recover, perform these steps:

Issue the command:
```
ps -ef | grep hats | grep -v grep
```
to find the daemon's process_ID:
The output of the command is similar to the following:
```
root 13446  8006  0  May 27   - 26:47 /usr/sbin/rsct/bin/hatsd -n 3
```
In this example, the process_ID is 13446.
Note:
If HACMP is also running on the node, there will be two lines of output in the ps command. To find out which is the PSSP version of the daemon, the following methods can be used:
1. If the number displayed after the -n text in the output (the node number) is different on the two lines, the PSSP version of the daemon is the one where the number is the SP node number. The node's number is obtained by issuing the command:
```
/usr/lpp/ssp/install/bin/node_number
```
2. If the node numbers are the same, issue the command:
```
lssrc -s topsvcs
```
  to obtain the process id of the HACMP/ES version of the daemon. This process id should match that in one of the lines in the ps output. The PSSP version of the daemon is the other one.
Issue the command:
```
kill process_ID
```
This stops the Topology Services daemon.
If the AIX SRC subsystem does not restart the Topology Services subsystem automatically, issue this command:
```
/usr/sbin/rsct/bin/hatsctrl -s
```

For HACMP, restarting the Topology Services daemon requires shutting down the HACMP cluster on the node, which can be done with the sequence:

smit hacmp
                       Cluster Services
                           Stop Cluster Services

After HACMP is stopped, follow the instructions for PSSP on page ***, to find the process id of the Topology Services daemon and stop it, using the command:

/usr/sbin/rsct/bin/topsvcsctrl

instead of the command:

/usr/sbin/rsct/bin/hatsctrl

Now restart HACMP on the node using this sequence:

smit hacmp
                       Cluster Services
                           Start Cluster Services

Follow the procedures in Operational verification tests to ensure that the subsystem is behaving as expected across all nodes.

Note:

In the HACMP/ES environment, DO NOT STOP the Topology Services daemon by issuing any of these commands.

kill
stopsrc
topsvcsctrl -k

This is because stopping the Topology Services daemon while the cluster is up on the node results in the node being stopped by the HACMP cluster manager.

Action 10 - Correct authentication keys

The cause of this problem should be investigated thoroughly. The Topology Services key file can be re-created using the procedure described in Action 5 - Correct key files.

Alternatively, a new key can be added to the Topology Services key file, and Topology Services may be notified by a refresh command to use this new key. The procedure to add a new key is described in "The Topology Services Subsystem," section "Changing the Authentication Method and Key" of PSSP: Administration Guide.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]