Diagnosis Guide

Diagnostic procedures

These test verify the installation, configuration and operation of Topology Services.

Installation verification test

This test determines whether Topology Services has been successfully installed.

Verify if RSCT has been installed. Issue the command:

lslpp -l | grep rsct

Good results are indicated by an output similar to:

  rsct.basic.hacmp    1.2.0.0  COMMITTED  RS/6000 Cluster Technology
  rsct.basic.rte      1.2.0.0  COMMITTED  RS/6000 Cluster Technology
  rsct.basic.sp       1.2.0.0  COMMITTED  RS/6000 Cluster Technology
  rsct.clients.hacmp  1.2.0.0  COMMITTED  RS/6000 Cluster Technology
  rsct.clients.rte    1.2.0.0  COMMITTED  RS/6000 Cluster Technology
  rsct.clients.sp     1.2.0.0  COMMITTED  RS/6000 Cluster Technology
  rsct.core.utils     1.2.0.0  COMMITTED  RS/6000 Cluster Technology

Error results are indicated by no output from the command.

Issue the command:
```
lppchk -c "rsct*"
```
Good results are indicated by the absence of error messages and the return of a zero exit status from this command. The command produces no output if it succeeds.
Error results are indicated by a non-zero exit code and by error messages similar to these:
```
lppchk: 0504-206  File /usr/lib/nls/msg/en_US/hats.cat could not be located.
lppchk: 0504-206  File /usr/sbin/rsct/bin/hatsoptions could not be located.
lppchk: 0504-208  Size of /usr/sbin/rsct/bin/phoenix.snap is 29356,
                      expected value was 29355.
```
Some error messages may appear if an EFIX is applied to a file set. An EFIX is an emergency fix, supplied by IBM, to correct a specific problem.

If the test failed, verify the installation of RSCT. The following file sets need to be installed:

rsct.basic.rte
rsct.core.utils
rsct.clients.rte
rsct.basic.sp
rsct.clients.sp
rsct.basic.hacmp
rsct.clients.hacmp

If the test succeeds, proceed to Configuration verification tests. If the test fails, see if RSCT was installed, and install RSCT if it was not.

Configuration verification tests

These tests verify the configuration of Topology Services.

Configuration test 1 - Verify configuration data for PSSP

This test verifies that Topology Services in PSSP has the configuration data it needs. Proceed to Configuration test 3 - Verify HACMP/ES configuration data if in the HACMP/ES environment.

Issue the following commands to display data from the SDR:

SDRGetObjects Syspar
SDRGetObjects SP cw_ipaddrs
SDRGetObjects TS_Config
SDRGetObjects Adapter
SDRGetObjects -G Adapter

Good results are indicated by none of these commands giving an error message, and all commands giving non-null output. SDRGetObjects Adapter must show all the adapters in the current partition, and SDRGetObjects -G Adapter must show all the adapters in the machine.

Error results are indicated if these commands fail. The SDR could be experiencing problems. Diagnose the SDR subsystem. If the commands succeed but do not show the expected information, it is possible that a problem occurred in the installation of the nodes. Verify installation of the nodes.

If the test is successful, proceed to Configuration test 2 - Check control workstation Ethernet adapter.

Configuration test 2 - Check control workstation Ethernet adapter

This test determines whether the control workstation has an Ethernet adapter that can be included in the Topology Services configuration file. On the control workstation, issue the command netstat -in, followed by the command: ifconfig enn, for each Ethernet adapter listed by netstat.

Verify that at least one of the "en" adapters on the control workstation is on the same subnet ID as the en0 adapter of at least one of the nodes. The subnet ID and subnet mask for the control workstation adapter can be derived from the ifconfig command output. Use this calculation:

 Subnet id = inet & netmask
 Subnet mask = netmask

where inet and netmask are given in the output of the previous ifconfig command , and "&" is the bitwise "AND" operator. Ignore ifconfig command output that begins with inet6. Those are addresses in IPv6 format.

For example, if the command ifconfig en0 produced this output:

en0:  flags=e080863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,
    GROUPRT,64BIT> inet 9.114.61.125 netmask 0xffffffc0 broadcast 9.114.61.127

calculate the Subnet ID as follows:

Convert the inet and netmask to hexadecimal notation. Convert each octet separately, and remove the "."
In this example, the inet is 9.114.61.125, which converts to 0x09723D7D
In this example, the netmask is 0xFFFFFFC0, which is already in hexadecimal notation. The equivalent dotted decimal form is: 255.255.255.192.
Calculate the subnet ID = inet & netmask.
In this example, 0x09723D7D & 0xFFFFFFC0 = 0x09723D40.
Convert the result back to dotted decimal form: 0x09723D40 = 9.114.61.64. This is the Subnet ID.

The information about the nodes' adapters can be obtained by issuing the command: SDRGetObjects -G Adapter

 Subnet id = netaddr & netmask
 Subnet mask = netmask

where netaddr and netmask are given in the output of the SDRGetObjects command.

Good results are indicated by the existence of at least one node where the Subnet ID / Subnet mask pairs are the same as in one of the control workstation's "en" adapters.

Error results are indicated by the absence of such a pair.

If the problem is in the control workstation's or nodes' netmask, the netmask problem must be corrected. Adapters that belong to the same subnet must have the same netmask. If the problem is due to a lack of an Ethernet adapter in the control workstation which is in the same subnet as one of the nodes, this adapter must be added.

If this test is a success, proceed to Operational verification tests.

Configuration test 3 - Verify HACMP/ES configuration data

This test verifies that Topology Services in HACMP/ES has the configuration data it needs. Configuration data is stored in the HACMP/ES Global ODM. Obtain the output of the following commands:

/usr/es/sbin/cluster/utilities/cllsif
/usr/es/sbin/cluster/utilities/clhandle
/usr/es/sbin/cluster/utilities/clhandle -a
/usr/es/sbin/cluster/utilities/cllsclstr
odmget HACMPnim
odmget HACMPtopsvcs

The output of cllsif is similar to the following:

 Adapter          Type     Network  NetType Attribute Node  IP Address Hardware Interface Global
  Name                                                                           Address  Name
 
c47n07             service  SP_ether  ether  public  c47n07  9.114.61.71  en0   glob_SP_ether
c47n07_hpsboot     boot     sw_net    hps    public  c47n07  1.1.1.1      css0 
c47n15_hpsservice  service  sw_net    hps    public          1.1.1.12     css0 
c47n13_hpsservice  service  sw_net    hps    public          1.1.1.10     css0

This output should contain all the adapters defined in the HACMP configuration. The adapter names and addresses (at least the boot and standby) should correspond to what is actually configured on the machine. Those can be obtained by issuing the netstat -in command.

The output of clhandle -a should show all the nodes configured in HACMP, while the output of clhandle should contain the local node name and number.

The output of cllsclstr should show the cluster name and id.

The output of odmget HACMPtopsvcs should be similar to the following:

HACMPtopsvcs:
                                   hbInterval = 1
                                   fibrillateCount = 4
                                   runFixedPri = 1
                                   fixedPriLevel = 38
                                   tsLogLength = 5000
                                   gsLogLength = 5000
                                   instanceNum = 2

The output of odmget HACMPnim should be similar to the following:

HACMPnim:
                                       name = "ether"
                                       desc = "Ethernet Protocol"
                                       addrtype = 0
                                       path = ""
                                       para = ""
                                       grace = 30
                                       hbrate = 500000
                                       cycle = 4
                               
HACMPnim:
                                       name = "token"
                                       desc = "Token Ring Protocol"
                                       addrtype = 0
                                       path = ""
                                       para = ""
                                       grace = 90
                                       hbrate = 500000
                                       cycle = 4

Good results are indicated by the output of these commands reflecting the desired HACMP configuration, with respect to networks, network adapters, and tunable values. In this case, proceed to Operational verification tests.

Error results are indicated if there is any inconsistency between the displayed configuration data and the desired configuration data. In this case, the HACMP configuration has to be edited, and the Cluster Topology must be synchronized.

Operational verification tests

The following names apply to the operational verification tests in this section:

Subsystem name:
- On PSSP nodes hats
- On the PSSP control workstation hats.partition_name
- On HACMP nodes topsvcs
User log file:
- On PSSP /var/ha/log/hats.dd.hhmmss.partition_name.lang
- On HACMP nodes /var/ha/log/topsvcs.dd.hhmmss.cluster_name.lang
Service log file:
- On PSSP /var/ha/log/hats.dd.hhmmss.partition_name
- On HACMP nodes /var/ha/log/topsvcs.dd.hhmmss.cluster_name
run directory:
- On PSSP /var/ha/run/hats.partition_name
- On HACMP nodes /var/ha/run/topsvcs.cluster_name
machines.lst file:
- On PSSP /var/ha/run/hats.partition_name/machines.lst
- On HACMP nodes /var/ha/run/topsvcs.cluster_name/machines.cluster_id.lst

Operational test 1 - Verify status and adapters

This test verifies whether Topology Services is working and that all the adapters are up. Issue the following command:

lssrc -ls subsystem_name

Good results are indicated by an output similar to the following:

        Subsystem         Group            PID     Status 
         hats             hats             20494   active
        Network Name   Indx Defd Mbrs St Adapter ID      Group ID
        SPether        [ 0]   15   15  S 9.114.61.195    9.114.61.195   
        SPether        [ 0] en0          0x3740dd5c      0x3740dd62
        HB Interval = 1 secs. Sensitivity = 4 missed beats
        SPswitch       [ 1]   14   14  S 9.114.61.139    9.114.61.139   
        SPswitch       [ 1] css0         0x3740dd5d      0x3740dd62
        HB Interval = 1 secs. Sensitivity = 4 missed beats
          Configuration Instance = 926566126
          Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
          control workstation IP address = 9.114.61.125
          Daemon employs no security
          Data segment size: 6358 KB. Number of outstanding malloc: 588
          Number of nodes up: 15. Number of nodes down: 0.

If the number under the Mbrs heading is the same as the number under Defd, all adapters defined in the configuration are part of the adapter membership group. The numbers under the Group ID heading should remain the same over subsequent invocations of lssrc several seconds apart. This is the expected behavior of the subsystem.

Error results are indicated by outputs similar to the following:

0513-036 The request could not be passed to the hats subsystem. Start the subsystem and try your command again.
In this case, the subsystem is down. Issue the errpt command and look for an entry for the subsystem name. Proceed to Operational test 2 - Determine why the Topology Services subsystem is inactive.
0513-085 The hats Subsystem is not on file.
The subsystem is not defined to the AIX SRC. In PSSP, the partition-sensitive subsystems may have been undefined by the syspar_ctrl command. The same command may be used to add the subsystems to the node. In HACMP/ES, HACMP may have not been installed on the node. Check the HACMP subsystem.

This output requires investigation because the number under Mbrs is smaller than the number under Defd.

        Subsystem         Group            PID     Status
         hats             hats             20494   active
        Network Name   Indx Defd Mbrs St Adapter ID      Group ID
        SPether        [ 0]   15    8  S 9.114.61.195    9.114.61.195
        SPether        [ 0] en0          0x3740dd5c      0x3740dd62
        HB Interval = 1 secs. Sensitivity = 4 missed beats
        SPswitch       [ 1]   14    7  S 9.114.61.139    9.114.61.139
        SPswitch       [ 1] css0         0x3740dd5d      0x3740dd62
        HB Interval = 1 secs. Sensitivity = 4 missed beats
          Configuration Instance = 926566126
          Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
          control workstation IP address = 9.114.61.125
          Daemon employs no security
          Data segment size: 6358 KB. Number of outstanding malloc: 588
          Number of nodes up: 8. Number of nodes down: 7.
          Nodes down: 17-29(2)

Some remote adapters are not part of the local adapter's group. Proceed to Operational test 3 - Determine why remote adapters are not in the local adapter's membership group.

This output requires investigation because a local adapter is disabled.

        Subsystem         Group            PID     Status
         hats             hats             20494   active
        Network Name   Indx Defd Mbrs St Adapter ID      Group ID
        SPether        [ 0]   15   15  S 9.114.61.195    9.114.61.195
        SPether        [ 0] en0          0x3740dd5c      0x3740dd62
        HB Interval = 1 secs. Sensitivity = 4 missed beats
        SPswitch       [ 1]   14    0  D 9.114.61.139
        SPswitch       [ 1] css0
        HB Interval = 1 secs. Sensitivity = 4 missed beats
          Configuration Instance = 926566126
          Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
          control workstation IP address = 9.114.61.125
          Daemon employs no security
          Data segment size: 6358 KB. Number of outstanding malloc: 588
          Number of nodes up: 15. Number of nodes down: 0.

A local adapter is disabled. Proceed to Operational test 4 - Check address of local adapter.

This output requires investigation because there is a U below the St heading.
```
        Subsystem         Group            PID     Status
         hats             hats             20494   active
        Network Name   Indx Defd Mbrs St Adapter ID      Group ID
        SPether        [ 0]   15    8  S 9.114.61.195    9.114.61.195
        SPether        [ 0] en0          0x3740dd5c      0x3740dd62
        HB Interval = 1 secs. Sensitivity = 4 missed beats
        SPswitch       [ 1]   14    1  U 9.114.61.139    9.114.61.139
        SPswitch       [ 1] css0         0x3740dd5d      0x3740dd5d
        HB Interval = 1 secs. Sensitivity = 4 missed beats
          Configuration Instance = 926566126
          Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
          control workstation IP address = 9.114.61.125
          Daemon employs no security
          Data segment size: 6358 KB. Number of outstanding malloc: 588
          Number of nodes up: 8. Number of nodes down: 7.
          Nodes down: 17-29(2)
```
The last line of the output shows a list of nodes that are either up or down, whichever is smaller. The list of nodes that are down includes only the nodes that are configured and have at least one adapter that Topology Services monitors. Nodes are specified by a list of node ranges, as follows:
```
N1-N2(I1)  N3-N4(I2) ...
```
Here, there are two ranges, N1-N2(I1) and N3-N4(I2). They are interpreted as follows:
- N1 is the first node in the first range
- N2 is the last node in the first range
- I1 is the increment for the first range
- N3 is the first node in the second range
- N4 is the last node in the second range
- I2 is the increment for the second range
If the increment is 1, it is omitted. If the range has only one node, only that node's number is displayed. Examples are:
1. Nodes down: 17-29(2) means that nodes 17 through 29 are down. In other words, nodes 17, 19, 21, 23, 25, 27, and 29 are down.
2. Nodes up: 5-9(2) 13 means that nodes 5, 7, 9, and 13 are up.
3. Nodes up: 5-9 13-21(4) means that nodes 5, 6, 7, 8, 9, 13, 17, and 21 are up.
An adapter stays in a singleton unstable membership group. This normally occurs for a few seconds after the daemon starts or after the adapter is re-enabled. If the situation persists for more than one minute, this may indicate a problem. This usually indicates that the local adapter is receiving some messages, but it is unable to obtain responses for its outgoing messages. Proceed to Operational test 7 - Check for partial connectivity.
An output similar to the expected output, or similar to output 3, but where the numbers under the Group ID heading (either the address of the Group Leader adapter or the "incarnation number" of the group) change every few seconds without ever becoming stable.
This kind of output indicates that there is some partial connectivity on the network. Some adapters may be able to communicate only with a subset of adapters. Some adapters may be able to send messages only or receive messages only. This output indicates that the adapter membership groups are constantly reforming, causing a substantial increase in the CPU and network resources used by the subsystem.
A partial connectivity situation is preventing the adapter membership group from holding together. Proceed to Operational test 10 - Check neighboring adapter connectivity.

If this test is successful, proceed to Operational test 11 - Verify node reachability information.

Operational test 2 - Determine why the Topology Services subsystem is inactive

This test is to determine why the Topology Services subsystem is not active.

For PSSP, issue command: errpt -N "hats*" -a
For HACMP/ES, issue command: errpt -N topsvcs -a

The AIX error log entries produced by this command, together with their description in Table 58, explain why the subsystem is inactive. If no entry that explains why the subsystem went down or could not start exists, it is possible that the daemon may have exited abnormally.

In this case, issue the errpt -a command and look for an error. Look for an error entry with a LABEL: of CORE_DUMP and PROGRAM NAME of hatsd. (Issue the command: errpt -J CORE_DUMP -a.) If such an entry is found, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Another possibility when there is no TS_ error log entry, is that the Topology Services daemon could not be loaded. In this case a message similar to the following may be present in the Topology Services User startup log:

 0509-036 Cannot load program hatsd because of the following errors:
 0509-023 Symbol dms_debug_tag in hatsd is not defined.
 0509-026 System error: Cannot run a file that does not have a valid format.

The message may refer to the Topology Services daemon, or to some other program invoked by the startup script hats. If such an error is found, contact the IBM Support Center.

For errors where the daemon did start up but exited during initialization, detailed information about the problem is in the Topology Services User error log.

Operational test 3 - Determine why remote adapters are not in the local adapter's membership group

Issue the command:

lssrc -ls subsystem

on all the nodes and the PSSP control workstation. The command:

dsh -a "lssrc -ls subsystem"

issued from the control workstation can be used to issue lssrc on all the nodes.

If this test follows output 3, at least one node will not have the same output as the node from where output 3 was taken.

Some of the possibilities are:

The node is down or unreachable. Diagnose that node by using Operational test 1 - Verify status and adapters.

The output is similar to output of 3, but with a different group id, such as in this output:

        Subsystem         Group            PID     Status
         hats             hats             20494   active
        Network Name   Indx Defd Mbrs St Adapter ID      Group ID
        SPether        [ 0]   15    7  S 9.114.61.199    9.114.61.201
        SPether        [ 0] en0          0x3740dd5c      0x3740dd72
        HB Interval = 1 secs. Sensitivity = 4 missed beats
        SPswitch       [ 1]   14    7  S 9.114.61.141    9.114.61.141
        SPswitch       [ 1] css0         0x3740dd5d      0x3740dd72
        HB Interval = 1 secs. Sensitivity = 4 missed beats
          Configuration Instance = 926566126
          Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
          Control Workstation IP address = 9.114.61.125
          Daemon employs no security
          Data segment size: 6358 KB. Number of outstanding malloc: 588
          Number of nodes up: 7. Number of nodes down: 8.
          Nodes up: 17-29(2)

Compare this with the output from 3. Proceed to Operational test 8 - Check if configuration instance and security status are the same across all nodes.

The output is similar to the outputs of 1, 2, 4, or 5. Return to Operational test 1 - Verify status and adapters, but this time focus on this new node.

Operational test 4 - Check address of local adapter

This test verifies whether a local adapter is configured with the correct address. Assuming that this test is being run because the output of the lssrc command indicates that the adapter is disabled, there should be an entry in the AIX error log that points to the problem.

Issue the command:

errpt -J TS_LOC_DOWN_ST,TS_MISCFG_EM -a | more

Examples of the error log entries that appear in the output are:

      LABEL:          TS_LOC_DOWN_ST
      IDENTIFIER:     D17E7B06
 
      Date/Time:       Mon May 17 23:29:34 
      Sequence Number: 227
      Machine Id:      000032054C00
      Node Id:         c47n11
      Class:           S
      Type:            INFO
      Resource Name:   hats.c47s
 
      Description
      Possible malfunction on local adapter

      LABEL:          TS_MISCFG_EM
      IDENTIFIER:     6EA7FC9E
 
      Date/Time:       Mon May 17 16:28:45 
      Sequence Number: 222
      Machine Id:      000032054C00
      Node Id:         c47n11
      Class:           U
      Type:            PEND
      Resource Name:   hats.c47s
      Resource Class:  NONE
      Resource Type:   NONE
      Location:        NONE
      VPD:             
 
      Description
      Local adapter misconfiguration detected

Good results are indicated by the absence of the TS_MISCFG_EM error entry. To verify that the local adapter has the expected address, issue the command:

ifconfig interface_name

where interface_name is the interface name listed on the output of lssrc, such as:

        SPswitch       [ 1]   14    0  D 9.114.61.139
        SPswitch       [ 1] css0

For the lssrc command output, the output of ifconfig css0 is similar to:

css0: flags=800847 <UP,BROADCAST,DEBUG,RUNNING,SIMPLEX>
        inet 9.114.61.139 netmask 0xffffffc0 broadcast 9.114.61.191

Error results are indicated by the TS_MISCFG_EM error entry and by the output of the ifconfig command not containing the address displayed in the lssrc command output.

Diagnose the reason why the adapter is configured with an incorrect address. For PSSP, the adapter may have been incorrectly configured in the SDR, or the adapter's address was incorrectly set manually. For HACMP, the cluster on the node may have been stopped with the "Forced Down" option. The adapters must be configured with their boot-time addresses before the cluster can be started on a node. This can be done by issuing command:

/etc/rc.net -boot

several times in a sequence. Issuing the command only once may not set all IP routes correctly.

If this test is a success, proceed to Operational test 5 - Check if the adapter is enabled for IP.

Operational test 5 - Check if the adapter is enabled for IP

Issue the command:

ifconfig interface_name

The output is similar to the following:

 css0: flags=800847 <UP,BROADCAST,DEBUG,RUNNING,SIMPLEX>
         inet 9.114.61.139 netmask 0xffffffc0 broadcast 9.114.61.191

Good results are indicated by the presence of the UP string in the first line of the output. In this case, proceed to Operational test 6 - Check whether the adapter can communicate with other adapters in the network.

Error results are indicated by the absence of the UP string in the first line of the output.

Issue the command:

ifconfig interface_name up

to re-enable the adapter to IP.

Operational test 6 - Check whether the adapter can communicate with other adapters in the network

Root authority is needed to access the contents of the machines.lst file. Display the contents of the machines.lst file. The output is similar to the following:

            *InstanceNumber=925928580
            *configId=1244520230
            *!HaTsSeCStatus=off
            *FileVersion=1
            *!TS_realm=PSSP
            TS_Frequency=1
            TS_Sensitivity=4
            TS_FixedPriority=38
            TS_LogLength=5000
            *!TS_PinText
            Network Name SPether
            Network Type ether
            *
            *Node Type Address
                0 en0 9.114.61.125
                1 en0  9.114.61.65
                3 en0  9.114.61.67
                11 en0  9.114.61.195
            ...
            Network Name SPswitch
            Network Type hps
            *
            *Node Type Address
                1 css0 9.114.61.129
                3 css0 9.114.61.131
                11 css0 9.114.61.139

Locate the network to which the adapter under investigation belongs. For example, the css0 adapter on node 11 belongs to network SPswitch. Issue the command:

ping -c 5 address

for the addresses listed in the machines.lst file.

Good results are indicated by outputs similar to the following.

            PING 9.114.61.129: (9.114.61.129): 56 data bytes
            64 bytes from 9.114.61.129: icmp_seq=0 ttl=255 time=0 ms
            64 bytes from 9.114.61.129: icmp_seq=1 ttl=255 time=0 ms
            64 bytes from 9.114.61.129: icmp_seq=2 ttl=255 time=0 ms
            64 bytes from 9.114.61.129: icmp_seq=3 ttl=255 time=0 ms
            64 bytes from 9.114.61.129: icmp_seq=4 ttl=255 time=0 ms
 
            ----9.114.61.129 PING Statistics----
            5 packets transmitted, 5 packets received, 0% packet loss
            round-trip min/avg/max = 0/0/0 ms

The number before packets received should be greater than 0.

Error results are indicated by outputs similar to the following:

            PING 9.114.61.129: (9.114.61.129): 56 data bytes
 
            ----9.114.61.129 PING Statistics----
            5 packets transmitted, 0 packets received, 100% packet loss

The command should be repeated with different addresses until it succeeds or until several different attempts are made. After that, pursue the problem as an adapter or IP-related problem. If the adapter is an SP Switch adapter, refer to Diagnosing SP Switch problems. If the adapter is an SP Switch2 adapter, refer to Diagnosing SP Switch2 problems.

If this test succeeds, but the adapter is still listed as disabled in the lssrc command output, collect the data listed in Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational test 7 - Check for partial connectivity

Adapters stay in a singleton unstable state when there is partial connectivity between two adapters. One reason for an adapter to stay in this state is that it keeps receiving PROCLAIM messages, to which it responds with a JOIN message, but no PTC message comes in response to the JOIN message.

Check in the User log file to see if a message similar to the following appears repeatedly:

2523-097 JOIN time has expired. PROCLAIM message was sent
                   by (10.50.190.98:0x473c6669)

If this message appears repeatedly in the User log, investigate IP connectivity between the local adapter and the adapter whose address is listed in the User log entry (10.50.190.98 in the example here). Issue command:

ping -c 5 address

address is 10.50.190.98 in this example.

See Operational test 5 - Check if the adapter is enabled for IP for a description of good results for this command.

The local adapter cannot communicate with a Group Leader that is attempting to attract the local adapter into the adapter membership group. The problem may be with either the local adapter or the Group Leader adapter ("proclaimer" adapter). Pursue this as an IP connectivity problem. Focus on both the local adapter and the Group Leader adapter. See Diagnosing IP routing problems.

If the ping command succeeds, but the local adapter still stays in the singleton unstable state, contact the IBM Support Center.

In an HACMP/ES environment, it is possible that there are two adapters in different nodes both having the same service address. This can be verified by issuing:

lssrc -ls subsystem_name

and looking for two different nodes that have the same IP address portion of Adapter ID. In this case, this problem should be pursued as an HACMP/ES problem. Contact the IBM Support Center.

If this test fails, proceed to Operational test 4 - Check address of local adapter, concentrating on the local and Group Leader adapters.

Operational test 8 - Check if configuration instance and security status are the same across all nodes

This test is used when there seem to be multiple partitioned adapter membership groups across the nodes, as in output 2.

This test verifies whether all nodes are using the same configuration instance number and same security setting. The instance number changes each time the machines.lst file is generated by the startup script. In PSSP, the configuration instance always increases. In HACMP/ES, the configuration instance number normally increases, unless a snapshot of a previous configuration is applied.

Issue the command:

lssrc -ls subsystem_name

on all nodes. If this is not feasible, issue the command at least on nodes that produce an output that shows a different Group ID.

Compare the line Configuration Instance = (number) in the lssrc outputs. Also, compare the line Daemon employs in the lssrc command outputs.

Good results are indicated by the number after the Configuration Instance phrase being the same in all the lssrc outputs. This means that all nodes are working with the same version of the machines.lst file.

Error results are indicated by the configuration instance being different in the two "node partitions" (this is unrelated to the SP system partitions). In this case, the adapters in the two partitions cannot merge into a single group because the configuration instances are different across the node partitions. This situation is likely to be caused by a refresh-related problem. One of the node groups, probably that with the lower configuration instance, was unable to run a refresh. If a refresh operation was indeed attempted, consult the description of the "Nodes or adapters leave membership after refresh" problem in Error symptoms, responses, and recoveries.

The situation may be caused by a problem in the AIX SRC subsystem, which fails to notify the Topology Services daemon about the refresh. The description of the "Nodes or adapters leave membership after refresh" problem in Error symptoms, responses, and recoveries explains how to detect the situation where the Topology Services daemon has lost its connection with the AIX SRC subsystem. In this case, contact the IBM Support Center.

If the security setting is not the same on all the nodes in a partition, some of the nodes may fail to authenticate each other's messages. AIX error log entries with labels TS_SECURITY_ST and TS_SECURITY2_ST may appear on those nodes. For information about these error log entries, see TS_SECURITY_ST on page *** and TS_SECURITY2_ST on page ***.

If this test is successful, proceed to Operational test 9 - Check connectivity among multiple node partitions.

Operational test 9 - Check connectivity among multiple node partitions

This test is used when adapters in the same Topology Services network form multiple adapter membership groups, rather than a single group encompassing all the adapters in the network.

Follow the instructions in Operational test 8 - Check if configuration instance and security status are the same across all nodes to obtain lssrc outputs for each of the node partitions.

The IP address listed in the lssrc command output under the Group ID heading is the IP address of the Group Leader. If two node partitions are unable to merge into one, this is caused by the two Group Leaders being unable to communicate with each other. Note that even if some adapters in different partitions can communicate, the group merge will not occur unless the Group Leaders are able to exchange point-to-point messages. Use ping (as described in Operational test 6 - Check whether the adapter can communicate with other adapters in the network) to determine whether the Group Leaders can communicate with each other.

For example, assume on one node the output of the lssrc -ls hats command is:

     Subsystem         Group            PID     Status 
      hats             hats             15750   active
     Network Name   Indx Defd Mbrs St Adapter ID      Group ID
     SPether        [0]   15    9   S 9.114.61.65     9.114.61.195   
     SPether        [0]              0x373897d2      0x3745968b
     HB Interval = 1 secs. Sensitivity = 4 missed beats
     SPswitch       [1]   14    14  S 9.114.61.129    9.114.61.153   
     SPswitch       [1]              0x37430634      0x374305f1
     HB Interval = 1 secs. Sensitivity = 4 missed beats

and on another node it is:

     Subsystem         Group            PID     Status 
      hats             hats             13694   active
     Network Name   Indx Defd Mbrs St Adapter ID      Group ID
     SPether        [0]  15    6  S 9.114.30.69     9.114.61.71   
     SPether        [0]              0x37441f24      0x37459754
     HB Interval = 1 secs. Sensitivity = 4 missed beats
     SPswitch       [1]   14   14  S 9.114.61.149    9.114.61.153   
     SPswitch       [1]             0x374306a4      0x374305f1

In this example, the partition is occurring in the SP Ethernet. The two Group Leaders are IP addresses 9.114.61.195 and 9.114.61.71. Login to the node that hosts one of the IP addresses and issue the ping test to the other address. In case the two adapters in question are in the same subnet, verify whether they have the same subnet mask. Configuration test 2 - Check control workstation Ethernet adapter describes how to obtain the subnet id and subnet mask for an adapter.

Good results and error results for the ping test are described in Operational test 6 - Check whether the adapter can communicate with other adapters in the network. If the ping test is not successful, a network connectivity problem between the two Group Leader nodes is preventing the groups from merging. Diagnose the network connectivity problem. See Diagnosing system connectivity problems.

Good results for the subnet mask test are indicated by the adapters that have the same subnet id also having the same subnet mask. If the subnet mask test fails, the subnet mask at one or more nodes must be corrected by issuing the command:

ifconfig interface_name address netmask netmask

All the adapters that belong to the same subnet must have the same subnet mask.

If the ping test is successful (the number of packets received is greater than 0), and the subnet masks match, there is some factor other than network connectivity preventing the two Group Leaders from contacting each other. The cause of the problem may be identified by entries in the Topology Services User log. If the problem persists, collect the data listed in Information to collect before contacting the IBM Support Center and contact the IBM Support Center. Include information about the two Group Leader nodes.

Operational test 10 - Check neighboring adapter connectivity

This test checks neighboring adapter connectivity, in order to investigate partial connectivity situations. Issue the command errpt -J TS_DEATH_TR | more on all the nodes. Look for recent entries with label TS_DEATH_TR. This is the entry created by the subsystem when the local adapter stops receiving heartbeat messages from the neighboring adapter. For the adapter membership groups to be constantly reforming, such entries should be found in the error log.

Issue the ping test on the node where the TS_DEATH_TR entry exists. The target of the ping should be the adapter whose address is listed in the Detail Data of the AIX error log entry. Operational test 6 - Check whether the adapter can communicate with other adapters in the network describes how to perform the ping test and interpret the results.

If the ping test fails, this means that the two neighboring adapters have connectivity problems, and the problem should be pursued as an IP connectivity problem.

If the ping test is successful, the problem is probably not due to lack of connectivity between the two neighboring adapters. The problem may be due to one of the two adapters not receiving the COMMIT message from the "mayor adapter" when the group is formed. The ping test should be used to probe the connectivity between the two adapters and all other adapters in the local subnet.

Operational test 11 - Verify node reachability information

Issue the following command:

lssrc -ls subsystem_name

and examine lines:

Number of nodes up: # . Number of nodes down: #.
Nodes down: [...] or Nodes up: [...]

in the command output.

Good results are indicated by the line Number of Nodes down: 0. For example,

Number of nodes up: 15    Number of nodes down: 0

However, such output can only be considered correct if indeed all nodes in the system are known to be up. If a given node is indicated as being up, but the node seems unresponsive, perform problem determination on the node. Proceed to Operational test 12 - Verify the status of an unresponsive node that Is shown to be up by Topology Services.

Error results are indicated by Number of Nodes down: being nonzero. The list of nodes that are flagged as being up or down is given in the next output line. An output such as Nodes down: 17-23(2) indicates that nodes 17, 19, 21, and 23 are considered down by Topology Services. If the nodes in the list are known to be down, this is the expected output. If, however, some of the nodes are thought to be up, it is possible that a problem exists with the Topology Services subsystem on these nodes. Proceed to Operational test 1 - Verify status and adapters, focusing on each of these nodes.

Operational test 12 - Verify the status of an unresponsive node that Is shown to be up by Topology Services

Examine the machines.lst configuration file and obtain the IP addresses for all the adapters in the given node that are in the Topology Services configuration. For example, for node 9, entries similar to the following may be found in the file:

 9 en0  9.114.61.193
 9 css0 9.114.61.137

Issue this command.

ping -c5 IP_address

If there is no response to the ping packets (the output of the command shows 100% packet loss) for all the node's adapters, the node is either down or unreachable. Pursue this as a node health problem. If Topology Services still indicates the node as being up, contact the IBM Support Center because this is probably a Topology Services problem. Collect long tracing information from the Topology Services logs. See Topology Services service log. Also obtain iptrace information from the node where the test is being run. See Information to collect before contacting the IBM Support Center.

If the output of the ping command shows some response (for example, 0% packet loss), the node is still up and able to send and receive IP packets. The Topology Services daemon is likely to be running and able to send and receive heartbeat packets. This is why the node is still seen as being up. This problem should be pursued as an AIX-related problem.

If there is a response from the ping command, and the node is considered up by remote Topology Services daemons, but the node is unresponsive and no user application is apparently able to run, a system dump must be obtained to find the cause of the problem. See Producing a system dump.

In a PSSP environment, make an attempt to connect to the node using the serial line interface. Issue this command:

spmon -o nodenode_number

If the connection is successful, the problem is likely to be lack of IP connectivity to the node. If the connection is not successful, a system dump is needed to diagnose the problem.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]