These test verify the installation, configuration and operation of Topology Services.
This test determines whether Topology Services has been successfully installed.
lslpp -l | grep rsct
Good results are indicated by an output similar to:
rsct.basic.hacmp 1.2.0.0 COMMITTED RS/6000 Cluster Technology rsct.basic.rte 1.2.0.0 COMMITTED RS/6000 Cluster Technology rsct.basic.sp 1.2.0.0 COMMITTED RS/6000 Cluster Technology rsct.clients.hacmp 1.2.0.0 COMMITTED RS/6000 Cluster Technology rsct.clients.rte 1.2.0.0 COMMITTED RS/6000 Cluster Technology rsct.clients.sp 1.2.0.0 COMMITTED RS/6000 Cluster Technology rsct.core.utils 1.2.0.0 COMMITTED RS/6000 Cluster Technology
Error results are indicated by no output from the command.
lppchk -c "rsct*"
Good results are indicated by the absence of error messages and the return of a zero exit status from this command. The command produces no output if it succeeds.
Error results are indicated by a non-zero exit code and by error messages similar to these:
lppchk: 0504-206 File /usr/lib/nls/msg/en_US/hats.cat could not be located. lppchk: 0504-206 File /usr/sbin/rsct/bin/hatsoptions could not be located. lppchk: 0504-208 Size of /usr/sbin/rsct/bin/phoenix.snap is 29356, expected value was 29355.
Some error messages may appear if an EFIX is applied to a file set. An EFIX is an emergency fix, supplied by IBM, to correct a specific problem.
If the test failed, verify the installation of RSCT. The following file sets need to be installed:
If the test succeeds, proceed to Configuration verification tests. If the test fails, see if RSCT was installed, and install RSCT if it was not.
These tests verify the configuration of Topology Services.
This test verifies that Topology Services in PSSP has the configuration data it needs. Proceed to Configuration test 3 - Verify HACMP/ES configuration data if in the HACMP/ES environment.
Issue the following commands to display data from the SDR:
Good results are indicated by none of these commands giving an error message, and all commands giving non-null output. SDRGetObjects Adapter must show all the adapters in the current partition, and SDRGetObjects -G Adapter must show all the adapters in the machine.
Error results are indicated if these commands fail. The SDR could be experiencing problems. Diagnose the SDR subsystem. If the commands succeed but do not show the expected information, it is possible that a problem occurred in the installation of the nodes. Verify installation of the nodes.
If the test is successful, proceed to Configuration test 2 - Check control workstation Ethernet adapter.
This test determines whether the control workstation has an Ethernet adapter that can be included in the Topology Services configuration file. On the control workstation, issue the command netstat -in, followed by the command: ifconfig enn, for each Ethernet adapter listed by netstat.
Verify that at least one of the "en" adapters on the control workstation is on the same subnet ID as the en0 adapter of at least one of the nodes. The subnet ID and subnet mask for the control workstation adapter can be derived from the ifconfig command output. Use this calculation:
Subnet id = inet & netmask Subnet mask = netmask
where inet and netmask are given in the output of the previous ifconfig command , and "&" is the bitwise "AND" operator. Ignore ifconfig command output that begins with inet6. Those are addresses in IPv6 format.
For example, if the command ifconfig en0 produced this output:
en0: flags=e080863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST, GROUPRT,64BIT> inet 9.114.61.125 netmask 0xffffffc0 broadcast 9.114.61.127
calculate the Subnet ID as follows:
In this example, 0x09723D7D & 0xFFFFFFC0 = 0x09723D40.
The information about the nodes' adapters can be obtained by issuing the command: SDRGetObjects -G Adapter
Subnet id = netaddr & netmask Subnet mask = netmaskwhere netaddr and netmask are given in the output of the SDRGetObjects command.
Good results are indicated by the existence of at least one node where the Subnet ID / Subnet mask pairs are the same as in one of the control workstation's "en" adapters.
Error results are indicated by the absence of such a pair.
If the problem is in the control workstation's or nodes' netmask, the netmask problem must be corrected. Adapters that belong to the same subnet must have the same netmask. If the problem is due to a lack of an Ethernet adapter in the control workstation which is in the same subnet as one of the nodes, this adapter must be added.
If this test is a success, proceed to Operational verification tests.
This test verifies that Topology Services in HACMP/ES has the configuration data it needs. Configuration data is stored in the HACMP/ES Global ODM. Obtain the output of the following commands:
The output of cllsif is similar to the following:
Adapter Type Network NetType Attribute Node IP Address Hardware Interface Global Name Address Name c47n07 service SP_ether ether public c47n07 9.114.61.71 en0 glob_SP_ether c47n07_hpsboot boot sw_net hps public c47n07 1.1.1.1 css0 c47n15_hpsservice service sw_net hps public 1.1.1.12 css0 c47n13_hpsservice service sw_net hps public 1.1.1.10 css0
This output should contain all the adapters defined in the HACMP configuration. The adapter names and addresses (at least the boot and standby) should correspond to what is actually configured on the machine. Those can be obtained by issuing the netstat -in command.
The output of clhandle -a should show all the nodes configured in HACMP, while the output of clhandle should contain the local node name and number.
The output of cllsclstr should show the cluster name and id.
The output of odmget HACMPtopsvcs should be similar to the following:
HACMPtopsvcs: hbInterval = 1 fibrillateCount = 4 runFixedPri = 1 fixedPriLevel = 38 tsLogLength = 5000 gsLogLength = 5000 instanceNum = 2
The output of odmget HACMPnim should be similar to the following:
HACMPnim: name = "ether" desc = "Ethernet Protocol" addrtype = 0 path = "" para = "" grace = 30 hbrate = 500000 cycle = 4 HACMPnim: name = "token" desc = "Token Ring Protocol" addrtype = 0 path = "" para = "" grace = 90 hbrate = 500000 cycle = 4
Good results are indicated by the output of these commands reflecting the desired HACMP configuration, with respect to networks, network adapters, and tunable values. In this case, proceed to Operational verification tests.
Error results are indicated if there is any inconsistency between the displayed configuration data and the desired configuration data. In this case, the HACMP configuration has to be edited, and the Cluster Topology must be synchronized.
The following names apply to the operational verification tests in this section:
This test verifies whether Topology Services is working and that all the adapters are up. Issue the following command:
lssrc -ls subsystem_name
Good results are indicated by an output similar to the following:
Subsystem Group PID Status hats hats 20494 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [ 0] 15 15 S 9.114.61.195 9.114.61.195 SPether [ 0] en0 0x3740dd5c 0x3740dd62 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [ 1] 14 14 S 9.114.61.139 9.114.61.139 SPswitch [ 1] css0 0x3740dd5d 0x3740dd62 HB Interval = 1 secs. Sensitivity = 4 missed beats Configuration Instance = 926566126 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats control workstation IP address = 9.114.61.125 Daemon employs no security Data segment size: 6358 KB. Number of outstanding malloc: 588 Number of nodes up: 15. Number of nodes down: 0.
If the number under the Mbrs heading is the same as the number under Defd, all adapters defined in the configuration are part of the adapter membership group. The numbers under the Group ID heading should remain the same over subsequent invocations of lssrc several seconds apart. This is the expected behavior of the subsystem.
Error results are indicated by outputs similar to the following:
In this case, the subsystem is down. Issue the errpt command and look for an entry for the subsystem name. Proceed to Operational test 2 - Determine why the Topology Services subsystem is inactive.
The subsystem is not defined to the AIX SRC. In PSSP, the partition-sensitive subsystems may have been undefined by the syspar_ctrl command. The same command may be used to add the subsystems to the node. In HACMP/ES, HACMP may have not been installed on the node. Check the HACMP subsystem.
Subsystem Group PID Status hats hats 20494 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [ 0] 15 8 S 9.114.61.195 9.114.61.195 SPether [ 0] en0 0x3740dd5c 0x3740dd62 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [ 1] 14 7 S 9.114.61.139 9.114.61.139 SPswitch [ 1] css0 0x3740dd5d 0x3740dd62 HB Interval = 1 secs. Sensitivity = 4 missed beats Configuration Instance = 926566126 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats control workstation IP address = 9.114.61.125 Daemon employs no security Data segment size: 6358 KB. Number of outstanding malloc: 588 Number of nodes up: 8. Number of nodes down: 7. Nodes down: 17-29(2)
Some remote adapters are not part of the local adapter's group. Proceed to Operational test 3 - Determine why remote adapters are not in the local adapter's membership group.
Subsystem Group PID Status hats hats 20494 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [ 0] 15 15 S 9.114.61.195 9.114.61.195 SPether [ 0] en0 0x3740dd5c 0x3740dd62 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [ 1] 14 0 D 9.114.61.139 SPswitch [ 1] css0 HB Interval = 1 secs. Sensitivity = 4 missed beats Configuration Instance = 926566126 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats control workstation IP address = 9.114.61.125 Daemon employs no security Data segment size: 6358 KB. Number of outstanding malloc: 588 Number of nodes up: 15. Number of nodes down: 0.
A local adapter is disabled. Proceed to Operational test 4 - Check address of local adapter.
Subsystem Group PID Status hats hats 20494 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [ 0] 15 8 S 9.114.61.195 9.114.61.195 SPether [ 0] en0 0x3740dd5c 0x3740dd62 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [ 1] 14 1 U 9.114.61.139 9.114.61.139 SPswitch [ 1] css0 0x3740dd5d 0x3740dd5d HB Interval = 1 secs. Sensitivity = 4 missed beats Configuration Instance = 926566126 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats control workstation IP address = 9.114.61.125 Daemon employs no security Data segment size: 6358 KB. Number of outstanding malloc: 588 Number of nodes up: 8. Number of nodes down: 7. Nodes down: 17-29(2)
The last line of the output shows a list of nodes that are either up or down, whichever is smaller. The list of nodes that are down includes only the nodes that are configured and have at least one adapter that Topology Services monitors. Nodes are specified by a list of node ranges, as follows:
N1-N2(I1) N3-N4(I2) ...
Here, there are two ranges, N1-N2(I1) and N3-N4(I2). They are interpreted as follows:
If the increment is 1, it is omitted. If the range has only one node, only that node's number is displayed. Examples are:
An adapter stays in a singleton unstable membership group. This normally occurs for a few seconds after the daemon starts or after the adapter is re-enabled. If the situation persists for more than one minute, this may indicate a problem. This usually indicates that the local adapter is receiving some messages, but it is unable to obtain responses for its outgoing messages. Proceed to Operational test 7 - Check for partial connectivity.
This kind of output indicates that there is some partial connectivity on the network. Some adapters may be able to communicate only with a subset of adapters. Some adapters may be able to send messages only or receive messages only. This output indicates that the adapter membership groups are constantly reforming, causing a substantial increase in the CPU and network resources used by the subsystem.
A partial connectivity situation is preventing the adapter membership group from holding together. Proceed to Operational test 10 - Check neighboring adapter connectivity.
If this test is successful, proceed to Operational test 11 - Verify node reachability information.
This test is to determine why the Topology Services subsystem is not active.
The AIX error log entries produced by this command, together with their description in Table 58, explain why the subsystem is inactive. If no entry that explains why the subsystem went down or could not start exists, it is possible that the daemon may have exited abnormally.
In this case, issue the errpt -a command and look for an error. Look for an error entry with a LABEL: of CORE_DUMP and PROGRAM NAME of hatsd. (Issue the command: errpt -J CORE_DUMP -a.) If such an entry is found, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
Another possibility when there is no TS_ error log entry, is that the Topology Services daemon could not be loaded. In this case a message similar to the following may be present in the Topology Services User startup log:
0509-036 Cannot load program hatsd because of the following errors: 0509-023 Symbol dms_debug_tag in hatsd is not defined. 0509-026 System error: Cannot run a file that does not have a valid format.
The message may refer to the Topology Services daemon, or to some other program invoked by the startup script hats. If such an error is found, contact the IBM Support Center.
For errors where the daemon did start up but exited during initialization, detailed information about the problem is in the Topology Services User error log.
lssrc -ls subsystem
on all the nodes and the PSSP control workstation. The command:
dsh -a "lssrc -ls subsystem"
issued from the control workstation can be used to issue lssrc on all the nodes.
If this test follows output 3, at least one node will not have the same output as the node from where output 3 was taken.
Some of the possibilities are:
Subsystem Group PID Status hats hats 20494 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [ 0] 15 7 S 9.114.61.199 9.114.61.201 SPether [ 0] en0 0x3740dd5c 0x3740dd72 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [ 1] 14 7 S 9.114.61.141 9.114.61.141 SPswitch [ 1] css0 0x3740dd5d 0x3740dd72 HB Interval = 1 secs. Sensitivity = 4 missed beats Configuration Instance = 926566126 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats Control Workstation IP address = 9.114.61.125 Daemon employs no security Data segment size: 6358 KB. Number of outstanding malloc: 588 Number of nodes up: 7. Number of nodes down: 8. Nodes up: 17-29(2)
Compare this with the output from 3. Proceed to Operational test 8 - Check if configuration instance and security status are the same across all nodes.
This test verifies whether a local adapter is configured with the correct address. Assuming that this test is being run because the output of the lssrc command indicates that the adapter is disabled, there should be an entry in the AIX error log that points to the problem.
errpt -J TS_LOC_DOWN_ST,TS_MISCFG_EM -a | more
Examples of the error log entries that appear in the output are:
LABEL: TS_LOC_DOWN_ST IDENTIFIER: D17E7B06 Date/Time: Mon May 17 23:29:34 Sequence Number: 227 Machine Id: 000032054C00 Node Id: c47n11 Class: S Type: INFO Resource Name: hats.c47s Description Possible malfunction on local adapter
LABEL: TS_MISCFG_EM IDENTIFIER: 6EA7FC9E Date/Time: Mon May 17 16:28:45 Sequence Number: 222 Machine Id: 000032054C00 Node Id: c47n11 Class: U Type: PEND Resource Name: hats.c47s Resource Class: NONE Resource Type: NONE Location: NONE VPD: Description Local adapter misconfiguration detected
Good results are indicated by the absence of the TS_MISCFG_EM error entry. To verify that the local adapter has the expected address, issue the command:
ifconfig interface_name
where interface_name is the interface name listed on the output of lssrc, such as:
SPswitch [ 1] 14 0 D 9.114.61.139 SPswitch [ 1] css0
For the lssrc command output, the output of ifconfig css0 is similar to:
css0: flags=800847 <UP,BROADCAST,DEBUG,RUNNING,SIMPLEX> inet 9.114.61.139 netmask 0xffffffc0 broadcast 9.114.61.191
Error results are indicated by the TS_MISCFG_EM error entry and by the output of the ifconfig command not containing the address displayed in the lssrc command output.
Diagnose the reason why the adapter is configured with an incorrect address. For PSSP, the adapter may have been incorrectly configured in the SDR, or the adapter's address was incorrectly set manually. For HACMP, the cluster on the node may have been stopped with the "Forced Down" option. The adapters must be configured with their boot-time addresses before the cluster can be started on a node. This can be done by issuing command:
/etc/rc.net -boot
several times in a sequence. Issuing the command only once may not set all IP routes correctly.
If this test is a success, proceed to Operational test 5 - Check if the adapter is enabled for IP.
Issue the command:
ifconfig interface_name
The output is similar to the following:
css0: flags=800847 <UP,BROADCAST,DEBUG,RUNNING,SIMPLEX> inet 9.114.61.139 netmask 0xffffffc0 broadcast 9.114.61.191
Good results are indicated by the presence of the UP string in the first line of the output. In this case, proceed to Operational test 6 - Check whether the adapter can communicate with other adapters in the network.
Error results are indicated by the absence of the UP string in the first line of the output.
Issue the command:
ifconfig interface_name up
to re-enable the adapter to IP.
Root authority is needed to access the contents of the machines.lst file. Display the contents of the machines.lst file. The output is similar to the following:
*InstanceNumber=925928580 *configId=1244520230 *!HaTsSeCStatus=off *FileVersion=1 *!TS_realm=PSSP TS_Frequency=1 TS_Sensitivity=4 TS_FixedPriority=38 TS_LogLength=5000 *!TS_PinText Network Name SPether Network Type ether * *Node Type Address 0 en0 9.114.61.125 1 en0 9.114.61.65 3 en0 9.114.61.67 11 en0 9.114.61.195 ... Network Name SPswitch Network Type hps * *Node Type Address 1 css0 9.114.61.129 3 css0 9.114.61.131 11 css0 9.114.61.139
Locate the network to which the adapter under investigation belongs. For example, the css0 adapter on node 11 belongs to network SPswitch. Issue the command:
ping -c 5 address
for the addresses listed in the machines.lst file.
Good results are indicated by outputs similar to the following.
PING 9.114.61.129: (9.114.61.129): 56 data bytes 64 bytes from 9.114.61.129: icmp_seq=0 ttl=255 time=0 ms 64 bytes from 9.114.61.129: icmp_seq=1 ttl=255 time=0 ms 64 bytes from 9.114.61.129: icmp_seq=2 ttl=255 time=0 ms 64 bytes from 9.114.61.129: icmp_seq=3 ttl=255 time=0 ms 64 bytes from 9.114.61.129: icmp_seq=4 ttl=255 time=0 ms ----9.114.61.129 PING Statistics---- 5 packets transmitted, 5 packets received, 0% packet loss round-trip min/avg/max = 0/0/0 ms
The number before packets received should be greater than 0.
Error results are indicated by outputs similar to the following:
PING 9.114.61.129: (9.114.61.129): 56 data bytes ----9.114.61.129 PING Statistics---- 5 packets transmitted, 0 packets received, 100% packet loss
The command should be repeated with different addresses until it succeeds or until several different attempts are made. After that, pursue the problem as an adapter or IP-related problem. If the adapter is an SP Switch adapter, refer to Diagnosing SP Switch problems. If the adapter is an SP Switch2 adapter, refer to Diagnosing SP Switch2 problems.
If this test succeeds, but the adapter is still listed as disabled in the lssrc command output, collect the data listed in Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
Adapters stay in a singleton unstable state when there is partial connectivity between two adapters. One reason for an adapter to stay in this state is that it keeps receiving PROCLAIM messages, to which it responds with a JOIN message, but no PTC message comes in response to the JOIN message.
Check in the User log file to see if a message similar to the following appears repeatedly:
2523-097 JOIN time has expired. PROCLAIM message was sent by (10.50.190.98:0x473c6669)
If this message appears repeatedly in the User log, investigate IP connectivity between the local adapter and the adapter whose address is listed in the User log entry (10.50.190.98 in the example here). Issue command:
ping -c 5 addressaddress is 10.50.190.98 in this example.
See Operational test 5 - Check if the adapter is enabled for IP for a description of good results for this command.
The local adapter cannot communicate with a Group Leader that is attempting to attract the local adapter into the adapter membership group. The problem may be with either the local adapter or the Group Leader adapter ("proclaimer" adapter). Pursue this as an IP connectivity problem. Focus on both the local adapter and the Group Leader adapter. See Diagnosing IP routing problems.
If the ping command succeeds, but the local adapter still stays in the singleton unstable state, contact the IBM Support Center.
In an HACMP/ES environment, it is possible that there are two adapters in different nodes both having the same service address. This can be verified by issuing:
lssrc -ls subsystem_name
and looking for two different nodes that have the same IP address portion of Adapter ID. In this case, this problem should be pursued as an HACMP/ES problem. Contact the IBM Support Center.
If this test fails, proceed to Operational test 4 - Check address of local adapter, concentrating on the local and Group Leader adapters.
This test is used when there seem to be multiple partitioned adapter membership groups across the nodes, as in output 2.
This test verifies whether all nodes are using the same configuration instance number and same security setting. The instance number changes each time the machines.lst file is generated by the startup script. In PSSP, the configuration instance always increases. In HACMP/ES, the configuration instance number normally increases, unless a snapshot of a previous configuration is applied.
lssrc -ls subsystem_name
on all nodes. If this is not feasible, issue the command at least on nodes that produce an output that shows a different Group ID.
Compare the line Configuration Instance = (number) in the lssrc outputs. Also, compare the line Daemon employs in the lssrc command outputs.
Good results are indicated by the number after the Configuration Instance phrase being the same in all the lssrc outputs. This means that all nodes are working with the same version of the machines.lst file.
Error results are indicated by the configuration instance being different in the two "node partitions" (this is unrelated to the SP system partitions). In this case, the adapters in the two partitions cannot merge into a single group because the configuration instances are different across the node partitions. This situation is likely to be caused by a refresh-related problem. One of the node groups, probably that with the lower configuration instance, was unable to run a refresh. If a refresh operation was indeed attempted, consult the description of the "Nodes or adapters leave membership after refresh" problem in Error symptoms, responses, and recoveries.
The situation may be caused by a problem in the AIX SRC subsystem, which fails to notify the Topology Services daemon about the refresh. The description of the "Nodes or adapters leave membership after refresh" problem in Error symptoms, responses, and recoveries explains how to detect the situation where the Topology Services daemon has lost its connection with the AIX SRC subsystem. In this case, contact the IBM Support Center.
If the security setting is not the same on all the nodes in a partition, some of the nodes may fail to authenticate each other's messages. AIX error log entries with labels TS_SECURITY_ST and TS_SECURITY2_ST may appear on those nodes. For information about these error log entries, see TS_SECURITY_ST on page *** and TS_SECURITY2_ST on page ***.
If this test is successful, proceed to Operational test 9 - Check connectivity among multiple node partitions.
This test is used when adapters in the same Topology Services network form multiple adapter membership groups, rather than a single group encompassing all the adapters in the network.
Follow the instructions in Operational test 8 - Check if configuration instance and security status are the same across all nodes to obtain lssrc outputs for each of the node partitions.
The IP address listed in the lssrc command output under the Group ID heading is the IP address of the Group Leader. If two node partitions are unable to merge into one, this is caused by the two Group Leaders being unable to communicate with each other. Note that even if some adapters in different partitions can communicate, the group merge will not occur unless the Group Leaders are able to exchange point-to-point messages. Use ping (as described in Operational test 6 - Check whether the adapter can communicate with other adapters in the network) to determine whether the Group Leaders can communicate with each other.
For example, assume on one node the output of the lssrc -ls hats command is:
Subsystem Group PID Status hats hats 15750 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [0] 15 9 S 9.114.61.65 9.114.61.195 SPether [0] 0x373897d2 0x3745968b HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [1] 14 14 S 9.114.61.129 9.114.61.153 SPswitch [1] 0x37430634 0x374305f1 HB Interval = 1 secs. Sensitivity = 4 missed beats
and on another node it is:
Subsystem Group PID Status hats hats 13694 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [0] 15 6 S 9.114.30.69 9.114.61.71 SPether [0] 0x37441f24 0x37459754 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [1] 14 14 S 9.114.61.149 9.114.61.153 SPswitch [1] 0x374306a4 0x374305f1
In this example, the partition is occurring in the SP Ethernet. The two Group Leaders are IP addresses 9.114.61.195 and 9.114.61.71. Login to the node that hosts one of the IP addresses and issue the ping test to the other address. In case the two adapters in question are in the same subnet, verify whether they have the same subnet mask. Configuration test 2 - Check control workstation Ethernet adapter describes how to obtain the subnet id and subnet mask for an adapter.
Good results and error results for the ping test are described in Operational test 6 - Check whether the adapter can communicate with other adapters in the network. If the ping test is not successful, a network connectivity problem between the two Group Leader nodes is preventing the groups from merging. Diagnose the network connectivity problem. See Diagnosing system connectivity problems.
Good results for the subnet mask test are indicated by the adapters that have the same subnet id also having the same subnet mask. If the subnet mask test fails, the subnet mask at one or more nodes must be corrected by issuing the command:
ifconfig interface_name address netmask netmask
All the adapters that belong to the same subnet must have the same subnet mask.
If the ping test is successful (the number of packets received is greater than 0), and the subnet masks match, there is some factor other than network connectivity preventing the two Group Leaders from contacting each other. The cause of the problem may be identified by entries in the Topology Services User log. If the problem persists, collect the data listed in Information to collect before contacting the IBM Support Center and contact the IBM Support Center. Include information about the two Group Leader nodes.
This test checks neighboring adapter connectivity, in order to investigate partial connectivity situations. Issue the command errpt -J TS_DEATH_TR | more on all the nodes. Look for recent entries with label TS_DEATH_TR. This is the entry created by the subsystem when the local adapter stops receiving heartbeat messages from the neighboring adapter. For the adapter membership groups to be constantly reforming, such entries should be found in the error log.
Issue the ping test on the node where the TS_DEATH_TR entry exists. The target of the ping should be the adapter whose address is listed in the Detail Data of the AIX error log entry. Operational test 6 - Check whether the adapter can communicate with other adapters in the network describes how to perform the ping test and interpret the results.
If the ping test fails, this means that the two neighboring adapters have connectivity problems, and the problem should be pursued as an IP connectivity problem.
If the ping test is successful, the problem is probably not due to lack of connectivity between the two neighboring adapters. The problem may be due to one of the two adapters not receiving the COMMIT message from the "mayor adapter" when the group is formed. The ping test should be used to probe the connectivity between the two adapters and all other adapters in the local subnet.
Issue the following command:
lssrc -ls subsystem_name
and examine lines:
in the command output.
Good results are indicated by the line Number of Nodes down: 0. For example,
Number of nodes up: 15 Number of nodes down: 0
However, such output can only be considered correct if indeed all nodes in the system are known to be up. If a given node is indicated as being up, but the node seems unresponsive, perform problem determination on the node. Proceed to Operational test 12 - Verify the status of an unresponsive node that Is shown to be up by Topology Services.
Error results are indicated by Number of Nodes down: being nonzero. The list of nodes that are flagged as being up or down is given in the next output line. An output such as Nodes down: 17-23(2) indicates that nodes 17, 19, 21, and 23 are considered down by Topology Services. If the nodes in the list are known to be down, this is the expected output. If, however, some of the nodes are thought to be up, it is possible that a problem exists with the Topology Services subsystem on these nodes. Proceed to Operational test 1 - Verify status and adapters, focusing on each of these nodes.
Examine the machines.lst configuration file and obtain the IP addresses for all the adapters in the given node that are in the Topology Services configuration. For example, for node 9, entries similar to the following may be found in the file:
9 en0 9.114.61.193 9 css0 9.114.61.137
ping -c5 IP_address
If there is no response to the ping packets (the output of the command shows 100% packet loss) for all the node's adapters, the node is either down or unreachable. Pursue this as a node health problem. If Topology Services still indicates the node as being up, contact the IBM Support Center because this is probably a Topology Services problem. Collect long tracing information from the Topology Services logs. See Topology Services service log. Also obtain iptrace information from the node where the test is being run. See Information to collect before contacting the IBM Support Center.
If the output of the ping command shows some response (for example, 0% packet loss), the node is still up and able to send and receive IP packets. The Topology Services daemon is likely to be running and able to send and receive heartbeat packets. This is why the node is still seen as being up. This problem should be pursued as an AIX-related problem.
If there is a response from the ping command, and the node is considered up by remote Topology Services daemons, but the node is unresponsive and no user application is apparently able to run, a system dump must be obtained to find the cause of the problem. See Producing a system dump.
In a PSSP environment, make an attempt to connect to the node using the serial line interface. Issue this command:
spmon -o nodenode_number
If the connection is successful, the problem is likely to be lack of IP connectivity to the node. If the connection is not successful, a system dump is needed to diagnose the problem.