Diagnosis Guide

Actions

Action 1 - Diagnose multiple nodes

If several node icons in a frame report a failure, (either the nodes are not responding or several adapters are inactive) there may be a network problem.

If the failing nodes or communication adapters are on the same Local Area Network (LAN), verify the LAN hardware. If you determine that the hardware is functioning properly, call the IBM Support Center. Otherwise, follow local procedures for servicing your hardware.

If the nodes are not on the same LAN, diagnose the nodes individually as described in Action 2 - Diagnose individual nodes.

Action 2 - Diagnose individual nodes

If an individual node icon in a frame reports a failure, use the SP Hardware Perspective to display the Nodes Status page in the Node notebook, for the failing node.

Check the node's LCD/LED indicator.
If a three-digit code is displayed, check SP-specific LED/LCD values to see if the code is described there. If the code is not described in this section, refer to IBM RS/6000 Problem Solving Guide.
Check the hostResponds indicator for a failure.
Check the node's power indicator.
If it shows that the node power is off, turn the node's power on.
If it shows that the node power is on or if the problem persists, call IBM hardware support.

Action 3 - Diagnose a network problem

If a node is not responding to a network command, you can access the node by using the tty. This can be done by using the SP Hardware Perspective, selecting the node and performing an open tty action on it. It can also be done by issuing the

s1term -w frame number slot number

command, where frame number is the frame number of the node and slot number is the slot number of the node.

Using either method, you can login to the node and check the hostname, network interfaces, network routes, and hostname resolution to determine why the node is not responding. The Appendix "IP Address and Host Name Changes for SP Systems" in PSSP: Administration Guide contains a procedure for changing hostnames and IP addresses.

If you are using IPv6 alias addresses, verify that each network (Ethernet or token ring) adapter on the affected system has a valid IPv4 address defined. To verify the adapter IP addresses on the control workstation and nodes, run the SYSMAN_test command. This command issues an error message if the node does not have valid IPv4 addresses for all Ethernet and token ring adapters that are used by the SP system. See Verifying System Management installation.

Action 4 - Diagnose a Topology-related problem

If the ping and telnet commands are successful, but hostResponds still shows Node Not Responding, there may be something wrong with the Topology Services (hats) subsystem. Perform these steps:

Examine the en0 (Ethernet adapter) and css0 (switch adapter) addresses on all nodes to see if they match the addresses in /var/ha/run/hats.partition_name/machines.lst.
Verify that the netmask and broadcast addresses are consistent across all nodes. Use the ifconfig en0 and ifconfig css0 commands.
Examine the hats log file on the failing node. It is named: /var/ha/log/hats.dd.HHMMSS.partition_name, where dd.HHMMSS is the day of the month and time of day when the Topology Services daemon was started, and partition_name is the name of the node's system partition.
Examine the hats log file for the Group Leader nodes. Group Leader nodes are those that host the adapter whose address is listed below the line "Group ID" in the output of the lssrc -ls hats command. For more information, see Diagnosing Topology Services problems, and the Topology Services chapter in PSSP: Administration Guide.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]