Diagnosis Guide

Diagnostic procedures

If your SP system or SP system partition shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified SP Switch node.

Note:

If your system is running in restricted root access mode, the following commands must be issued from the control workstation:

CSS_test
Eclock
Efence
Eprimary
Equiesce
Estart
Eunfence
Eunpartition
mult_senders_test
switch_stress
wrap_test

SP Switch diagnostics

Verify the SP Switch topology configuration

The switch topology file is used to define the hardware configuration to the css support software. It should reflect the number of switches and nodes installed, as well as define how they are connected.

The topology file can reside in two places: in the SDR, or in the expected.top file in the /etc/SP directory of the primary node. Usually, the configuration in the SDR is used. If the configuration in /etc/SP/expected.top on the primary node exists, it overrides the configuration in the SDR. The /etc/SP/expected.top on the primary node is generally used for debugging proposes.

To verify that the topology in the SDR is correct, first read it out of the SDR using the command:

Etopology -read file_name

The Etopology command reads the switch topology from the SDR and places it in the specified file. For more information on this command, see PSSP: Command and Technical Reference.

Once the file is extracted, verify that the switch topology is an accurate representation of the installed hardware.

If changes to the switch topology file are required, remember to place them back into the SDR using the Etopology command:

Etopology file_name

A default set of topology configuration files is available in the /etc/SP directory. For more information, see PSSP: Command and Technical Reference.

The SP Switch uses an annotated topology file, produced by the Eannotator command. The system administrator is responsible to run the command and create the annotated topology file. When the file is not annotated, the fault service daemon will still work and the switch will function, but the switch jack numbers will not be correct. If you suspect that your topology file has a problem, you can verify it. Examine the out.top file in /var/adm/SPlogs/css of each node, and examine the topology file (using the Etopology -read file_name option as described previously).

Each link line in an annotated file is marked by E as in the example:

E01-S17-BH-J7 to E01-N1

Each link line in a file that is not annotated is marked by L as in the example:

L01-S17-BH-J7 to L01-N1

Verify the System Data Repository (SDR)

To verify that the SDR is installed and operating correctly, run the SDR_test command on the control workstation. It can be run either through SMIT panels or from the command line.

To verify SDR installation and operation from the SMIT panel:

Issue the command:
```
smit SP_verify
```
The RS/6000 SP Installation/Configuration Verification menu appears.
Select: System Data Repository.
Press enter.
Review the output created.

To verify SDR installation and operation from the command line, enter:

/usr/lpp/ssp/bin/SDR_test

Review the output created.

Whenever the SDR_test command is run, a log file is created to enable the user to review the test results. The default log file created is: /var/adm/SPlogs/SDR_test.log. If the SDR_test command is run without root authority, the default log file created is: /tmp/SDR_test.log. Complete information on SDR_test can be found in PSSP: Command and Technical Reference. See also SDR verification test.

Next, login to the failing node and issue the command:

SDRGetObjects switch_responds

Examine the output that is returned. If the switch responds bits are returned, this indicates that the SDR is operating. You can also determine which nodes are operational on the switch by examining the value returned: A value of 1 indicates that the node is operational. A value of 0 indicates that the node is not operational.

Clock diagnostics

The following procedure should not be run on nodes that are operational on the switch. The utilities used for those verifications cannot coexist with normal switch operations on the node. If a clocking problem exists on all nodes in a rack, see SP rack or system clock diagnostics. Otherwise, see SP Switch external clock diagnostics.

SP Switch external clock diagnostics

Perform these steps to determine if the external clock is operational at the node:

Login to the node in question.
Issue the following command:
```
/usr/lpp/ssp/css/diags/read_tbic -s
```
The output is similar to:
```
TBIC status register   :78XXXXXX
```
Look at bit 3 and 4 (the bits are numbered from left to right, starting with 0):
- If bits 3 and 4 are ON (equal 1), the external clock is operational at the node.
- If either bit 3 or 4 are OFF (equal 0), the external clock is not operational at that node.
In this example, bits 3 and 4 are both ON, indicating that the external clock is operational at the node.

Perform the following steps to restore the external clock at the node:

rc.switch the node with the following command:
```
/usr/lpp/ssp/css/rc.switch
```
The output is similar to:
```
adapter/mca/tb3
```
Determine if the clock is still not present using the same read_tbic command from the previous procedure.
If the clock is still not operational, try to Eclock the system.
The Eclock command affects all switch boards in the system and requires exclusive use of the switch, and therefore the SP system partitions on the switch. For more information on the Eclock command see PSSP: Command and Technical Reference.
To Eclock the system, issue the command:
```
Eclock -d         
```
Determine if the clock is present after the Eclock. Use the same read_tbic command from the previous procedure.
If the clock is still not present, run the Cable diagnostics or contact IBM Hardware Service and request that they run the Cable diagnostics.

SP rack or system clock diagnostics

The following table list the possible clock loss problems on single racks and systems along with their recovery actions:

Table 26. Clock problems and recovery actions

Problem	Recovery action
All nodes in a single rack will not clock.	Cause: SP Switch is powered off. Action: Power the switch on, run Eclock -d, then Estart. Cause: The switch is not Eclocked Action: Run Eclock -d, then Estart. Cause: The clock topology file used does not match the physical system topology or the switch board is defective. Action: Contact IBM Hardware Service.
Some racks in the system will not clock.	Cause: SP Switches are powered off. Action: Power the switches on, run Eclock -d, then Estart. Cause: The system is not Eclocked. Action: Run Eclock -d, then Estart. Cause: The clock topology file used does not match the physical system topology or the master clock switch is bad. Action: Contact IBM Hardware Service.

Problem

Recovery action

All nodes in a single rack will not clock.

Cause: SP Switch is powered off.

Action: Power the switch on, run Eclock -d, then Estart.

Cause: The switch is not Eclocked

Action: Run Eclock -d, then Estart.

Cause: The clock topology file used does not match the physical system topology or the switch board is defective.

Action: Contact IBM Hardware Service.

Some racks in the system will not clock.

Cause: SP Switches are powered off.

Action: Power the switches on, run Eclock -d, then Estart.

Cause: The system is not Eclocked.

Action: Run Eclock -d, then Estart.

Cause: The clock topology file used does not match the physical system topology or the master clock switch is bad.

Action: Contact IBM Hardware Service.

SP Switch adapter diagnostics

The adapter diagnostics have two modes of operation: the Power-On-Self-Test (POST) and by issuing the diag command from the command line.

For the automatic POST tests scenario, issue the command:

diag -c -d css0

For advanced diagnostics scenarios, issue the command:

diag -A -d css0

The advanced tests check the cable wrap. You will need the card and cable wrap plug to complete these tests.

Note:: The complete set of adapter diagnostics needs the exclusive use of the css adapter on the current node that the diagnostics are run on. Any other processes that have the css device driver open must be closed (killed) before issuing the adapter diagnostics command. One of those processes is the fault service daemon: fault_service_Worm_RTG_SP. Processes such as the "switch clock reader application" make use of the fault-service daemon, and therefore they should be closed as well.

The diagnostics failures are reported to the AIX error log of the failing node. To view the adapter diagnostics errors:

Login to the failing node.
Issue the command:
```
errpt -a |grep "Switch adapter failed POST diagnostics" 
```
to view the POST adapter diagnostics AIX error log entries.
In most cases each of the error entries will contain a Service Request Number (SRN).
Use the SRN to locate your error and its recovery actions in Table 27.
Note that x may represent any value in Table 27.

Table 27. SP Switch adapter Service Request Number failures and recovery actions

SRN	Recovery action
1xx	See Verify software installation. If the verification is successful and the problem persists, contact the IBM Support Center.
28x	See Clock diagnostics. If the verification is successful and the problem persists, contact the IBM Support Center.
Axx	See Clock diagnostics. If the verification is successful and the problem persists, contact the IBM Support Center.
All other SRNs	Contact IBM Hardware Service and arrange to have the adapter or cable replaced.

diag_fail condition for an SP Switch adapter

If a node is powered on before the switch boards receive power, the node's switch adapter does not receive a clock signal. This is a situation that could occur during installation. This causes a diag_fail condition for the SP Switch adapter. If the adapter status is diag_fail, the rc.switch process terminates without starting the fault service daemon.

There are two ways to correct the problem:

Reboot the node.
Reconfigure the adapter. This requires stopping any process that could hold the css0 device, such as Topology Services (hats). Issue these command:
1. stopsrc -s hats
2. /usr/lpp/ssp/css/ucfgtb3 -l css0 -v
3. /usr/lpp/ssp/css/cfgtb3 -l css0 -v
4. startsrc -s hats

Note:: If there are other processes, including application processes, holding the css0 device, these processes must be stopped and restarted as well.

Cable diagnostics

Switch to switch cable diagnostics

Visually inspect the cable in question:

Remove the cable from the back of the switch and examine the connectors (cable and switch bulkhead jack) for bent pins or other visible damage. If everything looks OK, reconnect the cable to the switch bulkhead jack. If not, contact IBM Hardware Service and have them repair or replace the damaged components.
Repeat step 1 for the other end of the switch to switch cable.
Run the SP Switch Wrap Test and SP Switch Stress Test. See SP Switch and SP Switch2 advanced diagnostic tools.
If everything checks out, contact IBM Hardware Service and have them replace the cable. If the problem persists, contact the IBM Support Center.

Node to switch cable diagnostics

Visually inspect the cable in question:

Remove the cable from the back of the node and examine the connectors (cable and back of the adapter) for bent pins or other visible damage. If everything looks OK, reconnect the cable to the adapter. If not, contact IBM Hardware Service and have them repair or replace the damaged components.
Remove the cable from the back of the switch and examine the connectors (cable and switch bulkhead jack) for bent pins or other visible damage. If everything looks OK, reconnect the cable to the switch bulkhead jack. If not, contact IBM Hardware Service and have them repair or replace the damaged components.
Run the SP Switch Wrap Test and SP Switch Stress Test. See SP Switch and SP Switch2 advanced diagnostic tools.
If everything visually checks out, run advanced Adapter Diagnostics on the suspect adapter. The procedure is outlined in SP Switch adapter diagnostics. Follow the online instructions. If the diagnostics detect a failure, contact IBM Hardware Service and have them replace the failing components. If the adapter diagnostics pass and the problem persists, contact the IBM Support Center.
As a result of removing the cable, the node may be automatically fenced by the system. After reinstalling the cable, reboot the node or run the rc.switch command to reset the switch adapter. Only after this is complete, try to Eunfence the node.

SP Switch node diagnostics

Identify the failing node

Use this scenario if an application running on several nodes loses connectivity over the switch, or the switch_responds class indicates that several nodes are not on the switch. For more information on the switch_responds class, see the SDRGetObjects entry in PSSP: Command and Technical Reference.

View the summary log file, located on the control workstation.
See Summary log for SP Switch, SP Switch2, and switch adapter errors.
Locate the first AIX error log entry that indicates a node or connectivity failure.
Examine other entries to see if the first failure is the cause of subsequent failures.
On the node that experienced the first failure, examine the AIX error log to see the complete version of the record described previously.
Use this as a starting point to debug the problem on this node.

Verify software installation

The software installation and verification are done using the CSS_test command on the control workstation. It can be run either through SMIT panels or the command line.

If you are using SP system partitions, CSS_test runs in the active SP system partition only. For more information on managing system partitions, see PSSP: Administration Guide.

If CSS_test is issued following a successful Estart, additional verification of the system is done to determine if each node in the system or system partition, can be pinged.

To verify CSS installation from the SMIT panels:

Issue:
```
smit SP_verify
```
The RS/6000 SP Installation/Configuration Verification menu appears.
Select: Communications Subsystem.
Press enter.
Review the output created.

Whenever the CSS_test command is run, a log file is created to enable the user to review the test results. The file is /var/adm/SPlogs/CSS_test.log. Complete information on CSS_test can be found in PSSP: Command and Technical Reference.

To verify CSS installation from the command line:

Issue the command:
```
/usr/lpp/ssp/bin/CSS_test 
```
Review the log file.

When running CSS_test, consider the following:

The directory /usr/lpp/ssp on each of the nodes in the system partition should be accessible (execute and read permissions) to the user who performs the test.
The script file /etc/inittab on each node should contain an entry for the script rc.switch.

Verify SP Switch node operation

Use this procedure to verify that a single SP Switch node is operating correctly. If the node you are attempting to verify is the primary node, start with Step 1. If it is a secondary node, start with Step 2.

Determine which node is the primary by issuing the Eprimary command on the control workstation. For complete information on the Eprimary command, see PSSP: Command and Technical Reference. For our purposes, consider this output:
```
1 - primary
2 - oncoming primary
26 - primary backup
26 - oncoming primary backup
1 - autounfence
```
If the command returns an oncoming primary value of none, reissue the Eprimary command, specifying the node you would like to have as the primary node. Following the completion of the Eprimary command (to change the oncoming primary) an Estart command is required to make the oncoming primary node the primary.
If the command returns a primary value of none, an Estart is required to make the oncoming primary node the primary.
The primary node on the SP Switch system can move to another node, if a primary node takeover is initiated by the backup. To determine if this has happened, look at the values of the primary and the oncoming primary backup. If they are the same value, then a takeover has occurred.
Ensure that the node is accessible from the control workstation. This is done by using the dsh command to issue the date command on the node as follows:
```
/usr/lpp/ssp/bin/dsh -w problem_hostname date
```
The output is similar to:
```
TUE Jan 25 10:24:28 EDT 2000 
```
If the current date and time are not returned, refer to Diagnosing remote command problems on the SP System.
Verify that the switch adapter (css0) is configured and is ready for operation on the node. This can be done by examining the adapter_config_status attribute in the switch_responds object of the SDR:
```
SDRGetObjects switch_responds node_number==problem_node_number\
node_number switch_responds autojoin isolated adapter_config_status
```
The output is similar to:
```
node_number switch_responds autojoin isolated adapter_config_status
 1               0               0       0        css_ready 
```
If the adapter_config_status object is anything other than css_ready, see Adapter configuration error information.
To obtain the value to use for problem_node_number, issue an SDR query of the node_number attribute of the Node object, as follows:
```
SDRGetObjects Node reliable_hostname==problem_hostname node_number 
```
The output is similar to the following:
```
node_number
   1
```
Verify that the fault_service_Worm_RTG_SP daemon is running on the node. This can be accomplished by using the dsh command on the control workstation to issue a ps command to the problem node as follows:
```
/usr/lpp/ssp/bin/dsh -w problem_hostname ps -e | grep Worm_RTG 
```
The output is similar to the following:
```
18422  -0:00 fault_service_Worm_RTG_SP
```
If the fault_service_Worm_RTG_SP daemon is running, SP Switch node verification is complete.
If the fault_service_Worm_RTG_SP daemon is not running, see AIX Error Log information. The possible reasons why the fault_service_Worm_RTG_SP daemon is not running are:
- The daemon exited due to an abnormal error condition.
- A SIGTERM, SIGBUS, or SIGDANGER signal was processed by the daemon.

Node crash

A node crash is generally identified by the LED/LCD display on the node flashing 888. Do not reboot the node. See Producing a system dump.

SP Switch advanced diagnostics

To examine the SP Switch fabric in more detail, see SP Switch and SP Switch2 advanced diagnostic tools.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]