If your SP system or SP system partition shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified SP Switch node.
The switch topology file is used to define the hardware configuration to the css support software. It should reflect the number of switches and nodes installed, as well as define how they are connected.
The topology file can reside in two places: in the SDR, or in the expected.top file in the /etc/SP directory of the primary node. Usually, the configuration in the SDR is used. If the configuration in /etc/SP/expected.top on the primary node exists, it overrides the configuration in the SDR. The /etc/SP/expected.top on the primary node is generally used for debugging proposes.
To verify that the topology in the SDR is correct, first read it out of the SDR using the command:
Etopology -read file_name
The Etopology command reads the switch topology from the SDR and places it in the specified file. For more information on this command, see PSSP: Command and Technical Reference.
Once the file is extracted, verify that the switch topology is an accurate representation of the installed hardware.
If changes to the switch topology file are required, remember to place them back into the SDR using the Etopology command:
Etopology file_name
A default set of topology configuration files is available in the /etc/SP directory. For more information, see PSSP: Command and Technical Reference.
The SP Switch uses an annotated topology file, produced by the Eannotator command. The system administrator is responsible to run the command and create the annotated topology file. When the file is not annotated, the fault service daemon will still work and the switch will function, but the switch jack numbers will not be correct. If you suspect that your topology file has a problem, you can verify it. Examine the out.top file in /var/adm/SPlogs/css of each node, and examine the topology file (using the Etopology -read file_name option as described previously).
Each link line in an annotated file is marked by E as in the example:
E01-S17-BH-J7 to E01-N1
Each link line in a file that is not annotated is marked by L as in the example:
L01-S17-BH-J7 to L01-N1
To verify that the SDR is installed and operating correctly, run the SDR_test command on the control workstation. It can be run either through SMIT panels or from the command line.
To verify SDR installation and operation from the SMIT panel:
smit SP_verify
To verify SDR installation and operation from the command line, enter:
/usr/lpp/ssp/bin/SDR_test
Review the output created.
Whenever the SDR_test command is run, a log file is created to enable the user to review the test results. The default log file created is: /var/adm/SPlogs/SDR_test.log. If the SDR_test command is run without root authority, the default log file created is: /tmp/SDR_test.log. Complete information on SDR_test can be found in PSSP: Command and Technical Reference. See also SDR verification test.
Next, login to the failing node and issue the command:
SDRGetObjects switch_responds
Examine the output that is returned. If the switch responds bits are returned, this indicates that the SDR is operating. You can also determine which nodes are operational on the switch by examining the value returned: A value of 1 indicates that the node is operational. A value of 0 indicates that the node is not operational.
The following procedure should not be run on nodes that are operational on the switch. The utilities used for those verifications cannot coexist with normal switch operations on the node. If a clocking problem exists on all nodes in a rack, see SP rack or system clock diagnostics. Otherwise, see SP Switch external clock diagnostics.
Perform these steps to determine if the external clock is operational at the node:
/usr/lpp/ssp/css/diags/read_tbic -s
TBIC status register :78XXXXXX
In this example, bits 3 and 4 are both ON, indicating that the external clock is operational at the node.
Perform the following steps to restore the external clock at the node:
/usr/lpp/ssp/css/rc.switch
adapter/mca/tb3
The Eclock command affects all switch boards in the system and requires exclusive use of the switch, and therefore the SP system partitions on the switch. For more information on the Eclock command see PSSP: Command and Technical Reference.
To Eclock the system, issue the command:
Eclock -d
The following table list the possible clock loss problems on single racks
and systems along with their recovery actions:
Table 26. Clock problems and recovery actions
The adapter diagnostics have two modes of operation: the Power-On-Self-Test (POST) and by issuing the diag command from the command line.
For the automatic POST tests scenario, issue the command:
diag -c -d css0
For advanced diagnostics scenarios, issue the command:
diag -A -d css0
The advanced tests check the cable wrap. You will need the card and cable wrap plug to complete these tests.
The diagnostics failures are reported to the AIX error log of the failing node. To view the adapter diagnostics errors:
errpt -a |grep "Switch adapter failed POST diagnostics"
to view the POST adapter diagnostics AIX error log entries.
Table 27. SP Switch adapter Service Request Number failures and recovery actions
SRN | Recovery action |
---|---|
1xx | See Verify software installation.
If the verification is successful and the problem persists, contact the IBM Support Center. |
28x | See Clock diagnostics.
If the verification is successful and the problem persists, contact the IBM Support Center. |
Axx | See Clock diagnostics.
If the verification is successful and the problem persists, contact the IBM Support Center. |
All other SRNs | Contact IBM Hardware Service and arrange to have the adapter or cable replaced. |
If a node is powered on before the switch boards receive power, the node's switch adapter does not receive a clock signal. This is a situation that could occur during installation. This causes a diag_fail condition for the SP Switch adapter. If the adapter status is diag_fail, the rc.switch process terminates without starting the fault service daemon.
There are two ways to correct the problem:
Visually inspect the cable in question:
Visually inspect the cable in question:
Use this scenario if an application running on several nodes loses connectivity over the switch, or the switch_responds class indicates that several nodes are not on the switch. For more information on the switch_responds class, see the SDRGetObjects entry in PSSP: Command and Technical Reference.
See Summary log for SP Switch, SP Switch2, and switch adapter errors.
The software installation and verification are done using the CSS_test command on the control workstation. It can be run either through SMIT panels or the command line.
If you are using SP system partitions, CSS_test runs in the active SP system partition only. For more information on managing system partitions, see PSSP: Administration Guide.
If CSS_test is issued following a successful Estart, additional verification of the system is done to determine if each node in the system or system partition, can be pinged.
To verify CSS installation from the SMIT panels:
smit SP_verify
Whenever the CSS_test command is run, a log file is created to enable the user to review the test results. The file is /var/adm/SPlogs/CSS_test.log. Complete information on CSS_test can be found in PSSP: Command and Technical Reference.
To verify CSS installation from the command line:
/usr/lpp/ssp/bin/CSS_test
When running CSS_test, consider the following:
Use this procedure to verify that a single SP Switch node is operating correctly. If the node you are attempting to verify is the primary node, start with Step 1. If it is a secondary node, start with Step 2.
1 - primary 2 - oncoming primary 26 - primary backup 26 - oncoming primary backup 1 - autounfence
If the command returns an oncoming primary value of none, reissue the Eprimary command, specifying the node you would like to have as the primary node. Following the completion of the Eprimary command (to change the oncoming primary) an Estart command is required to make the oncoming primary node the primary.
If the command returns a primary value of none, an Estart is required to make the oncoming primary node the primary.
The primary node on the SP Switch system can move to another node, if a primary node takeover is initiated by the backup. To determine if this has happened, look at the values of the primary and the oncoming primary backup. If they are the same value, then a takeover has occurred.
/usr/lpp/ssp/bin/dsh -w problem_hostname date
The output is similar to:
TUE Jan 25 10:24:28 EDT 2000
If the current date and time are not returned, refer to Diagnosing remote command problems on the SP System.
SDRGetObjects switch_responds node_number==problem_node_number\ node_number switch_responds autojoin isolated adapter_config_status
The output is similar to:
node_number switch_responds autojoin isolated adapter_config_status 1 0 0 0 css_ready
If the adapter_config_status object is anything other than css_ready, see Adapter configuration error information.
To obtain the value to use for problem_node_number, issue an SDR query of the node_number attribute of the Node object, as follows:
SDRGetObjects Node reliable_hostname==problem_hostname node_number
The output is similar to the following:
node_number 1
/usr/lpp/ssp/bin/dsh -w problem_hostname ps -e | grep Worm_RTG
The output is similar to the following:
18422 -0:00 fault_service_Worm_RTG_SP
If the fault_service_Worm_RTG_SP daemon is running, SP Switch node verification is complete.
If the fault_service_Worm_RTG_SP daemon is not running, see AIX Error Log information. The possible reasons why the fault_service_Worm_RTG_SP daemon is not running are:
A node crash is generally identified by the LED/LCD display on the node flashing 888. Do not reboot the node. See Producing a system dump.
To examine the SP Switch fabric in more detail, see SP Switch and SP Switch2 advanced diagnostic tools.