If your SP system or SP system partition shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified SP Switch2 node.
The switch plane topology file is used to define the hardware configuration to the css support software. It should reflect the number of switches and nodes installed, as well as define how they are connected.
The topology file can reside in two places: in the SDR, or in the expected.top file in the /etc/SP directory of the primary node. Usually, the configuration in the SDR is used. If the configuration in /etc/SP/expected.top on the primary node exists, it overrides the configuration in the SDR. The /etc/SP/expected.top on the primary is generally used for debugging proposes.
To verify that the topology in the SDR is correct, first read it out of the SDR using the command:
Etopology -read [- p (0 | 1 | all)]file_name
The -p flag indicates the switch plane number.
The Etopology command reads the switch topology from the SDR and places it in the specified file. For more information on this command, see PSSP: Command and Technical Reference.
Once the file is extracted, verify that the switch topology is an accurate representation of the installed hardware.
If changes to the switch topology file are required, remember to place them back into the SDR by issuing the Etopology command:
Etopology [- p (0 | 1 | all)] file_name
A default set of topology configuration files is available in the /etc/SP directory. For more information, see PSSP: Command and Technical Reference.
The SP Switch2 uses an annotated topology file, produced by the Eannotator command. The system administrator is responsible for running the command and creating the annotated topology file. When the file is not annotated, the fault service daemon will still work and the switch will function, but the switch jack numbers will not be correct in the topology file, the out.top file, and the cable_miswire file. If you suspect your topology file was not annotated, verify it. Examine the out.top file in /var/adm/SPlogs/css0/p0 and /var/adm/SPlogs/css1/p0 of each node, and examine the topology file by using the Etopology -read command described earlier.
Each link line in an annotated file is marked by E, as in this example:
s 13 0 s 23 0 E01-S17-BH-J6 to E02-S17-BH-J6
Each link line in a file that is not annotated is marked by L, as in this example:
s 13 0 s 23 0 L01-S00-BH-J9 to L02-S00-BH-J9
To verify that the SDR is installed and operating correctly, run the SDR_test command on the control workstation. It can be run either through SMIT panels or from the command line.
To verify SDR installation and operation from the SMIT panels:
smit SP_verify
To verify SDR installation and operation from the command line, enter:
/usr/lpp/ssp/bin/SDR_test
and review the output created.
Whenever the SDR_test command is run, a log file is created to enable the user to review the test results. The default log file created is: /var/adm/SPlogs/SDR_test.log. If the SDR_test command is run without root authority, the default log file created is: /tmp/SDR_test.log. Complete information on SDR_test can be found in PSSP: Command and Technical Reference. See also SDR verification test.
Next, login to the failing node and issue the command:
SDRGetObjects switch_responds node_number switch_responds0
Examine the output that is returned. If the switch responds bits are returned, this indicates that the SDR is operating. You can also determine which nodes are operational on the switch by examining the value returned. A value of 1 indicates that the node is operational on the switch. A value of 0 indicates that the node is not operational on the switch.
Verify that the SDR and the node configuration match each other.
This verification procedure can be done on a node that tried to run the fault
service daemon. On the node, view the file
/var/adm/SPlogs/css/rc.switch.log. This file
lists the configuration information found by the rc.switch
script just before it attempts to run the fault service daemon.
Table 42. SP Switch2 rc.switch.log file and SDR equivalents
Line in rc.switch.log | Value in file | Value in SDR |
---|---|---|
Line 1 | date and time when the information was taken. |
|
Lines 2, 4, 8 | node configuration: reliable_hostname, node_number and switch_node_number |
SDRGetObjects Node frame_number==tested_node_frame
slot_number==tested_node_slot reliable_hostname node_number
switch_node_number
|
Line 5 | switch_type should be equal to 132 for the SP Switch2. |
|
Line 6 | number_switch_planes should be equal to 1. |
|
Line 7 | adapter_name and adapter_status should be only one device - css0 with css_ready. If the status is other then css_ready, see Table 37. |
|
Lines 9, 10, 11 | IP configuration: netaddr and netmask | SDRGetObjects Adapter node_number==tested_node adapter_type==css0 netaddr netmask |
If any of the above node configuration data do not match, correct the SDR configuration and re-configure the problem node. See PSSP: Installation and Migration Guide for more details on how to do this.
The adapter diagnostics have two modes of operation: the Power-On-Self-Test (POST) and by online issuing of the diag command.
For the automatic POST tests scenario, issue the command:
diag -c -d [css0 | css1]
For advanced diagnostics scenarios, issue the command:
diag -A -d [css0 | css1]
The advanced tests check the cable wrap. You will need the card and cable wrap plug to complete these tests.
The diagnostics failures are reported to the AIX error log of the failing node. To view the adapter diagnostics errors:
errpt -a | grep "Switch adapter failed POST diagnostics"
to view the POST adapter diagnostics AIX error log entries.
Table 43. SP Switch2 adapter Service Request Number failures and recovery actions
SRN | Recovery action |
---|---|
765-x1xx | See Verify software installation. If the verification is successful and the problem persists, contact the IBM Support Center. |
765-1xx6 | Wrap test failed. This problem is caused by a bad switch cable. Contact IBM Hardware Service and arrange to have the switch cable replaced. |
765-2xx1 | RDRAM test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx2 | RDRAM Controller test failed. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx3 | SRAM test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx4 | SRAM Controller test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx5 | DMA test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx6 | Wrap test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx7 | Registers test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx8 | 740 access test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xx9 | Reassembly test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xxA | Segmentation test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
765-2xxB | Interrupts test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced. |
All other SRNs | Contact IBM Hardware Service and arrange to have the adapter or cable replaced. |
Visually inspect the cable in question:
Visually inspect the cable in question:
Use this scenario if an application running on several nodes loses connectivity over the switch, or the switch_responds class indicates that several nodes are not on the switch. For more information on the switch_responds class, see the SDRGetObjects entry in PSSP: Command and Technical Reference.
See Summary log for SP Switch, SP Switch2, and switch adapter errors.
The software installation and verification are done using the CSS_test command on the control workstation. CSS_test can be run either through SMIT panels or from the command line.
If CSS_test is issued following a successful Estart, additional verification of the system is done to determine if each node in the system or system partition, can be pinged.
To verify CSS installation from the SMIT panels:
smit SP_verify
Whenever the CSS_test command is run, a log file is created to enable the user to review the test results. The log file is /var/adm/SPlogs/CSS_test.log. Complete information on CSS_test can be found in PSSP: Command and Technical Reference.
To verify CSS installation from the command line:
/usr/lpp/ssp/bin/CSS_test
When running CSS_test, consider the following:
Use this procedure to verify that a single SP Switch2 node is operating correctly. If the node you are attempting to verify is the primary node, start with Step 1. If it is a secondary node, start with Step 2.
plane 0: 1 - primary plane 0: 5 - oncoming primary plane 0: 49 - primary backup plane 0: 45 - oncoming primary backup plane 0: 1 - autounfence plane 1: 1 - primary plane 1: 5 - oncoming primary plane 1: 49 - primary backup plane 1: 45 - oncoming primary backup plane 1: 1 - autounfence
If the command returns a primary value of none, an Estart is required to make the oncoming primary node the primary.
If the command returns an oncoming primary value of none, reissue the Eprimary command specifying the node that you would like to have as the primary node. Following the completion of the Eprimary command (to change the oncoming primary) an Estart is required to make the oncoming primary node the primary.
The primary node on the SP Switch2 system can move to another node, if a primary node takeover is initiated by the backup. To determine if this has happened, look at the values of the primary and the oncoming primary backup. If they are the same value, a takeover has occurred.
/usr/lpp/ssp/bin/dsh -w problem_hostname date
The output is similar to:
TUE JAN 25 10:24:28 EST 2000
If the current date and time are not returned, refer to Diagnosing remote command problems on the SP System.
SDRGetObjects Adapter adapter_type==css0 node_number adapter_config_status
or
SDRGetObjects Adapter adapter_type==css1 node_number adapter_config_status
The output is similar to:
node_number adapter_config_status 1 css_ready
If the adapter_config_status object is anything other than css_ready, see Adapter configuration error information. More information on the error may be found in the AIX error log. Run the errpt -a command on this node and match the adapter error log to the error list found in AIX Error Log information.
/usr/lpp/ssp/bin/dsh -w problem_hostname ps -e | grep Worm_RTG
The output is similar to the following:
18422 -0:00 fault_service_Worm_RTG_CS
If the fault_service_Worm_RTG_CS daemon is running, SP Switch2 node verification is complete.
If the fault_service_Worm_RTG_CS daemon is not running, see AIX Error Log information. The possible reasons why the fault_service_Worm_RTG_CS daemon is not running are:
Look in the node's AIX error log for local adapter errors and handle them if they are found. If no local adapter errors are found, and the adapter diagnostics are not running, and still the adapter.log is in exit state, run the rc.switch -l [css0 | css1] command to restart the fault service daemon. Then, see if the end of the adapter.log now contains the message ------Adapter Thread Started------, signaling that the adapter has been restarted.
A node crash is generally identified by the LED/LCD display on the node flashing 888. Do not reboot the node. See Producing a system dump.
Some applications use the Switch Time-Of-Day (Switch TOD). The Switch TOD is a value that is passed only to nodes that are on the switch (their corresponding switch_responds0 flag is 1). The Most-Significant-Bit (MSB) of this 64 bit value is called the 'valid bit'. When the 'valid bit' is 1, the Switch TOD is valid. This means that the value you see is synchronized with the switch TOD. When the 'valid bit' is 0, the Switch TOD is invalid. This means that the value you see is propagated by the node and not synchronized with the switch TOD. In this case, you are not assured to have this value within range of the Switch TOD.
When your node begins to get the Switch TOD, the value of your Switch TOD will change to the Switch TOD value (if necessary) and the 'valid bit' will turn to 1. The Emaster command will show you the node number that is responsible for the Switch TOD. The emaster daemon, which runs on the control workstation, is monitoring the switch and tries to recover the Switch TOD when necessary.
The sections and subsections that follow are ordered according to the more probable cause of the problem. After each item, check the Switch TOD again, and if the problem persists, continue to the next item.
Some of the nodes in your system have Switch TOD 'valid bit' turned ON and some have it turned OFF.
SDRGetObjects switch_responds node_number==your_node_number switch_responds0
All the nodes in your system have Switch TOD 'valid bit' turned OFF.
49 - Master switch sequencing node
Subsystem Group PID Status emaster swt 25380 active
The Emaster command shows no node number as the Master Switch Sequencer (MSS) node.
Subsystem Group PID Status emaster swt 25380 active
There is no Switch TOD recovery on your system. An event happened on your system that should have caused a replacement of the MSS node, but failed to replace the MSS node.
Subsystem Group PID Status emaster swt 25380 active
Issue these commands on the control workstation, to verify the daemons for each node:
If any of these have "inoperative" status, you may experience a problem with Switch TOD recovery. To restart these you should run:
and then run:
To examine the SP Switch2 fabric in more detail, see SP Switch and SP Switch2 advanced diagnostic tools.