Diagnosis Guide

Diagnostic procedures

If your SP system or SP system partition shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified SP Switch2 node.

Note:

If your system is running in Restricted Root Access mode, the following commands must be issued from the control workstation:

CSS_test
Efence
Eprimary
Equiesce
Estart
Eunfence
Eunpartition
mult_senders_test
switch_stress
wrap_test

SP Switch2 diagnostics

Verify the SP Switch2 topology configuration

The switch plane topology file is used to define the hardware configuration to the css support software. It should reflect the number of switches and nodes installed, as well as define how they are connected.

The topology file can reside in two places: in the SDR, or in the expected.top file in the /etc/SP directory of the primary node. Usually, the configuration in the SDR is used. If the configuration in /etc/SP/expected.top on the primary node exists, it overrides the configuration in the SDR. The /etc/SP/expected.top on the primary is generally used for debugging proposes.

To verify that the topology in the SDR is correct, first read it out of the SDR using the command:

Etopology -read [- p (0 | 1 | all)]file_name

The -p flag indicates the switch plane number.

The Etopology command reads the switch topology from the SDR and places it in the specified file. For more information on this command, see PSSP: Command and Technical Reference.

Once the file is extracted, verify that the switch topology is an accurate representation of the installed hardware.

If changes to the switch topology file are required, remember to place them back into the SDR by issuing the Etopology command:

Etopology [- p (0 | 1 | all)] file_name

A default set of topology configuration files is available in the /etc/SP directory. For more information, see PSSP: Command and Technical Reference.

The SP Switch2 uses an annotated topology file, produced by the Eannotator command. The system administrator is responsible for running the command and creating the annotated topology file. When the file is not annotated, the fault service daemon will still work and the switch will function, but the switch jack numbers will not be correct in the topology file, the out.top file, and the cable_miswire file. If you suspect your topology file was not annotated, verify it. Examine the out.top file in /var/adm/SPlogs/css0/p0 and /var/adm/SPlogs/css1/p0 of each node, and examine the topology file by using the Etopology -read command described earlier.

Each link line in an annotated file is marked by E, as in this example:

s 13 0    s 23 0    E01-S17-BH-J6 to E02-S17-BH-J6

Each link line in a file that is not annotated is marked by L, as in this example:

s 13 0    s 23 0  	L01-S00-BH-J9  to L02-S00-BH-J9

Verify the System Data Repository (SDR)

To verify that the SDR is installed and operating correctly, run the SDR_test command on the control workstation. It can be run either through SMIT panels or from the command line.

To verify SDR installation and operation from the SMIT panels:

Issue the command:
```
smit SP_verify
```
The RS/6000 SP Installation/Configuration Verification menu appears.
Select: System Data Repository.
Press enter.
Review the output created.

To verify SDR installation and operation from the command line, enter:

/usr/lpp/ssp/bin/SDR_test

and review the output created.

Whenever the SDR_test command is run, a log file is created to enable the user to review the test results. The default log file created is: /var/adm/SPlogs/SDR_test.log. If the SDR_test command is run without root authority, the default log file created is: /tmp/SDR_test.log. Complete information on SDR_test can be found in PSSP: Command and Technical Reference. See also SDR verification test.

Next, login to the failing node and issue the command:

SDRGetObjects switch_responds node_number switch_responds0

Examine the output that is returned. If the switch responds bits are returned, this indicates that the SDR is operating. You can also determine which nodes are operational on the switch by examining the value returned. A value of 1 indicates that the node is operational on the switch. A value of 0 indicates that the node is not operational on the switch.

Verify node configuration

Verify that the SDR and the node configuration match each other. This verification procedure can be done on a node that tried to run the fault service daemon. On the node, view the file /var/adm/SPlogs/css/rc.switch.log. This file lists the configuration information found by the rc.switch script just before it attempts to run the fault service daemon.

Table 42. SP Switch2 rc.switch.log file and SDR equivalents

Line in rc.switch.log	Value in file	Value in SDR
Line 1	date and time when the information was taken.
Lines 2, 4, 8	node configuration: reliable_hostname, node_number and switch_node_number	SDRGetObjects Node frame_number==`tested_node_frame` slot_number==`tested_node_slot` reliable_hostname node_number switch_node_number
Line 5	switch_type should be equal to 132 for the SP Switch2.
Line 6	number_switch_planes should be equal to 1.
Line 7	adapter_name and adapter_status should be only one device - css0 with css_ready. If the status is other then css_ready, see Table 37.
Lines 9, 10, 11	IP configuration: netaddr and netmask	SDRGetObjects Adapter node_number==`tested_node` adapter_type==css0 netaddr netmask

If any of the above node configuration data do not match, correct the SDR configuration and re-configure the problem node. See PSSP: Installation and Migration Guide for more details on how to do this.

SP Switch2 adapter diagnostics

The adapter diagnostics have two modes of operation: the Power-On-Self-Test (POST) and by online issuing of the diag command.

For the automatic POST tests scenario, issue the command:

diag -c -d [css0 | css1]

For advanced diagnostics scenarios, issue the command:

diag -A -d [css0 | css1]

The advanced tests check the cable wrap. You will need the card and cable wrap plug to complete these tests.

Note:: The complete set of adapter diagnostics needs the exclusive use of the css adapter on the current node that the diagnostics are run on. Any other processes that have the css device driver open must be closed (killed) before issuing the adapter diagnostics command. One of those processes is the fault service daemon: fault_service_Worm_RTG_CS. Processes such as "switch clock (TOD) reader applications" make use of the css device driver and therefore these processe should be closed as well.

The diagnostics failures are reported to the AIX error log of the failing node. To view the adapter diagnostics errors:

Login to the failing node.
Issue the command:
```
errpt -a | grep "Switch adapter failed POST diagnostics" 
```
to view the POST adapter diagnostics AIX error log entries.
In most cases each of the error entries will contain a Service Request Number (SRN).
Use the SRN to locate your error and its recovery actions in Table 43.
Note that x may represent any value in Table 43.

Table 43. SP Switch2 adapter Service Request Number failures and recovery actions

SRN	Recovery action
765-x1xx	See Verify software installation. If the verification is successful and the problem persists, contact the IBM Support Center.
765-1xx6	Wrap test failed. This problem is caused by a bad switch cable. Contact IBM Hardware Service and arrange to have the switch cable replaced.
765-2xx1	RDRAM test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx2	RDRAM Controller test failed. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx3	SRAM test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx4	SRAM Controller test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx5	DMA test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx6	Wrap test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx7	Registers test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx8	740 access test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xx9	Reassembly test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xxA	Segmentation test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
765-2xxB	Interrupts test failed. This problem is caused by a faulty adapter. Contact IBM Hardware Service and arrange to have the adapter replaced.
All other SRNs	Contact IBM Hardware Service and arrange to have the adapter or cable replaced.

Cable diagnostics

Switch to switch Cable Diagnostics

Visually inspect the cable in question:

Remove the cable from the back of the switch and examine the connectors (cable and switch bulkhead jack) for bent pins or other visible damage. If everything looks OK, reconnect the cable to the switch bulkhead jack. If not, contact IBM Hardware Service and have them repair or replace the damaged components.
Repeat step 1 for the other end of the switch to switch cable.
Run the SP Switch Wrap Test and SP Switch Stress Test. See SP Switch and SP Switch2 advanced diagnostic tools.
If everything visually checks out, contact IBM Hardware Service and have them replace the cable. If the problem persists, contact the IBM Support Center.

Node to switch cable diagnostics

Visually inspect the cable in question:

Remove the cable from the back of the node and examine the connectors (cable and back of the adapter) for bent pins or other visible damage. If everything looks OK, reconnect the cable to the adapter. If not, contact IBM Hardware Service and have them repair or replace the damaged components.
Remove the cable from the back of the switch and examine the connectors (cable and switch bulkhead jack) for bent pins or other visible damage. If everything looks OK, reconnect the cable to the switch bulkhead jack. If not, contact IBM Hardware Service and have them repair or replace the damaged components.
Run the SP Switch Wrap Test and SP Switch Stress Test. See SP Switch and SP Switch2 advanced diagnostic tools.
If everything visually checks out, run advanced Adapter Diagnostics on the suspect adapter. The procedure is outlined in SP Switch2 adapter diagnostics. Follow the online instructions. If the diagnostics detect a failure, contact IBM Hardware Service and have them replace the failing components. If the adapter diagnostics pass and the problem persists, contact the IBM Support Center.
As a result of removing the cable, the node may be automatically fenced by the system. After reinstalling the cable, reboot the node or run the rc.switch command to reset the switch adapter. Only after this is complete, try to Eunfence the node.

SP Switch2 node diagnostics

Identify the failing node

Use this scenario if an application running on several nodes loses connectivity over the switch, or the switch_responds class indicates that several nodes are not on the switch. For more information on the switch_responds class, see the SDRGetObjects entry in PSSP: Command and Technical Reference.

View the summary log file, located on the control workstation.
See Summary log for SP Switch, SP Switch2, and switch adapter errors.
Locate the first error log entry that indicates a node or connectivity failure.
Examine other entries to see if the first failure is the cause of subsequent failures.
On the node that experienced the first failure, examine the error log to see the complete version of the error log record described previously.
Use this as a starting point to debug the problem on this node.

Verify software installation

The software installation and verification are done using the CSS_test command on the control workstation. CSS_test can be run either through SMIT panels or from the command line.

If CSS_test is issued following a successful Estart, additional verification of the system is done to determine if each node in the system or system partition, can be pinged.

To verify CSS installation from the SMIT panels:

Issue:
```
smit SP_verify
```
The RS/6000 SP Installation/Configuration Verification menu appears.
Select: Communications Subsystem.
Press enter.
Review the output created.

Whenever the CSS_test command is run, a log file is created to enable the user to review the test results. The log file is /var/adm/SPlogs/CSS_test.log. Complete information on CSS_test can be found in PSSP: Command and Technical Reference.

To verify CSS installation from the command line:

Issue the command:
```
/usr/lpp/ssp/bin/CSS_test 
```
Review the log file to determine the results.

When running CSS_test, consider the following:

The directory /usr/lpp/ssp should be accessible.
The script file /etc/inittab on each node should contain an entry for the script rc.switch.

Verify SP Switch2 node operation

Use this procedure to verify that a single SP Switch2 node is operating correctly. If the node you are attempting to verify is the primary node, start with Step 1. If it is a secondary node, start with Step 2.

Determine which node is the primary by issuing the Eprimary command on the control workstation. For complete information on the Eprimary command, see PSSP: Command and Technical Reference. For our purposes, consider this output:
```
plane 0:  1     - primary
plane 0:  5     - oncoming primary
plane 0:  49    - primary backup
plane 0:  45    - oncoming primary backup
plane 0:  1     - autounfence
 
plane 1:  1     - primary
plane 1:  5     - oncoming primary
plane 1:  49    - primary backup
plane 1:  45    - oncoming primary backup
plane 1:  1     - autounfence
```
If the command returns a primary value of none, an Estart is required to make the oncoming primary node the primary.
If the command returns an oncoming primary value of none, reissue the Eprimary command specifying the node that you would like to have as the primary node. Following the completion of the Eprimary command (to change the oncoming primary) an Estart is required to make the oncoming primary node the primary.
Note:
Both the Eprimary and Estart commands have a flag (-p) which specifies the number of the switch plane that the command references. If the -p flag is omitted, the command applies to all planes.

The primary node on the SP Switch2 system can move to another node, if a primary node takeover is initiated by the backup. To determine if this has happened, look at the values of the primary and the oncoming primary backup. If they are the same value, a takeover has occurred.
Ensure that the node is accessible from the control workstation. This is done by using the dsh command to issue the date command on the node as follows:
```
/usr/lpp/ssp/bin/dsh -w problem_hostname date
```
The output is similar to:
```
TUE JAN 25 10:24:28 EST 2000 
```
If the current date and time are not returned, refer to Diagnosing remote command problems on the SP System.
Verify that the switch adapter (css0 or css1) is configured and is ready for operation on the node. This can be done by examining the adapter_config_status attribute in the switch_responds object of the SDR:
```
SDRGetObjects Adapter adapter_type==css0 node_number adapter_config_status
```
or
```
SDRGetObjects Adapter adapter_type==css1 node_number adapter_config_status
```
The output is similar to:
```
node_number adapter_config_status
  	1 	          css_ready
```
If the adapter_config_status object is anything other than css_ready, see Adapter configuration error information. More information on the error may be found in the AIX error log. Run the errpt -a command on this node and match the adapter error log to the error list found in AIX Error Log information.
Verify that the fault_service_Worm_RTG_CS daemon is running on the node. This can be accomplished by using the dsh command on the control workstation to issue a ps command to the problem node as follows:
```
/usr/lpp/ssp/bin/dsh -w problem_hostname ps -e | grep Worm_RTG 
```
The output is similar to the following:
```
18422  -0:00 fault_service_Worm_RTG_CS
```
If the fault_service_Worm_RTG_CS daemon is running, SP Switch2 node verification is complete.
If the fault_service_Worm_RTG_CS daemon is not running, see AIX Error Log information. The possible reasons why the fault_service_Worm_RTG_CS daemon is not running are:
- The daemon exited due to an abnormal error condition.
- A SIGTERM, SIGBUS, or SIGDANGER signal was processed by the daemon.
Verify that the adapter is running on the node. This can be accomplished by tailing the end of the /var/adm/SPlogs/css0/adapter.log or /var/adm/SPlogs/css1/adapter.log file. If the any of the last several lines have fsd_adapter_thread_exit, the adapter is not running. This means that the adapter has a permanent adapter error, or that the adapter diagnostics are running.
Look in the node's AIX error log for local adapter errors and handle them if they are found. If no local adapter errors are found, and the adapter diagnostics are not running, and still the adapter.log is in exit state, run the rc.switch -l [css0 | css1] command to restart the fault service daemon. Then, see if the end of the adapter.log now contains the message ------Adapter Thread Started------, signaling that the adapter has been restarted.

Node crash

A node crash is generally identified by the LED/LCD display on the node flashing 888. Do not reboot the node. See Producing a system dump.

SP Switch2 Time Of Day (TOD) diagnostics

Some applications use the Switch Time-Of-Day (Switch TOD). The Switch TOD is a value that is passed only to nodes that are on the switch (their corresponding switch_responds0 flag is 1). The Most-Significant-Bit (MSB) of this 64 bit value is called the 'valid bit'. When the 'valid bit' is 1, the Switch TOD is valid. This means that the value you see is synchronized with the switch TOD. When the 'valid bit' is 0, the Switch TOD is invalid. This means that the value you see is propagated by the node and not synchronized with the switch TOD. In this case, you are not assured to have this value within range of the Switch TOD.

When your node begins to get the Switch TOD, the value of your Switch TOD will change to the Switch TOD value (if necessary) and the 'valid bit' will turn to 1. The Emaster command will show you the node number that is responsible for the Switch TOD. The emaster daemon, which runs on the control workstation, is monitoring the switch and tries to recover the Switch TOD when necessary.

The sections and subsections that follow are ordered according to the more probable cause of the problem. After each item, check the Switch TOD again, and if the problem persists, continue to the next item.

SP Switch2 TOD on node is not valid

Some of the nodes in your system have Switch TOD 'valid bit' turned ON and some have it turned OFF.

Validate that your node is on the switch; the switch_responds0 for your node is 1. One way to see that is by issuing:
```
SDRGetObjects switch_responds node_number==your_node_number switch_responds0
```
If your node is not on the switch, you have to unfence it, then check your node's Switch TOD value. See Unfence an SP Switch2 node.
If your node is on the switch and still only your node shows the Switch TOD 'valid bit' as OFF (all the other nodes that are on the switch have their Switch TOD 'valid bit' turned ON), contact IBM Hardware Service.

SP Switch2 TOD is not valid on all nodes

All the nodes in your system have Switch TOD 'valid bit' turned OFF.

Verify that the switch is up and running (the Eprimary command shows one node as the primary node,) and the switch_responds0 of the primary node is 1. If the switch is down, run the Estart command, then check again.
Verify that you do have a Master Switch Sequencer (MSS) node. To do this, issue the command Emaster on your control workstation. The result should look like
```
49    - Master switch sequencing node
```
Verify that the MSS node is not fenced OFF the switch. If it is fenced, use the Eunfence command to bring the node back on the switch.
Verify that the emaster daemon is running on your control workstation. To do this, issue the command lssrc -s emaster. The result should look like:
```
Subsystem         Group            PID     Status 
 emaster           swt             25380   active
```
If the Status is "inoperative", check the AIX error log on the control workstation for the resignation cause, and follow the recommended actions. Restart the emaster daemon by issuing: startsrc -s emaster. Check that the daemon stays in active status.
If the Status is still "inoperative", call the IBM Support Center.
If the Status is Active and you have one or more nodes on the switch, but there still is no MSS node assigned, call the IBM Support Center.

No Master Switch-Sequencer (MSS) node

The Emaster command shows no node number as the Master Switch Sequencer (MSS) node.

Verify that the emaster daemon is running on your control workstation. To do this, issue the command lssrc -s emaster. The result should look like:
```
Subsystem         Group            PID     Status 
 emaster           swt             25380   active
```
If the Status is "inoperative", check the AIX error log on the control workstation for the resignation cause, and follow the recommended actions. Restart the emaster daemon by issuing: startsrc -s emaster. Check that the daemon stays in active status.
Verify that the switch is up and running (the Eprimary command shows a valid node number as primary node,) and the switch_responds of this node is 1. If the switch is down, issue the Estart command.
If you have successfully run the Estart command, and some of the nodes came up on the switch, and there still is no MSS, call the IBM Support Center.

No SP Switch2 TOD monitoring

There is no Switch TOD recovery on your system. An event happened on your system that should have caused a replacement of the MSS node, but failed to replace the MSS node.

Check the AIX error log for the reason for the failure. Follow the recovery action suggested in the error log entry.
Validate that the emaster daemon is running on your control workstation. To do this, issue the command lssrc -s emaster. The result should look like:
```
Subsystem         Group            PID     Status 
 emaster           swt             25380   active
```
If the Status is "inoperative", check the AIX error log on the control workstation for the resignation cause, and follow the recommended actions. Restart the emaster daemon by issuing: startsrc -s emaster. Check that the daemon stays in active status.
If the Status is still "inoperative", call the IBM Support Center.
Verify that the Event Management, Group Services and Topology Services groups are running. Issue these commands on the control workstation:
- lssrc -g hags
- lssrc -g hats
- lssrc -g haem
Issue these commands on the control workstation, to verify the daemons for each node:
- dsh -av lssrc -g hags
- dsh -av lssrc -g haem
- dsh -av lssrc -g haem
If any of these have "inoperative" status, you may experience a problem with Switch TOD recovery. To restart these you should run:
- stopsrc -g haem
- stopsrc -g hats
- stopsrc -g hags
and then run:
- startsrc -g hags
- startsrc -g hats
- startsrc -g haem
If the Status is still "inoperative", call the IBM Support Center.
You can monitor events reaching the emaster daemon by looking in the emaster.log located in the node log directory level.

SP Switch2 advanced diagnostics

To examine the SP Switch2 fabric in more detail, see SP Switch and SP Switch2 advanced diagnostic tools.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]