Diagnosis Guide

Error symptoms, responses, and recoveries

If your system shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified switch node. If any of the recovery actions fail to resolve your problem, contact IBM Support Center.

Note:

If your system is running in restricted root access mode, the following commands must be issued from the control workstation:

CSS_test
Efence
Eprimary
Equiesce
Estart
Eunfence
Eunpartition
mult_senders_test
switch_stress
wrap_test

SP Switch2 symptoms and recovery actions

Table 44 lists the known symptoms of a failure in the SP Switch2, and points the user to the location of the detailed diagnostics and recovery action. You may have a symptom that does not appear in the table. In this case, view the error entry in the AIX error log and see AIX Error Log information.

Table 44. SP Switch2 symptoms and recovery actions

Symptom	Recovery action
Estart failure: System cannot find Estart command. Primary node is not reachable. Estart command times out or fails. Expected number of nodes not initialized. Some links do not initialize.	See Verify software installation. See Verify SP Switch2 node operation. See Estart error recovery. See SP Switch2 device and link error information. See SP Switch2 device and link error information.
Nodes drops off of the switch. (switch_responds is 0 for the node.).	See Verify SP Switch2 node operation.
Nodes fail to communicate over the switch, but its switch_responds is 1(ping or CSS_test commands fail).	See Verify SP Switch2 node operation. See AIX Error Log information.
Node crash.	See Node crash.
Node fails to Eunfence.	See Unfence an SP Switch2 node. See Eunfence error recovery.
Oncoming primary node is fenced	See Unfence an SP Switch2 node. See Eunfence error recovery.
Ecommand failure.	See Ecommands error recovery.
diag_fail condition for an SP Switch2 adapter	See SP Switch2 adapter diagnostics.
switch_responds is still 1 after node panic.	See switch_responds is still on after node panic.
The switch_plane and switch_plane_seq numbers in the Switch class of the SDR are incorrect. The SDR_config command appears to number switches incorrectly.	See plane.info file

Recover an SP Switch2 node

You can restart the fault_service_Worm_RTG_CS daemon on the node by issuing:

/usr/lpp/ssp/css/rc.switch

Following the rc.switch, run this command to determine if the daemon is still running or has died:

ps -e | grep Worm

At this point you should be able to Eunfence the node by issuing:

Eunfence problem_node_number

The output should be similar to the following:

All nodes successfully unfenced.

If you cannot resolve the problem, contact the IBM Support Center. You should also attempt to gather all log files associated with this failure. See Information to collect before contacting the IBM Support Center.

Worm error recovery

The following steps enable you to recover from switch initialization failures that impact the worm subsystem.

Login to the switch primary node. See the Eprimary command to determine which node is the switch primary node.
Run the command errpt -a |pg and search for message Estart failed label CS_SW_ESTRT_FAIL_RE. The Detail Data will give the return code of the failure:
```
switch initialization failed with xx
```
where xx is the number to look up in Table 45.
If you did not find the error log entry, view the bottom of the file /var/adm/SPlogs/css0/p0/flt or /var/adm/SPlogs/css1/p0/flt. Look for a message similar to:
- CSworm_bfs_phase1() failed with rc=xx
- CSworm_bfs_phase2() failed with rc=xx
where xx is the number to look up in Table 45.
Use the rc (return code) to retrieve the appropriate entry in Table 45.
If the return code cannot be found in the table, or the actions taken did not correct the problem, contact the IBM Support Center.

Table 45. SP Switch2 worm return codes and analysis

Return code	Analysis
-2	Explanation: Adapter sender port is not connected to the switch. (Phase 1 failure). Cause: Oncoming primary is fenced off the switch. Action: Pick another node as primary and run Estart again. See Unfence an SP Switch2 node.
-3	Explanation: Adapter receiver port is connected to switch Cause: Oncoming primary is fenced off the switch. (Phase 1 failure). Action: Pick another node as primary and Estart again. See Unfence an SP Switch2 node.
-4	Explanation: Unable to generate routes for the network. (Phase 1 failure). Cause: Corrupted topology file. Action: See Verify the SP Switch2 topology configuration.
-5	Explanation: Send packet from local node failed. Cause: Bad switch adapter. Action: Run the switch adapter diagnostics on the primary node. If diagnostics fails to isolate the problem, contact the IBM Support Center.
-6	Explanation: A switch miswire was detected. Cause: Switch network cabling does not match the switch topology file. Action: View the /var/adm/SPlogs/css0/p0/cable_miswire or /var/adm/SPlogs/css1/p0/cable_miswire file to determine which cables are in question. Then disconnect and check the associated cables. If the problem persists, contact IBM Hardware Service.
-7	Explanation: Unable to generate routes for the network. (Phase 1 failure). Cause: Primary node is not connected in the right switch capsule. Action: View the /var/adm/SPlogs/css0/p0/cable_miswire or /var/adm/SPlogs/css1/p0/cable_miswire file to determine which cables are in question. See PSSP: Planning, Volume 2 for wiring information. Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service.
-8	Explanation: Receive FIFO is full. Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the primary node. If diagnostics fail to isolate the problem, contact IBM Hardware Service. Cause: The switch is backed up from a node or a switch chip. Action: Contact the IBM Support Center.
-9	Explanation: Unable to initialize FIFOs. (Phase 1 failure). Cause: Software problem. Action: Contact the IBM Support Center.
-10	Explanation: Node found in Switch Chip's FIFO. (Phase 1 failure). Cause: Software problem. Action: Contact the IBM Support Center.
-12	Explanation: The worm was unable to contact the oncoming primary backup node. New backup will be selected. Cause: This is not an error. Action: None.
-13	Explanation: Switch chip id mismatch from a previous connected switch chip. (Phase 1 failure). Cause: Hardware problem. Action: If the problem persists, contact IBM Hardware service.
-23	Explanation: The switch chip that the oncoming primary node is connected to did not respond. The oncoming primary node failed to communicate with the switch. (Phase 1 failure). Cause: Oncoming primary node is fenced. Action: Pick another node as oncoming primary and Estart again. See Unfence an SP Switch2 node. Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the oncoming primary node. If they fail, try to isolate the problem. Contact IBM Hardware Service. Cause: Bad switch board. Action: If problem persist, Contact IBM Hardware Service.
-27	Explanation: The TBIC was not initialized. Cause: The switch adapter is uninitialized. Action: Run the script rc.switch on the primary node, then issue the Estart command from the control workstation. Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the primary node. If diagnostics fails to isolate the problem, contact IBM Hardware Service.
-36	Explanation: This node resigned as the primary node. Cause: The node determined it could no longer control and monitor the switch. The primary backup node is now in control of the switch. Action: No action required.
-41	Explanation: Worm reached retry limit. Cause: System cables may have a problem. The system is not stable. Action: View /var/adm/SPlogs/css0/p0/out.top or /var/adm/SPlogs/css1/p0/out.top on the oncoming primary node. Check or comment out all links that are marked as fenced or faulty. Check or comment out all nodes that were not found by the worm. Run Estart again. If problem persists, contact IBM Hardware service.
-43	Explanation: A read or write operation to the switch adapter failed. Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the primary node. If diagnostics fail to isolate the problem, contact IBM Hardware Service.
-51	Explanation: Unexpected return. The software experienced an unexpected values. Cause: This can happen for these reasons: ID mismatch, lock handling failure, unexpected SDR access, query or setting failure, null pointer that should have a value, unexpected memory updates failure, unexpected value inside a packet. Action: The software automatically create a css.snap file. Call the IBM Support Center with this file.
-52	Explanation: No response from the switch chip connected to the oncoming primary. (Phase 1 failure). Cause: Cable between the oncoming primary node and switch board is faulty. Action: Check or reconnect the oncoming primary cable and try again. Cause: Adapter error on the oncoming primary node. Action: Run switch adapter diagnostics on the oncoming primary node. If fails, isolate the problem. Contact IBM Hardware Service.
-54	Explanation: Unknown device id returned from the switch chip that the oncoming primary is connected to. (Phase 1 failure). Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service. Cause: System not configured properly. Action: Check your configuration on the control workstation and try again. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service.
-55	Explanation: Switch chip signature test failed or failed to reset switch chip errors. (Phase 1 failure). Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service.
-56	Explanation: Switch chip connected to the oncoming primary reported the primary to be connected to internal switch port. (Phase 1 failure). Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persists, contact IBM Hardware Service. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service.
-57	Explanation: Oncoming primary connected to the wrong switch. (Phase 1 failure). Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service. Cause: System not configured properly. Action: Check your configuration on the control workstation and try again. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service.
-58	Explanation: Oncoming primary connected in the wrong place. (Phase 1 failure). Cause: Cable miswire. View the /var/adm/SPlogs/css0/p0/cable_miswire or /var/adm/SPlogs/css1/p0/cable_miswire file to determine which cables are in question. See PSSP: Planning, Volume 2 for wiring information.
-61	Explanation: Failed to reset the oncoming primary's switch chip's error registers. (Phase 1 failure). Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service.

Ecommands error recovery

Error isolation for any of the Ecommands is as follows.

View the error output returned from the command. Note the error message number and text.
Find the message in PSSP: Messages Reference.
More information can be obtained from the Ecommands.log trace file, see Ecommands.log.
Perform the recommended recovery action.
If the Ecommand failed because it was unable to communicate with every node, see Diagnosing SP Security Services problems.
If the Ecommand failed because it cannot access the SDR, or the SDR is set up incorrectly, see Verify the System Data Repository (SDR).
After the recovery action is taken, if the problem persists, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Estart error recovery

To isolate and recover from failures in the Estart command, follow these steps:

Login to the primary node.
View the bottom of the file /var/adm/SPlogs/css0/p0/flt or /var/adm/SPlogs/css1/p0/flt.
Use the failure message as an index to Table 46.
If the failure message cannot be found in the table, or the actions taken did not correct the problem, contact the IBM Support Center.

Table 46. SP Switch2 Estart problems and analysis

Return Code	Analysis
Error in buildDeviceDatabase()	Explanation: Unable to build the device database. Cause: Missing or corrupted topology file. Action: See Verify the SP Switch2 topology configuration. Cause: malloc failures. Action: See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
Error in CSswitchInit()	Explanation: Unable to initialize the switch network. Cause: Switch initialization failed. Action: See Worm error recovery.
Error in writeDeviceDatabase()	Explanation: Unable to write /var/adm/SPlogs/css0/p0/out.top or /var/adm/SPlogs/css1/p0/out.top file. Cause: Missing or corrupted topology file. Action: See Verify the SP Switch2 topology configuration. Cause: The /var file system is not large enough to accommodate the new out.top file. Action: Increase the size of /var.
No valid backup - SDR current Backup being changed to none	Explanation: Informational message. Cause: No node available as a backup. Action: No action required.
Cannot access the SDR - SDR current Backup not changed	Explanation: SDR failures Cause: SDR not set up properly. Action: See Verify the System Data Repository (SDR).
Error in: fopen(act.top.`PID`) fprintf(act.top.`PID`) fclose(act.top.`PID`) rename(act.top, act.top.`PID`)	Explanation: An error occurred accessing file /var/adm/SPlogs/css0/p0/act.top.`PID` or /var/adm/SPlogs/css1/p0/act.top.`PID`. Cause: File access problems. Action: Evaluate the errno returned and take the appropriate action. If the problem persists contact the IBM Support Center.

Unfence an SP Switch2 node

The recovery action to take depends on the current status of the switch, and the personality of the switch node to be unfenced. The SP Switch2 status is limited to whether it is operational or not. The personality of the switch node to unfence is whether or not the node is to become the primary node, primary backup node, or a secondary node of the switch. For more information on any of the commands used in this section, see PSSP: Command and Technical Reference.

To display the switch primary node and primary backup node on all switch planes, issue the command:

Eprimary

Example output is:

plane 0:  none  - primary
plane 0:  5     - oncoming primary
plane 0:  none  - primary backup
plane 0:  45    - oncoming primary backup
plane 0:  1     - autounfence
 
plane 1:  none  - primary
plane 1:  5     - oncoming primary
plane 1:  none  - primary backup
plane 1:  45    - oncoming primary backup
plane 1:  1     - autounfence

In this example, no primary node is available. Therefore, the SP Switch2 is not operational. Remember that the Efence and Eunfence commands work only when the Eprimary command shows a valid node number as the primary node.

Note:: In the following commands, the -p flag can be used to specify the number of the switch plane. If the -p flag is not used, the command is applied to all switch planes.

SP Switch2 is operational and the node is to be the secondary. The node to unfence is not listed as the primary or the oncoming primary, and there is a primary node.
Use the command Eunfence to unfence the node.
SP Switch2 is operational and the node is to be the primary. The node to unfence is the oncoming primary, and another node is currently the primary.
- Use the command Eunfence to unfence the node.
- Use the command Estart to set the node to its primary personality.
SP Switch2 is not operational and the node is to be the secondary. No node is listed as primary, and another node is listed as the oncoming primary.
- Use the command Estart to operate the SP Switch2.
- Use the command Eunfence to unfence the node.
SP Switch2 is not operational and the node is to be the primary. No node is listed as primary and the fenced node is listed as the oncoming primary.
- Use the command Eprimary to set another node as oncoming primary.
  Select a node that is not fenced as an oncoming primary, otherwise the Estart command with fail again.
- Use the command Estart to operate the SP Switch2.
- Use the command Eunfence to unfence the node.
- Use the command Eprimary to set the unfenced node to be the oncoming primary.
- Use the command Estart to set the node as the switch primary.

Eunfence error recovery

This section is used to help you when you failed to unfence your node, following the unfence procedure described in Unfence an SP Switch2 node.

To isolate and correct most Eunfence problems, you should refer first to Ecommands error recovery.

Note:: In the following commands, the -p flag can be used to specify the number of the switch plane. If the -p flag is not used, the command is applied to all switch planes.

The following list provides additional reasons for a particular node to fail to Eunfence:

The Eunfence of a node failed, but the SP Switch2 was not Estarted. You cannot attempt to Eunfence any node on an SP Switch2 that is not started. Issue the Estart command.
The node can no longer be reached through the switch network. More information can be gathered from the out.top trace file, see out.top.
The SP Switch2 node failed to Eunfence because the switch topology could not be distributed. See Diagnosing SP Security Services problems.
The node failed to respond when attempting to Eunfence it. See SP Switch2 node diagnostics to isolate and correct the problem.
The user receives the message "Cannot Unfence node xxx - timeout", the most likely cause is that the fault service daemon (fault_service_Worm_RTG_CS) is not running on the node. If the this is the case, issue the /usr/lpp/ssp/css/rc.switch command to start the daemon. If the daemon is still not running, refer to the rc.switch.log trace file. See rc.switch.log.
The user receives a message similar to "Cannot Unfence node xxx - timeout", and you have replaced the switch cable. See Cable diagnostics. Even though the fault service daemon (fault_service_Worm_RTG_CS) is running, you must issue the /usr/lpp/ssp/css/rc.switch command to reload and reset the adapter before you can try to Eunfence the node.
If any of the preceding procedures fail to resolve the problem, and the node is still fenced, gather the css logs of the primary node and the fenced node. This can be accomplished by logging into those nodes and issuing the /usr/lpp/ssp/css/css.snap command. See Information to collect before contacting the IBM Support Center.

switch_responds is still on after node panic

This section addresses the case where a node panics, host_responds becomes 0 and switch_responds0 or switch_responds0 are still 1. This is a valid condition when the SP Switch2 adapter, on the crashed node, has no outstanding requests to or from this node. The SP Switch2 is now in a state where it can become backlogged, since the link is still marked as up. This can cause problems on other parts of the SP Switch2 network.

Each node fault-service daemon is responsible for updating its switch_responds0 and switch_responds1. The SP Switch2 primary node detects fallen links and turns off the appropriate switch_responds The switch_responds0 or switch_responds1 is turned on only during Estart or Eunfence command processing. A node panic with switch_responds 1 is a legitimate occurrence. There are two cases:

With primary or backup SP Switch2 nodes running, the switch_responds0 or switch_responds1 is updated only after a packet is sent to the panicked node. Therefore, a user can change switch_responds0 or switch_responds1 by trying to ping the panicked node. Having HA (hats and hags) run on the nodes can remedy the situation, since they run IP packets between the nodes casually in order to check the links (LAN Adapter event).
Without primary or backup SP Switch2 node running, there is no switch control. In this case, switch_responds0 or switch_responds1 is not updated. Only a new Estart command, specifying the -p flag with the correct switch plane number, can correct this.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]