If your system shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified switch node. If any of the recovery actions fail to resolve your problem, contact IBM Support Center.
Table 44 lists the known symptoms of a failure in the SP Switch2, and points the user to the location of the detailed diagnostics and recovery action. You may have a symptom that does not appear in the table. In this case, view the error entry in the AIX error log and see AIX Error Log information.
Table 44. SP Switch2 symptoms and recovery actions
Symptom | Recovery action |
---|---|
Estart failure:
|
|
Nodes drops off of the switch. (switch_responds is 0 for the node.). | See Verify SP Switch2 node operation. |
Nodes fail to communicate over the switch, but its switch_responds is 1(ping or CSS_test commands fail). |
|
Node crash. | See Node crash. |
Node fails to Eunfence. |
|
Oncoming primary node is fenced |
|
Ecommand failure. | See Ecommands error recovery. |
diag_fail condition for an SP Switch2 adapter | See SP Switch2 adapter diagnostics. |
switch_responds is still 1 after node panic. | See switch_responds is still on after node panic. |
The switch_plane and switch_plane_seq numbers in the Switch class of the SDR are incorrect. The SDR_config command appears to number switches
incorrectly.
| See plane.info file |
You can restart the fault_service_Worm_RTG_CS daemon on the node by issuing:
/usr/lpp/ssp/css/rc.switch
Following the rc.switch, run this command to determine if the daemon is still running or has died:
ps -e | grep Worm
At this point you should be able to Eunfence the node by issuing:
Eunfence problem_node_number
The output should be similar to the following:
All nodes successfully unfenced.
If you cannot resolve the problem, contact the IBM Support Center. You should also attempt to gather all log files associated with this failure. See Information to collect before contacting the IBM Support Center.
The following steps enable you to recover from switch initialization failures that impact the worm subsystem.
switch initialization failed with xx
where xx is the number to look up in Table 45.
where xx is the number to look up in Table 45.
Table 45. SP Switch2 worm return codes and analysis
Return code | Analysis |
---|---|
-2 | Explanation: Adapter sender port is not connected to the
switch. (Phase 1 failure).
Cause: Oncoming primary is fenced off the switch. Action: Pick another node as primary and run Estart again. See Unfence an SP Switch2 node. |
-3 | Explanation: Adapter receiver port is connected to switch
Cause: Oncoming primary is fenced off the switch. (Phase 1 failure). Action: Pick another node as primary and Estart again. See Unfence an SP Switch2 node. |
-4 | Explanation: Unable to generate routes for the
network. (Phase 1 failure).
Cause: Corrupted topology file. Action: See Verify the SP Switch2 topology configuration. |
-5 | Explanation: Send packet from local node failed.
Cause: Bad switch adapter. Action: Run the switch adapter diagnostics on the primary node. If diagnostics fails to isolate the problem, contact the IBM Support Center. |
-6 | Explanation: A switch miswire was detected.
Cause: Switch network cabling does not match the switch topology file. Action: View the /var/adm/SPlogs/css0/p0/cable_miswire or /var/adm/SPlogs/css1/p0/cable_miswire file to determine which cables are in question. Then disconnect and check the associated cables. If the problem persists, contact IBM Hardware Service. |
-7 | Explanation: Unable to generate routes for the
network. (Phase 1 failure).
Cause: Primary node is not connected in the right switch capsule. Action: View the /var/adm/SPlogs/css0/p0/cable_miswire or /var/adm/SPlogs/css1/p0/cable_miswire file to determine which cables are in question. See PSSP: Planning, Volume 2 for wiring information. Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service. |
-8 | Explanation: Receive FIFO is full.
Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the primary node. If diagnostics fail to isolate the problem, contact IBM Hardware Service. Cause: The switch is backed up from a node or a switch chip. Action: Contact the IBM Support Center. |
-9 | Explanation: Unable to initialize FIFOs. (Phase 1
failure).
Cause: Software problem. Action: Contact the IBM Support Center. |
-10 | Explanation: Node found in Switch Chip's
FIFO. (Phase 1 failure).
Cause: Software problem. Action: Contact the IBM Support Center. |
-12 | Explanation: The worm was unable to contact the
oncoming primary backup node. New backup will be selected.
Cause: This is not an error. Action: None. |
-13 | Explanation: Switch chip id mismatch from a previous
connected switch chip. (Phase 1 failure).
Cause: Hardware problem. Action: If the problem persists, contact IBM Hardware service. |
-23 | Explanation: The switch chip that the oncoming primary
node is connected to did not respond. The oncoming primary node failed
to communicate with the switch. (Phase 1 failure).
Cause: Oncoming primary node is fenced. Action: Pick another node as oncoming primary and Estart again. See Unfence an SP Switch2 node. Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the oncoming primary node. If they fail, try to isolate the problem. Contact IBM Hardware Service. Cause: Bad switch board. Action: If problem persist, Contact IBM Hardware Service. |
-27 | Explanation: The TBIC was not initialized.
Cause: The switch adapter is uninitialized. Action: Run the script rc.switch on the primary node, then issue the Estart command from the control workstation. Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the primary node. If diagnostics fails to isolate the problem, contact IBM Hardware Service. |
-36 | Explanation: This node resigned as the primary
node.
Cause: The node determined it could no longer control and monitor the switch. The primary backup node is now in control of the switch. Action: No action required. |
-41 | Explanation: Worm reached retry limit.
Cause: System cables may have a problem. The system is not stable. Action: View /var/adm/SPlogs/css0/p0/out.top or /var/adm/SPlogs/css1/p0/out.top on the oncoming primary node. Check or comment out all links that are marked as fenced or faulty. Check or comment out all nodes that were not found by the worm. Run Estart again. If problem persists, contact IBM Hardware service. |
-43 | Explanation: A read or write operation to the switch
adapter failed.
Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the primary node. If diagnostics fail to isolate the problem, contact IBM Hardware Service. |
-51 | Explanation: Unexpected return. The software
experienced an unexpected values.
Cause: This can happen for these reasons: ID mismatch, lock handling failure, unexpected SDR access, query or setting failure, null pointer that should have a value, unexpected memory updates failure, unexpected value inside a packet. Action: The software automatically create a css.snap file. Call the IBM Support Center with this file. |
-52 | Explanation: No response from the switch chip connected
to the oncoming primary. (Phase 1 failure).
Cause: Cable between the oncoming primary node and switch board is faulty. Action: Check or reconnect the oncoming primary cable and try again. Cause: Adapter error on the oncoming primary node. Action: Run switch adapter diagnostics on the oncoming primary node. If fails, isolate the problem. Contact IBM Hardware Service. |
-54 | Explanation: Unknown device id returned from the switch
chip that the oncoming primary is connected to. (Phase 1
failure).
Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service. Cause: System not configured properly. Action: Check your configuration on the control workstation and try again. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service. |
-55 | Explanation: Switch chip signature test failed or failed
to reset switch chip errors. (Phase 1 failure).
Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service. |
-56 | Explanation: Switch chip connected to the oncoming
primary reported the primary to be connected to internal switch port.
(Phase 1 failure).
Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persists, contact IBM Hardware Service. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service. |
-57 | Explanation: Oncoming primary connected to the wrong
switch. (Phase 1 failure).
Cause: The frame supervisor's TTY is not cabled properly. Action: Reconnect the frame supervisor's TTY and try again. If the problem persist, contact IBM Hardware Service. Cause: System not configured properly. Action: Check your configuration on the control workstation and try again. Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service. |
-58 | Explanation: Oncoming primary connected in the wrong
place. (Phase 1 failure).
Cause: Cable miswire. View the /var/adm/SPlogs/css0/p0/cable_miswire or /var/adm/SPlogs/css1/p0/cable_miswire file to determine which cables are in question. See PSSP: Planning, Volume 2 for wiring information. |
-61 | Explanation: Failed to reset the oncoming primary's
switch chip's error registers. (Phase 1 failure).
Cause: Switch board hardware failure. Action: If the problem persists, contact IBM Hardware Service. |
Error isolation for any of the Ecommands is as follows.
To isolate and recover from failures in the Estart command, follow these steps:
Table 46. SP Switch2 Estart problems and analysis
Return Code | Analysis |
---|---|
Error in buildDeviceDatabase() | Explanation: Unable to build the device database.
Cause: Missing or corrupted topology file. Action: See Verify the SP Switch2 topology configuration. Cause: malloc failures. Action: See Information to collect before contacting the IBM Support Center and contact the IBM Support Center. |
Error in CSswitchInit() | Explanation: Unable to initialize the switch
network.
Cause: Switch initialization failed. Action: See Worm error recovery. |
Error in writeDeviceDatabase() | Explanation: Unable to write
/var/adm/SPlogs/css0/p0/out.top or
/var/adm/SPlogs/css1/p0/out.top file.
Cause: Missing or corrupted topology file. Action: See Verify the SP Switch2 topology configuration. Cause: The /var file system is not large enough to accommodate the new out.top file. Action: Increase the size of /var. |
No valid backup - SDR current Backup being changed to none | Explanation: Informational message.
Cause: No node available as a backup. Action: No action required. |
Cannot access the SDR - SDR current Backup not changed | Explanation: SDR failures
Cause: SDR not set up properly. Action: See Verify the System Data Repository (SDR). |
Error in: |
Explanation: An error occurred accessing file
/var/adm/SPlogs/css0/p0/act.top.PID or
/var/adm/SPlogs/css1/p0/act.top.PID.
Cause: File access problems. Action: Evaluate the errno returned and take the
appropriate action. If the problem persists contact the IBM Support
Center.
|
The recovery action to take depends on the current status of the switch, and the personality of the switch node to be unfenced. The SP Switch2 status is limited to whether it is operational or not. The personality of the switch node to unfence is whether or not the node is to become the primary node, primary backup node, or a secondary node of the switch. For more information on any of the commands used in this section, see PSSP: Command and Technical Reference.
To display the switch primary node and primary backup node on all switch planes, issue the command:
Eprimary
plane 0: none - primary plane 0: 5 - oncoming primary plane 0: none - primary backup plane 0: 45 - oncoming primary backup plane 0: 1 - autounfence plane 1: none - primary plane 1: 5 - oncoming primary plane 1: none - primary backup plane 1: 45 - oncoming primary backup plane 1: 1 - autounfence
In this example, no primary node is available. Therefore, the SP Switch2 is not operational. Remember that the Efence and Eunfence commands work only when the Eprimary command shows a valid node number as the primary node.
Select a node that is not fenced as an oncoming primary, otherwise the Estart command with fail again.
This section is used to help you when you failed to unfence your node, following the unfence procedure described in Unfence an SP Switch2 node.
To isolate and correct most Eunfence problems, you should refer first to Ecommands error recovery.
The following list provides additional reasons for a particular node to fail to Eunfence:
This section addresses the case where a node panics, host_responds becomes 0 and switch_responds0 or switch_responds0 are still 1. This is a valid condition when the SP Switch2 adapter, on the crashed node, has no outstanding requests to or from this node. The SP Switch2 is now in a state where it can become backlogged, since the link is still marked as up. This can cause problems on other parts of the SP Switch2 network.
Each node fault-service daemon is responsible for updating its switch_responds0 and switch_responds1. The SP Switch2 primary node detects fallen links and turns off the appropriate switch_responds The switch_responds0 or switch_responds1 is turned on only during Estart or Eunfence command processing. A node panic with switch_responds 1 is a legitimate occurrence. There are two cases: