If your system or system partition shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified switch node. If any of the recovery actions fail to resolve your problem, contact IBM Support Center.
Table 28 lists the known symptoms of a failure in the SP Switch, and points the user to the location of the detailed diagnostics and recovery action. You may have a symptom that does not appear in the table. In this case, view the error entry in the AIX error log and see AIX Error Log information.
Table 28. SP Switch symptoms and recovery actions
Symptoms | Recovery actions |
---|---|
Estart failure:
|
|
Nodes drops off of the switch. (switch_responds is OFF for the node). | See Verify SP Switch node operation. |
Nodes fails to communicate over the switch, but its switch_responds is ON (ping or CSS_test commands fail). |
|
Node crash. | See Node crash. |
Node fails to Eunfence. |
|
Oncoming primary node is fenced |
|
Ecommand failure. | See Ecommands error recovery. |
diag_fail condition for an SP Switch adapter | See diag_fail condition for an SP Switch adapter. |
switch_responds is still ON after node panic. | See switch_responds is still on after node panic. |
You can restart the fault_service_Worm_RTG_SP daemon on the node by issuing:
/usr/lpp/ssp/css/rc.switch
Following the rc.switch, run this command to determine if the daemon is still running or has died:
ps -e | grep Worm
At this point you should be able to Eunfence the node by issuing:
Eunfence problem_node_number
The output is similar to the following:
All nodes successfully unfenced.
If you cannot resolve the problem, contact the IBM Support Center. You should also attempt to gather any log files that are associated with this failure. See Information to collect before contacting the IBM Support Center.
The following steps enable you to recover from switch initialization failures which impact the worm subsystem.
Table 29. SP Switch worm return codes and analysis
Return code | Analysis |
---|---|
-3 | Explanation: Local adapter receiver port is not
enabled.
Cause: The switch is not clocked. Action: From the Control Work Station (CWS) issue the command Eclock -d followed by the command Estart. Cause: Oncoming primary is fenced off the switch. Action: See Unfence an SP Switch node. |
-4 | Explanation: Unable to generate routes for the
network.
Cause: Corrupted topology file. Action: See Verify the SP Switch topology configuration. |
-5 | Explanation: Send packet from local node failed.
Cause: Bad switch adapter. Action: Run the switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact the IBM Support Center. |
-6 | Explanation: A switch miswire was detected.
Cause: Switch network cabling does not match the switch topology file. Action: View the /var/adm/SPlogs/css/cable_miswire file to determine which cables are in question. Then disconnect and check the associated cables. If the problem persists, contact IBM Hardware Service. |
-7 | Explanation: A node miswire was detected.
Cause: Switch network cabling does not match the switch topology file. Action: The device is not cabled properly. There are two possible causes for this condition: the switch network is miswired or the frame supervisor's tty is not cabled properly. First view the /var/adm/SPlogs/css/cable_miswire file. Verify and correct all links listed in the file. Then issue the command Eclock -d followed by Estart. If the problem persists, contact IBM Hardware Service. |
-8 | Explanation: Receive FIFO is full.
Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact the IBM Hardware Service. Cause: The switch is backed up from a node or a switch chip. Action: Contact the IBM Support Center. |
-9 | Explanation: Unable to initialize FIFOs.
Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact IBM Hardware Service. |
-27 | Explanation: The TBIC was not initialized.
Cause: The switch adapter is uninitialized. Action: Run the script rc.switch on the failing node, then issue the Estart command from the control workstation. Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact IBM Hardware Service. |
-36 | Explanation: This node resigned as the primary
node.
Cause: The node determined it could no longer control and monitor the SP Switch. The primary backup node is now in control of the SP Switch. Action: No action required. |
-43 | Explanation: A read or write operation to the switch
adapter failed.
Cause: Bad switch adapter. Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact IBM Hardware Service. |
Error isolation for any of the Ecommands (Eclock, Eannotator, and others) is as follows:
To isolate and recover from failures in the Estart command, follow these steps:
Table 30. SP Switch Estart problems and analysis
Error | Analysis |
---|---|
Error in buildDeviceDatabase() | Explanation: Unable to build the device database.
Cause: Missing or corrupted topology file. Action: See Verify the SP Switch topology configuration. Cause: malloc failures. Action: See Information to collect before contacting the IBM Support Center and contact the IBM Support Center. |
Error in TBSswitchInit() | Explanation: Unable to initialize the switch
network.
Cause: Switch initialization failed. Action: See Worm error recovery. |
Error in writeDeviceDatabase() | Explanation: Unable to write
/var/adm/SPlogs/css/out.top.
Cause: Missing or corrupted topology file. Action: See Verify the SP Switch topology configuration. Cause: The /var file system is not large enough to accommodate the new out.top file. Action: Increase the size of /var. |
No valid backup - SDR current Backup being changed to none | Explanation: Informational message.
Cause: No node available as a backup. Action: No action required. |
Cannot access SDR - SDR current Backup not changed | Explanation: SDR failure
Cause: SDR not set up properly. Action: See Verify the System Data Repository (SDR) |
Error in: |
Explanation: An error occurred accessing file
/var/adm/SPlogs/css/act.top.PID.
Cause: File access problems. Action: Evaluate the errno returned and take the
appropriate action. If the problem persists contact the IBM Support
Center.
|
The recovery action to take depends on the current status of the SP Switch, and the personality of the switch node to be unfenced. The SP Switch status is limited to whether it is operational or not. The personality of the switch node to unfence is whether or not the node is to become the primary node, primary backup node, or a secondary node. For more information on any of the commands used in this section, see PSSP: Command and Technical Reference.
To display the switch primary node and primary backup node, issue the command:
Eprimary
Example output is:
none - primary 2 - oncoming primary none - primary backup 26 - oncoming primary backup
In this example, no primary node is available. Therefore, the SP Switch is not operational.
This section is used to help you when you failed to unfence your node, following the unfence procedure described in Unfence an SP Switch node.
To isolate and correct most Eunfence problems, you should refer first to Ecommands error recovery.
The following list provides additional reasons for a particular node to fail to Eunfence:
This section addresses the case where a node panics, host_responds becomes OFF and switch_responds is still ON. This is a valid condition when the SP Switch adapter, on the crashed node, has no outstanding requests to or from this node. The SP Switch is now in a state where it can become backlogged, since the link is still marked as up. This can cause problems on other parts of the SP Switch network.
Each node fault-service daemon is responsible for updating its switch_responds. The SP Switch primary node detects fallen links and turns off switch_responds of faulty ones. The switch_responds is turned on only during Estart or Eunfence command processing. A node panic with switch_responds ON is a legitimate occurrence. There are two cases: