IBM Books

Diagnosis Guide


Error symptoms, responses, and recoveries

If your system or system partition shows signs of a switch failure, locate the symptom and perform the recovery action described. All the recovery actions described require that the user have root access on the specified switch node. If any of the recovery actions fail to resolve your problem, contact IBM Support Center.

Note:
If your system is running in restricted root access mode, the following commands must be issued from the control workstation:

SP Switch symptoms and recovery actions

Table 28 lists the known symptoms of a failure in the SP Switch, and points the user to the location of the detailed diagnostics and recovery action. You may have a symptom that does not appear in the table. In this case, view the error entry in the AIX error log and see AIX Error Log information.


Table 28. SP Switch symptoms and recovery actions

Symptoms Recovery actions
Estart failure:
  1. System cannot find Estart command.
  2. Primary node is not reachable.
  3. Estart command times out or fails.
  4. Expected number of nodes not initialized.
  5. Some links do not initialize.
  1. See Verify software installation.
  2. See Verify SP Switch node operation.
  3. See Estart error recovery.
  4. See SP Switch device and link error information.
  5. See SP Switch device and link error information.

Nodes drops off of the switch. (switch_responds is OFF for the node). See Verify SP Switch node operation.
Nodes fails to communicate over the switch, but its switch_responds is ON (ping or CSS_test commands fail).
  1. See Verify SP Switch node operation.
  2. See AIX Error Log information.

Node crash. See Node crash.
Node fails to Eunfence.
  1. See Unfence an SP Switch node.
  2. See Eunfence error recovery.

Oncoming primary node is fenced
  1. See Unfence an SP Switch node.
  2. See Eunfence error recovery.

Ecommand failure. See Ecommands error recovery.
diag_fail condition for an SP Switch adapter See diag_fail condition for an SP Switch adapter.
switch_responds is still ON after node panic. See switch_responds is still on after node panic.

Recover an SP Switch node

You can restart the fault_service_Worm_RTG_SP daemon on the node by issuing:

/usr/lpp/ssp/css/rc.switch 

Following the rc.switch, run this command to determine if the daemon is still running or has died:

ps -e | grep Worm

At this point you should be able to Eunfence the node by issuing:

Eunfence problem_node_number

The output is similar to the following:

All nodes successfully unfenced.

If you cannot resolve the problem, contact the IBM Support Center. You should also attempt to gather any log files that are associated with this failure. See Information to collect before contacting the IBM Support Center.

Worm error recovery

The following steps enable you to recover from switch initialization failures which impact the worm subsystem.

  1. Login to the failing node. (see Eprimary command).
  2. View the bottom of the file /var/adm/SPlogs/css/worm.trace. Look for a message similar to one of the following, where xx represents any number:
  3. Use the rc (return code) number as entry in Table 29.
  4. If the return code cannot be found in the table, or the actions taken did not correct the problem, contact the IBM Support Center.


Table 29. SP Switch worm return codes and analysis

Return code Analysis
-3 Explanation: Local adapter receiver port is not enabled.

Cause: The switch is not clocked.

Action: From the Control Work Station (CWS) issue the command Eclock -d followed by the command Estart.

Cause: Oncoming primary is fenced off the switch.

Action: See Unfence an SP Switch node.

-4 Explanation: Unable to generate routes for the network.

Cause: Corrupted topology file.

Action: See Verify the SP Switch topology configuration.

-5 Explanation: Send packet from local node failed.

Cause: Bad switch adapter.

Action: Run the switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact the IBM Support Center.

-6 Explanation: A switch miswire was detected.

Cause: Switch network cabling does not match the switch topology file.

Action: View the /var/adm/SPlogs/css/cable_miswire file to determine which cables are in question. Then disconnect and check the associated cables. If the problem persists, contact IBM Hardware Service.

-7 Explanation: A node miswire was detected.

Cause: Switch network cabling does not match the switch topology file.

Action: The device is not cabled properly. There are two possible causes for this condition: the switch network is miswired or the frame supervisor's tty is not cabled properly.

First view the /var/adm/SPlogs/css/cable_miswire file. Verify and correct all links listed in the file. Then issue the command Eclock -d followed by Estart. If the problem persists, contact IBM Hardware Service.

-8 Explanation: Receive FIFO is full.

Cause: Bad switch adapter.

Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact the IBM Hardware Service.

Cause: The switch is backed up from a node or a switch chip.

Action: Contact the IBM Support Center.

-9 Explanation: Unable to initialize FIFOs.

Cause: Bad switch adapter.

Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact IBM Hardware Service.

-27 Explanation: The TBIC was not initialized.

Cause: The switch adapter is uninitialized.

Action: Run the script rc.switch on the failing node, then issue the Estart command from the control workstation.

Cause: Bad switch adapter.

Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact IBM Hardware Service.

-36 Explanation: This node resigned as the primary node.

Cause: The node determined it could no longer control and monitor the SP Switch. The primary backup node is now in control of the SP Switch.

Action: No action required.

-43 Explanation: A read or write operation to the switch adapter failed.

Cause: Bad switch adapter.

Action: Run switch adapter diagnostics on the failing node. If diagnostics fails to isolate the problem, contact IBM Hardware Service.

Ecommands error recovery

Error isolation for any of the Ecommands (Eclock, Eannotator, and others) is as follows:

  1. View the error output returned from the command. Note the error message number and text.
  2. Find the message in PSSP: Messages Reference.
  3. More information can be obtained from the Ecommands.log trace file, see Ecommands.log.
  4. Perform the recommended recovery action.
  5. If the Ecommand failed because it was unable to communicate with every node, see Diagnosing SP Security Services problems.
  6. If the Ecommand failed because it cannot access the SDR, or the SDR is set up incorrectly, see Verify the System Data Repository (SDR).
  7. After the recovery action is taken, if the problem persists, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Estart error recovery

To isolate and recover from failures in the Estart command, follow these steps:

  1. Login to the primary node.
  2. View the bottom of the file /var/adm/SPlogs/css/flt file.
  3. Use the failure message as an index to Table 30.
  4. If the failure message cannot be found in the table, or the actions taken did not correct the problem, contact the IBM Support Center.


Table 30. SP Switch Estart problems and analysis

Error Analysis
Error in buildDeviceDatabase() Explanation: Unable to build the device database.

Cause: Missing or corrupted topology file.

Action: See Verify the SP Switch topology configuration.

Cause: malloc failures.

Action: See Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Error in TBSswitchInit() Explanation: Unable to initialize the switch network.

Cause: Switch initialization failed.

Action: See Worm error recovery.

Error in writeDeviceDatabase() Explanation: Unable to write /var/adm/SPlogs/css/out.top.

Cause: Missing or corrupted topology file.

Action: See Verify the SP Switch topology configuration.

Cause: The /var file system is not large enough to accommodate the new out.top file.

Action: Increase the size of /var.

No valid backup - SDR current Backup being changed to none Explanation: Informational message.

Cause: No node available as a backup.

Action: No action required.

Cannot access SDR - SDR current Backup not changed Explanation: SDR failure

Cause: SDR not set up properly.

Action: See Verify the System Data Repository (SDR)

Error in:
  • fopen(act.top.PID)
  • fprintf(act.top.PID)
  • fclose(act.top.PID)
  • rename(act.top, act.top.PID)
Explanation: An error occurred accessing file /var/adm/SPlogs/css/act.top.PID.

Cause: File access problems.

Action: Evaluate the errno returned and take the appropriate action. If the problem persists contact the IBM Support Center.

Unfence an SP Switch node

The recovery action to take depends on the current status of the SP Switch, and the personality of the switch node to be unfenced. The SP Switch status is limited to whether it is operational or not. The personality of the switch node to unfence is whether or not the node is to become the primary node, primary backup node, or a secondary node. For more information on any of the commands used in this section, see PSSP: Command and Technical Reference.

To display the switch primary node and primary backup node, issue the command:

Eprimary

Example output is:

none - primary
2 - oncoming primary
none - primary backup
26 - oncoming primary backup

In this example, no primary node is available. Therefore, the SP Switch is not operational.

  1. SP Switch is operational and the node is to be the secondary. The node to unfence is not listed as the primary or the oncoming primary, and there is a primary node.

    Use the command Eunfence to unfence the node.

  2. SP Switch is operational and the node is to be the primary. The node to unfence is the oncoming primary and another node is currently the primary.
  3. SP Switch is not operational and the node is to be the secondary: no node is listed as primary, and another node is listed as the oncoming primary.
  4. SP Switch is not operational and the node is to be the primary. No node is listed as primary, and the fenced node is listed as the oncoming primary.

Eunfence error recovery

This section is used to help you when you failed to unfence your node, following the unfence procedure described in Unfence an SP Switch node.

To isolate and correct most Eunfence problems, you should refer first to Ecommands error recovery.

The following list provides additional reasons for a particular node to fail to Eunfence:

  1. The Eunfence of a node failed, but the SP Switch was not Estarted. You cannot attempt to Eunfence any node on an SP Switch that is not started. Issue the Estart command.
  2. The node can no longer be reached through the switch network. More information can be gathered from the out.top trace file, see out.top.
  3. The SP Switch node failed to Eunfence because the switch topology could not be distributed. See Diagnosing SP Security Services problems.
  4. The node failed to respond when attempting to Eunfence it. See SP Switch node diagnostics to isolate and correct the problem.
  5. The user receives the message "Cannot Unfence node xxx - timeout", the most likely cause is that the fault service daemon (fault_service_Worm_RTG_SP) is not running on the node. If the this is the case, issue the /usr/lpp/ssp/css/rc.switch command to start the daemon. If the daemon is still not running, refer to the rc.switch.log trace file. See rc.switch.log.
  6. The user receives the message "Cannot Unfence node xxx - timeout", and you have replaced the switch cable. See Cable diagnostics. Even though the fault service daemon (fault_service_Worm_RTG_SP) is running, you must issue the /usr/lpp/ssp/css/rc.switch command to reload and reset the adapter before you can try again to Eunfence the node.
  7. If any of the preceding procedures fail to resolve the problem, and the node is still fenced, gather the css logs of the primary node and the fenced node. This can be accomplished by logging into those nodes and issuing the /usr/lpp/ssp/css/css.snap command. See Information to collect before contacting the IBM Support Center.

switch_responds is still on after node panic

This section addresses the case where a node panics, host_responds becomes OFF and switch_responds is still ON. This is a valid condition when the SP Switch adapter, on the crashed node, has no outstanding requests to or from this node. The SP Switch is now in a state where it can become backlogged, since the link is still marked as up. This can cause problems on other parts of the SP Switch network.

Each node fault-service daemon is responsible for updating its switch_responds. The SP Switch primary node detects fallen links and turns off switch_responds of faulty ones. The switch_responds is turned on only during Estart or Eunfence command processing. A node panic with switch_responds ON is a legitimate occurrence. There are two cases:

  1. With primary or backup SP Switch nodes running, the switch_responds is updated only after a packet is sent to the panicked node. Therefore, a user can change switch_responds by trying to ping the panicked node. Having HA (hats and hags) run on the nodes can remedy the situation, since they run IP packets between the nodes casually in order to check the links (LAN Adapter event).
  2. Without primary or backup SP Switch node running, there is no switch control. In this case, switch_responds is not updated. Only a new Estart can correct this.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]