IBM Books

Diagnosis Guide


Error information

SP Switch2 operation and recovery are initialized from the switch primary node. When the primary node fails, the primary backup node takes over and becomes the new switch primary node. For SP Switch2 related problems, the error messages that have been logged are found on the current primary node.

SP Switch2 Time Of Day (TOD) is maintained by the Master Switch Sequencer (MSS) node. The MSS node is selected and monitored from the control workstation by the emasterd daemon. For TOD related problems, the logged error messages are found on the control workstation.

If the Switch Admin daemon (cssadm2) is running on the control workstation, the logged error messages are found on the control workstation.

SP Switch2 log and temporary file hierarchy

All the SP Switch2 log and temporary files are organized in a directory hierarchy. Next to each directory the specified file level is given.

                               /var/adm/SPlogs/css     node
                                /                \                     
        /var/adm/SPlogs/css0	adapter   /var/adm/SPlogs/css1 adapter
                        /                              \
     /var/adm/SPlogs/css0/p0 port	        /var/adm/SPlogs/css1/p0 port	

Relevant files are found in the directories:

In this chapter, whenever a temporary file is mentioned, the file level is given using this terminology:

  1. node level - The file resides in the /var/adm/SPlogs/css directory. There will be only one file on the node with the same name. The file contains information relevant to all the adapters with all their ports.
  2. adapter level - The file resides in the /var/adm/SPlogs/css0 or /var/adm/SPlogs/css1 directory, where 0 or 1 is the adapter ID. The file contains information relevant to a specific adapter, which affects all the ports of the adapter.
  3. port level - The file resides in the /var/adm/SPlogs/css0/p0 or /var/adm/SPlogs/css1/p0 directory, where 0 or 1 is the adapter ID and 0 in the p0 is the port id. The file contains information relevant to the port of the adapter.

plane.info file

This file has a full path name of /etc/plane.info. It is created by the user who wishes to override the SDR_config switch to plane number calculations. This file is optional. This file consists of one line for each switch, and has the following format:

Frame#:Slot# Plane# Sequence#

The Sequence# is the switch number within the plane. For example, the first switch in plane 1 is sequence number 1, the second switch in plane 1 is sequence number 2, and the first switch in plane 2 is sequence number 1.

A sample file for a two-plane SP Switch2 system would be:

		1:17 0 1
		2:17 1 1
		3:17 0 2
		4:17 1 2

If something in this file is incorrect, the switch_plane and switch_plane_seq numbers in the Switch class of the SDR will reflect the errors. If one of your switches goes down or you have to disconnect it, the SDR_config command may try to renumber your switches if it does not see that switch. In this case, you can reserve the spot for that switch by creating an /etc/plane.info file consisting of what your system should look like when that broken switch is up and running.

The /etc/plane.info file should be deleted when no longer needed, or the SDR_config command will always use it to override its own calculations.

AIX Error Log information

In order to isolate an adapter or SP Switch2 error, first view the AIX error log.

The Resource Name (Res Name) in the error log gives you an indication of what resource detected the failure.

Table 31. Resource Name failure indications - SP Switch2

Resource name Indication
Worm The information was extracted from the SP Switch2 initialization.
css Incorrect status was detected by the adapter (css device driver).
css0 Failed adapter diagnostics on adapter 0.
css1 Failed adapter diagnostics on adapter 1.

For a more detailed description, issue the AIX command:

errpt -a [-N resource_name] | more

where the optional resource_name is one of the entries in Table 31.

There are several subcomponents that write entries in the AIX error log:

  1. Adapter recovery errors, which are recognized by labels with prefixes: CS_ADAPT_, CS_ATRANS_, CS_PTRANS_, CS_PTRANS, CS_PORT, CSS_DD, CSS_SLIH, and HACSSRMD.

    Table 32. Possible causes of adapter failures - SP Switch2

    Label and Error ID Error description and analysis
    CS_PTRANS_HW_RE

    FECAAB29

    Explanation: SP Switch2 adapter port - transient hardware error.

    Cause: SP Switch2 cable failure.

    Action: Check, reconnect, unfence, and if the problem persists, replace the cable.

    Cause: SP Switch2 adapter port hardware failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.

    Cause: SP Switch2 failure.

    Action: Call IBM Hardware Service if the problem persists.

    CS_PORT_HW_ER

    984F7BA3

    Explanation: SP Switch2 adapter port - permanent hardware error.

    Cause: SPSwitch2 cable failure.

    Action: Check, reconnect, unfence and if the problem persists, replace the cable.

    Cause: SP Switch2 adapter port hardware failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.

    Cause: SP Switch2 failure.

    Action: Call IBM Hardware Service if the problem persists.

    CS_ATRANS_HW_RE

    506F0AF2

    Explanation: SP Switch2 adapter - transient hardware error.

    Cause: SP Switch2 adapter hardware failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.
    CS_ATRANS_MC_RE

    6561F900

    Explanation: SP Switch2 adapter - transient microcode error.

    Cause: SP Switch2 adapter microcode error.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call the IBM Support Center
    CS_ATRANS_SFW_RE

    07E73229

    Explanation: SP Switch2 adapter - transient software error.

    Cause: SP Switch2 adapter device driver error.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call the IBM Support Center
    CS_ATRANS_HW_MC_RE

    218342C7

    Explanation: SP Switch2 adapter - transient hardware or microcode error.

    Cause: SP Switch2 adapter failure.

    Cause: An adapter microcode failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.
    CS_ATRANS_HW_SFW_RE

    827413D0

    Explanation: SP Switch2 adapter - transient hardware or software error.

    Cause: SP Switch2 adapter failure.

    Cause: An adapter device drive software failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.
    CS_ADAPT_HW_RE

    CDECB780

    Explanation: SP Switch2 adapter - critical hardware error.

    Cause: SP Switch2 adapter failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.
    CS_ADAPT_MC_RE

    DF2AE96B

    Explanation: SP Switch2 adapter - critical microcode error.

    Cause: An adapter microcode failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.
    CS_ADAPT_SFW_RE

    958E05ED

    Explanation: SP Switch2 adapter - critical software error.

    Cause: SP Switch2 adapter device driver failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call the IBM Support Center
    CS_ADAPT_HW_MC_RE

    D0999C1C

    Explanation: SP Switch2 adapter - critical hardware or microcode error.

    Cause: SP Switch2 adapter failure.

    Cause: An adapter microcode failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.
    CS_ADAPT_HW_SFW_RE

    DA7623E7

    Explanation: SP Switch2 adapter - critical hardware or software error.

    Cause: SP Switch2 adapter failure.

    Cause: An adapter device drive software failure.

    Action: If the problem persists:

    • Run adapter diagnostics.
    • Call the IBM Support Center
    CS_ADAPT_HW_ER

    DD4FECEA

    Explanation: SP Switch2 adapter - permanent hardware error.

    Cause: SP Switch2 adapter failure.

    Action:

    • Run adapter diagnostics.
    • Call IBM Hardware Service.
    CS_ADAPT_MC_ER

    85421CF0

    Explanation: SP Switch2 adapter - permanent microcode error.

    Cause: SP Switch2 adapter microcode failure.

    Action:

    • Run adapter diagnostics.
    • Call the IBM Support Center.
    CS_ADAPT_SFW_ER

    F6E84D66

    Explanation: SP Switch2 adapter - permanent software error.

    Cause: SP Switch2 adapter device driver failure.

    Action:

    • Run adapter diagnostics.
    • Call the IBM Support Center.
    CS_ADAPT_HW_MC_ER

    405FA51A

    Explanation: SP Switch2 adapter - permanent hardware or microcode error.

    Cause: SP Switch2 adapter failure

    Cause: SP Switch2 adapter microcode failure.

    Action:

    • Run adapter diagnostics.
    • Call the IBM Support Center.
    CS_ADAPT_HW_SFW_ER

    035E8AD9

    Explanation: SP Switch2 adapter - permanent hardware or software error.

    Cause: SP Switch2 adapter failure

    Cause: SP Switch2 adapter device driver failure.

    Action:

    • Run adapter diagnostics.
    • Call the IBM Support Center.
    CSS_DD_DEBUG_ER

    8ED77B7A

    Explanation: SP Switch2 CSS device driver error.

    Cause: SP Switch2 device driver failure.

    Action: Call the IBM Support Center.

    CSS_DD_CFG_ER

    0195C376

    Explanation: SP Switch2 CSS device driver configuration error.

    Cause: SP Switch2 device driver failure.

    Action: Run configuration method with verbose flag: /usr/lpp/ssp/css/cfgcol -v -l css[0 | 1]

    • Run the configuration method with the verbose option.
    • Call IBM Support Center.
    CSS_SLIH_ER

    BAB33325

    Explanation: SP Switch2 CSS device driver - an unexpected interrupt occurred.

    Cause: SP Switch2 adapter or SP Switch2 failure.

    Action:

    • See more logged information in the AIX error log.
    • Run adapter diagnostics.
    • If the problem persists, call IBM Hardware service.
    HACSSRMD_ERR

    C3E70E5D

    Explanation: SP Switch2 CSS hacssrmd daemon terminated.

    Cause: Unknown failure.

    Action: If the problem persists, call the IBM Support Center.

  2. SP Switch2 fault service daemon errors, which are recognized by labels with the prefix: CS_SW_.

    Table 33. Possible causes of fault service daemon failures - SP Switch2

    Label and Error ID Error description and analysis
    CS_SW_ADPT_TYPE_ER

    CC1DCEED

    Explanation: The connected adapter type is not supported on SP Switch2.

    Cause: The user plugged an unsupported adapter or node into the SP Switch2 port.

    Action: Call the IBM Support Center.

    CS_SW_SEND_HANG_RE

    69AB5AEC

    Explanation: A Sender Hang was detected.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_TKNCOUNTER_RE

    3EF9DDC7

    Explanation: A Token Counter Error occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_INIT_STATE_RE

    EB8CFA87

    Explanation: Initialization State Machine error occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_TOD_ECC_RE

    35D40633

    Explanation: Receiver TOD ECC Error occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_CQ_PE_NCL_RE

    066BD301

    Explanation: Parity Error on Next Chunk Linked List occurred.

    Cause: SP Switch2 chip saw an error on a received package.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_CQ_PE_FSL_RE

    E273ABC6

    Explanation: Parity Error on Free Space Linked List occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_CQ_SRM_EC_RE

    5A19E266

    Explanation: Source Routed Multicast ECC Error occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_CQ_MCSRDT_RE

    89316FCB

    Explanation: Multicast Source Routed Decode Table Parity error occurred.

    Cause: A switch chip saw error on received package.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_CQ_MCLRTD_RE

    A974CB87

    Explanation: Multicast Lookup Table Route Decoder Parity Error occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_CQ_RCA_PE_RE

    27305E8F

    Explanation: Repeat Count Array Parity Error occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_MULTICASTR_RE

    FADF4398

    Explanation: Multicast Route Error occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_CHIP_ID_ER_RE

    43D748CF

    Explanation: Chip ID Error occurred.

    Cause: A switch chip configuration error. The hardware monitor daemon or the control workstation is down.

    Action: Check to see if the hardware monitor daemon is up.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_SVC_ARROVR_RE

    D32FD026

    Explanation: Service Array Overflow Latch occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_PE_SVCARRI_RE

    BAEF6722

    Explanation: Parity Error on input to Service Array occurred.

    Cause: SP Switch2 chip saw an error on a received package.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_PE_SVCARRO_RE

    59D3D44A

    Explanation: Parity Error on output to Service Array occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_INV_SVCCMD_RE

    D2833C50

    Explanation: Invalid Service Command error occurred.

    Cause: SP Switch2 chip saw an error on a received package.

    Action: If the problem persists, call IBM Hardware Service.

    CS_TOD_ERROR_RE

    A11A52E1

    Explanation: Error occurred in TOD logic.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_CSS_IF_FAIL_ER

    B454D630

    Explanation: SP Switch2 adapter service interface system call failed.

    Cause: Unable to communicate with the SP Switch2 adapter.

    Action:

    • Check the switch adapter configuration.
    • Run adapter diagnostics. See SP Switch2 adapter diagnostics.
    • If the problem persists, call IBM Hardware Service.
    CS_SW_TKN_CNT_O_RE

    E5895205

    Explanation: A Sender Token Count Overflow occurred.

    Cause: SP Switch2 chip failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_ACK_FAILED_RE

    D4E9B237

    Explanation: SP Switch2 daemon failed to acknowledge a service command.

    Cause: SP Switch2 communication failure.

    Cause: A traffic backlog on SP Switch2 adapter.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_EDC_ERROR_RE

    E6E27F0C

    Explanation: An EDC-class error was detected.

    Cause: A transient error in data occurred during transmission over switch links. The EDC error may be one of the following:

    • A receiver EDC error.
    • A parity error on route.
    • An undefined control character was received.
    • Unsolicited data was received.
    • A receiver lost end-of-packet.
    • A token count miscomparison.
    • A token sequence error.
    • A token count overflow.

    Cause: A loose, disconnected, or faulty cable.

    Action:

    • See Cable diagnostics.
    • See /var/adm/SPlogs/css[0 | 1]/p0/out.top for cable information.

    Cause: A node was shutdown, reset, powered off, or disconnected.

    Action: See Verify SP Switch2 node operation.

    Cause: SP Switch2 adapter hardware failure.

    Action: For more information, see SP Switch2 device and link error information.

    CS_SW_RCVLNKSYNC_RE

    37D7841C

    Explanation: A Receiver Port Link Synch Failure occurred.

    Cause: A loose, disconnected, or faulty cable.

    Action:

    • Check, reconnect, or replace the cable.
    • See /var/adm/SPlogs/css[0 | 1]/p0/out.top for cable information.
    • See Cable diagnostics.

    Cause: A node was shutdown, reset, powered off, or disconnected.

    Action:

    Cause: SP Switch2 adapter hardware failure.

    Action:

    • Run adapter diagnostics.
    • See /var/adm/SPlogs/css[0 | 1]/p0/flt for more information.

    Cause: Remote SP Switch2 adapter hardware failure.

    Action:

    • Run adapter diagnostics on remote node.
    • See /var/adm/SPlogs/css[0 | 1]/p0/flt to identify remote node.
    CS_SW_FIFOOVRFLW_RE

    821465C1

    Explanation: A Receiver FIFO Overflow error was detected.

    Cause: A loose, disconnected, or faulty cable.

    Action:

    Cause: A node was shutdown, reset, powered off, or disconnected.

    Action:

    Cause: SP Switch2 adapter hardware failure.

    Action:

    CS_SW_EDCTHRSHLD_RE

    39FCD5B9

    Explanation: EDC Error Threshold condition occurred.

    Cause: A loose, disconnected, or faulty cable.

    Action:

    • See Cable diagnostics.
    • See /var/adm/SPlogs/css[0 | 1]/p0/flt for more information.
    • If the problem persists, call IBM Hardware Service.
    CS_SW_RECV_STATE_RE

    255F1AA2

    Explanation: SP Switch2 receiver state machine error.

    Cause: SP Switch2 adapter or switch failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_PE_ON_DATA_RE

    1F59782A

    Explanation: SP Switch2 sender parity error on data was detected.

    Cause: SP Switch2 board failure.

    Action: Call IBM Hardware Service.

    CS_SW_INVALD_RTE_RE

    02A63E85

    Explanation: SP Switch2 sender invalid route error occurred.

    Cause: SP Switch2 adapter microcode or a switch daemon software error.

    Action: Call the IBM Support Center.

    CS_SW_SNDLOSTEOP_RE

    F835CDED

    Explanation: Sender Lost EOP (end-of-packet) condition occurred.

    Cause: A loose, disconnected, or faulty cable.

    Action:

    • See Cable diagnostics.
    • See /var/adm/SPlogs/css[0 | 1]/p0/out.top for cable information.

    Cause: A node was shutdown, reset, powered off, or disconnected.

    Action: See Verify SP Switch2 node operation.

    Cause: SP Switch2 adapter hardware failure.

    Action:

    CS_SW_SNDTKNTHRS_RE

    80CF3B5A

    Explanation: A Token Error Threshold error occurred.

    Cause: A loose, disconnected, or faulty cable.

    Action:

    • See Cable diagnostics.
    • If the problem persists, call IBM Hardware Service.
    • For more information, see /var/adm/SPlogs/css[0 | 1]/p0/flt.
    CS_SW_SND_STATE_RE

    74CEAB0F

    Explanation: A Sender State Machine Error occurred.

    Cause: SP Switch2 adapter or SP Switch2 failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_PE_ON_NMLL_RE

    7F704673

    Explanation: A Parity Error on the NMLL was detected.

    Cause: SP Switch2 board failure.

    Action: Call IBM Hardware Service.

    CS_SW_CRC_SVCPKT_RE

    8B091668

    Explanation: SP Switch2 service logic detected an incorrect CRC on a Service Packet.

    Cause: A transient error in data occurred during transmission over SP Switch2 links.

    Action: See /var/adm/SPlogs/css[0 | 1]/p0/flt for more information.

    Cause: A loose, disconnected, or faulty cable.

    Action:

    Cause: A node was shutdown, reset, powered off, or disconnected.

    Action:

    Cause: SP Switch2 adapter hardware failure.

    Action:

    CS_SW_PE_RTE_TBL_RE

    E8F741CD

    Explanation: SP Switch2 service logic detected an incorrect Parity Error in the Route Table.

    Cause: SP Switch2 board failure.

    Action: Call IBM Hardware Service.

    CS_SW_SVC_STATE_RE

    CF66D3CC

    Explanation: SP Switch2 service logic state machine error.

    Cause: SP Switch2 board failure.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_OFFLINE_RE

    57959ED9

    Explanation: Node received a fence (Offline) request.

    Cause: The operator ran the Efence command.

    Action: Run the Eunfence command to bring the node onto the SP Switch2.

    CS_SW_PRI_TAKOVR_RE

    A8978621

    Explanation: SP Switch2 primary node takeover.

    Cause: The SP Switch2 primary node became inaccessible.

    Action: See the AIX error log on the previous SP Switch2 primary node.

    CS_SW_BCKUP_TOVR_RE

    FD2D84AD

    Explanation: SP Switch2 primary-backup node takeover.

    Cause: The SP Switch2 primary-backup node became inaccessible

    Action: See the AIX error log on the previous switch primary- backup node.

    CS_SW_LST_BUP_CT_RE

    2196D5B4

    Explanation: SP Switch2 primary-backup node not responding.

    Cause: The SP Switch2 primary-backup node become inaccessible.

    Action: See the AIX error log on the current SP Switch2 primary-backup node.

    CS_SW_UNINI_NODE_RE

    96DD24B7

    Explanation: SP Switch2 nodes not initialized during Estart command processing.

    Cause: The listed nodes were shutdown, reset, powered off, or disconnected.

    Action:

    Cause: SP Switch2 adapter problem.

    Action: Run adapter diagnostics on listed nodes.

    CS_SW_UNINI_LINK_RE

    362E5B7

    Explanation: SP Switch2 links were not initialized during Estart command processing.

    Cause: The node was fenced.

    Action: Run the Eunfence command to unfence the node.

    Cause: The switch cable is not wired correctly.

    Action: See /var/adm/SPlogs/css[0 | 1]/p0/cable_miswire to determine if cables were not wired correctly.

    Cause: A loose, disconnected, or faulty cable.

    Action: See Cable diagnostics.

    CS_PROCESS_KILLD_RE

    D250F9DB

    Explanation: User Process was killed due to link outage.

    Cause: SP Switch2 adapter failure or SP Switch2 failure.

    Cause: The operator fenced this node.

    Action: See neighboring error log entries to determine the cause of the outage.

    CS_SW_MISWIRE_ER

    933B622E

    Explanation: SP Switch2 cable miswired (not connected to the correct switch jack).

    Cause: The SP Switch2 cable was not wired correctly.

    Action: See /var/adm/SPlogs/css[0 | 1]/p0/cable_miswire to determine if cables were not wired correctly.

    CS_SW_HARDWARE_ER

    F96576C4

    Explanation: Defective SP Switch2 board.

    Cause: SP Switch2 board configuration problem.

    Action:

    • Issue the command: hmcmds -G setid frame:slot to reconfigure the SP Switch2 board.
    • If the problem persists, call the IBM Support Center.

    Cause: Faulty SP Switch2 board.

    Action: If the problem persists, call IBM Hardware Service.

    CS_SW_LOGFAILURE_RE

    5ABE7E20

    Explanation: Error writing SP Switch2 log files.

    Cause: The /var file system is full.

    Action: Obtain free space in the file system or expand the file system.

    Cause: There are too many files open in the system.

    Action: Reduce the number of open files in the system.

    CS_SW_INIT_FAIL_ER

    957E82AA

    Explanation: Switch fault-service daemon initialization failed.

    Cause: The operating environment could not be established.

    Action:

    • See Detail Data of this entry for a specific failure.
    • Correct the problem and restart the daemon by running the rc.switch command.
    • If the problem persists, call the IBM Support Center.
    CS_SW_SIGTERM_ER

    A98EF5D8

    Explanation: SP Switch2 fault service daemon received SIGTERM.

    Cause: Another process sent a SIGTERM.

    Action: Run the rc.switch command to restart the daemon.

    CS_SW_SVC_Q_FULL_RE

    172826EF

    Explanation: SP Switch2 service send queue is full.

    Cause: There is a traffic backlog on the SP Switch2 adapter.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_GET_SVCREQ_ER

    4DFEC48

    Explanation: SP Switch2 daemon could not get a service request.

    Cause: SP Switch2 device driver failure.

    Action: Call the IBM Support Center.

    CS_SW_RSGN_PRIM_RE

    585D90B2

    Explanation: The SP Switch2 Primary node resigned from the job as primary node.

    Cause: Could not communicate over the SP Switch2.

    Action: See neighboring AIX error log entries to determine the cause of the outage.

    Cause: Another node was selected as the primary node.

    Action: None.

    CS_SW_RSGN_BKUP_RE

    C32FD9D3

    Explanation: Resigning as the SP Switch2 primary-backup node.

    Cause: Could not communicate over the SP Switch2.

    Action: See neighboring error log entries to determine the cause of the outage.

    Cause: Another node was selected as the primary-backup node.

    Action: None.

    CS_SW_SDR_FAIL_RE

    3E6F3E2E

    Explanation: Switch fault service daemon failed to communicate with SDR.

    Cause: An Ethernet overload.

    Action: If the problem persists, call the IBM Support Center.

    Cause: Excessive traffic to the SDR.

    Action: If the problem persists, call the IBM Support Center.

    Cause: The SDR daemon on the control workstation is down.

    Action: Check to see if the SDR daemon is up.

    Cause: A software error.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_SCAN_FAIL_ER

    63589548

    Explanation: SP Switch2 scan failed.

    Cause: Could not communicate over the SP Switch2.

    Cause: SP Switch2 adapter failure or a SP Switch2 failure.

    Action: Issue the Estart command if primary takeover does not occur.

    CS_SW_PLANEMISW_ER

    94E99A66

    Explanation: SP Switch2 plane miswire.

    Cause: SP Switch2 cable is connected on one side to a switch-port or node-port belonging to a different SP Switch2 plane than the one that the other side of the cable is connected to.

    Action: See /var/adm/SPlogs/css[0 | 1]/p0/cable_miswire to determine which cables were not wired correctly.

    CS_SW_NODEMISW_RE

    A19DCA76

    Explanation: SP Switch2 node miswired.

    Cause: SP Switch2 cable was not plugged into the correct node.

    Action: See /var/adm/SPlogs/css[0 | 1]/p0/cable_miswire to determine if cables were not wired correctly.

    CS_SW_NODECONF_RE

    CEB4B5AF

    Explanation: SP Switch2 node configuration error.

    Cause: A switch node was not configured properly to the system.

    Cause: An unknown node was plugged into the system - probable miswire.

    Action:

    • Use the switch node number and plane given to find the offending node.
    • See /var/adm/SPlogs/css[0 | 1]/p0/cable_miswire to determine where the offending node is connected.
    • Reconfigure or disconnect the offending node.
    CS_SW_RTE_GEN_RE

    44D2A1B5

    Explanation: SP Switch2 daemon failed to generate routes.

    Cause: A software error.

    Action: Call the IBM Support Center.

    CS_SW_FENCE_FAIL_RE

    A6E635F9

    Explanation: Fence of node off SP Switch2 failed.

    Cause: Could not communicate over the SP Switch2.

    Action:

    • See /var/adm/SPlogs/css[0 | 1]/p0/flt for more information.
    • See the error log on the failing node.
    • Issue the Estart command to initialize the switch network.
    CS_SW_REOP_WIN_ER

    0C17D5C7

    Explanation: Switch fault service daemon reopen adapter windows failed.

    Cause: SP Switch2 adapter or daemon recovered.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_ESTRT_FAIL_RE

    4EE9669F

    Explanation: Estart command failed - switch network could not be initialized.

    Cause: Could not initialize SP Switch2 chips or nodes.

    Action:

    • See Detail Data of this entry for the specific failure.
    • If the problem persists, call the IBM Support Center.
    CS_SW_IP_RESET_ER

    A6BCABA3

    Explanation: Switch fault service daemon could not reset IP.

    Cause: SP Switch2 device driver error.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_CBCST_FAIL_RE

    31C01480

    Explanation: Switch fault service daemon command broadcast failed.

    Cause: Could not communicate over the SP Switch2.

    Cause: A traffic backlog on the SP Switch2 adapter.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_UBCST_FAIL_RE

    F7704403

    Explanation: Switch fault service daemon database updates broadcast failed.

    Cause: SP Switch2 communication failure.

    Cause: A traffic backlog on the SP Switch2 adapter.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_DNODE_FAIL_RE

    19337D09

    Explanation: Switch daemon failed to communicate with dependent nodes.

    Cause: Failed to communicate over the SP Switch2.

    Cause: A traffic backlog on the SP Switch2 adapter.

    Action: If the problem persists, call the IBM Support Center.

    CS_SW_PORT_STUCK_RE

    889BE7C3

    Explanation: SP Switch2 port cannot be disabled. Eunfence command failed.

    Cause:

    • SP Switch2 chip or adapter hardware error.
    • Cable failure.

    Action:

    • See /var/adm/SPlogs/css[0 | 1]/p0/flt on the primary node for more information.
    • Run adapter diagnostics on the node that failed to unfence.
    CS_SW_FSD_TERM_ER

    1C27CFCD

    Explanation: Switch fault service daemon process was terminated.

    Action: See preceding error log entries to determine the cause of the failure.

    Cause: Faulty system planar.

    Action: Run complete diagnostics on the node.

    Cause: Not enough free space left in the node's /var/adm/SPlogs file system.

    Action: Obtain more space.

  3. Adapter Diagnostic errors, which are recognized by labels with a prefix of: SWT_DIAG_

    Table 34. Possible causes of adapter diagnostic failures - SP Switch2

    Label and Error ID Error description and analysis
    SWT_DIAG_ERROR1_ER

    8998B96D

    Explanation: SP Switch2 adapter failed post-diagnostics, see the man page for the diag command.

    Cause: Faulty switch adapter.

    Action: See SP Switch2 adapter diagnostics.

    SWT_DIAG_ERROR2_ER

    2FFF253A

    Explanation: SP Switch2 adapter failed diagnostics.

    Cause: Faulty switch adapter.

    Action: See SP Switch2 adapter diagnostics.

  4. emasterd errors, which are recognized by labels with a prefix of: CS_EMSTR_

    Table 35. Possible causes of SP Switch2 TOD management (emasterd) failures

    Label and Error ID Error description and analysis
    CS_EMSTR_EXIT_ER

    FFDEEAA3

    Explanation: SP Switch2 TOD management daemon (emasterd) exited on the control workstation.

    Action: Look for more information in the AIX error log on the control workstation.

    Cause: Insufficient free space in /var/adm/SPlogs file system.

    Action: Free space in the file system.

    Cause: The daemon cannot communicate with the SDR.

    Action: See that the SDR daemon is running, and restart the emasterd.

    CS_EMSTR_RESIGN_ER

    13D2008B

    Explanation: SP Switch2 TOD management daemon (emasterd) failed to resign the current MSS (Master Switch Sequencer) node.

    Cause: No communication with the MSS node via the SP Switch2 or Ethernet. No new MSS will be assigned.

    Action: Resume communication with the node.

    CS_EMSTR_MS_SRCH_ER

    EB6CBD01

    Explanation: SP Switch2 TOD management daemon (emasterd) failed to find the current MSS (Master Switch Sequencer) node. The SDR information is incorrect.

    Cause: No communication with the MSS node via the SP Switch2 or Ethernet. No new MSS will be assigned.

    Action: Resume communication with the node.

  5. SP Switch2 PCI Adapter errors, which are recognized by these error log entries.

    Table 36. Possible causes of SP Switch2 PCI Adapter failures

    Label and Error ID Error description and analysis
    CS_RCVY_START_RE

    6317E423

    Explanation: Critical error recovery is starting.

    Action: This is likely to be a switch adapter hardware or microcode failure. See neighboring error log entries for the cause of the outage.

    CRS_FENCE_ER

    1485995F

    Explanation: CSS Adapter Port-Permanent Error(Fence)

    Cause: Switch cable or switch failure

    Action: Run adapter diagnostic. See /var/adm/SPlogs/css/css[0 | 1]/la_error.log and /var/adm/SPlogs/css[0 | 1]/la_event_d.trace for details.

    Cause: Switch cable loose or disconnected

    Action: Check, reconnect, or replace the cable. Unfence the node.

    CRS_OFFLINE_ER

    4E14B222

    Explanation: CSS Adapter - Permanent Hardware or Software Error(Offline)

    Cause: Switch adapter hardware or software error.

    Action: Run adapter diagnostic. See /var/adm/SPlogs/css/css[0 | 1]/la_error.log and /var/adm/SPlogs/css[0 | 1]/la_event_d.trace for details. Record the above information anc contact the IBM Support Center.

    CRS_RESTART_HWSW_ER

    4007FD6A

    Explanation: CSS Adapter - Critical Hardware or Software Error(Restart)

    Cause: Switch adapter hardware or software error.

    Action: See /var/adm/SPlogs/css/css[0 | 1]/la_error.log and /var/adm/SPlogs/css[0 | 1]/la_event_d.trace for details. If the problem persists, run adapter diagnostics and call hardware support.

    CRS_LAMSG_SEND_FAIL

    5144238C

    Explanation: Event Handler failed to send La_event_d error action

    Cause: Switch adapter hardware or software error.

    Action: See /var/adm/SPlogs/css/css[0 | 1]/la_error.log and /var/adm/SPlogs/css[0 | 1]/la_event_d.trace for details.

    CORSAIR_CONFIG1_ER

    FFCF4911

    Explanation: CSS config failed.

    Cause: Software Error, ODM error.

    Action: Run config method with verbose option for more information.

Adapter configuration error information

The following table is based on the possible values of the adapter_config_status attribute of the Adapter object of the SDR. Use the following command to determine its value:

SDRGetObjects Adapter adapter_type==[css0 | css1] node_number adapter_config_status

Use the value of the adapter_config_status attribute for the node in question, to index into Table 37. The value of a correctly configured CSS adapter is css_ready.

Note:
The adapter_config_status table that follows uses the phrase "adapter configuration command". This refers to the SP Switch2 adapter configuration method. Use this command to invoke it:
/usr/lpp/ssp/css/cfgcol -v -l [css0 | css1] > output_file_name


Table 37. adapter_config_status values - SP Switch2

adapter_config_status Explanation and recovery
css_ready Correctly configured CSS adapter.
odm_fail

genmajor_fail

genminor_fail

getslot_fail

build_dds_fail

Explanation: An ODM failure has occurred while configuring the CSS adapter.

Action: Rerun the adapter configuration command. If the problem persists, contact the IBM Support Center and supply the command output.

getslot_fail Verify that the CSS adapter is properly seated, then rerun the adapter configuration command. If the problem persists, contact the IBM Support Center and supply them with the command output.
busresolve_fail Explanation: There are insufficient bus resources to configure the CSS adapter.

Action: Contact the IBM Support Center.

dd_load_fail See Verify software installation. If software installation verification is successful and the problem persists, contact the IBM Support Center.
make_special_fail Explanation: The CSS device special file could not be created during adapter configuration.

Action: Rerun the adapter configuration command. If the problem persists, contact the IBM Support Center and supply them with the command output.

dd_config_fail Explanation: An internal device driver error occurred during CSS adapter configuration.

Action: See Information to collect before contacting the IBM Support Center.

diag_fail Explanation: SP Switch2 diagnostics failed.

Action: See SP Switch2 adapter diagnostics.

not_configured Explanation: The CSS adapter is missing or not configured.
pdd_init_fail

load_khal_fail

See Verify software installation. If software installation verification is successful and the problem persists, contact the IBM Support Center.

SP Switch2 device and link error information

The device and link current status is gathered in the annotated switch topology file, out.top, that is created on each plane of each node that has its corresponding switch_responds set to 1. For plane 0, switch_responds0 must be 1. For plane1, switch_responds1 must be 1. The file looks like the switch topology file except that for each device or link that differs from the operational default status, an additional comment is made. For the directory that contains the out.top file, see SP Switch2 log and temporary file hierarchy.

These additional comments are appended to the file by the fault service daemon and reflect the current connectivity status of the link or device. No comment on a link or device line means that the link or device exists and is operational. The comment format is:

ideal-topology-line device-status-no which-device:device-status-string (link-status-string) 

where:

Not all the comments reflect an error. Some may be a result of the system configuration or current system administration.

An example of a failing entry and description is in out.top. If the listed recovery actions fail to resolve your problem, contact the IBM Support Center.

The possible device status values for SP Switch systems, with their recovery actions, are listed in Table 38. The possible link status values for SP Switch systems, with their recovery actions, are listed in Table 39. Additional miswire information can be found in cable_miswire.

Table 38. SP Switch2 device status and recovery actions

Device status number Device status text Explanation and recovery actions
2 Initialized Explanation: Both devices are initialized. The port's link status is not operational.

Cause: The link is faulty.

Action: See Table 39 for link status.

0 Uninitialized Explanation: No device is connected to this port.

Cause: There is no cable connected to this port.

Action: If this is intentional, no action is needed. If not, connect a cable to the port.

-3 The device has been removed from network because of a bad signature Explanation: The device was removed from the switch network - device configuration failure.

Cause: A fault on the device.

Action: Contact IBM Hardware Service.

-4 Device has been removed from network - faulty. Explanation: The device has been removed from the switch network.

Cause: A fault on the device.

Action: If the device in question is a node, see Verify SP Switch2 node operation. Otherwise, contact IBM Hardware Service.

-5 Device has been removed from the network by the system administrator. Explanation: The device was placed offline by the systems administrator (Efence).

Cause: The switch administrator ran the Efence command.

Action: Eunfence the device.

-6 Device has been removed from network - no AUTOJOIN. Explanation: The device was removed and isolated from the switch network.

Cause: The node was Efence without AUTOJOIN, the node was rebooted or powered off, or the node faulted.

Action: First attempt to Eunfence the device. If the node fails to rejoin the switch network, see AIX Error Log information. If the problem persists, contact the IBM Support Center.

-7 Device has been removed from the network for not responding. Explanation: The device was removed from the switch network.

Cause: An attempt was made to contact the device, but the device did not respond.

Action: If the device in question is a node, see Verify SP Switch2 node operation. Otherwise, contact the IBM Support Center.

-8 Device has been removed from the network because of a miswire. Explanation: The device is not cabled properly.

Cause: Either the switch network is miswired, or the frame supervisor tty is not cabled properly.

Action: First view the /var/adm/SPlogs/css[0 | 1]/p0/cable_miswire file. Verify and correct all links listed in the file. Then issue the Estart command. If the problem persists, contact IBM Hardware Service.

-9 Destination not reachable. Explanation: The device was not reachable through the switch network.

Cause: This is generally due to other errors in the switch network fabric.

Action: Investigate and correct the other problems, then run the Estart command.


Table 39. SP Switch2 link status and recovery actions

Link Status Number Link Status Text Explanation and Recovery Actions
0 Uninitialized Explanation: The link is uninitialized.

Cause: Switch Initialization was not complete.

Action: Try to Estart the switch network again. If the problem persists, contact the IBM Support Center.

-1 The link is not operational - link re-timing Explanation: The link is in the initialization stage.

Cause: If the problem persists, the link may be faulty - Cable or interposer card faulty.

Action: First attempt to Estart the switch again, If the link does not come up, try switching the cable or connecting a wrap plug to test the interposer card.

-2 Wrap plug is installed. Explanation: This link is connected to a wrap plug.

Cause: The wrap plug is connected to the port in order to test the port. This is not normally a problem.

Action: None.

-3 The link is not operational - link failed to time. Explanation: The link failed to initialize.

Cause: If problem persists, the link maybe faulty - Cable or interposer card faulty.

Action: First attempt to Estart again. If the link does not come up, try switching the cable or connecting a wrap plug to test the interposer card.

-4 Link has been removed from network or miswired - faulty. Explanation: The link is not operational and was removed from the switch network.

Cause: Either the link is miswired or the link has failed.

Action: First check the /var/adm/SPlogs/css[0 | 1]/p0 directory for the existence of a cable_miswire file. If the file exists, verify and correct all links listed in the file. Then issue the Estart command.

If the cable_miswire file does not exist, examine the /var/adm/SPlogs/css[0 | 1]/p0/flt file for entries relating to this link. If entries are found, verify that the cable is seated at both ends, then run the Estart command. If the problem persists, contact the IBM Support Center.

-5 The link has been removed from network by the system administrator Explanation: The link was removed (commented out) from the switch network by the switch administrator. This is not a problem.
-6 The link has been removed from network - no AUTOJOIN Explanation: The device was removed and isolated from the switch network.

Cause: The node was Efence without AUTOJOIN, the node was rebooted or powered off, or the node faulted.

Action:

  • First attempt to Eunfence the device.
  • If the node fails to rejoin the switch network, see AIX Error Log information.
  • If the problem persists, contact the IBM Support Center.
-7 Link has been removed from network - fenced. Explanation: The device was placed offline by the Systems Administrator (Efence).

Action: Eunfence the associated node.

-8 Link has been removed from network - probable miswire. Explanation: The link is not cabled properly.

Action: View the /var/adm/SPlogs/css[0 | 1]/p0/cable_miswire file. Verify and correct all links listed in the file, then run the Estart command.

-9

Link has been removed from network - not connected.

Explanation: The link cannot be reached by the primary node, therefore initialization of the link is not possible.

Cause: This is generally caused by other problems in the switch network, such as a switch chip being disabled.

Action: Investigate and correct the underlying problem, then run the Estart command.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]