RAID controller offline

 

Overview of controller operation

 

FC Ports

WWNN and WWPN assignment convention  

 

Drive Mapping

 

Failover methods

                        Implicit Fail Over (Dual Controllers unit)

                        Explicit Fail Over (Single and Dual Controller units)

Failover/Failback

 

 

Controller behavior summary

 

Controller Failure identification

 

Gathering Logs

Support Archive
Management Station logs
Event Log

            Controller event logs

Configuration Profile

            CLI Diagnostic Dump Command (Support Archive)

 

 

FC Ports

 

Each controller has two FC ports that are labeled FC0 and FC1. Each of these two ports provides access to the logical drives. On the owning controller, both FC0 and FC1 will return good status for the Test Unit Ready command (TUR). For the alternate controller, the FC ports will return “path set – passive” (04/0B).

 

You can remove access to a logical drive from any of those ports using the “logical manage” command.  See Using the CLI for additional information.

 

The port name follows the convention shown in Table 1.

WWNN and WWPN assignment convention.

Table 1 - WWNN and WWPN assignment convention

Node and Port Name Convention

Enclosure

FC0

FC1

Node Name

20000000d1262472

NA

NA

 Controller A Port names

NA

21000000d1262472

21010000d1262472

 Controller B Port names

NA

21020000d1262472

21030000d1262472

 

Drive Mapping

 

In normal operation, all storage is "visible” (shown in Figure 1 as dash lines to the HDDs) through all four physical ports of the storage enclosure (assuming dual controllers).  On the owning controller (solid lines to the HDDs), both FC0 and FC1 will return good status for the Test Unit Ready command (TUR). For the alternate controller, the FC ports will return “path set – passive” (04/0B).  These statuses indicate to the DSM which path to select as the active path. This information is in turn passed to the MPIO driver. The MPIO driver verifies that it can actively communicate to the logical drive.  A change in path as a result of a failover condition is accomplished through the SetNewPath command to the alternate path. This can be initiated by either the initiator or the controller. It is important to note that when setting up logical drives, a path for both controllers needs to be present. This can be done using 2 HBAs with a dedicated path to each controller or by using a single HBA with a hub or switch attached to both controllers. In order to ensure that conflicts do not occur, the proper mapping needs to be created in ServeRAID manager. This mapping should be done after the DSM has been installed.

 

Failover methods

 

The DS400 Storage Server supports two methods of fail over:

1)      Implicit fail over where the fail over process is managed by the enclosure itself

2)      Explicit fail over where the fail over process is commanded by the host computer. 

 

 

Implicit Fail Over (Dual Controllers unit)

 

Implicit fail over occurs when a crash occurs on a particular storage controller.  Implicit fail over is managed by the DS400 through the loss of the heartbeat between the two controllers (see Figure 1 ). In this case, the initiator will detect a failure on the primary connections and will attempt to direct traffic through the other controller ports.  The DS 400 will automatically transition the secondary ports to an active state, multiplex the back end storage and ensure cache coherency.

 

Explicit Fail Over (Single and Dual Controller units)

 

Explicit fail over occurs when the initiator detects a connection failure (initiator can no longer communicate with a controller) and directly commands the DS400 Storage Server to fail over to the other controller. This action is done through the initiator’s DSM sending a SetTargetPortGroup command to the still accessible controller. On a single controller unit, the initiator may switch the data path to the second port of the controller.  Most fabric incidents (e.g., cable failures, switch failures) will result in an explicit fail over.

 

Figure 1: Controller implicit failover

 

 

Note:  In a dual controller configuration the controllers are either active/active or active/passive.  In the active/passive only one controller at a time can access the physical drives. This is referred to as Asynchronous Logical Unit Access (ALUA)

 

Failover/Failback

When a failover condition occurs, the logical drives will go into a transition state as the array ownership changes. The DSM will monitor the LUNs until they change into either a passive state (04/0B) or active state (2A/06 or 29/01). Once this occurs the DSM will issue a SetNewPath to select the currently active path. After a controller failure (manual or real or commanded), the controller has to be brought back on-line via management tools, i.e., peer enable (CLI) or through the ServeRAID manager GUI.  It is not automatic.

 

Controller behavior summary

 

            Devices (drives)

All drives will always appear in the lists of controller A and B. Although drives are located on either the A or B side of the enclosure, this does not indicate controller ownership. Thus drives installed on either side of the enclosure may be owned by either A or B controller. Drives do not indicate ownership.

 

Arrays

Arrays do indicate ownership – both current and preferred. The preferred ownership is established at array creation. Therefore all drives assigned to that array will be “masked” from the other controller.  Preferred ownership is only changed if an array is moved via management commands to the other controller. Preferred ownership will not change on a controller failure – either a real failure or a manual (controller pull) failure. After a controller failure (manual or real or commanded), the controller has to be brought back on-line via management tools, i.e., peer enable.  It is not automatic.

 

FC failover

WWNs are not failed over. Therefore port or controller failures require host DSM\MPIO software to redirect traffic to a remaining port. If a single port fails, then the logical drive can be accessed via the other port on the same controller. If both ports fail, then the DSM\MPIO software has to issue a command to select the new active path to the surviving controller.

 

Hot-spares

Hot-spares are assigned to an array. If an array is moved to the other controller (for any reason) then the hot-spare moves with it.

Controller Failure identification

Use the ServeRaid Manager to determine which controller has failed. The failing controller is overlaid with a warning or fatal icon and an Event is generated. The information recorded for an event includes a long and short textual description of the error, along with a severity level, a time stamp and details. Right-click on the flagged controller icon and select Properties to get the information for this controller.  Figure 2.shows array SCTRLB owned by controller B (HDD and controller icons are shaded). Figure 3 shows that controller B has failed and that controller A has assumed ownership of all arrays from controller B. The Event log displays the following information entry indicating that the heartbeat between both controllers has been interrupted:

Network interface link lost on interface1 (Interface1 is the internal Ethernet port that carries the heartbeat)

Figure 2: Array ownership

 

Figure 3: Controller failure - Array ownership switched to controller A

Steps to identify the failure mode of the controller (DS400)

  1. Check the controller LEDs for any abnormal indications. See Indicator lights and problem indications
  2. Analyze the Event log.
  3. Make sure that the network is not down. ServeRaid uses out-of–band management of the controllers.
  4. You should at this time save the Support Archive file in the event that the problem cannot be resolved locally. This file will be requested by IBM support to analyze the failure.
  5. Use the PD maps to assist you in solving the problem
  6. When the problem is resolved, restart the controller. Make sure the storage can now be accessed from the host.

 

Gathering logs

 

Support Archive

 

This is the most important set of logs that you can capture. It will be forwarded to IBM support for in depth analysis. You can analyze the Event log, error log and the Controller configuration profile.

From ServeRAID Manager right click on desired enclosure and select "Save support archive". The following files are saved in the ServeRaid Manager folder (typically C:\Program Files\IBM\ServeRaid Manager):

RaidEvt.log - The event log

RaidEvtX.log – The controller event log (where X is either A or B)

RaidErr.log - The error log

RaidCfg.log - This is the configuration profile for that enclosure

diagnostics.tgz file - Compressed file containing binary and text files for Engineering analysis

 

Note: The support archive may take 5 to 10 minutes to save.

 

Management Station logs

 

 From Management Station collect the following files from the C:\WINDOWS\Temp directory or (C\winnt\temp):

 mgmtservice.log

 mgmtservice.log.old (if exists)

           

These two files are XML based files that show all of the events generated as the result of communication between the controller and the Management Agent running in the ServeRAID manager GUI. Any time communication is lost or critical alerts are generated, the information will be stored in this log.

 

Event Log

 

From ServeRAID Manager click on the Event button on the tool bar. The Event log displays. You can save the log from the dialog "File" menu. The file Event.txt is saved in the ServeRAID install folder (typically C:\Program Files\IBM\ServeRAID Manager). This log contains all the communication and events generated between the enclosure and the Management station.

 

This log contains the same info as Raidevt.log.

 

Controller event logs

 

These logs (RaidEvtX.log –where X is either A or B) can be viewed from the folder where ServeRAID manager is installed. They are local RAID events generated by the onboard ServeRAID controller and the TCP\IP listening service for the management service.  From an  external storage perspective it will only indicate when the connections were established and dropped by the listening service.

 

Configuration Profile

 

This the configuration profile for the enclosure.

 From the ServeRaid console right click on the Host management station and select "Save Printable Configuration".  The resulting file – RaidExt1.log - can be found in the ServeRAID install folder in a sub-folder the name of which is derived from the name of the Host.

 

This file contains the ACL (Access Control List), the logical drives, array and controller information. A common problem in logical drives not being discovered is caused by an improperly set up ACL. Check the logical drive assignment information in the ACL. Another potential cause of undiscovered drives is due to the LUN failover to the alternate controller (change in the array ownership)

 

 This log is the same as the RaidCfg.log that is saved in the Support archive file.

 

CLI Diagnostic Dump Command

 

The support archive can also be uploaded to a host using the diagnostic Dump command. The example below shows the syntax of the CLI command. See Using the CLI for additional information.

dump [1kxmodem] [xmodem]

Sends a diagnostics dump.  The diagnostic.bin file is created and sent to the host.

This file is the same file that is generated from the ServeRAID manager when “Save Support Archive” is selected from the controller pop-up menu.

To save the file from a serial port connection (left most port) using 1kxmodem, do the following from a CLI prompt:

DS400[X] (diagnostics)# dump 1kxmodem    where X is controller A or B

Creating the dump file: 145.10kB

Issue the 1kxmodem command to start the transfer à at this time select “receive” from the Terminal application menu (Hyperterminal).

Diagnostics dump was successfully sent

From a Telnet session using xmodem:

DS400 (diag)# dump xmodem à press enter

Creating the dump file: 145.10kB    à at this time select “receive” from the Telnet File menu

Creating the dump file:      0 B  Diagnostics dump was successfully sent

Note: the size of the file may or may not be incremented.

 

Back to the top