You should be referred to this chapter from a PD map or indication. If this is not the case, see Problem determination starting points.
After you read the relevant information in this chapter, return to the PD MAPS (use back arrow)
PD hints – Implicit Fail Over
(Controller Failure)
The DS400 appliance supports two methods of fail over:
1) Implicit fail over where the fail over process is managed by the appliance itself
2) Explicit fail over where the fail over process is commanded by the host computer.
Implicit fail over occurs when a crash occurs on a particular storage controller. Implicit fail over is managed by the DS x00 through the loss of the heartbeat between the two controllers. In this case, the initiator will detect a failure on the primary connections and will attempt to direct traffic through the secondary ports. The DS 400 will automatically transition the secondary ports to an active state, multiplex the back end storage and ensure cache coherency.
Explicit fail over occurs when the initiator detects a connection failure and directly commands the DS x00 appliance to fail over to the other controller. On a single controller unit, the initiator may switch the data path to the second port of the controller (assuming redundant configuration and that the appropriate drivers are installed). Most fabric incidents (e.g., cable failures, switch failures) will result in an explicit fail over.
Note:
Figure 1 shows an implicit fail over in a dual controller configuration.
Figure 1: Controller implicit failover
Note: In a dual controller configuration the controllers are either active/active or active/passive. However the active/passive version is actually just the active/active version with all storage configured on one controller.
In normal operation, all storage is "visible” (dash lines to the HDDs) through all four physical ports of the storage enclosure (assuming dual controllers). However, access to a physical storage volume is only available on the two ports of the respective controller (solid lines to the HDDs)
The DS400 appliance defines two port groups corresponding to the two controllers. Only three states are valid for the port groups: active/optimized, standby and unavailable.
A target port group (controller) is in an active state for a logical unit when it can process I/O requests. A target port group is in a standby state for a logical unit when the corresponding controller does not have access to the back end storage (SCSI HDDs). A target port group is in an unavailable state when the corresponding controller has crashed or is not present.
Devices (drives)
All drives will always appear in the
lists of controller A and B. Although drives are located on either the A or B
side of the enclosure, this does not indicate controller ownership. Thus drives
installed on either side of the enclosure may be owned by either A or B
controller. Drives do not indicate ownership.
Arrays
Arrays do indicate ownership – both
current and preferred. The preferred ownership is established at array
creation. Preferred ownership is only changed if an array is moved via
management commands to the other controller. Preferred ownership will not
change on a controller failure – either a real failure or a manual (controller
pull) failure. After a controller failure (manual or real or commanded), the
controller has to be brought back on-line via management tools, i.e., peer
enable. It is not automatic.
FC failover
WWNs
are not failed over, therefore port or controller
failures require host MPIO software to redirect traffic to a remaining port. If
a single port fails, then the logical drive can be accessed via the other port
on the same controller. If both ports fail, then the MPIO software has to issue
a command to fail that controller before access to the ports on the other
controller.
Hot-spares
Hot-spares are assigned to an array. If an array is moved to the other controller (for any reason) then
the hot-spare moves with it.
Use the ServeRaid Manager to determine which controller has failed. The failing controller is overlaid with a warning or fatal icon and an Event is generated. The information recorded for an event includes a long and short textual description of the error, along with a severity level, a time stamp and details. Right-click on the flagged controller icon and select Properties to get the information for this controller. Figure 2.shows array SCTRLB owned by controller B (HDD and controller icons are shaded). Figure 3 shows that controller B has failed and that controller A has assumed ownership of all arrays from controller B. The Event log displays the following information entry indicating that the heartbeat between both controllers has been interrupted:
Network interface link lost on interface1 (Interface1 is the
internal Ethernet port that carries the heartbeat)
Figure 3: Controller
failure - Array ownership switched to controller A
Steps to identify the failure mode of the
controller (DS400)
This is the most important set of
logs that you can capture. It will be forwarded to IBM support for in depth
analysis. You can analyze the Event log, error
log and the Controller configuration profile.
From ServeRAID
Manager right click on desired enclosure and select "Save support
archive". The following files are saved in the ServeRaid
Manager folder (typically C:\Program Files\IBM\ServeRaid
Manager):
RaidEvt.log - The event log
RaidErr.log - The error log
RaidCfg.log - This is the configuration
profile for that enclosure
diagnostics.tgz file - Compressed file containing binary and text files
for Engineering analysis
From Management Station collect the following
files from the C:\WINDOWS\Temp directory or (C\winnt\temp):
mgmtservice.log
mgmtservice.log.old
(if exists)
From ServeRAID
Manager click on the Event button on the tool bar. The
Event log displays. You can save the log from the dialog "File" menu.
The file Event.txt is saved in the ServeRaid install
folder (typically C:\Program Files\IBM\ServeRaid
Manager). This log contains the same info as Raidevt.log
This the configuration profile
for the enclosure.
From the ServeRaid
console right click on the Host management station and select "Save
Printable Configuration". The
resulting file – RaidExt1.log - can be found in the ServeRAID
install folder in a sub-folder the name of which is derived from the name of
the Host. This log is the same as the RaidCfg.log that is saved in the Support
archive file.
CLI Diagnostic Dump Command
The support archive can also be uploaded to a host using the diagnostic Dump command. See Using the CLI for additional information.