Servers - High availability using the F/W streaming adapter and the SCSI-2 F/W RAID adapter

Summary
This document outlines IBM recommendations on obtaining high availability of an IBM server in a RAID environment. The IBM SCSI-2 Fast/Wide PCI-Bus RAID Adapter and the IBM Fast/Wide Streaming Adapter/A are covered.

NOTE: This document is intended for use by system administrators.

Date: December 1997

Preventive Measures to Help Obtain High Availability
IBM recommends the following precautions in order to help obtain high availability of the RAID subsystem:

Define a Hot Spare
Defining a hot spare drive minimizes the length of time a server operates with degraded performance when a defunct drive occurs. The hot spare also allows the "inconsistent" drive to be easily recognized in the event of a multiple defunct drive failure such that recovery procedures require much less technical expertise. The section below explains this advantage in greater detail:

Hot Spare Advantages
When a system has a drive that becomes defunct, data is not written to this DDD drive, but data is written to the other drives in the array. Therefore that DDD becomes "inconsistent" with the rest of the drives in the array. When multiple drives appear DDD, the first and most critical task is defining the "inconsistent" drive correctly. The "inconsistent" drive must be the last drive replaced since it requires rebuilding (and, if truly defective, may need physical replacement). If the "inconsistent" drive is software replaced (See Software Replace vs. Physical Replace) first when a multiple DDD failure occurs, the "inconsistent" data will be used to rebuild another drive. This eventually corrupts the other drives (and data) on the system.
However, when an HSP is defined, you are protected from rebuilding another drive from an "inconsistent" drive. This is because of the way the RAID adapter marks the states of drives. When a system has a defined HSP, as soon as the HSP takes over for the DDD drive, the RAID Adapter marks the DDD drive in its configuration as the HSP drive. The adapter does not visually change the status of the drive to HSP. Yet if you perform a software replace or physical replace, the RAID Adapter starts the drive and changes the DDD state to HSP. The RAID Adapter does not allow this drive to be brought back to ONL status.

When the HSP takes over for the DDD drive, the HSP is rebuilt to replace the DDD drive. During the rebuilding of the HSP drive, it appears in the OFL state. The OFL state changes to ONL once this drive is completely rebuilt and fully operational for the DDD drive. The DDD drive remains DDD.

If a HSP is not defined or multiple drives appear DDD before the HSP is completely rebuilt, then this is not the case. You must read the RAID log to determine the "inconsistent" drive. Then for the IBM SCSI-2 F/W PCI-Bus RAID Adapter and the IBM F/W Streaming RAID Adapter/A, you must ensure that the software replace option is selected on each drive bay in the correct order such that the "inconsistent" drive is brought online last and rebuilt.

If a HSP drive was defined but did not complete the rebuild, then it is much easier to identify the "inconsistent" drive. The "inconsistent" drive will remain in OFL status.

When multiple drives appear defunct, as long as the logical drive is not in the OFL state, the user may select the Replace option to change the state of any of the DDD drives. Order does not matter with logical drives in the CRT state because the "inconsistent" drive will appear as OFL or DDD to the user. If the logical drive is in the OFL state, the user may attempt to recover by identifying the "inconsistent" drive, software replacing all drives except the "inconsistent" drive, and then rebuilding the "inconsistent" drive.

Install and Use NetFinity Manager
You should install NetFinity Manager 5.0 or greater in order to monitor the RAID array remotely. NetFinity Manager can be used to schedule synchronization to occur at any time of the day, so synchronization of the RAID array can be scheduled for non-peak hours and will not require user input to get things started. With NetFinity services installed at the server, and the NetFinity manager installed on a workstation, the RAID array can be monitored, and even synchronized, from a remote location. The system can also be configured to send alert messages regarding the RAID subsystem over the network to the workstation. You can even setup NetFinity Manager to page someone, e.g., the network administrator or a service technician, if a certain alert condition is reached. NetFinity Manager can also perform many other functions such as monitoring processor utilization, critical file monitoring and detecting installed software across the network. NetFinity is also used to capture PFA alerts from hard files and then send system alerts to the appropriate parties. In order to use Netfinity 5.0 to schedule data scrubbing, please download NF50RAID.EXE from http://www.us.pc.ibm.com/files.html. This file contains updated Netfinity program files which are required for scheduling data scrubbing on controllers with the write policy set to write-back cache. When installed with the NetFinity Manager code the following operating systems are affected: OS/2, WINNT, and WIN95.

Data Scrub Drives Weekly
One of the best ways to recognize potential disk problems in advance and correct them before a failure occurs, is to Data Scrub. Sector media errors can be identified and corrected simply by forcing all data sectors in the array to be accessed through Data Scrubbing. Data Scrubbing checks all data sectors in the array and should be performed weekly. With the IBM F/W Streaming RAID Adapter/A and the IBM SCSI-2 F/W PCI-Bus RAID Adapter, an easy process used to accomplish Data Scrubbing is synchronization. Data Scrubbing will force all sectors of the drives contained in the array to be read in the background while allowing concurrent user disk activity. NetFinity Manager 5.0 can be obtained at no additional charge by customers that have an IBM server that ships with ServerGuide.

Apply All Updates
You should apply all updates regarding RAID. Check the IBM Server web site at: http://www.us.pc.ibm.com/server/server.html or call the HelpCenter for up-to-date information.

Install and Use RAID Administration and Monitoring Utilities
The RAID administration utility alerts the user via the speaker and display if a drive becomes DDD or if a Predictive Failure Analysis (PFA) alert occurs. PFA support on disk drives recognizes potentially bad drives, and alerts systems administrators allowing them to replace the unit before a catastrophic drive failure. The PFA alert prompts you to replace the drive before actual failure, so that a HSP is always present.

The RAID administration utility monitors RAID operations, displaying results on the RAID Administration Screen and logging these RAID events to a file. You can specify whether you want the utility to save the file to a diskette drive, local hard drive, or network hard drive; however, the recommended policy is to use the diskette drive or network drive. This practice makes it easier to recover from situations where the operating system is not accessible due to the failure. The logs themselves are required to recover data from systems when multiple DDD drives occur. The logs also provide essential RAID history for that server when troubleshooting and isolating the defective part in cases where it is not the drive that is defective.

Ensure Current Backup of RAID Configuration is Available
You should always have a current backup of the RAID configuration; anytime the array changes, you should make another backup. To create this backup, select Backup Config. to Diskette under Advanced Functions on the Main Menu of the RAID Configuration Diskette. You are prompted to enter a filename; the default is CONFIG. IBM recommends that you provide a unique name and backup to a different diskette each time. A unique name ensures that a good backup is not inadvertently overwritten, and a different diskette allows you to write-protect the diskette and keep it in a safe place. NetFinity 4.0 or above also allows you to backup the configuration under the RAID manager.

Have a RAID Configuration Utility Diskette Available
Having a copy of the RAID Configuration Utility Diskette is crucial when working on a RAID system. Ensure that you always have a RAID Configuration Utility Diskette available in close proximity to all RAID systems. Due to possible changes of drive states, the backup RAID configuration stored on the diskette may differ from the current working RAID configuration.

Recovery Procedures for DDD Drives
This section provides you with procedures for recovering from many different DDD scenarios. Topics include:

Software Replace vs. Physical Replace
For the IBM SCSI-2 F/W PCI-Bus RAID Adapter and the IBM F/W Streaming RAID Adapter/A, you perform drive replacement via the RAID Configuration Utility. To begin, select Replace Drive and Rebuild Drive options under Replace/Rebuild on the RAID Configuration Main Menu. With this action, the RAID Adapter sends a start unit command to the drive. Once the drive starts successfully, the drive state changes from defunct (DDD) to either hot-spare (HSP) or offline (OFL). The drive state is HSP if an HSP has already taken over for this DDD drive. The drive state is OFL if no HSP drive was present when the drive went DDD. The logical drive will be in a critical state and a rebuild is necessary to bring this drive into the array as online (ONL). Once the rebuild completes successfully, the logical drive indicates OKY status.

Software Replace vs. Physical Replace
When the RAID Adapter communicates with the hardfile and receives an unexpected response, the adapter will mark the drive defunct in order to avoid any potential data loss. For example, this could occur in the event of a power loss to any of the components in the SCSI RAID subsystem. In this case, the RAID adapter will err on the side of safety and will no longer write to that drive although the drive may not be defective in anyway.

Different circumstances warrant either a software replace or a physical replace, as discussed in the following bullets:
-Using a software replace is recommended to try to recover data when multiple DDD drives occur. In this situation, you may lose data on drives that are not actually defective if you run a normal rebuild process.

WARNING: IF YOU USE THE WRONG ORDER WHEN YOU ATTEMPT A SOFTWARE REPLACE, DATA CORRUPTION RESULTS.

Using and Understanding the RAID Administration Log
Being able to read the RAID log produced by the RAID administration and monitoring utilities is a very important part of recovering an array when one or more drives are marked DDD. From the RAID log, you can determine in what order drives went DDD, and, if multiple drives are DDD, which one is the "inconsistent" drive. The RAID log is created by either running the RAID Administration program or Netfinity RAID Manager. RAID Administration Program can be obtained from the Configuration Diskette which contains the device drivers under the specific operating system subdirectory. The diskette is available on the IBM website http://www.us.pc.ibm.com/files.html. Search on "RAID." The following is an excerpt from a RAID log created by the RAID administration utility:

RAID Log

28 January 1997, 11:23:38

28 January 1997, 13:03:30

Adp 0: Drv at ch 1 bay 2 is defunct.

28 January 1997, 13:03:40

Adp 0: Drv at ch 1 bay 2 is not auto replaced.

The first two lines of the RAID log show that the drive in bay 5 was marked DDD and auto replaced by the HSP drive in bay 2. At a later point in time after the rebuild to bay 2 was successful, bay 2 was marked DDD. Because there was no HSP drive defined (bay 5 had been neither physically nor software replaced, so it was still DDD), bay 2 was not auto replaced, so the array remains in the critical state until a replacement drive is added. Using the time stamps on this RAID log, you can tell the exact times the apparent drive failures occurred. You can use this information to rebuild the array properly when multiple DDD drives occur at the same time.

In the current status interpreted by the RAID log, the drive in bay 2 is the "inconsistent" drive, and you must physically replace it. If more drives are DDD but not listed in the RAID log because the server has trapped (OS/2 or NT) or the volume was dismounted (NetWare). Then, you need to software replace those drives before replacing the drive in bay 2, because the other drives contain the correct information to rebuild the "inconsistent" drive assuming no other error has arisen on those drives.

Before you perform any actions on the hardware, use NetFinity, the RAID administration program, or the RAID configuration program to fill in the attached template at the end of this document with the current status of all the drives, both internal and external. This template provides a three-channel diagram to accommodate all types of IBM RAID Adapters.

For the F/W Streaming RAID Adapter/A and SCSI-2 Fast/Wide PCI-Bus RAID Adapter, if power is lost or another drive is marked DDD during a rebuild operation, the rebuild fails and the drive being rebuilt remains in the OFL state. If you are working with systems that have these adapters, do not perform any operations on the OFL drive until all other DDD drives are changed back to either ONL or HSP. This is because the OFL drive is "inconsistent" from the rest of the array and requires a rebuild operation. If you do not rebuild the drive, then data will be corrupted. If you accidentally select an OFL drive to rebuild while other drives in the array besides the HSP are DDD, then the rebuild fails and the OFL drive becomes DDD. In a case such as this, if you have not noted which drive was OFL, then you no longer are able to tell which drive was the original the OFL or "inconsistent" drive. The best way to ensure data is rebuilt successfully is to perform the following two steps:

1. Do not perform any operations on an OFL drive until all DDD drives have changed back to either ONL or HSP.

2. Write down which drive is OFL so that you have a note of the "inconsistent" drive. This ensures that you will be able to determine the "inconsistent" drive in case you inadvertently cause it to go DDD.

First Actions to be Performed On Service Call with DDD Drives
1. Pull RAID Administration Log created by RAID Administration Program or Netfinity Manager. RAID Administration Program can be obtained from the Configuration Diskette which contains the device drivers under the specific operating system subdirectory. The diskette is available on the IBM website http://www.us.pc.ibm.com/files.html. Search on "RAID."

2. From reading the RAID Administration Log or Netfinity Manager log, determine whether a HSP drive was present in the system or not. The log will indicate that a drive in a specific bay went DDD. Then, if hot spare was present, it will indicate that a drive in a specific bay auto-replaced the DDD drive bay.

3. View the Drive Information under Options in RAID Administration Program or under Netfinity RAID Manager to determine if any errors were recorded against DDD drive.

NOTE: Do not reboot the system because these error counters initialize to zero when the system is rebooted.

Hard Errors - The number of SCSI I/O processor errors that occurred on the drive since the Device Error Table was last cleared. It also indicates if the drive exceeded Predictive Failure Analysis (PFA) threshold.

Action:
Contact your support representative for further problem determination.

Soft Errors - The number of SCSI Check Condition status messages returned from the Drive (except Unit Attention) since the Device Error Table was last cleared.

Action :

Miscellaneous Errors - The number of other errors (such as selection timeout, unexpected bus free, or SCSI phase error) that occur on the drive since the Device Error Table was last cleared.

Action:
Ensure cabling and connectors are seated properly. If backplane, ensure backplane is not bowed causing poor drive connection. If there are no problems with cable, backplane, etc, determine whether HSP drive is present or not and follow appropriate Recovery Procedures listed below but do not software replace the drive. Physically replace the drive.

Parity Errors - The number of parity errors that occurred on the SCSI bus since the Device Error Table was last cleared.

Action:
Check to ensure SCSI bus is properly Terminated with one Active Terminator only placed at the end of the SCSI Chain. If a backplane is the last device on the chain, then the backplane terminates the bus as long as no cable is plugged into the daisy-chained connector on the backplane.

PFA Error - Predictive Failure Analysis

Action:
Determine whether HSP drive is present or not and follow appropriate Recovery Procedures listed below, but do not software replace the drive. Physically replace the drive.

Recovery Procedures When HSP is Present at Time of Failure

The following instructions apply to the IBM SCSI-2 Fast/Wide PCI-Bus RAID Adapter and IBM Fast/Wide Streaming Adapter/A.

One DDD Drive, No OFL
Follow the steps below to bring the DDD drive back to ONL if the following items are true:

Once you verify the conditions above through either the RAID administration log or the RAID administration utility, perform the following steps to bring the DDD drive back to HSP status.

1. Physically replace the hard drive in the DDD bay with a new one of the same capacity or greater.

2. With a RAID-1 or RAID-5 array, the operating system is still functional at this point. Use either NetFinity or the RAID administration utility to bring the drive back to HSP status. With the RAID administration utility, open the options menu and select Replace Drive.

3. When you see the prompt to select the DDD drive, highlight the drive you just replaced and press Enter.

4. The RAID adapter issues a start unit command to the drive. Once the drive successfully spins up, the RAID adapter changes the drive's status to HSP and saves the new configuration.

5. If you see an "Error in starting drive" message, reinsert cables, the hard drive, etc., to verify these are connected properly, then go to step 2. If the error persists, go to step 1.

6. If the error still occurs with a known good hard drive, then troubleshoot to determine the defective part, which may be a cable, back plane, RAID adapter, etc. Once you have replaced the defective part so that there is a good connection between RAID adapter and hard drive, go to step 2.

Two DDD Drives, No OFL
If the system has two DDD drives, and a defined hot spare existed prior to the drive failures, then the system should still be up and running as long as the logical drives are configured as RAID-5 or RAID-1. If the system is still running, then one of the DDD drives becomes HSP when you replace it. Perform the following steps to bring the logical drive back to ONL status. (Because the operating system is functional, this procedure assumes you are using the RAID administration utility within the operating system to recover.):

1. Physically replace both drives that are marked DDD.

2. Once you replace both drives, select the options menu of the RAID administration utility. Choose Replace Drive, highlight the first DDD drive, and press Enter. You receive a message confirming that the drive is starting. After that, one of two things happens:

You can check which one occurs by viewing the RAID log.

3. Repeat step 2 for the second DDD drive.

More than 2 DDD Drives, No OFL
In this scenario, the operating system is no longer functional. Therefore, you must boot to the RAID Option Diskette to recover the array. It is extremely important to confirm that either the RAID administration utility or NetFinity Manager has been running prior to the drives being marked defunct. If so, the utility or NetFinity Manager has logged the sequence of DDD events to a log file either on a diskette or on a local or network drive. With this file, you can view the log file on another machine to determine the "inconsistent" drive. When you know which drive is "inconsistent", you can attempt to recover data.

NOTE: The previous paragraph states "attempt to recover" because once you lose more than one drive in a set of RAID-5 or RAID-1 logical drives, loss of data is definitely a possibility. The steps below guide you through a recovery, if at all possible.

1. View the RAID log on another machine and write down the order in which the drives went defunct.

2. Boot to the RAID configuration diskette, and select View Configuration. Make sure that the template contains the correct information for the status of all drives, not just those listed in the RAID log.

3. Using the RAID configuration utility, select Replace Drive and choose a DDD drive that is not listed in the RAID log. Repeat this step until the only DDD drives remaining are those indicated in the RAID log file.

NOTE: The drives marked DDD that are not listed in the RAID log are the last ones to go defunct. You must recover these drives first so that the information from them can be used to rebuild the original drive that failed (the "inconsistent" drive). If you do not replace the "inconsistent" drive last, then the system uses it to rebuild the last drive that went defunct, resulting in corrupted data. Therefore, it is extremely important to perform step 3 carefully.

4. Select Replace Drive and then select the last drive to go defunct according to the log file. Repeat this step until you have replaced all drives in the correct order. One of the drives should appear as OFL and one should appear as HSP; the rest appear as ONL.

5. Select Rebuild and highlight the DDD drive.

6. If the rebuild completes successfully, reboot to the operating system. If it does not complete successfully, go to step 7.
At this point, run non-destructive RAID diagnostics individually on each drive. Run these diagnostics individually to ensure that you do not get more than one drive that goes defunct at a time. If a drive does go DDD, physically replace that drive and run a replace/rebuild procedure. This verifies that you remove all defective drives from the system, if any exist.

7. If the rebuild process fails, then perform these steps:

a. Exit to the RAID Main Menu.

b. Select Drive Information and view the error counts for each of the hard drives to determine which drive has errors.

c. If the errors occurred on the drive being rebuilt, then physically replace this drive. Select Replace. The status of the drive changes from DDD to OFL. Attempt the rebuild process again. If it completes successfully, go to Step 6.

If the drive still fails the rebuild process, then verify that the drives being rebuilt from do not have any errors. If they have no errors, then you should be able to rebuild the data. Check cable connections to the drive being rebuilt - it is possible that you replaced a defective drive with another defective drive.

When errors occur on the drives that you are rebuilding from, the adapter continues to rebuild all information except that contained in the unrecoverable defective sector. If the unrecoverable sector was in the data area of the disk, then naturally some data has been lost. There is no method at this time for determining whether the errors are in a data or non-data area of the disk. Users must inspect their personal files to determine this.

To recover the portion of the data that was rebuilt, perform the following steps after the "Rebuild Failure" message:

1. If a backup configuration is available, restore the backup configuration.

2. If a backup configuration is not available, write down the information you can retrieve by selecting the View Configuration option. Delete the array and manually create it to match this configuration information. Perform this step carefully, for if you deviate in any way from the original configuration, then you will lose all data. NOTE: Do not Initialize this logical drive.

3. Have all users verify their personal files to ensure their data is good. Keep in mind that some files may be corrupt due to rebuild errors.

One or More DDD Drives, and One OFL Drive
Follow the same basic steps as those listed in the above section to recover your data. When a drive is marked OFL, that means that it is spinning but "inconsistent" with the rest of the array. Usually when a drive is marked OFL, the data on it is being rebuilt from the remaining drives in the array. If the server loses power, or if another drive goes DDD during a rebuild, then the drive being rebuilt remains OFL. In this case, you have to boot the machine to the RAID Configuration Diskette and then follow the procedure in the previous section. Make sure that the OFL drive is the last drive to be software replaced. The offline drive is the "inconsistent" drive, and it requires a rebuilding process.

NOTE: Data corruption occurs if the OFL drive is used to rebuild another drive.

Recovery Procedures When HSP is not Present at Time of Failure
For the IBM SCSI-2 Fast/Wide PCI-Bus RAID Adapter and IBM Fast/Wide Streaming Adapter/A, use the following instructions.

One DDD Drive, No OFL
Follow these steps to bring the DDD drive back to the ONL state if the following items are true:

Once the conditions above are verified through either the RAID administration log or the RAID administration utility, perform the following steps to bring the DDD drive back to ONL status.

1. If drive has never been marked DDD, proceed to step 3 to software replace the drive using the RAID Administration Program or Netfinity RAID Manager.

NOTE: Refer to "Software Replace vs. Physical Replace" section of this paper to understand differences between software and physical replacement

2. If the drive has been marked DDD before, proceed to step 7.

3. With a RAID-1 or RAID-5 array, the operating system will be functional. Use either NetFinity or the RAID administration utility within the operating system to bring the drive back to ONL status. With the RAID administration utility, open the Options menu and select Rebuild Drive.

4. When you see the prompt to select the DDD drive, highlight the drive you just replaced and press Enter.

5. The RAID adapter issues a start unit command to the drive. You receive a message confirming that the drive is starting. The drive then begins the rebuild process. Once the drive completes this process, the drive's status changes to ONL.

6. If you see a "Error in starting drive" message, reinsert the cables, hard drive, etc., to verify there is a good connection, then go to step 3. If the error persists, go to step 7.

7. Physically replace the hard drive in the DDD bay with a new one of the same or greater capacity and go to step 3.

8. If the error still occurs with a known good hard file, then troubleshoot to determine if the cable, back plane, RAID adapter, etc., is defective.

NOTE: RAID Adapter should not be replaced unless Hard Errors are reported under Drive Information with RAID Administration Options Menu or Netfinity RAID Manager.

Once you have replaced the defective part so that there is a good connection between the adapter and hard drive, go to step 3.

Two DDD Drives, No OFL
In this case, with no defined hot spare drive, then the server more than likely trapped (under OS/2 and NT), or the volume was dismounted (under NetWare). To attempt to resolve this scenario, you must examine the RAID log generated by the RAID Administration Utility and follow the steps below:

1. Boot to the RAID configuration utility for your RAID adapter.

2. Select Replace Drive. Highlight the drive marked DDD last by the RAID adapter and press enter. The drive spins up and changes from DDD to ONL status.

WARNING: IF YOU USE THE WRONG ORDER WHEN YOU SELECT SET DEVICE STATE TO CHANGE DRIVE'S STATE TO ONL, DATA CORRUPTION RESULTS. SEE NOTE BELOW TO DETERMINE LAST DRIVE MARKED DDD BY THE RAID ADAPTER

NOTE: Refer to "Using and Understanding the RAID Administration Log" section of this document, for details on obtaining and interpreting the RAID log. If only one drive is recorded in the RAID log because the RAID adapter was not able to log the defunct drive before the operating system went down, then the last drive that went defunct is the drive that is not recorded in the RAID log. If two drives are recorded in the RAID log, then the last drive to go defunct is the second drive listed in the log - the drive with the most recent time stamp.

3. If the drive has been marked DDD before, proceed to step 8.

4. Proceed to step 5 to software replace the remaining DDD drive using the RAID Administration Program or Netfinity RAID Manager.

NOTE: Refer to "Software Replace vs. Physical Replace" section of this paper to understand differences between software and physical replacement

5. With a RAID-1 or RAID-5 array, the operating system will be functional. Use either NetFinity or the RAID administration utility within the operating system to bring the drive back to ONL status. With the RAID administration utility, open the Options menu and select Rebuild Drive.

6. When you see the prompt to select the DDD drive, highlight the drive you just replaced and press Enter.

7. The RAID adapter issues a start unit command to the drive. You receive a message confirming that the drive is starting. The drive then begins the rebuild process. Once the drive completes this process, the drive's status changes to ONL.

8. If you see a "Error in starting drive" message, reinsert the cables, hard drive, etc., to verify there is a good connection, then go to step 5. If the error persists, go to step 9.

9. Physically replace the hard drive in the DDD bay with a new one of the same or greater capacity and go to step 5.

10. If the error still occurs with a known good hard file, then troubleshoot to determine if the cable, back plane, RAID adapter, etc., is defective.

NOTE: RAID Adapter should not be replaced unless Hard Errors are reported under Drive Information with RAID Administration Options Menu or Netfinity RAID Manager.

Once you have replaced the defective part so that there is a good connection between the adapter and hard drive, go to step 3.

11. If software replacement brings all drives back ONL and makes system operational, carefully inspect all cables, etc to ensure that cable or backplane is not defective. Check all backplane connectors and ensure that backplane is not bowed. When multiple drives are marked defunct, it is often the communication channel (cable or backplane) that is the cause of the failure. If backplane is bowed, drives and backplane connectors may not seat properly causing it to have a bad connection. Also, with hot-swap drives that are removed frequently, connectors could become damaged if too much force is exerted.

12. If the rebuild completes successfully, then perform the following steps to ensure that all drives are good:
Run non-destructive RAID diagnostics individually on each drive. Run the diagnostics individually to ensure that you do not have more than one drive that can become defunct at a time. If a drive does become DDD, physically replace that drive and run a rebuild process on the new drive. This verifies that all defective drives are removed from the system, if any exist.

If the REBUILD process fails, then perform the following steps:

a. Exit to the RAID Main Menu.
b. Select Drive Information and view the error counters for each of the hard files to find out which drive had errors. Refer to "First Actions to be Performed on Service Call With DDD Drives" for descriptions of the various errors and the appropriate action.
c. If the errors occur on the drive being rebuilt, then physically replace this drive and select Rebuild again. The drive's status changes from DDD to RBL and the rebuild process begins. If this process completes successfully, go to Step 5.

If it still fails the rebuild, then verify that the drives that are being rebuilt from do not have any errors. If they have no errors, then you should be able to rebuild the data. Check cable connections to the drive being rebuilt. It is possible that you replaced a defective drive with another defective drive.

- When errors occur on the drives that you are rebuilding from, the adapter continues to rebuild all information except that contained in the unrecoverable defective sector. If the unrecoverable sector was in the data area of the disk, then naturally some data has been lost. There is no method at this time for determining whether the errors are in a data or non-data area of the disk. Users must inspect their personal files to determine this.

To recover the portion of the data that was rebuilt, perform the following steps after the "Rebuild Failure" message:

1. If a backup configuration is available, restore the backup configuration.

2. If a backup configuration is not available, write down the information you can retrieve by selecting the View Configuration option. Delete the array and manually create it to match this configuration information. Perform this step carefully, for if you deviate in any way from the original configuration, then you will lose all data. NOTE: Do not Initialize this logical drive.

3. Have all users verify their personal files to ensure their data is good. Keep in mind that some files may be corrupt due to rebuild errors.

More than 2 DDD Drives, No OFL
To attempt to recover, perform the following:
1. View the RAID log and write down the order in which the drives went defunct.

2. Boot to the RAID Configuration Diskette and select View Configuration. Make sure that the template contains the correct information for the status of all drives, not just those listed in the RAID log.

3. Using the RAID configuration utility, select Replace Drive and choose a DDD drive not listed in the RAID log. Change the state of this drive to ONL. Perform this step until the only DDD drives remaining are those indicated in the RAID log.

WARNING: IF YOU USE THE WRONG ORDER WHEN YOU SELECT SET DEVICE STATE TO CHANGE DRIVES' STATEs TO ONL, DATA CORRUPTION RESULTS. ENSURE THAT YOU ONLY CHANGE DEVICE STATES TO ONL OF DRIVES NOT LISTED AS DDD IN THE RAID LOG. THE FIRST DRIVE THAT WENT DEFUNCT REQUIRES REBUILDING. SO IT MUST BE REPLACED LAST.

NOTE: Refer to "Using and Understanding the RAID Administration Log" section of this document, for details on obtaining and interpreting the RAID log. Refer to "Software Replace vs. Physical Replace" section of this paper to understand differences between software and physical replacement

4. Follow the same procedure used to recover from two DDD drives, as outlined in the previous section.

Recovery from RAID Adapter Failure
When a RAID adapter failure occurs, you must replace the RAID adapter and then place the new RAID configuration onto the RAID adapter. For the IBM SCSI-2 Fast/Wide PCI-Bus RAID Adapter and IBM Fast/Wide Streaming Adapter/A, there are two ways to restore the RAID configuration:

1. If you have the most recent backup of the current RAID configuration, then perform the following steps:

2. If no backup of the current RAID configuration is available, then perform one of the following steps:

NOTE: Do not Initialize this logical drive.

NOTE: When you have a defined hot spare and the RAID log is not available, remember that the hot spare becomes part of the array as soon as the first drive is marked defunct. The initial drive that went defunct is DDD in the configuration and is no longer part of the array. However, the hot-spare drive, until it is completely rebuilt, is marked as write only in the configuration. If the configuration is lost, then you must remember that the hot spare may or may not have completed rebuilding. Therefore, take this into account when replacing RAID adapters where the NVRAM is also corrupted, the known state of the array is uncertain, the RAID log is not available, drives are DDD, or a hot spare was defined.

1. Manually define the array according to your best estimate, including the original HSP drive as part of the array. You include the HSP drive because other drives were defunct besides the HSP. Therefore, the HSP has most likely taken over for the original drive.

2. Before booting to the configuration, pull the original HSP drive and mark it as defunct. This ensures the logical drive is running in the CRT state. This in turn eliminates problems if the HSP could not have completed rebuilding.

NOTE: The information above is to help guide you to make the best choices when servicing RAID problems. However, there will be times when data is not recoverable.

Drive Template
As mentioned in the section titled "Using and Understanding the RAID Administration Log," you may find this template useful to record the status of drives as you begin the troubleshooting process.

Channel 1	Channel 2	Channel 3
	.	.
	.	.
	.	.
	.	.
	.	.
	.	.
	.	.

Definitions

Array
In the RAID environment, data is striped across multiple physical hard drives. The array is defined as the set of hard drives included in the data striping. The largest number of physical hard drives that you can define in one array is eight.

Data Scrubbing
Data Scrubbing forces all data sectors in a logical drive to be accessed so that sector media errors are identified and corrected at the disk level using disk ECC information if possible, or at the array level using RAID parity information if necessary. For a high level of data protection, Data Scrubbing should be performed weekly.

Logical Drive
The array specifies which drives should be included in the striping of data. Each array is subdivided into one or more logical drives. The logical drives specify the following:

RAID-0
RAID level 0 stripes the data across all of the drives of the array. RAID-0 offers substantial speed enhancement, but provides for no data redundancy. Therefore, a defective hard disk within the array results in loss of data in the logical drive assigned level 0, but only in that logical drive.

RAID-1
RAID level 1 provides an enhanced feature for disk mirroring that stripes data as well as copies of the data across all the drives of the array. The first stripe is the data stripe, and the second stripe is the mirror (copy) of the first data stripe The data in the mirror stripe is written on another drive. Because data is mirrored, the capacity of the logical drive when assigned level 1 is 50% of the physical capacity of the grouping of hard disk drives in the array.

RAID-5
RAID level 5 stripes data and parity across all drives of the array. When a disk array is assigned RAID-5, the capacity of the logical drive is reduced by one physical drive size because of parity storage. The parity is spread across all drives in the array. If one drive fails, the data can be rebuilt. If more than one drive fails, but one or none of the drives are actually defective, then data may not be lost. You can use a process called software replacement on the non-defective hard drives.

Software Replace
A Software Replace of a hardfile refers to when the hardfile is not physically replaced in the system. A drive may have been marked defunct but brought back online using the RAID Administration program. The drive is rebuilt without having been physically replaced. This could occur because when the RAID Adapter communicates with the hardfile and receives an unexpected response, the adapter will mark the drive defunct in order to avoid any potential data loss.

Synchronization
Synchronization reads all the data bits of the entire logical drive, calculates the parity bit for the data, compares the calculated parity with the existing parity, and updates the existing parity if inconsistent.

The following definitions describe the logical drive states for the IBM SCSI-2 F/W PCI-Bus RAID Adapter and the IBM F/W Streaming RAID Adapter/A:

CRITICAL (CRT)
This is the status for RAID-1 and RAID-5 arrays where the system is running in degraded mode because one drive is DDD. If another drive goes DDD, the array will be OFL and the operating system will not be operational.

GOOD
For the IBM SCSI-2 F/W PCI-Bus RAID Adapter and the IBM F/W Streaming RAID Adapter/A, the logical drive status is GOOD when all drives in the array are ONLINE and fully operational.
The adapters also assign device states to physical drives. The following definitions describe these device states:

DDD
The RAID adapter marks an ONL or OFL drive defunct, changing its status to DDD and removing power from the drive, when one of the following conditions occur:
- The drive does not respond to commands by a certain timeout value.
- The drive exceeds the number of allowed busy status responses as specified by the RAID adapter firmware.
- A reassign failure or two successive failures in verification occur when the RAID adapter tries to recover from a media error reported from the drive.

NOTE: Media error recovery and conditions under which a drive is marked defunct in the recovery process vary slightly depending upon the specific RAID adapter.

FMT
Format; the drive is being reformatted.

HSP
A hot-spare (HSP) drive is a drive designated to be a replacement for the first DDD drive that occurs. The state of the drive appears as HSP. When a DDD drive occurs and a HSP is defined, the hot-spare drive takes over for the drive that appears as DDD. The HSP drive is rebuilt to be identical to the DDD drive. During the rebuilding of the HSP drive, this drive changes to the OFL state. The OFL state will turn to ONL once the drive is completely rebuilt and fully operating for the DDD drive.

OFL
Offline; a good drive that replaces a defunct drive in a RAID level 1 or level 5 array. This drive is associated with the array, but does not contain any data. Drive status remains OFL during the rebuild phase.

ONL
Online; a drive that RAID adapter detects as installed, operational, and configured into an array appear as this state.

PFA
The firmware of a hard drive uses algorithms to track the error rates on the drive. The drive alerts the user with a Predictive Failure Analysis (PFA) alert via the RAID administration utility and NetFinity when degradation of drive performance (read/write errors) is detected. When a PFA alert occurs, physical replacement of the drive is recommended.

RDY
RDY appears as the status of a drive that the RAID adapter detects as installed, spun up, but not configured in an array.

UFM
Unformatted; a drive that requires a low-level formatting before it can be used in an array. You can start the low-level format by selecting Format Drive from the RAID Configuration Main Menu.

Additional Information

Web Sites
IBM maintains extensive and timely information on the world wide web. Visit the following sites for more information on IBM servers and other IBM products. These sources contain product information, performance data, and technical literature.

IBM Home Page ............................................ http://www.ibm.com
IBM PSG Home page .................................. http://www.pc.ibm.com
IBM PSG Server Home page .................... http://www.pc.ibm.com/us/server/server.html
IBM PSG Company Support ...................... http://www.pc.ibm.com/us/support.html
TechConnect Program ................................ http://www.pc.ibm.com/techconnect/
File repositories ............................................. http://www.pc.ibm.com/us/files.html or ftp://ftp.pcco.ibm.com

White Papers
The following White Papers pertain to RAID and hardfile technologies. These provide procedures for ensuring the highest protection and availability of customer data and are viewable on-line in PDF format at: http://www.pc.ibm.com/techconnect/tech/resource.html. From this site select "White Papers".

1. Using IBM RAID Adapters to Avoid Data Loss.
2. High Availability Using the IBM ServeRAID Adapter
3. Understanding Hard Disk Drive Media Defects.

Notice
International Business Machines Corporation 1997. All rights reserved.

References in this publication to IBM products, programs or services do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or service is not intended to state or imply that only IBM's product, program, or service may be used. Any functional equivalent program that does not infringe any of IBM's intellectual property rights may be used instead of the IBM product, program or service.

Information in this paper was developed in conjunction with use of the equipment specified, and is limited in application to those specific hardware and software products and levels.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to the IBM Director of Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594 USA.

The information contained in this document has not been submitted to any formal IBM test and is distributed AS IS WITHOUT WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. The information about non-IBM (VENDOR) products in this manual has been supplied by the vendor and IBM assumes no responsibility for its accuracy or completeness. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk. This publication could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time.

The following terms are trademarks or registered trademarks of the International Business Machines Corporation in the United States and/or other countries.

OS/2Â® NetFinityÂ®

Microsoft, Windows, Windows NT, and the Windows logo are registered trademarks of Microsoft Corporation.
UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company Limited.

Other company, product, and service names may be trademarks or service marks of others. IBM Server White Paper IBM Corporation 1997. All rights reserved.