Recovering from Disk Drive Problems

AIX Version 4.3 System Management Guide: Operating System and Devices

Recovering from Disk Drive Problems

This procedure describes how to recover or restore data in logical volumes if a disk drive is failing. Before proceeding with this procedure, you should try the procedure "Migrating the Contents of a Physical Volume". That procedure is the preferred way to recover data from a failing disk.

If your drive is failing and you can repair the drive without reformatting it, no data will be lost. See "Recovering a Disk Drive without Reformatting."
If the disk drive must be reformatted or replaced, you should make a backup, if possible, and remove the disk drive from its volume group and system configuration before replacing it. Some data from single-copy file systems may be lost. See "Recovering Using a Reformatted or Replacement Disk Drive".

Prerequisites

Run diagnostics on the failed disk drive. For instructions, refer to "How to Run Hardware Service Aids" in your system unit operator guide.
The following scenario will be used in the next three procedures. The volume group called myvg contains three disk drives. The disks in this scenario are called hdisk2 , hdisk3 , and hdisk4 . Assume the hdisk3 disk drive goes bad.
The hdisk2 disk drive contains the nonmirrored logical volume lv01 and a copy of the logical volume mylv . The mylv logical volume is mirrored and has three copies, each of which takes up two physical partitions on its disk. The hdisk3 disk drive contains another copy of mylv and the nonmirrored logical volume lv00 . Finally, hdisk4 contains a third copy of mylv as well as lv02 . The myvg diagram shows this scenario.

Recovering a Disk Drive without Reformatting

If you fix the bad disk and place it back in the system without reformatting it, then you can simply let the system automatically activate and resynchronize the stale physical partitions on the drive at boot time. A stale physical partition is a physical partition that contains data you cannot use. To discover if a physical partition is stale, use the lspv -M command to display information about a physical volume. Stale physical partitions will be marked stale .

Recovering Using a Reformatted or Replacement Disk Drive

If you must reformat or replace the failing drive, you should remove all references to nonmirrored file systems from the failing disk and remove it from the volume group and system configuration before replacing it. If you do not do this, you will create problems in the ODM and system configuration databases.

Before Removing the Failed Drive

You should be familiar with which logical volumes are on the failing drive. To look at the contents of the failing drive, use one of the other drives. For example, use hdisk4 to look at hdisk3 :
```
lspv -M -n hdisk4 hdisk3
```
The lspv command displays information about a physical volume within a volume group. The output might look something like the following:
```
hdisk3:1        mylv:1
hdisk3:2        mylv:2
hdisk3:3        lv00:1
hdisk3:4-50
```
The first column displays the physical partitions and the second column displays the logical partitions. Partitions 4 through 50 are free.
Back up all single-copy logical volumes on the failing device, if possible.
If you have single-copy file systems, unmount them from the disk. Mirrored file systems do not have to be unmounted. Single-copy file systems are those that have the same number of logical partitions as physical partitions on the output from the lspv command. In the example scenario, lv00 on the failing disk hdisk3 is a single-copy file system. Use the command:
```
unmount /Directory
```
Remove all single-copy file systems from the failed physical volume by using the rmfs command:
```
rmfs /Directory
```
Remove all mirrored logical volumes located on the failing disk by reducing the number of copies of the physical partitions to only those that are currently available. The rmlvcopy command removes copies from each logical partition. For example:
```
rmlvcopy mylv 2 hdisk3
```
By removing the copy on hdisk3 , you reduce the number of copies of each logical partition belonging to the mylv logical volume from three to two (one on hdisk4 and one on hdisk2),
Note: Do not use rmlvcopy on the hd5 and hd7 logical volumes from physical volumes in the rootvg volume group. The system will not allow you to remove these logical volumes because there should be only one copy of these.
Remove the primary dump device (logical volume hd7) if the failing physical volume was a part of the rootvg volume group that contained it. For example:
```
sysdumpdev -P -p /dev/sysdumpnull
```
The sysdumpdev command changes the primary or secondary dump device location for a running system. When you reboot, the dump device will return to its original location.
Remove any paging spaces located on the disk using the rmps command. If you cannot remove paging spaces because they are currently in use, you must flag the paging space as not active and reboot before continuing with this procedure. If there are active paging spaces, the reducevg command may fail.
Remove any other logical volumes, such as those with only one copy, using the rmlv command. For example:
```
rmlv -f lv00
```
The rmlv command removes a logical volume from a volume group.
Reduce the size of the volume group to omit the failed drive using the reducevg command. For example:
```
reducevg -df myvg hdisk3
```
This example reduces the size of the myvg volume group to omit the hdisk3 drive.
You can now power off the old drive using the SMIT fast path smit rmvdsk . Change the KEEP definition in database field to no. Power off the system and allow your next level of support to add the new or reformatted disk drive.
Shut down the system:
```
shutdown -F
```
The shutdown command halts the operating system.

After Reformatting a Drive

Since the disk has been reformatted, the volume group defined in the disk is gone. If you have forgotten to or were unable to reducevg the disk from the old volume group before the disk was formatted, the following procedure can help clean up the VGDA/ODM information.

If the volume group consisted of only one disk, which was reformatted, enter:
```
exportvg VGName
```
If the volume group consists of more than one disk, first run the command:
```
varyonvg VGName
```
You will receive a message about a missing or unavailable disk, and the disk you have now reformatted will be listed. Note the PVID of that disk, which is listed in the varyonvg message. It is the 16-character string between the name of the missing disk and the label PVNOTFND.
```
hdiskX PVID PVNOTFND
```
Enter:
```
varyonvg -f VGName
```
The missing disk is now displayed with the PVREMOVED label.
```
hdiskX PVID PVREMOVED
```
Then, enter the command:
```
reducevg -df VGName PVID
```

Attention: The logical volumes defined on this missing disk will be deleted from the ODM and VGDA areas of the remaining disks that make up the volume group VGName.

After Adding a Reformatted or Replacement Disk Drive

If you would prefer not to reboot the system after reformatting the disk drive, you must configure the disk and create the device entry:

cfgmgr
mkdev -1 hdisk3

If you want to reboot the system, this will automatically configure the new drive. After rebooting, use the following procedure:

List all the disks using the lsdev command. Then find the name of the disk you just attached. For example:
```
lsdev -C -c disk
```
In this example, the disk that was just attached will be called by the same name as before (hdisk3 ).
Make the disk available using the chdev command:
```
chdev -l hdisk3 -a pv=yes
```
Add the new disk drive to the volume group using the extendvg command. For example:
```
extendvg myvg hdisk3
```
The extendvg command increases the size of the volume group by adding one or more physical volumes. This example adds the hdisk3 drive to the myvg volume group.
Recreate the single-copy logical volumes on the disk drive you just attached using the mklv command. For example:
```
mklv -y lv00 myvg 1 hdisk3
```
This example recreates the lv00 logical volume on the hdisk3 drive. The 1 means that this logical volume is not mirrored.
Recreate the file systems on the logical volume using the crfs command:
```
crfs -v jfs -d LVname -m /Directory
```
Restore single-copy file system data from backup media. See "Restoring Individual User Files" .
Recreate the mirrored copies of logical volumes using the mklvcopy command. For example:
```
mklvcopy mylv 3 hdisk3
```
The mklvcopy command creates copies of data within a logical volume. This example creates a mirrored third partition (the mylv logical volume) onto hdisk3 .
Synchronize the new mirror with the data on the current mirrors (on hdisk2 and hdisk4 ):
```
syncvg -p hdisk3
```
The syncvg command synchronizes logical volume copies that are not current.

After performing this procedure, all mirrored file systems should be restored and up-to-date. If you were able to back up your single-copy file systems, they will also be ready to use. You should be able to proceed with normal system use.

Example of Recovery from a Failed Disk Drive

To recover from a failed disk drive, back out the way you came in; that is, list the steps you went through to create the volume group, and then go backwards. The following example is an illustration of this technique. It shows how a mirrored logical volume was created and then how it was altered, backing out one step at a time, when a disk failed.

Note: The following example of a specific instance and is given for illustration only. It is not intended as a general prototype on which to base any general recovery procedures.

Create a volume group called workvg on hdisk1.
```
mkvg -y workvg hdisk1
```

Create two more disks for this volume group.

extendvg workvg hdisk2
   
extendvg workvg hdisk3

Create a logical volume of 40MB that has three copies. Each copy is on one of each of the three disks that comprise workvg .
```
mklv -y testlv workvg 10
   
mklvcopy testlv 3
```
Assume that hdisk2 fails.
Reduce the number of mirrored copies for the logical volume from three to two, and inform the LVM that you aren't counting on the copy on hdisk2 anymore.
```
rmlvcopy testlv 2 hdisk2
```
Detach hdisk2 from the system in such a way that the ODM and VGDA are updated.
```
reducevg workvg hdisk2
```
Communicate to the ODM and the disk driver that you are taking hdisk2 offline for replacement.
```
rmdev -l hdisk2 -d
```
Shut down the system.
```
shutdown -F
```
Put in a new disk. It may or may not have the same SCSI ID as the former hdisk2.
Reboot the machine.
Because you have a new disk (the system sees that there is a new PVID on this disk), the system will choose the first OPEN hdisk name. Because the -d flag was used in step 6, the name hdisk2 was released. Thus the configurator chooses hdisk2 for the name of the new disk. If the -d flag had not been used, hdisk4 would have been chosen as the new name.
Add this disk into the workvg system.
```
extendvg workvg hdisk2
```
Create two mirrored copies of the logical volume. The Logical Volume Manager will automatically place the third logical volume copy on the new hdisk2.
```
mklvcopy testlv 3
```