System Management Guide:
Operating System and Devices

LVM Troubleshooting Tasks

The topics in this section provide diagnostics and recovery procedures to use if you encounter one of the following:

Disk Drive Problems

If your disk drive is running out of available space, see Getting More Space on a Disk Drive. If you suspect a disk drive is mechanically failing or has failed, run diagnostics on the disk use the following procedure:

With root authority, type the following SMIT fast path on the command line:
```
smit diag
```
Select Current Shell Diagnostics to enter the AIX Diagnostics tool.
After you read the Diagnostics Operating Instructions screen, press Enter.
Select Diagnostics Routines.
Select System Verification.
Scroll down through the list to find and select the drive you want to test.
Select Commit.

Based on the diagnostics results, you should be able to determine the condition of the disk:

If you detect the disk drive is failing or has failed, of primary importance is recovering the data from that disk. If the disk is still accessible, try completing the procedure in Migrating the Contents of a Physical Volume. Migration is the preferred way to recover data from a failing disk. The following procedures describe how to recover or restore data in logical volumes if migration cannot complete successfully.
If your drive is failing and you can repair the drive without reformatting it, no data will be lost. See Recovering a Disk Drive without Reformatting.
If the disk drive must be reformatted or replaced, make a backup, if possible, and remove the disk drive from its volume group and system configuration before replacing it. Some data from single-copy file systems might be lost. See Recovering Using a Reformatted or Replacement Disk Drive.
If your system supports the hot removability feature, see Recovering from Disk Failure while the System Remains Available.

Getting More Space on a Disk Drive

If you run out of space on a disk drive, there are several ways to remedy the problem. You can automatically track and remove unwanted files, restrict users from certain directories, or mount space from another disk drive.

You must have root user, system group, or administrative group authority to execute these tasks.

Cleaning Up File Systems Automatically

Use the skulker command to clean up file systems by removing unwanted files. Type the following from the command line:

skulker -p

The skulker command is used to periodically purge obsolete or unneeded files from file systems. Candidates include files in the /tmp directory, files older than a specified age, a.out files, core files, or ed.hup files.

The skulker command is typically run daily, as part of an accounting procedure run by the cron command during off-peak hours. For more information about using the skulker command in a cron process, see Fix Disk Overflows.

For information on typical cron entries, see Setting Up an Accounting System.

Restricting Users from Certain Directories

Another way to release disk space and possibly to keep it free is to restrict and monitor disk usage.

Restrict users from certain directories by typing:
```
chmod 655 DirName
```
This command sets read and write permissions for the owner (root) and sets read-only permissions for the group and others. DirName is the full path name of the directory you want to restrict,
Monitor the disk usage of individual users. One way to do this is to add the following line to the /var/spool/cron/crontabs/adm file:
```
0 2 * * 4 /usr/sbin/acct/dodisk
```
This line executes the dodisk command at 2 a.m. (0 2) each Thursday (4). The dodisk command initiates disk-usage accounting. This command is usually run as part of an accounting procedure run by the cron command during off-peak hours. See Setting Up an Accounting System for more information on typical cron entries.

Mounting Space from Another Disk Drive

Another way to get more space on a disk drive is to mount space from another drive. You can mount space from one disk drive to another in the following ways:

Use the smit mountfs fast path.
Use the mount command. For example:
```
mount -n nodeA -vnfs /usr/spool /usr/myspool
```
The mount command makes a file system available for use at a specific location.

For more information about mounting file systems, see Mount a JFS or JFS2.

Recovering a Disk Drive without Reformatting

If you repair a bad disk and place it back in the system without reformatting it, you can let the system automatically activate and resynchronize the stale physical partitions on the drive at boot time. A stale physical partition contains data your system cannot use.

If you suspect a stale physical partition, type the following on the command line:

lspv -M PhysVolName

Where PhysVolName is the name of your physical volume. The lspv command output will list all partitions on your physical volume. The following is an excerpt from example output:

hdisk16:112     lv01:4:2        stale
hdisk16:113     lv01:5:2        stale
hdisk16:114     lv01:6:2        stale
hdisk16:115     lv01:7:2        stale
hdisk16:116     lv01:8:2        stale
hdisk16:117     lv01:9:2        stale
hdisk16:118     lv01:10:2       stale

The first column displays the physical partitions and the second column displays the logical partitions. Any stale physical partitions are noted in the third column.

Recovering Using a Reformatted or Replacement Disk Drive

This section describes how to recover data from a failed disk drive when you must reformat or replace the failed disk.

Attention: Before you reformat or replace a disk drive, remove all references to nonmirrored file systems from the failing disk and remove the disk from the volume group and system configuration. If you do not, you create problems in the ODM (object data manager) and system configuration databases. Instructions for these essential steps are included in the following procedure, under Before replacing or reformatting your failed or failing disk.

The following procedure uses a scenario in which the volume group called myvg contains three disk drives called hdisk2, hdisk3, and hdisk4. In this scenario, hdisk3 goes bad. The nonmirrored logical volume lv01 and a copy of the mylv logical volume is contained on hdisk2. The mylv logical volume is mirrored and has three copies, each of which takes up two physical partitions on its disk. The failing hdisk3 contains another copy of mylv, and the nonmirrored logical volume called lv00. Finally, hdisk4 contains a third copy of mylv as well as a logical volume called lv02. The following figure shows this scenario.

This procedure is divided into the following key segments:

The things you do to protect data before replacing or reformatting your failing disk
The procedure you follow to reformat or replace the disk
The things you do to recover the data after the disk is reformatted or replaced

Before replacing or reformatting your failed or failing disk:

Log in with root authority.
If you are not familiar with the logical volumes that are on the failing drive, use an operational disk to view the contents of the failing disk. For example, to use hdisk4 to look at hdisk3, type the following on the command line:
```
lspv -M -n hdisk4 hdisk3
```
The lspv command displays information about a physical volume within a volume group. The output looks similar to the following:
```
hdisk3:1        mylv:1
hdisk3:2        mylv:2
hdisk3:3        lv00:1
hdisk3:4-50
```
The first column displays the physical partitions, and the second column displays the logical partitions. Partitions 4 through 50 are free.
Back up all single-copy logical volumes on the failing device, if possible. For instructions, see Backing Up and Restoring Information.
If you have single-copy file systems, unmount them from the disk. (You can identify single-copy file systems from the output of the lspv command. Single-copy file systems have the same number of logical partitions as physical partitions on the output.) Mirrored file systems do not have to be unmounted.
In the scenario, lv00 on the failing disk hdisk3 is a single-copy file system. To unmount it, type the following:
```
unmount /dev/lv00
```
If you do not know the name of the file system, assuming the /etc/filesystems file is not solely located on the failed disk, type mount on the command line to list all mounted file systems and find the name associated with your logical volume. You can also use the grep command on the /etc/filesystems file to list only the file system names, if any, associated with your logical volume. For example:
```
grep lv00 /etc/filesystems
```
The output looks similar to the following:
```
dev             = /dev/lv00   
log             = /dev/loglv00
```
Notes:
1. The unmount command fails if the file system you are trying to unmount is currently being used. The unmount command executes only if none of the file system's files are open and no user's current directory is on that device.
2. Another name for the unmount command is umount. The names are interchangeable.
Remove all single-copy file systems from the failed physical volume by typing the rmfs command:
```
rmfs /FSname
```
Remove all mirrored logical volumes located on the failing disk.
Note

You cannot use rmlvcopy on the hd5 and hd7 logical volumes from physical volumes in the rootvg volume group. The system does not allow you to remove these logical volumes because there is only one copy of these.

The rmlvcopy command removes copies from each logical partition. For example, type:
```
rmlvcopy mylv 2 hdisk3
```
By removing the copy on hdisk3, you reduce the number of copies of each logical partition belonging to the mylv logical volume from three to two (one on hdisk4 and one on hdisk2).
If the failing disk was part of the root volume group and contained logical volume hd7, remove the primary dump device (hd7) by typing the following on the command line
```
sysdumpdev -P -p /dev/sysdumpnull
```
The sysdumpdev command changes the primary or secondary dump device location for a running system. When you reboot, the dump device returns to its original location.
Remove any paging space located on the disk using the following command:
```
rmps PSname
```
Where PSname is the name of the paging space to be removed, which is actually the name of the logical volume on which the paging space resides.

If the rmps command is not successful, you must use the smit chps fast path to deactivate the primary paging space and reboot before continuing with this procedure. The reducevg command in step 10 can fail if there are active paging spaces.
Remove any other logical volumes from the volume group, such as those that do not contain a file system, using the rmlv command. For example, type:
```
rmlv -f lv00
```

Remove the failed disk from the volume group using the reducevg command. For example, type:
```
reducevg -df myvg hdisk3
```
If you cannot execute the reducevg command or if the command is unsuccessful, the procedure in step 13 can help clean up the VGDA/ODM information after you reformat or replace the drive

Replacing or reformatting your failed or failing disk:

The next step depends on whether you want to reformat or replace your disk and on what type of hardware you are using:
- If you want to reformat the disk drive, use the following procedure:
  1. With root authority, type the following SMIT fast path on the command line:
```
smit diag
```
  2. Select Current Shell Diagnostics to enter the AIX Diagnostics tool.
  3. After you read the Diagnostics Operating Instructions screen, press Enter.
  4. Select Task Selection.
  5. Scroll down through the task list to find and select Format Media.
  6. Select the disk you want to reformat. After you confirm that you want to reformat the disk, all content on the disk will be erased.
  After the disk is reformatted, continue with step 12.
- If your system supports hot swap disks, use the procedure in Recovering from Disk Failure while the System Remains Available then continue with step 13.
- If your system does not support hot swap disks, do the following:
  - Power off the old drive using the SMIT fast path smit rmvdsk. Change the KEEP definition in database field to No.
  - Contact your next level of system support to replace the disk drive.

After replacing or reformatting your failed or failing disk:

Follow the instructions in Configuring a Disk and Making an Available Disk a Physical Volume.

If you could not use the reducevg command on the disk from the old volume group before the disk was formatted (step 10), the following procedure can help clean up the VGDA/ODM information.
1. If the volume group consisted of only one disk that was reformatted, type:
```
exportvg VGName
```
  Where VGName is the name of your volume group.
2. If the volume group consists of more than one disk, type the following on the command line:
```
varyonvg VGName
```
  The system displays a message about a missing or unavailable disk, and the new (or reformatted) disk is listed. Note the physical volume identifier (PVID) of the new disk, which is listed in the varyonvg message. It is the 16-character string between the name of the missing disk and the label PVNOTFND. For example:
```
hdisk3 00083772caa7896e PVNOTFND
```
  Type:
```
varyonvg -f VGName
```
  The missing disk is now displayed with the PVREMOVED label. For example:
```
hdisk3 00083772caa7896e PVREMOVED
```
  Then, type the command:
```
reducevg -df VGName PVID
```
  Where PVID is the physical volume identifier (in this scenario, 00083772caa7896e).
To add the new disk drive to the volume group, use the extendvg command. For example, type:
```
extendvg myvg hdisk3
```
To re-create the single-copy logical volumes on the new (or reformatted) disk drive, use the mklv command. For example, type:
```
mklv -y lv00 myvg 1 hdisk3
```
This example recreates the lv00 logical volume on the hdisk3 drive. The 1 means that this logical volume is not mirrored.
To re-create the file systems on the logical volume, use the crfs command. For example, type
```
crfs -v jfs -d lv00 -m /dev/lv00
```
To restore single-copy file system data from backup media, see Restoring from Backup Image Individual User Files.
To re-create the mirrored copies of logical volumes, use the mklvcopy command. For example, type:
```
mklvcopy mylv 3 hdisk3
```
This example creates a mirrored third partition of the mylv logical volume on hdisk3.
To synchronize the new mirror with the data on the other mirrors (in this example, hdisk2 and hdisk4), use the syncvg command. For example, type:
```
syncvg -p hdisk3
```

At this point, all mirrored file systems should be restored and up-to-date. If you were able to back up your single-copy file systems, they will also be ready to use. You should be able to proceed with normal system use.

Example of Recovery from a Failed Disk Drive

To recover from a failed disk drive, back out the way you came in; that is, list the steps you went through to create the volume group, and then go backwards. The following example is an illustration of this technique. It shows how a mirrored logical volume was created and then how it was altered, backing out one step at a time, when a disk failed.

Note

The following example illustrates a specific instance. It is not intended as a general prototype on which to base any general recovery procedures.

The system manager, Jane, created a volume group called workvg on hdisk1, by typing:
```
mkvg -y workvg hdisk1
```
She then created two more disks for this volume group, by typing:
```
extendvg workvg hdisk2

extendvg workvg hdisk3
```
Jane created a logical volume of 40 MB that has three copies. Each copy is on one of each of the three disks that comprise the workvg volume group. She used the following commands:
```
mklv -y testlv workvg 10

mklvcopy testlv 3
```

After Jane created the mirrored workvg volume group, hdisk2 failed. Therefore, she took the following steps to recover:

She removed the logical volume copy from hdisk2 by typing:
```
rmlvcopy testlv 2 hdisk2
```
She detached hdisk2 from the system so that the ODM and VGDA are updated, by typing:
```
reducevg workvg hdisk2
```

She removed hdisk2 from the system configuration to prepare for replacement by typing:
```
rmdev -l hdisk2 -d
```
She chose to shut down the system, by typing:
```
shutdown -F
```
She replaced the disk. The new disk did not have the same SCSI ID as the former hdisk2.
She rebooted the system.
Because you have a new disk (the system sees that there is a new PVID on this disk), the system chooses the first open hdisk name. Because the -d flag was used in step 3, the name hdisk2 was released, so the system chose hdisk2 as the name of the new disk. If the -d flag had not been used, hdisk4 would have been chosen as the new name.
Jane added this disk into the workvg volume group by typing:
```
extendvg workvg hdisk2
```
She created two mirrored copies of the logical volume by typing:
```
mklvcopy testlv 3
```
The Logical Volume Manager automatically placed the third logical volume copy on the new hdisk2.

Recovering from Disk Failure while the System Remains Available

The procedure to recover from disk failure using the hot removability feature is, for the most part, the same as described in Recovering a Disk Drive without Reformatting, with the following exceptions:

To unmount file systems on a disk, use the procedure Mount a JFS or JFS2.
To remove the disk from its volume group and from the operating system, use the procedure Removing a Disk without Data.
To replace the failed disk with a new one, you do not need to shut down the system. Use the following sequence of procedures:
1. Adding Disks while the System Remains Available
2. Configuring a Disk
3. Continue with step 13 of Recovering Using a Reformatted or Replacement Disk Drive.

Replacing a Disk When the Volume Group Consists of One Disk

If you can access a disk that is going bad as part of a volume group, use one of the following procedures:

If the disk is bad and cannot be accessed, follow these steps:

Export the volume group.
Replace the drive.
Re-create the data from backup media that exists.

Physical or Logical Volume Errors

This section contains possible problems with and solutions for physical or logical volume errors.

Hot Spot Problems

If you notice performance degradation when accessing logical volumes, you might have hot spots in your logical volumes that are experiencing too much disk I/O. For more information, see AIX 5L Version 5.2 System Management Concepts: Operating System and Devices and Enabling and Configuring Hot Spot Reporting.

LVCB Warnings

The logical volume control block (LVCB) is the first 512 bytes of a logical volume. This area holds important information such as the creation date of the logical volume, information about mirrored copies, and possible mount points in the JFS. Certain LVM commands are required to update the LVCB, as part of the algorithms in LVM. The old LVCB is read and analyzed to see if it is a valid. If the information is valid LVCB information, the LVCB is updated. If the information is not valid, the LVCB update is not performed, and you might receive the following message:

Warning, cannot write lv control block data.

Most of the time, this message results when database programs bypass the JFS and access raw logical volumes as storage media. When this occurs, the information for the database is literally written over the LVCB. For raw logical volumes, this is not fatal. After the LVCB is overwritten, the user can still:

Expand a logical volume
Create mirrored copies of the logical volume
Remove the logical volume
Create a journaled file system to mount the logical volume

There are limitations to deleting LVCBs. A logical volumes with a deleted LVCB might not import successfully to other systems. During an importation, the LVM importvg command scans the LVCBs of all defined logical volumes in a volume group for information concerning the logical volumes. If the LVCB does not exist, the imported volume group still defines the logical volume to the new system that is accessing this volume group, and the user can still access the raw logical volume. However, the following typically happens:

Any JFS information is lost and the associated mount point is not imported to the new system. In this case, you must create new mount points, and the availability of previous data stored in the file system is not ensured.
Some non-JFS information concerning the logical volume cannot be found. When this occurs, the system uses default logical volume information to populate the ODM information. Thus, some output from the lslv command might be inconsistent with the real logical volume. If any logical volume copies still exist on the original disks, the information is not be correctly reflected in the ODM database. Use the rmlvcopy and mklvcopy commands to rebuild any logical volume copies and synchronize the ODM.

Physical Partition Limits

In the design of Logical Volume Manager (LVM), each logical partition maps to one physical partition (PP). And, each physical partition maps to a number of disk sectors. The design of LVM limits the number of physical partitions that LVM can track per disk to 1016. In most cases, not all of the 1016 tracking partitions are used by a disk. When this limit is exceeded, you might see a message similar to the following:

0516-1162 extendvg: Warning, The Physical Partition Size of PPsize requires the
creation of TotalPPs partitions for PVname.  The limitation for volume group
VGname is LIMIT physical partitions per physical volume.  Use chvg command
with -t option to attempt to change the maximum Physical Partitions per
Physical volume for this volume group.

Where:

PPsize: Is 1 MB to 1 GB in powers of 2.
Total PPs: Is the total number of physical partitions on this disk, given the PPsize.
PVname: Is the name of the physical volume, for example, hdisk3.
VGname: Is the name of the volume group.
LIMIT: Is 1016 or a multiple of 1016.

This limitation is enforced in the following instances:

When creating a volume group using the mkvg command, you specified a number of physical partitions on a disk in the volume group that exceeded 1016. To avoid this limitation, you can select from the physical partition size ranges of 1, 2, 4 (the default), 8, 16, 32, 64, 128, 256, 512 or 1024 MB and use the mkvg -s command to create the volume group. Alternatively, you can use a suitable factor that allows multiples of 1016 partitions per disk, and use the mkvg -t command to create the volume group.
When adding a disk to a pre-existing volume group with the extendvg command, the new disk caused the 1016 limitation violation. To resolve this situation, convert the existing volume group to hold multiples of 1016 partitions per disk using the chvg -t command. Alternatively, you can re-create the volume group with a larger partition size that allows the new disk, or you can create a standalone volume group consisting of a larger physical size for the new disk.

Partition Limitations and the rootvg

If the installation code detects that the rootvg drive is larger than 4 GB, it changes the mkvg -s value until the entire disk capacity can be mapped to the available 1016 tracks. This installation change also implies that all other disks added to rootvg, regardless of size, are also defined at that physical partition size.

Partition Limitations and RAID Systems

For systems using a redundant array of identical disks (RAID), the /dev/hdiskX name used by LVM may consist of many non-4 GB disks. In this case, the 1016 requirement still exists. LVM is unaware of the size of the individual disks that really make up /dev/hdiskX. LVM bases the 1016 limitation on the recognized size of /dev/hdiskX, and not the real physical disks that make up /dev/hdiskX.

Synchronizing the Device Configuration Database

A system malfunction can cause the device configuration database to become inconsistent with the LVM. When this happens, a logical volume command generates such error messages as:

0516-322 The Device Configuration Database is inconsistent ...

0516-306 Unable to find logical volume LVname in the Device
Configuration Database.

(where the logical volume called LVname is normally available).

Attention: Do not remove the /dev entries for volume groups or logical volumes. Do not change the database entries for volume groups or logical volumes using the Object Data Manager.

To synchronize the device configuration database with the LVM information, with root authority, type the following on the command line:

synclvodm -v VGName

Where VGName is the name of the volume group you want to synchronize.

Volume Group Errors

If the importvg command is not working correctly, try refreshing the device configuration database. See Synchronizing the Device Configuration Database.

Overriding a Vary-On Failure

Attention: Overriding a vary-on failure is an unusual operation; check all other possible problem sources such as hardware, cables, adapters, and power sources before proceeding. Overriding a quorum failure during a vary-on process is used only in an emergency and only as a last resort (for example, to salvage data from a failing disk). Data integrity cannot be guaranteed for management data contained in the chosen copies of the VGDA and the VGSA when a quorum failure is overridden.

When you choose to forcibly vary-on a volume group by overriding the absence of a quorum, the PV STATE of all physical volumes that are missing during this vary-on process will be changed to removed. This means that all the VGDA and VGSA copies are removed from these physical volumes. After this is done, these physical volumes will no longer take part in quorum checking, nor are they allowed to become active within the volume group until you return them to the volume group.

Under one or more of the following conditions, you might want to override the vary-on failure so that the data on the available disks in the volume group can be accessed:

Unavailable physical volumes appear permanently damaged.
You can confirm that at least one of the presently accessible physical volumes (which must also contain a good VGDA and VGSA copy) was online when the volume group was last varied on. Unconfigure and power off the missing physical volumes until they can be diagnosed and repaired.

Use the following procedure to avoid losing quorum when one disk is missing or might soon fail and requires repair:

To temporarily remove the volume from the volume group, type:
```
chpv -vr PVname
```
When this command completes, the physical volume PVname is no longer factored in quorum checking. However, in a two-disk volume group, this command fails if you try the chpv command on the disk that contains the two VGDA/VGSAs. The command does not allow you to cause quorum to be lost.
If you need to remove the disk for repair, power off the system, and remove the disk. (For instructions, see Disk Drive Problems.) After fixing the disk and returning the disk to the system, continue with the next step.
To make the disk available again to the volume group for quorum checking, type:
```
chpv -v PVname
```
Note

The chpv command is used only for quorum-checking alteration. The data that resides on the disk is still there and must be moved or copied to other disks if the disk is not to be returned to the system.

VGDA Warnings

In some instances, the user experiences a problem adding a new disk to an existing volume group or in creating of a new volume group. The message provided by LVM is:

0516-1163 extendvg: VGname already has maximum physical volumes.  With the maximum
number of physical partitions per physical volume being LIMIT, the maximum
number of physical volumes for volume group VGname is MaxDisks.

Where:

VGname: Is the name of the volume group.
LIMIT: Is 1016 or a multiple of 1016.
MaxDisks: Is the maximum number of disks in a volume group. For example, if there are 1016 physical partitions (PPs) per disk, then MaxDisk is 32; if there are 2032, then MaxDisk is 16.

You can modify the image.data file and then use alternate disk installation, or restore the system using the mksysb command to re-create the volume group as a big volume group. For more information, see the AIX 5L Version 5.2 Installation Guide and Reference.

On older AIX versions when the limit was smaller than 32 disks, the exception to this description of the maximum VGDA was the rootvg. To provide users with more free disk space, when rootvg was created, the mkvg -d command used the number of disks selected in the installation menu as the reference number. This -d number is 7 for one disk and one more for each additional disk selected. For example, if two disks are selected, the number is 8 and if three disks are selected, the number is 9, and so on.

System Management Guide: Operating System and Devices

LVM Troubleshooting Tasks

Disk Drive Problems

Getting More Space on a Disk Drive

Cleaning Up File Systems Automatically

Restricting Users from Certain Directories

Mounting Space from Another Disk Drive

Recovering a Disk Drive without Reformatting

Recovering Using a Reformatted or Replacement Disk Drive

Example of Recovery from a Failed Disk Drive

Recovering from Disk Failure while the System Remains Available

Replacing a Disk When the Volume Group Consists of One Disk

Physical or Logical Volume Errors

Hot Spot Problems

LVCB Warnings

Physical Partition Limits

Partition Limitations and the rootvg

Partition Limitations and RAID Systems

Synchronizing the Device Configuration Database

Volume Group Errors

Overriding a Vary-On Failure

VGDA Warnings

System Management Guide:
Operating System and Devices