[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

System Management Guide:
Operating System and Devices

LVM Troubleshooting Tasks

The topics in this section provide diagnostics and recovery procedures to use if you encounter one of the following:

Disk Drive Problems

If your disk drive is running out of available space, see Getting More Space on a Disk Drive. If you suspect a disk drive is mechanically failing or has failed, run diagnostics on the disk use the following procedure:

  1. With root authority, type the following SMIT fast path on the command line:
    smit diag
  2. Select Current Shell Diagnostics to enter the AIX Diagnostics tool.
  3. After you read the Diagnostics Operating Instructions screen, press Enter.
  4. Select Diagnostics Routines.
  5. Select System Verification.
  6. Scroll down through the list to find and select the drive you want to test.
  7. Select Commit.

Based on the diagnostics results, you should be able to determine the condition of the disk:

Getting More Space on a Disk Drive

If you run out of space on a disk drive, there are several ways to remedy the problem. You can automatically track and remove unwanted files, restrict users from certain directories, or mount space from another disk drive.

You must have root user, system group, or administrative group authority to execute these tasks.

Cleaning Up File Systems Automatically

Use the skulker command to clean up file systems by removing unwanted files. Type the following from the command line:

skulker -p

The skulker command is used to periodically purge obsolete or unneeded files from file systems. Candidates include files in the /tmp directory, files older than a specified age, a.out files, core files, or ed.hup files.

The skulker command is typically run daily, as part of an accounting procedure run by the cron command during off-peak hours. For more information about using the skulker command in a cron process, see Fix Disk Overflows.

For information on typical cron entries, see Setting Up an Accounting System.

Restricting Users from Certain Directories

Another way to release disk space and possibly to keep it free is to restrict and monitor disk usage.

Mounting Space from Another Disk Drive

Another way to get more space on a disk drive is to mount space from another drive. You can mount space from one disk drive to another in the following ways:

For more information about mounting file systems, see Mount a JFS or JFS2.

Recovering a Disk Drive without Reformatting

If you repair a bad disk and place it back in the system without reformatting it, you can let the system automatically activate and resynchronize the stale physical partitions on the drive at boot time. A stale physical partition contains data your system cannot use.

If you suspect a stale physical partition, type the following on the command line:

lspv -M PhysVolName

Where PhysVolName is the name of your physical volume. The lspv command output will list all partitions on your physical volume. The following is an excerpt from example output:

hdisk16:112     lv01:4:2        stale
hdisk16:113     lv01:5:2        stale
hdisk16:114     lv01:6:2        stale
hdisk16:115     lv01:7:2        stale
hdisk16:116     lv01:8:2        stale
hdisk16:117     lv01:9:2        stale
hdisk16:118     lv01:10:2       stale

The first column displays the physical partitions and the second column displays the logical partitions. Any stale physical partitions are noted in the third column.

Recovering Using a Reformatted or Replacement Disk Drive

This section describes how to recover data from a failed disk drive when you must reformat or replace the failed disk.

Attention: Before you reformat or replace a disk drive, remove all references to nonmirrored file systems from the failing disk and remove the disk from the volume group and system configuration. If you do not, you create problems in the ODM (object data manager) and system configuration databases. Instructions for these essential steps are included in the following procedure, under Before replacing or reformatting your failed or failing disk.

The following procedure uses a scenario in which the volume group called myvg contains three disk drives called hdisk2, hdisk3, and hdisk4. In this scenario, hdisk3 goes bad. The nonmirrored logical volume lv01 and a copy of the mylv logical volume is contained on hdisk2. The mylv logical volume is mirrored and has three copies, each of which takes up two physical partitions on its disk. The failing hdisk3 contains another copy of mylv, and the nonmirrored logical volume called lv00. Finally, hdisk4 contains a third copy of mylv as well as a logical volume called lv02. The following figure shows this scenario.

This procedure is divided into the following key segments:

Before replacing or reformatting your failed or failing disk:

  1. Log in with root authority.
  2. If you are not familiar with the logical volumes that are on the failing drive, use an operational disk to view the contents of the failing disk. For example, to use hdisk4 to look at hdisk3, type the following on the command line:

    lspv -M -n hdisk4 hdisk3

    The lspv command displays information about a physical volume within a volume group. The output looks similar to the following:

    hdisk3:1        mylv:1
    hdisk3:2        mylv:2
    hdisk3:3        lv00:1
    hdisk3:4-50

    The first column displays the physical partitions, and the second column displays the logical partitions. Partitions 4 through 50 are free.

  3. Back up all single-copy logical volumes on the failing device, if possible. For instructions, see Backing Up and Restoring Information.
  4. If you have single-copy file systems, unmount them from the disk. (You can identify single-copy file systems from the output of the lspv command. Single-copy file systems have the same number of logical partitions as physical partitions on the output.) Mirrored file systems do not have to be unmounted.

    In the scenario, lv00 on the failing disk hdisk3 is a single-copy file system. To unmount it, type the following:

    unmount /dev/lv00

    If you do not know the name of the file system, assuming the /etc/filesystems file is not solely located on the failed disk, type mount on the command line to list all mounted file systems and find the name associated with your logical volume. You can also use the grep command on the /etc/filesystems file to list only the file system names, if any, associated with your logical volume. For example:

    grep lv00 /etc/filesystems

    The output looks similar to the following:

    dev             = /dev/lv00   
    log             = /dev/loglv00
    Notes:
    1. The unmount command fails if the file system you are trying to unmount is currently being used. The unmount command executes only if none of the file system's files are open and no user's current directory is on that device.
    2. Another name for the unmount command is umount. The names are interchangeable.
  5. Remove all single-copy file systems from the failed physical volume by typing the rmfs command:

    rmfs /FSname
  6. Remove all mirrored logical volumes located on the failing disk.
    Note
    You cannot use rmlvcopy on the hd5 and hd7 logical volumes from physical volumes in the rootvg volume group. The system does not allow you to remove these logical volumes because there is only one copy of these.

    The rmlvcopy command removes copies from each logical partition. For example, type:

    rmlvcopy mylv 2 hdisk3

    By removing the copy on hdisk3, you reduce the number of copies of each logical partition belonging to the mylv logical volume from three to two (one on hdisk4 and one on hdisk2).

  7. If the failing disk was part of the root volume group and contained logical volume hd7, remove the primary dump device (hd7) by typing the following on the command line

    sysdumpdev -P -p /dev/sysdumpnull

    The sysdumpdev command changes the primary or secondary dump device location for a running system. When you reboot, the dump device returns to its original location.

  8. Remove any paging space located on the disk using the following command:
    rmps PSname

    Where PSname is the name of the paging space to be removed, which is actually the name of the logical volume on which the paging space resides.

    If the rmps command is not successful, you must use the smit chps fast path to deactivate the primary paging space and reboot before continuing with this procedure. The reducevg command in step 10 can fail if there are active paging spaces.

  9. Remove any other logical volumes from the volume group, such as those that do not contain a file system, using the rmlv command. For example, type:

    rmlv -f lv00
  10. Remove the failed disk from the volume group using the reducevg command. For example, type:

    reducevg -df myvg hdisk3

    If you cannot execute the reducevg command or if the command is unsuccessful, the procedure in step 13 can help clean up the VGDA/ODM information after you reformat or replace the drive

  11. Replacing or reformatting your failed or failing disk:

  12. The next step depends on whether you want to reformat or replace your disk and on what type of hardware you are using:
  13. After replacing or reformatting your failed or failing disk:

  14. Follow the instructions in Configuring a Disk and Making an Available Disk a Physical Volume.
  15. If you could not use the reducevg command on the disk from the old volume group before the disk was formatted (step 10), the following procedure can help clean up the VGDA/ODM information.
    1. If the volume group consisted of only one disk that was reformatted, type:
      exportvg VGName

      Where VGName is the name of your volume group.

    2. If the volume group consists of more than one disk, type the following on the command line:
      varyonvg VGName

      The system displays a message about a missing or unavailable disk, and the new (or reformatted) disk is listed. Note the physical volume identifier (PVID) of the new disk, which is listed in the varyonvg message. It is the 16-character string between the name of the missing disk and the label PVNOTFND. For example:

      hdisk3 00083772caa7896e PVNOTFND

      Type:

      varyonvg -f VGName
      The missing disk is now displayed with the PVREMOVED label. For example:
      hdisk3 00083772caa7896e PVREMOVED

      Then, type the command:

      reducevg -df VGName PVID

      Where PVID is the physical volume identifier (in this scenario, 00083772caa7896e).

  16. To add the new disk drive to the volume group, use the extendvg command. For example, type:
    extendvg myvg hdisk3
  17. To re-create the single-copy logical volumes on the new (or reformatted) disk drive, use the mklv command. For example, type:
    mklv -y lv00 myvg 1 hdisk3

    This example recreates the lv00 logical volume on the hdisk3 drive. The 1 means that this logical volume is not mirrored.

  18. To re-create the file systems on the logical volume, use the crfs command. For example, type
    crfs -v jfs -d lv00 -m /dev/lv00
  19. To restore single-copy file system data from backup media, see Restoring from Backup Image Individual User Files.
  20. To re-create the mirrored copies of logical volumes, use the mklvcopy command. For example, type:
    mklvcopy mylv 3 hdisk3

    This example creates a mirrored third partition of the mylv logical volume on hdisk3.

  21. To synchronize the new mirror with the data on the other mirrors (in this example, hdisk2 and hdisk4), use the syncvg command. For example, type:
    syncvg -p hdisk3

At this point, all mirrored file systems should be restored and up-to-date. If you were able to back up your single-copy file systems, they will also be ready to use. You should be able to proceed with normal system use.

Example of Recovery from a Failed Disk Drive

To recover from a failed disk drive, back out the way you came in; that is, list the steps you went through to create the volume group, and then go backwards. The following example is an illustration of this technique. It shows how a mirrored logical volume was created and then how it was altered, backing out one step at a time, when a disk failed.

Note
The following example illustrates a specific instance. It is not intended as a general prototype on which to base any general recovery procedures.
  1. The system manager, Jane, created a volume group called workvg on hdisk1, by typing:
    mkvg -y workvg hdisk1
  2. She then created two more disks for this volume group, by typing:
    extendvg workvg hdisk2
    
    extendvg workvg hdisk3
  3. Jane created a logical volume of 40 MB that has three copies. Each copy is on one of each of the three disks that comprise the workvg volume group. She used the following commands:
    mklv -y testlv workvg 10
    
    mklvcopy testlv 3

After Jane created the mirrored workvg volume group, hdisk2 failed. Therefore, she took the following steps to recover:

  1. She removed the logical volume copy from hdisk2 by typing:
    rmlvcopy testlv 2 hdisk2
  2. She detached hdisk2 from the system so that the ODM and VGDA are updated, by typing:
    reducevg workvg hdisk2
  3. She removed hdisk2 from the system configuration to prepare for replacement by typing:
    rmdev -l hdisk2 -d
  4. She chose to shut down the system, by typing:
    shutdown -F
  5. She replaced the disk. The new disk did not have the same SCSI ID as the former hdisk2.
  6. She rebooted the system.

    Because you have a new disk (the system sees that there is a new PVID on this disk), the system chooses the first open hdisk name. Because the -d flag was used in step 3, the name hdisk2 was released, so the system chose hdisk2 as the name of the new disk. If the -d flag had not been used, hdisk4 would have been chosen as the new name.

  7. Jane added this disk into the workvg volume group by typing:
    extendvg workvg hdisk2
  8. She created two mirrored copies of the logical volume by typing:
    mklvcopy testlv 3

    The Logical Volume Manager automatically placed the third logical volume copy on the new hdisk2.

Recovering from Disk Failure while the System Remains Available

The procedure to recover from disk failure using the hot removability feature is, for the most part, the same as described in Recovering a Disk Drive without Reformatting, with the following exceptions:

  1. To unmount file systems on a disk, use the procedure Mount a JFS or JFS2.
  2. To remove the disk from its volume group and from the operating system, use the procedure Removing a Disk without Data.
  3. To replace the failed disk with a new one, you do not need to shut down the system. Use the following sequence of procedures:
    1. Adding Disks while the System Remains Available
    2. Configuring a Disk
    3. Continue with step 13 of Recovering Using a Reformatted or Replacement Disk Drive.
Replacing a Disk When the Volume Group Consists of One Disk

If you can access a disk that is going bad as part of a volume group, use one of the following procedures:

If the disk is bad and cannot be accessed, follow these steps:

  1. Export the volume group.
  2. Replace the drive.
  3. Re-create the data from backup media that exists.

Physical or Logical Volume Errors

This section contains possible problems with and solutions for physical or logical volume errors.

Hot Spot Problems

If you notice performance degradation when accessing logical volumes, you might have hot spots in your logical volumes that are experiencing too much disk I/O. For more information, see AIX 5L Version 5.2 System Management Concepts: Operating System and Devices and Enabling and Configuring Hot Spot Reporting.

LVCB Warnings

The logical volume control block (LVCB) is the first 512 bytes of a logical volume. This area holds important information such as the creation date of the logical volume, information about mirrored copies, and possible mount points in the JFS. Certain LVM commands are required to update the LVCB, as part of the algorithms in LVM. The old LVCB is read and analyzed to see if it is a valid. If the information is valid LVCB information, the LVCB is updated. If the information is not valid, the LVCB update is not performed, and you might receive the following message:

Warning, cannot write lv control block data.

Most of the time, this message results when database programs bypass the JFS and access raw logical volumes as storage media. When this occurs, the information for the database is literally written over the LVCB. For raw logical volumes, this is not fatal. After the LVCB is overwritten, the user can still:

There are limitations to deleting LVCBs. A logical volumes with a deleted LVCB might not import successfully to other systems. During an importation, the LVM importvg command scans the LVCBs of all defined logical volumes in a volume group for information concerning the logical volumes. If the LVCB does not exist, the imported volume group still defines the logical volume to the new system that is accessing this volume group, and the user can still access the raw logical volume. However, the following typically happens:

Physical Partition Limits

In the design of Logical Volume Manager (LVM), each logical partition maps to one physical partition (PP). And, each physical partition maps to a number of disk sectors. The design of LVM limits the number of physical partitions that LVM can track per disk to 1016. In most cases, not all of the 1016 tracking partitions are used by a disk. When this limit is exceeded, you might see a message similar to the following:

0516-1162 extendvg: Warning, The Physical Partition Size of PPsize requires the
creation of TotalPPs partitions for PVname.  The limitation for volume group
VGname is LIMIT physical partitions per physical volume.  Use chvg command
with -t option to attempt to change the maximum Physical Partitions per
Physical volume for this volume group.

Where:

PPsize
Is 1 MB to 1 GB in powers of 2.
Total PPs
Is the total number of physical partitions on this disk, given the PPsize.
PVname
Is the name of the physical volume, for example, hdisk3.
VGname
Is the name of the volume group.
LIMIT
Is 1016 or a multiple of 1016.

This limitation is enforced in the following instances:

  1. When creating a volume group using the mkvg command, you specified a number of physical partitions on a disk in the volume group that exceeded 1016. To avoid this limitation, you can select from the physical partition size ranges of 1, 2, 4 (the default), 8, 16, 32, 64, 128, 256, 512 or 1024 MB and use the mkvg -s command to create the volume group. Alternatively, you can use a suitable factor that allows multiples of 1016 partitions per disk, and use the mkvg -t command to create the volume group.
  2. When adding a disk to a pre-existing volume group with the extendvg command, the new disk caused the 1016 limitation violation. To resolve this situation, convert the existing volume group to hold multiples of 1016 partitions per disk using the chvg -t command. Alternatively, you can re-create the volume group with a larger partition size that allows the new disk, or you can create a standalone volume group consisting of a larger physical size for the new disk.
Partition Limitations and the rootvg

If the installation code detects that the rootvg drive is larger than 4 GB, it changes the mkvg -s value until the entire disk capacity can be mapped to the available 1016 tracks. This installation change also implies that all other disks added to rootvg, regardless of size, are also defined at that physical partition size.

Partition Limitations and RAID Systems

For systems using a redundant array of identical disks (RAID), the /dev/hdiskX name used by LVM may consist of many non-4 GB disks. In this case, the 1016 requirement still exists. LVM is unaware of the size of the individual disks that really make up /dev/hdiskX. LVM bases the 1016 limitation on the recognized size of /dev/hdiskX, and not the real physical disks that make up /dev/hdiskX.

Synchronizing the Device Configuration Database

A system malfunction can cause the device configuration database to become inconsistent with the LVM. When this happens, a logical volume command generates such error messages as:

0516-322 The Device Configuration Database is inconsistent ...

OR

0516-306 Unable to find logical volume LVname in the Device
Configuration Database.

(where the logical volume called LVname is normally available).

Attention: Do not remove the /dev entries for volume groups or logical volumes. Do not change the database entries for volume groups or logical volumes using the Object Data Manager.

To synchronize the device configuration database with the LVM information, with root authority, type the following on the command line:

synclvodm -v VGName

Where VGName is the name of the volume group you want to synchronize.

Volume Group Errors

If the importvg command is not working correctly, try refreshing the device configuration database. See Synchronizing the Device Configuration Database.

Overriding a Vary-On Failure
Attention: Overriding a vary-on failure is an unusual operation; check all other possible problem sources such as hardware, cables, adapters, and power sources before proceeding. Overriding a quorum failure during a vary-on process is used only in an emergency and only as a last resort (for example, to salvage data from a failing disk). Data integrity cannot be guaranteed for management data contained in the chosen copies of the VGDA and the VGSA when a quorum failure is overridden.

When you choose to forcibly vary-on a volume group by overriding the absence of a quorum, the PV STATE of all physical volumes that are missing during this vary-on process will be changed to removed. This means that all the VGDA and VGSA copies are removed from these physical volumes. After this is done, these physical volumes will no longer take part in quorum checking, nor are they allowed to become active within the volume group until you return them to the volume group.

Under one or more of the following conditions, you might want to override the vary-on failure so that the data on the available disks in the volume group can be accessed:

Use the following procedure to avoid losing quorum when one disk is missing or might soon fail and requires repair:

  1. To temporarily remove the volume from the volume group, type:
    chpv -vr PVname
    When this command completes, the physical volume PVname is no longer factored in quorum checking. However, in a two-disk volume group, this command fails if you try the chpv command on the disk that contains the two VGDA/VGSAs. The command does not allow you to cause quorum to be lost.
  2. If you need to remove the disk for repair, power off the system, and remove the disk. (For instructions, see Disk Drive Problems.) After fixing the disk and returning the disk to the system, continue with the next step.
  3. To make the disk available again to the volume group for quorum checking, type:
    chpv -v PVname
    Note
    The chpv command is used only for quorum-checking alteration. The data that resides on the disk is still there and must be moved or copied to other disks if the disk is not to be returned to the system.
VGDA Warnings

In some instances, the user experiences a problem adding a new disk to an existing volume group or in creating of a new volume group. The message provided by LVM is:

0516-1163 extendvg: VGname already has maximum physical volumes.  With the maximum
number of physical partitions per physical volume being LIMIT, the maximum
number of physical volumes for volume group VGname is MaxDisks.

Where:

VGname
Is the name of the volume group.
LIMIT
Is 1016 or a multiple of 1016.
MaxDisks
Is the maximum number of disks in a volume group. For example, if there are 1016 physical partitions (PPs) per disk, then MaxDisk is 32; if there are 2032, then MaxDisk is 16.

You can modify the image.data file and then use alternate disk installation, or restore the system using the mksysb command to re-create the volume group as a big volume group. For more information, see the AIX 5L Version 5.2 Installation Guide and Reference.

On older AIX versions when the limit was smaller than 32 disks, the exception to this description of the maximum VGDA was the rootvg. To provide users with more free disk space, when rootvg was created, the mkvg -d command used the number of disks selected in the installation menu as the reference number. This -d number is 7 for one disk and one more for each additional disk selected. For example, if two disks are selected, the number is 8 and if three disks are selected, the number is 9, and so on.

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]