[ Previous | Next | Contents | Glossary | Home | Search ]
AIX Version 4.3 System Management Guide: Operating System and Devices

Recovering from File System, Disk Drive, or Controller Failure

File systems can get corrupted when the i-node or superblock information for the directory structure of the file system gets corrupted. This can be caused by a hardware-related ailment or by a program that gets corrupted that accesses the i-node or superblock information directly. (Programs written in assembler and C can bypass the operating system and write directly to the hardware.) One symptom of a corrupt file system is that the system cannot locate or read/write data located in the particular file system.

A disk drive can intermittently (or permanently) suffer read/write problems. If you hear a drive that makes loud squealing or scratching noises, it probably is about to fail. Usually, however, you will not notice that a drive has gone bad while it is still running. It is when you try to restart the system that the device refuses to work. (At this point, it is usually too late to retrieve the lost data.)

A controller failure can act much like a drive failure. However, when a drive fails, you cannot access that particular drive; when a controller fails, you cannot get access to all of the drives in the system (or many of them). A controller fails because some electrical component on the controller board fails.

Note: Hardware problems are usually the most difficult to diagnose. No two hardware failures are exactly the same. This is usually the case because different components on the same kind of board can fail, causing a totally different set of symptoms and problems. For help in diagnosing hardware problems, refer to the AIX Version 4.3 Problem Solving Guide and Reference.

Prerequisites

You must have root user or system group authority to execute this task.

Procedure

  1. Make sure that you have backups of the data.

  2. Reboot the machine with diagnostic diskettes, and determine whether the problem is the file system, disk drive, or controller.

  3. If the file system is the problem, try using the fsck command or the smit fsck fast path to correct the problem. (For information about using the fsck command, refer to "Fixing a Damaged File System".)

  4. If the disk drive is the problem, determine whether you can still address the disk drive (if it is available). There are three ways to do this:

  5. If you can still address the disk drive, reformat it, marking the bad sectors. (For information about reformatting a disk drive, refer to "Reformatting a Disk Drive.")

  6. If the controller card or other hardware is the problem, try replacing it with another card.

[ Previous | Next | Contents | Glossary | Home | Search ]