System Management Guide:
Operating System and Devices

File Systems Troubleshooting Tasks

The topics in this section provide diagnostics and recovery procedures to use if you encounter one of the following:

Fix Disk Overflows

A disk overflow occurs when too many files fill up the allotted space. This can be caused by a runaway process that creates many unnecessary files. You can use the following procedures to correct the problem:

Note

You must have root user authority to remove processes other than your own.

Identify Problem Processes

Use the following procedure to isolate problem processes.

To check the process status and identify processes that might be causing the problem, type:
```
ps -ef | pg 
```
The ps command shows the process status. The -e flag writes information about all processes (except kernel processes), and the -f flag generates a full listing of processes including what the command name and parameters were when the process was created. The pg command limits output to a single page at a time, so information does not scroll too quickly off the screen.

Check for system or user processes that are using excessive amounts of a system resource, such as CPU time. System processes such as sendmail, routed, and lpd seem to be the system processes most prone to becoming runaways.
To check for user processes that use more CPU than expected, type:
```
ps -u
```
Note the process ID (PID) of each problem process.

Terminate the Process

Use the following procedure to terminate a problem process:

Terminate the process that is causing the problem by typing:
```
kill -9 PID
```
Where PID is the ID of the problem process.
Remove the files the process has been making by typing:
```
rm file1 file2 file3
```
Where file1 file2 file3 represents names of process-related files.

Reclaim File Space without Terminating the Process

When an active file is removed from the file system, the blocks allocated to the file remain allocated until the last open reference is removed, either as a result of the process closing the file or because of the termination of the processes that have the file open. If a runaway process is writing to a file and the file is removed, the blocks allocated to the file are not freed until the process terminates.

To reclaim the blocks allocated to the active file without terminating the process, redirect the output of another command to the file. The data redirection truncates the file and reclaims the blocks of memory. For example:

$ ls -l
total 1248
-rwxrwxr-x      1 web   staff   1274770 Jul 20 11:19 datafile
$ date > datafile
$ ls -l
total 4
-rwxrwxr-x      1 web   staff        29 Jul 20 11:20 datafile

The output of the date command replaced the previous contents of the datafile file. The blocks reported for the truncated file reflect the size difference from 1248> to 4. If the runaway process continues to append information to this newly truncated file, the next ls command produces the following results:

$ ls -l
total 8
-rxrwxr-x       1 web   staff   1278866 Jul 20 11:21 datefile

The size of the datafile file reflects the append done by the runaway process, but the number of blocks allocated is small. The datafile file now has a hole in it. File holes are regions of the file that do not have disk blocks allocated to them.

Fix a / (root) Overflow

Check the following when the root file system (/) has become full:

Use the following command to read the contents of the /etc/security/failedlogin file:
```
who /etc/security/failedlogin
```
The condition of TTYs respawning too rapidly can create failed login entries. To clear the file after reading or saving the output, execute the following command:
```
cp /dev/null /etc/security/failedlogin
```
Check the /dev directory for a device name that is typed incorrectly. If a device name is typed incorrectly, such as rmto instead of rmt0, a file will be created in /dev called rmto. The command will normally proceed until the entire root file system is filled before failing. /dev is part of the root (/) file system. Look for entries that are not devices (that do not have a major or minor number). To check for this situation, use the following command:
```
cd /dev
ls -l | pg
```
In the same location that would indicate a file size for an ordinary file, a device file has two numbers separated by a comma. For example:
```
crw-rw-rw-   1 root     system    12,0 Oct 25 10:19 rmt0
```
If the file name or size location indicates an invalid device, as shown in the following example, remove the associated file:
```
crw-rw-rw-   1 root     system   9375473 Oct 25 10:19 rmto
```
Notes:
1. Do not remove valid device names in the /dev directory. One indicator of an invalid device is an associated file size that is larger than 500 bytes.
2. If system auditing is running, the default /audit directory can rapidly fill up and require attention.
Check for very large files that might be removed using the find command. For example, to find all files in the root (/) directory larger than 1 MB, use the following command:
```
find / -xdev -size  +2048 -ls |sort -r  +6
```
This command finds all files greater than 1 MB and sorts them in reverse order with the largest files first. Other flags for the find command, such as -newer, might be useful in this search. For detailed information, see the command description for the find command.
Note

When checking the root directory, major and minor numbers for devices in the /dev directory will be interspersed with real files and file sizes. Major and minor numbers, which are separated by a comma, can be ignored.

Before removing any files, use the following command to ensure a file is not currently in use by a user process:
```
fuser filename
```
Where filename is the name of the suspect large file. If a file is open at the time of removal, it is only removed from the directory listing. The blocks allocated to that file are not freed until the process holding the file open is killed.

Fix a /var Overflow

Check the following when the /var file system has become full:

You can use the find command to look for large files in the /var directory. For example:
```
find /var -xdev -size  +2048 -ls| sort -r  +6
```
For detailed information, see the command description for the find command.
Check for obsolete or leftover files in /var/tmp.
Check the size of the /var/adm/wtmp file, which logs all logins, rlogins and telnet sessions. The log will grow indefinitely unless system accounting is running. System accounting clears it out nightly. The /var/adm/wtmp file can be cleared out or edited to remove old and unwanted information. To clear it, use the following command:
```
cp /dev/null  /var/adm/wtmp
```
To edit the /var/adm/wtmp file, first copy the file temporarily with the following command:
```
/usr/sbin/acct/fwtmp < /var/adm/wtmp >/tmp/out
```
Edit the /tmp/out file to remove unwanted entries then replace the original file with the following command:
```
/usr/sbin/acct/fwtmp -ic < /tmp/out > /var/adm/wtmp
```
Clear the error log in the /var/adm/ras directory using the following procedure. The error log is never cleared unless it is manually cleared.
Note

Never use the cp /dev/null command to clear the error log. A zero-length errlog file disables the error logging functions of the operating system and must be replaced from a backup.
1. Stop the error daemon using the following command:
```
/usr/lib/errstop
```
2. Remove or move to a different filesystem the error log file by using one of the following commands:
```
rm /var/adm/ras/errlog
```
  or
```
mv /var/adm/ras/errlog filename
```
  Where filename is the name of the moved errlog file.
  
  Note
  
  The historical error data is deleted if you remove the error log file.
3. Restart the error daemon using the following command:
```
/usr/lib/errdemon
```
Note
Consider limiting the errlog by running the following entries in cron:
```
0 11 * * * /usr/bin/errclear -d S,O 30    
0 12 * * * /usr/bin/errclear -d H 90
```
Check whether the trcfile file in this directory is large. If it is large and a trace is not currently being run, you can remove the file using the following command:
```
rm /var/adm/ras/trcfile
```
If your dump device is set to hd6 (which is the default), there might be a number of vmcore* files in the /var/adm/ras directory. If their file dates are old or you do not want to retain them, you can remove them with the rm command.
Check the /var/spool directory, which contains the queueing subsystem files. Clear the queueing subsystem using the following commands:
```
stopsrc -s qdaemon
rm /var/spool/lpd/qdir/*
rm /var/spool/lpd/stat/*
rm /var/spool/qdaemon/*
startsrc -s qdaemon
```
Check the /var/adm/acct directory, which contains accounting records. If accounting is running, this directory may contain several large files. Information on how to manage these files is in System Accounting.
Check the /var/preserve directory for terminated vi sessions. Generally, it is safe to remove these files. If a user wants to recover a session, you can use the vi -r command to list all recoverable sessions. To recover a specific session, usevi -r filename.
Modify the /var/adm/sulog file, which records the number of attempted uses of the su command and whether each was successful. This is a flat file and can be viewed and modified with a favorite editor. If it is removed, it will be recreated by the next attempted su command. Modify the /var/tmp/snmpd.log, which records events from the snmpd daemon. If the file is removed it will be recreated by the snmpd daemon.
Note

The size of the /var/tmp/snmpd.log file can be limited so that it does not grow indefinitely. Edit the /etc/snmpd.conf file to change the number (in bytes) in the appropriate section for size.

Fix a User-Defined File System Overflow

Use this procedure to fix an overflowing user-defined file system.

Remove old backup files and core files. The following example removes all *.bak, .*.bak, a.out, core, *, or ed.hup files.

find / \( -name "*.bak" -o -name core -o -name a.out -o \
        -name "...*" -o -name ".*.bak" -o -name ed.hup \) \
        -atime +1 -mtime +1 -type f -print | xargs -e rm -f

To prevent files from regularly overflowing the disk, run the skulker command as part of the cron process and remove files that are unnecessary or temporary.
The skulker command purges files in /tmp directory, files older than a specified age, a.out files, core files, and ed.hup files. It is run daily as part of an accounting procedure run by the cron command during off-peak periods (assuming you have turned on accounting).

The cron daemon runs shell commands at specified dates and times. Regularly scheduled commands such as skulker can be specified according to instructions contained in the crontab files. Submit crontab files with the crontab command. To edit a crontab file, you must have root user authority.

For more information about how to create a cron process or edit the crontab file, refer to Setting Up an Accounting System.

Fix Other File Systems and General Search Techniques

Use the find command with the -size flag to locate large files or, if the file system recently overflowed, use the -newer flag to find recently modified files. To produce a file for the -newer flag to find against, use the following touch command:

touch mmddhhmm filename

Where mm is the month, dd is the date, hh is the hour in 24-hour format, mm is the minute, and filename is the name of the file you are creating with the touch command.

After you have created the touched file, you can use the following command to find newer large files:

find /filesystem_name -xdev -newer touch_filename -ls

You can also use the find command to locate files that have been changed in the last 24 hours, as shown in the following example:

find /filesystem_name -xdev -mtime 0 -ls

Fix a Damaged File System

File systems can get corrupted when the i-node or superblock information for the directory structure of the file system gets corrupted. This can be caused by a hardware-related ailment or by a program that gets corrupted that accesses the i-node or superblock information directly. (Programs written in assembler and C can bypass the operating system and write directly to the hardware.) One symptom of a corrupt file system is that the system cannot locate, read, or write data located in the particular file system.

To fix a damaged file system, you must diagnose the problem and then repair it. The fsck command performs low-level diagnosis and repairs.

Procedure

With root authority, unmount the damaged file system using one of the following SMIT fast paths: smit unmountfs (for a file system on a fixed disk drive) or smit unmntdsk (for a file system on a removeable disk).
Assess file system damage by running the fsck command. In the following example, the fsck command checks the unmounted file system located on the /dev/myfilelv device:
```
fsck /dev/myfilelv
```
The fsck command checks and interactively repairs inconsistent file systems. Normally, the file system is consistent, and the fsck command merely reports on the number of files, used blocks, and free blocks in the file system. If the file system is inconsistent, the fsck command displays information about the inconsistencies found and prompts you for permission to repair them. The fsck command is conservative in its repair efforts and tries to avoid actions that might result in the loss of valid data. In certain cases, however, the fsck command recommends the destruction of a damaged file. Refer to the fsck command description in AIX 5L Version 5.2 Commands Reference for a list of inconsistences that this command checks for.
If the file system cannot be repaired, restore it from backup.
Attention: Restoring a file system from a backup destroys and replaces any file system previously stored on the disk.

To restore the file system from backup, use the SMIT fastpath smit restfilesys or the series of commands shown in the following example:
```
mkfs /dev/myfilelv
mount /dev/myfilelv /myfilesys
cd /myfilesys
restore -r
```
In this example, the mkfs command makes a new file system on the device named /dev/myfilelv and initializes the volume label, file system label, and startup block. The mount command establishes /dev/myfilelv as the mountpoint for myfilesys and the restore command extracts the file system from the backup.

If your backup was made using incremental file system backups, you must restore the backups in increasing backup-level order (for example, 0, 1, 2). For more information about restoring a file system from backup, refer to "Restoring from Backup Image Individual User Files".

When using smit restfilesys to restore an entire file system, enter the target directory, restore device (other than /dev/rfd0), and number of blocks to read in a single input operation.

System Management Guide: Operating System and Devices