System Management Concepts:
Operating System and Devices

System Hang Detection

AIX can detect system hang conditions and try to recover from such situations, based on user defined actions. Currently AIX supports two types of hang detections.

Priority Hang Detection
Lost I/O Hang Detection

The following sections describe these types. For more information on system hang detection, see System Hang Management in AIX 5L Version 5.2 System Management Guide: Operating System and Devices.

Priority Hang Detection

All processes (also known as threads) run at a priority. This priority is numerically inverted in the range 0-126. Zero is highest priority and 126 is the lowest priority. The default priority for all threads is 60. The priority of a process can be lowered by any user with the nice command. Anyone with root authority can also raise a process's priority.

The kernel scheduler always picks the highest priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high priority threads to completely tie up the machine such that low priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung.

The System Hang Detection feature provides a mechanism to detect this situation and allow the system administrator a means to recover. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval. The actions and their defaults are:

    Action           Default    Default     Default   Default
                     Enabled    Priority    Timeout   Device

1)  Log an error        no         60          2
2)  Console message     no         60          2      /dev/console
3)  High priority       yes        60          2      /dev/tty0
    login shell
4)  Run a command at    no         60          2
    high priority
5)  Crash and reboot    no         39          5

Lost I/O Hang Detection

Because of I/O errors, the I/O path can become blocked and further I/O on that path is affected. In these circumstances it is essential that the operating system alert the user and execute user defined actions. As part of the Lost I/O detection and notification, the shdaemon, with the help of the Logical Volume Manager, monitors the I/O buffers over a period of time and checks whether any I/O is pending for too long a period of time. If the wait time exceeds the threshold wait time defined by the shconf file, a lost I/O is detected and further actions are taken. The information about the lost I/O is documented in the error log. Also based on the settings in the shconf file, the system might be rebooted to recover from the lost I/O situation.

For lost I/O detection, you can set the time out value and also enable the following actions:

Action	Default Enabled	Default Device
Console message	no	/dev/console
Crash and reboot	no	-

System Management Concepts: Operating System and Devices

System Hang Detection

Priority Hang Detection

Lost I/O Hang Detection

System Management Concepts:
Operating System and Devices