AIX can detect system hang conditions and try to recover from such situations, based on user defined actions. Currently AIX supports two types of hang detections.
The following sections describe these types. For more information on system hang detection, see System Hang Management in AIX 5L Version 5.2 System Management Guide: Operating System and Devices.
All processes (also known as threads) run at a priority. This priority is numerically inverted in the range 0-126. Zero is highest priority and 126 is the lowest priority. The default priority for all threads is 60. The priority of a process can be lowered by any user with the nice command. Anyone with root authority can also raise a process's priority.
The kernel scheduler always picks the highest priority runnable thread to put on a CPU. It is therefore possible for a sufficient number of high priority threads to completely tie up the machine such that low priority threads can never run. If the running threads are at a priority higher than the default of 60, this can lock out all normal shells and logins to the point where the system appears hung.
The System Hang Detection feature provides a mechanism to detect this situation and allow the system administrator a means to recover. This feature is implemented as a daemon (shdaemon) that runs at the highest process priority. This daemon queries the kernel for the lowest priority thread run over a specified interval. If the priority is above a configured threshold, the daemon can take one of several actions. Each of these actions can be independently enabled, and each can be configured to trigger at any priority and over any time interval. The actions and their defaults are:
Action Default Default Default Default Enabled Priority Timeout Device 1) Log an error no 60 2 2) Console message no 60 2 /dev/console 3) High priority yes 60 2 /dev/tty0 login shell 4) Run a command at no 60 2 high priority 5) Crash and reboot no 39 5
Because of I/O errors, the I/O path can become blocked and further I/O on that path is affected. In these circumstances it is essential that the operating system alert the user and execute user defined actions. As part of the Lost I/O detection and notification, the shdaemon, with the help of the Logical Volume Manager, monitors the I/O buffers over a period of time and checks whether any I/O is pending for too long a period of time. If the wait time exceeds the threshold wait time defined by the shconf file, a lost I/O is detected and further actions are taken. The information about the lost I/O is documented in the error log. Also based on the settings in the shconf file, the system might be rebooted to recover from the lost I/O situation.
For lost I/O detection, you can set the time out value and also enable the following actions:
Action | Default Enabled | Default Device |
---|---|---|
Console message | no | /dev/console |
Crash and reboot | no | - |