Resources have identifiable attributes (also called properties in Web-based System Manager) that can be expressed so that certain conditions of interest to the system administrator can be observed. Predefined thresholds can be set for conditions, and responses can be defined and associated with these conditions. When these thresholds are met, an event is generated, and the actions associated with the condition are run. Predefined conditions and responses can be used as is or as templates for defining the conditions most appropriate for your installation.
The major components of RSCT Resource Monitoring and Control are the Resource Monitoring and Control (RMC) subsystem and certain resource managers. These are described in the following sections.
The Resource Monitoring and Control subsystem (RMC subsystem) monitors and queries resources. The RMC daemon manages an RMC session and recovers from communications failures.
The RMC subsystem is used by its clients to monitor the state of system resources and to send commands to resource managers. The RMC subsystem acts as a broker between the client processes that use it and the resource manager processes that control resources.
A resource manager is a process that maps resource and resource-class abstractions into calls and commands for one or more specific types of resources. A resource manager is a stand-alone daemon. The resource manager contains definitions of all resource classes that the resource manager supports. A resource class definition includes a description of all attributes, actions, and other characteristics of a resource class.
These resource classes are accessible and their properties can be manipulated by the user through Web-based System Manager or through the command line.
See the RMC and ERRM commands to access the resource classes and manipulate their attributes through the command line interface.
The following resource managers are provided:
Provides a system-wide facility for recording information about the system's operation, which is particularly useful for tracking subsystems running in the background.
(See Audit Log Resource Manager for details.)
Provides the ability to take actions in response to conditions occurring on the system.
(See Event Response Resource Manager for details.)
Monitors file systems.
(See File System Resource Manager for details.)
Monitors resources related to an individual machine. The types of values that are provided relate to load (processes, paging space, and memory usage) and status of the operating system. It also monitors program activity from initiation until termination.
(See Host Resource Manager for details.)
The Audit Log subsystem is implemented as a resource manager within the RMC subsystem. It has two resource classes, IBM.AuditLog for subsystem definitions and IBM.AuditLogTemplate for audit-log-template definitions. Entries in the audit log are called records. Records can be added, retrieved, and removed through actions on a specific subsystem or on the subsystem class. The template definition class contains a description of each record type that a subsystem can add to the audit log. The template definition contains the data type, a descriptive message, and other information for each subsystem-specific field within the record.
There are typically two types of clients for the audit-log subsystem, subsystems that need to add records to the audit log, and users who extract records from the audit log via the command line or the Web-based System Manager interface.
The formatted message for each record provides a concise description of the situation and allows a user to easily see at a high level what has been happening on the system.
Each resource of this class represents a subsystem that will be adding records to the audit log. A resource of this class must be added before the subsystem can add records to the audit log. The resource can be added as part of the installation of the subsystem or at runtime.
The following properties can be monitored for this resource class:
This resource class holds all audit log templates. An audit log template describes the information that exists in each audit log record that is based on the template. In addition, an audit log template contains information on how to present records that use the template to an end user. Each template corresponds to a resource within this class. The attributes of this resource class are internal.
The system administrator interacts with the Event Response resource manager (ERRM) through the Web-based System Manager or through the ERRM command-line interface.
When an event occurs, ERRM runs user-configured commands, which can include scripts provided by RSCT. A command and its attributes are a type of action, and many actions can be configured for a single Event Response resource. An action consists of a name, a command to be run, and other variables. You specify the range of times when the command is run (day, start time, and end time). If the condition occurs at a time outside the specified time ranges, the command is not run, and if all of the actions within this Event Response resource have the same time ranges, none of the commands are run. If no time ranges are specified, the command is always run. There are also event and rearm event flags that specify the events for which the command is run. Three options are allowable; only event set, only rearm event set, or both flags set.
The Event Response Resource Manager (ERRM) is automatically started when the RMC subsystem is started.
Although performance is important, ensuring that no events are lost and that the user's commands are executed is of greater importance. Other factors outside the control of ERRM may affect performance as well (for example, network load, system load, and the performance of other required subsystems).
The only userid that can define, undefine, and modify ERRM resources is root. All other users have read access to ERRM resources. Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. ERRM communicates only with other local subsystems on the same node.
Information is handled as follows:
There are three Event Response resource classes:
The Condition resource class contains the necessary information (event expression and rearm expression) for the ERRM to register with the RMC for event notifications that the administrator deems important. Conditions contain essential information such as: the resource attributes of the resource to be monitored, the event expression, and the optional rearm expression.
Configuration of ERRM begins with the definition of a set of Condition resources. A Condition resource is registered with the RMC subsystem when the Condition resource is used in the definition of an active Association resource.
An Event Response resource is configured by defining one or more actions. Each action contains the name of the action, a command, and other fields within the action attribute. The Event Response resource runs any number of configured commands when an event with an active association occurs. When an event occurs, all of the actions associated with its Event Response resource are evaluated to determine whether they should be run.
Predefined responses are available to use and to serve as templates for your own responses. For a description of predefined responses and how to use them, see Predefined Responses. Scripts for notification and logging of events and for broadcasting messages to logged-in user consoles also are provided in the AIX Commands Reference.
See Getting Started with the Monitoring Application for specific task information on how to configure actions for Event Response resources and Event Response resources for Conditions.
The Association resource class joins the Condition resource class together with the Event Response resource class. It contains a flag that indicates whether the association between the condition and the event response is active. Event Responses and Conditions are separate entities, but for monitoring to take place, they need to be associated. An event cannot occur unless at least one Event Response is associated with a Condition. You can configure one or more actions for an Event Response, and one or more Event Responses for a Condition.
See Getting Started with the Monitoring Application for information on how to get started using the capabilities of the Event Response resource manager to monitor your system.
The File System resource manager (FSRM) manages file systems. It can provide the following information:
There is one File System resource manager (FSRM) on a node. It is started implicitly by the RMC subsystem.
To enforce security, only root can start the FSRM resource manager (although it is strongly recommended that the FSRM resource manager not be started manually). Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. No security audits are generated, and no encryption mechanisms are used. The FSRM communicates only with other local subsystems on the same node and with the RMC subsystem. The FSRM has no direct contact with clients.
Information is handled as follows:
These properties of a file system resource can be monitored:
The following table shows the predefined conditions and examples of expressions that are used to monitor the file system:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Expression | Monitored Resources |
---|---|---|---|---|---|
File system state
|
OpState != 1
|
An event is generated when any file system goes offline.
|
OpState == 1
|
The event is rearmed when any file system comes back online.
|
all
|
File system i-nodes used
|
PercentINodeUsed > 90
|
An event is generated when more than 90% of the total i-nodes in any file
system are in use.
|
PercentINodeUsed < 85
|
The event is rearmed when the percentage of i-nodes used in the file system
falls below 85%.
|
all
|
File system space used
|
PercentTotUsed > 90
|
An event is generated when more than 90% of the total space of any file
system is in use.
|
PercentTotUsed< 85
|
The event is rearmed when the space used in the file system falls below
85%.
|
all
|
/tmp space used
|
PercentTotUsed > 90
|
An event is generated when more than 90% of the total space in the
/tmp file system is in use.
|
PercentTotUsed < 85
|
The event is rearmed when the space used in the /tmp file system
falls below 85%.
|
/tmp
|
/var space used
|
PercentTotUsed > 90
|
An event is generated when more than 90% of the total space in the
/var file system is in use.
|
PercentTotUsed < 85
|
The event is rearmed when the space used in the /var file system
falls below 85%.
|
/var
|
The Host resource manager allows system resources for an individual machine to be monitored, particularly resources related to operating-system load and status.
The Host resource manager is started implicitly by the RMC subsystem only when a property of a Host resource class is first monitored (thus cutting down on performance overhead).
Security is governed by the RMC daemon, which authenticates clients and performs authorization checks. The Host resource manager runs as root. No security audits are generated, no encryption mechanisms are used, and there is no communication outside the node. The RMC daemon detects any authentication or authorization failures. All interprocess communication is accomplished through pipes and shared memory.
Information is handled as follows:
The Host resource manager consumes minimal system resources during normal operation. This is because the following approaches have been implemented:
The Host resource manager has the following resource classes that you can use to monitor system resources:
Each type of adapter that is supported has its own resource class as follows:
The program name of this resource class is IBM.Host. It allows the following resources of a host system to be monitored:
The operating system scheduler maintains a run queue of all of the processes that are ready to be dispatched. Each second, the process table is scanned to determine which processes are ready to run. If one or more processes are ready, they are placed on the run queue, and a counter is incremented. The counter is used to compute the value of the ProcRunQueue variable as the average number of ready-to-run processes. The scheduler also scans the process table for processes that are inactive because they are waiting to be paged in. A swapped process may (or may not) have some or all of its pages moved to the swap (page) device. As with the ProcRunQueue variable, the system increments a counter for swapped processes, which is used to compute the value of the ProcSwapQueue variable as the average number of processes swapped out. A process must be paged in and marked non-swapped before it can be placed on the run queue for execution. These properties can be monitored:
The following table shows the predefined conditions that are available for
monitoring the operating system scheduler, and example expressions:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Processes in run queue
|
(ProcRunQueue - ProcRunQueue @P) >= (ProcRunQueue @P * 0.5)
|
An event is generated each time the average number of processes on the run
queue has increased by 50% or more between observations.
|
ProcRunQueue < 50
|
The event is rearmed when the run queue length drops below 50.
|
Processes in swap queue
|
(ProcSwapQueue > 50) && (ProcSwapQueue@P > 50)
|
An event is generated each time two consecutive observations find 50
processes or more in the swap queue.
|
(ProcSwapQueue < 40) && (ProcSwapQueue@P < 40)
|
The event is rearmed when the number of processes in the swap queue drops
below 40 for two consecutive observations.
|
A paging space is fixed disk storage for information that is resident in virtual memory but is not currently being accessed. A paging space, or swap space, is a logical volume with the attribute type equal to paging. When the amount of free real memory in the system is low, programs or data that have not been used recently are moved from real memory to paging space to release real memory for other processes. The amount of paging space required depends upon the types of activities performed on the system. If paging space runs low, processes may be lost, and if paging space runs out, the system may panic. Paging-space shortage may cause memory performance degradation, and thrashing can occur (if VMM memory load control is turned off).
The system monitors the number of free paging-space blocks and detects when a paging-space shortage exists. When the number of free paging-space blocks falls below a threshold known as the paging-space warning level, the system informs all processes except kernel processes (kprocs) of this condition by sending the SIGDANGER signal. If the shortage continues and falls below a second threshold known as the paging-space terminate level, the system sends the SIGKILL signal to processes that are the major users of paging space and that do not have a signal handler for the SIGDANGER signal.
The warning-level and terminate-level thresholds can be obtained and altered by the command vmtune (npswarn and npskill parameters respectively). Processes executing in the early allocation environment avoid receiving the SIGKILL signal if a low paging space condition occurs. If the PSALLOC environment variable is set to early when a program starts, paging space is reserved at the time the process makes a memory request. If there is insufficient paging space, the early allocation algorithm used by the operating system causes the memory request to be unsuccessful. If the PSALLOC environment is not set, or is set to any value other than early, the operating system uses a late allocation algorithm for memory and paging-space allocation. Late allocation does not reserve paging space at the time the memory is requested but defers the reservation until the pages are touched.
These properties monitor the global state of all active paging spaces defined in the system (including NFS-mounted paging spaces):
The following table shows the predefined conditions that are available for
monitoring paging space, and example expressions:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Paging active space
|
TotalPgSpSize != TotalPgSpSize @P
|
An event is generated whenever the total amount of active paging space
changes.
|
None
|
None
|
Paging free space
|
TotalPgSpFree<= 2560
|
An event is generated when the VMM is within 2MB (512 4KB pages) of
reaching the paging space warning level.
|
TotalPgSpFree > 2560
|
The event is rearmed when the free paging space total becomes greater than
the same threshold.
|
Paging percent space used
|
PctTotalPgSpUsed> 90
|
An event is generated when more than 90% of the total paging space is in
use.
|
PctTotalPgSpUsed < 85
|
The event is rearmed when the percentage falls below 85%.
|
Paging percent space free
|
PctTotalPgSpFree< 10
|
An event is generated when the total amount of free paging space falls
below 10%.
|
PctTotalPgSpFree > 15
|
The event is rearmed when the free paging space increases to 15%.
|
The values represented for this attribute reflect total processor utilization across all of the active processors in a system.
The idle and wait states of a processor are monitored, and the time spent running in protection mode is monitored. At each clock tick, an array of counters is incremented to reflect processor activity based on the state of the current running processes. The PctTotalTimeKernel, PctTotalTimeUser, PctTotalTimeWait, and PctTotalTimeIdle properties provide the approximate average percentage of time all active processors are currently spending in each state. Therefore, the sum of these values is 100 at any given observation.
There are two protection modes that processes run in, kernel (or system) level and user level. Processes running in kernel mode run with kernel privileges and have access to kernel data. These processes include kernel processes (kprocs) and services (such as system calls and device drivers).
Processes running in user mode are normal applications with user level privileges and run in their own unique process space. When a user level process invokes a kernel service, for example, by making a system call, a mode switch occurs that causes the process to run in kernel mode while the service is running.
When the current running process makes a request that cannot be immediately satisfied, such as an I/O operation, the process is put into wait state. A processor is considered idle when the current running process is the wait process. The wait process is a kernel process (kproc) that is dispatched when no other processes are ready to run.
These properties can be monitored:
The following table shows the predefined conditions that are available for
monitoring system-wide processor idle time, and example expressions:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Processor idle time
|
PctTotalTimeIdle>= 70
|
An event is generated when the average time all processors are idle at
least 70% of the time.
|
PctTotalTimeIdle < 10
|
The event is rearmed when the idle time decreases below 10%.
|
Processor kernel time
|
PctTotalTimeKernel>= 70
|
An event is generated when the average time all processors are executing in
kernel mode is at least 70% of the time.
|
PctTotalTimeKernel < 10
|
The event is rearmed when the kernel time decreases below 10%.
|
Processor user time
|
PctTotalTimeUser>= 70
|
An event is generated when the average time all processors are executing in
user mode is at least 70% of the time.
|
PctTotalTimeUser< 10
|
The event is rearmed when the user time decreases below 10%.
|
Processor wait time
|
PctTotalTimeWait >= 50
|
An event is generated when the average time all processors are waiting on
I/O is at least 50% of the time.
|
PctTotalTimeWait < 10
|
The event is rearmed when the wait time decreases below 10%.
|
The VMM (Virtual Memory Manager) manages the allocation of real memory page frames, resolves references to virtual memory pages that are not currently in real memory (or do not yet exist), and manages the reading and writing of pages to disk storage.
The VMM maintains a list of free page frames that it uses to accommodate page faults. A page fault occurs when a page that is not in real memory is referenced. In most environments, the VMM must occasionally add to the free list by reassigning some page frames owned by running processes. The virtual-memory pages whose page frames are to be reassigned are selected by the VMM's page-replacement algorithm, which takes into consideration the segment type, statistics regarding rate of reoccurring page faults, and user-tunable thresholds. The number of frames reassigned to the free list is also determined by VMM thresholds.
Memory regions defined in either system or user space may be pinned. Pinning a memory region prohibits the pager from stealing pages from the pages backing the pinned memory region. After a memory region is pinned, accessing that region does not result in a page fault until the region is subsequently unpinned. While a portion of the kernel remains pinned, many regions are pageable and are only pinned while being accessed.
Thresholds used by the VMM include the minimum and maximum number of pages to be maintained on the free list (minfree and maxfree). These thresholds are used to determine when the VMM should start or stop stealing pages to replenish the free list. There is also a maximum percentage of real memory that may be pinned. The values of these thresholds may be queried or altered using the system command vmtune.
Virtual memory is partitioned into fixed-size units called pages. Each page may be in real memory (RAM) or stored on disk until needed. Real memory is partitioned into units that are equal in size to virtual pages and are referred to as page frames. To accommodate a large virtual memory space with a limited real memory space, the system uses real memory for work space and maps inactive data and programs to disk.
Pages of a virtual address space are considered to be persistent or working. Persistent pages have permanent storage locations on disk. Data files or executable programs are mapped to persistent pages. Since persistent pages have a permanent storage location, the VMM can write a changed page back to its permanent location or simply free the page frame if it was not altered and re-read the page on a subsequent request.
Working pages are transitory and exist only during their use by a process. Examples are process stack and data regions, kernel and kernel-extension text regions, and shared-library text and data regions. Working pages also require disk storage locations when they cannot be kept in real memory. Disk paging space is used for this purpose.
The operating system provides routines used by the kernel and by services executing at system level for allocating memory in kernel space. Counters are maintained in the kernel to track requests and use of kernel memory, based on the type of data structure or service. These properties can be used to monitor the number and size and the state of requests for buffers allocated in kernel memory. The types of kernel memory available are:
The following properties are available for monitoring real and virtual memory and kernel memory. The <x> in the names below refers to the type of kernel memory allocation as shown in the preceding bulleted list (28 possible monitors).
The following table shows the predefined conditions that are available for
monitoring memory management, and example expressions:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Real memory free
|
PctRealMemFree < 5
|
An event is generated when the percentage of real page frames that are free
falls below 5%.
|
PctRealMemFree> 10
|
The event is rearmed when the percentage of free frames exceeds 10%.
|
Real memory pinned
|
PctRealMemPinned > 75
|
An event is generated when the percentage of real page frames that are
pinned exceeds 75%.
|
PctRealMemPinned < 70
|
The event is rearmed when the percentage falls below 70%.
|
Real memory free frames
|
PctMemFramesFree < 120
|
An event is generated when the number of free real page frames falls below
120.
|
PctMemFramesFree> 150
|
The event is rearmed when the number free exceeds 150.
|
Page in rate
|
VMPgInRate > 500
|
An event is generated when the rate of pages read by the VMM for both
persistent and working pages exceeds 500 per second.
|
VMPgInRate < 400
|
The event is rearmed when the rate drops below 400.
|
Page out rate
|
VMPgOutRate > 500
|
An event is generated when the rate of pages written by the VMM for both
persistent and working pages exceeds 500 per second.
|
VMPgOutRate < 400
|
The event is rearmed when the rate drops below 400.
|
Page fault rate
|
VMPgFaultRate > 500
|
An event is generated when there are more than 500 page faults per
second.
|
VMPgFaultRate < 400
|
The event is rearmed when the rate drops to less than 400 pages per
second.
|
Page space in rate
|
VMPgSpInRate > 500
|
An event is generated when more than 500 pages per second are read by the
VMM from paging space devices (working pages only).
|
VMPgSpInRate< 400
|
The event is rearmed when the rate drops to less than 400 pages per
second.
|
Page space out rate
|
VMPgSpOutRate> 500
|
An event is generated when more than 500 pages per second are written by
the VMM to paging space devices (working pages only).
|
VMPgSpOutRate < 400
|
The event is rearmed when the rate drops to less than 400 pages per
second.
|
Kernel Mbuf rate
|
KMemReqMbufRate> 5000
|
An event is generated when the number of requests for a kernel buffer of
type <Mbuf> (network data buffer) exceeds 5000 per second.
|
KMemReqMbufRate< 4000
|
The event is rearmed when the rate falls below 4000 per second.
|
Kernel socket buffer rate
|
KMemReqSockRate > 5000
|
An event is generated when the number of requests for a kernel buffer of
type <Socket> (kernel socket structure) exceeds 5000 per second.
|
KMemReqSockRate < 4000
|
The event is rearmed when the rate falls below 4000 per second.
|
Kernel protocol CB rate
|
KMemReqProtcbRate> 5000
|
An event is generated when the number of requests for a kernel buffer of
type <Protcb> (Protocol Control Block) exceeds 5000 per second.
|
KMemReqProtcbRate < 4000
|
The event is rearmed when the rate falls below 4000 per second.
|
Kernel other IP CB rate
|
KMemReqOtherIPRate > 5000
|
An event is generated when the number of requests for a kernel buffer of
type <OtherIP> (other buffers used by IP) exceeds 5000 per
second.
|
KMemReqOtherIPRate < 4000
|
The event is rearmed when the rate falls below 4000 per second.
|
Kernel Mblk rate
|
KMemReqMblkRate > 5000
|
An event is generated when the number of requests for a kernel buffer of
type <Mblk> (stream header and data) exceeds 5000 per second.
|
KMemReqMblkRate< 4000
|
The event is rearmed when the rate falls below 4000 per second.
|
Kernel streams buffer rate
|
KMemReqStreamsRate > 5000
|
An event is generated when the number of requests for a kernel buffer of
type <Streams> (other streams related memory) exceeds 5000 per
second.
|
KMemReqStreamsRate < 4000
|
The event is rearmed when the rate falls below 4000 per second.
|
Kernel other memory rate
|
KMemReqOtherRate > 5000
|
An event is generated when the number of requests for a kernel buffer of
type <Other> (other kernel memory) exceeds 5000 per second.
|
KMemReqOtherRate < 4000
|
The event is rearmed when the rate falls below 4000 per second.
|
Kernel Mbuf failed rate
|
KMemFailMbufRate > 10
|
An event is generated when the number of failures of requests for a kernel
buffer of type <Mbuf> (network data buffer) exceeds 10 per
second.
|
KMemFailMbufRate < 5
|
The event is rearmed when the rate falls below 5 per second.
|
Kernel socket buffer failed rate
|
KMemFailSockRate> 10
|
An event is generated when the number of failures of requests for a kernel
buffer of type <Socket> (kernel socket structure) exceeds 10 per
second.
|
KMemFailSockRate< 5
|
The event is rearmed when the rate falls below 5 per second.
|
Kernel protocol CB failed rate
|
KMemFailProtcbRate > 10
|
An event is generated when the number of failures of requests for a kernel
buffer of type <Protcb> (Protocol Control Block) exceeds 10 per
second.
|
KMemFailProtcbRate < 5
|
The event is rearmed when the rate falls below 5 per second.
|
Kernel other IP CB failed rate
|
KMemFailOtherIPRate> 10
|
An event is generated when the number of failures of requests for a kernel
buffer of type <OtherIP> (other buffers used by IP) exceeds 10 per
second.
|
KMemFailOtherIPRate < 5
|
The event is rearmed when the rate falls below 5 per second.
|
Kernel Mblk failed rate
|
KMemFailMblkRate> 10
|
An event is generated when the number of failures of requests for a kernel
buffer of type <Mblk> (stream header and data) exceeds 10 per
second.
|
KMemFailMblkRate< 5
|
The event is rearmed when the rate falls below 5 per second.
|
Kernel streams buffer failed rate
|
KMemFailStreamsRate> 10
|
An event is generated when the number of failures of requests for a kernel
buffer of type <Streams> (other stream related memory) exceeds 10 per
second.
|
KMemFailStreamsRate < 5
|
The event is rearmed when the rate falls below 5 per second.
|
Kernel other memory failed rate
|
KMemFailOtherRate> 10
|
An event is generated when the number of failures of requests for a kernel
buffer of type <Other> (other kernel memory) exceeds 10 per
second.
|
KMemFailOtherRate < 5
|
The event is rearmed when the rate falls below 5 per second.
|
Kernel Mbufs
|
KMemNumMbuf > 10000
|
An event is generated when the allocated number of kernel buffers of type
<Mbuf> (network data buffer) exceeds 10000.
|
KMemNumMbuf < 9000
|
The event is rearmed when the number falls below 9000.
|
Kernel socket buffers
|
KMemNumSock > 10000
|
An event is generated when the allocated number of kernel buffers of type
<Socket> (kernel socket structure) exceeds 10000.
|
KMemNumSock< 9000
|
The event is rearmed when the number falls below 9000.
|
Kernel protocol CBs
|
KMemNumProtcb> 10000
|
An event is generated when the allocated number of kernel buffers of type
<Protcb> (Protocol Control Block) exceeds 10000.
|
KMemNumProtcb< 9000
|
The event is rearmed when the number falls below 9000.
|
Kernel other IP CBs
|
KMemNumOtherIP> 10000
|
An event is generated when the allocated number of kernel buffers of type
<OtherIP> (other buffers used by IP) exceeds 10000.
|
KMemNumOtherIP< 9000
|
The event is rearmed when the number falls below 9000.
|
Kernel Mblk buffers
|
KMemNumMblk> 10000
|
An event is generated when the allocated number of kernel buffers of type
<Mblk> (stream header and data) exceeds 10000.
|
KMemNumMblk < 9000
|
The event is rearmed when the number falls below 9000.
|
Kernel stream buffers
|
KMemNumStreams> 10000
|
An event is generated when the allocated number of kernel buffers of type
<Streams> (other streams related memory) exceeds 10000.
|
KMemNumStreams< 9000
|
The event is rearmed when the number falls below 9000.
|
Kernel other memory
|
KMemNumOther > 10000
|
An event is generated when the allocated number of kernel buffers of type
<Other> (other kernel memory) exceeds 10000.
|
KMemNumOther < 9000
|
The event is rearmed when the number falls below 9000.
|
Kernel Mbufs size
|
KMemSizeMbuf> 0x4000000
|
An event is generated when the total space occupied by kernel buffers of
type <Mbuf> (network data buffer) exceeds 64MB.
|
KMemSizeMbuf < 0x2000000
|
The event is rearmed when the allocated amount drops below 32MB.
|
Kernel socket buffers size
|
KMemSizeSock> 0x4000000
|
An event is generated when the total space occupied by kernel buffers of
type <Socket> (kernel socket structure) exceeds 64MB.
|
KMemSizeSock < 0x2000000
|
The event is rearmed when the allocated amount drops below 32MB.
|
Kernel protocol CBs size
|
KMemSizeProtcb > 0x4000000
|
An event is generated when the total space occupied by kernel buffers of
type <Protcb> (Protocol Control Block) exceeds 64MB.
|
KMemSizeProtcb< 0x2000000
|
The event is rearmed when the allocated amount drops below 32MB.
|
Kernel other IP CBs size
|
KMemSizeOtherIP> 0x4000000
|
An event is generated when the total space occupied by kernel buffers of
type <OtherIP> (other buffers used by IP) exceeds 64MB.
|
KMemSizeOtherIP< 0x2000000
|
The event is rearmed when the allocated amount drops below 32MB.
|
Kernel Mblks size
|
KMemSizeMblk > 0x4000000
|
An event is generated when the total space occupied by kernel buffers of
type <Mblk> (stream header and data) exceeds 64MB.
|
KMemSizeMblk < 0x2000000
|
The event is rearmed when the allocated amount drops below 32MB.
|
Kernel streams buffers size
|
KMemSizeStreams > 0x4000000
|
An event is generated when the total space occupied by kernel buffers of
type <Streams> (other streams related memory) exceeds 64MB.
|
KMemSizeStreams < 0x2000000
|
The event rearmed when the allocated amount drops below 32MB.
|
Kernel other memory size
|
KMemSizeOther > 0x4000000
|
An event is generated when the total space occupied by kernel buffers of
type <Other> (other kernel memory) exceeds 64MB.
|
KMemSizeOther < 0x2000000
|
The event is rearmed when the allocated amount drops below 32MB.
|
The program name of this resource class is IBM.PagingDevice. It can be used to monitor devices that are used by the operating system for paging. Each host may have one or more paging devices. On the operating system, the paging device is a logical volume.
These attributes can be monitored:
The following table shows the predefined conditions and examples of
expressions that are available for monitoring paging space for a specific
device:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Paging device state
|
OpState != 1
|
An event is generated when the paging space device goes offline.
|
OpState == 1
|
The event is rearmed when the device comes back online.
|
Paging device percent free
|
PctFree < 20
|
An event is generated when less that 20% of the paging device is
free.
|
PctFree > 25
|
The event is rearmed when the amount of free paging space on the device
exceeds 25%.
|
The program name of this resource class is IBM.Processor.
Because the system tracks the amount of time each processor spends idle, in wait state, and running in kernel and user modes, this resource class can be used to monitor these processor activities. At each clock tick, an array of counters is incremented to reflect the processor activity based on the state of the current running process. The processor user, kernel, wait, and idle resource properties provide the approximate percentage of time that a specific processor is currently spending in each state. Therefore, the sum of these properties is 100 at any given observation.
There are two protection modes that processes run in, kernel (or system) level and user level. Processes executing in kernel mode run with kernel privileges and have access to kernel data. These processes include kernel processes (kprocs), and services (such as system calls and device drivers).
Processes running in user mode are normal applications with user level privileges and run in their own unique process space. When a user level process invokes a kernel service, for example, by making a system call, a mode switch occurs that causes the process to run in kernel mode while the service is executing.
When the current running process makes a request that cannot be immediately satisfied, such as an I/O operation, the process is put into wait state.
The following properties can be monitored:
This resource class represents the characteristics of the processors
within a host. There is one instance of this resource for each
processor installed in a host regardless of whether it is active or
not. The following table shows the predefined conditions and examples
of expressions that are available for monitoring a processor:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Processor state
|
OpState !=1
|
An event is generated when the processor goes offline.
|
OpState == 1
|
The event is rearmed when the processor returns online.
|
Processor idle time
|
(PctTimeIdle >= 80) && (PctTimeIdle @P >= 80)
|
An event is generated each time the processor is idle at least 80% of the
time for two consecutive observations.
|
(PctTimeIdle < 50) (PctTimeIdle @P < 50)
|
The event is rearmed when the idle time for the processor is below 50% for
two consecutive observations.
|
Processor wait time
|
(PctTimeWait >= 50) && (PctTimeWait @P >= 50)
|
An event is generated when the average time the processor is in wait state
is at least 50% for two consecutive observations.
|
(PctTimeWait < 30) && (PctTimeWait @P < 30)
|
The event is rearmed when the processor is in wait state at most 30% of the
time for two consecutive observations.
|
Processor kernel time
|
(PctTimeKernel >= 70) && (PctTimeKernel @P >= 70)
|
An event is generated when the average time the processor is in kernel mode
for two consecutive observations is 80%.
|
(PctTimeKernel < 20) && (PctTimeKernel @P < 20)
|
The event is rearmed when the kernel mode time for the processor is below
20% for two consecutive observations.
|
Processor user time
|
(PctTimeUser>=80) && (PctTimeUser@P > 80)
|
An event is generated when the average time the processor is in user mode
for two consecutive observations is 80%.
|
(PctTimeUser < 50) && (PctTimeUser @P < 50)
|
The event is rearmed when the user mode time for the processor is below 50%
for two consecutive observations.
|
The program name of this resource class is IBM.Physical Volume. After a disk is added to the system, it must first be designated as a physical volume before it can be added to a volume group and used to contain a file system or paging space. A physical volume has certain configuration and identification information written on it. When a disk becomes a physical volume, it is divided into 512-byte physical blocks. Physical volumes have a unique name (typically hdiskx where x is a unique number on the system), which is permanently associated with the disk until it is undefined.
These properties, which reflect the basic performance of a physical disk, can be monitored:
Each instance of this resource class represents a physical volume that has
been defined to the system. All resources are monitored. The
following table shows the predefined condition and examples of expressions
that are available for monitoring physical disks:
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Disk percent busy
|
(PctBusy >= 90) && (PctBusy@P >=90)
|
An event is generated when the disk has been busy at least 90% of the time
for two consecutive observations.
|
PctBusy <80
|
The event is rearmed when the value decreases below 80%.
|
Disk read rate
|
RdBlkRate < 50
|
An event is generated when the rate per second of 512-byte blocks
read from the disk is less than 50.
|
RdBlkRate > 100
|
The event is rearmed when the rate exceeds 100.
|
Disk write rate
|
WrBlkRate < 50
|
An event is generated when the rate per second of 512-byte blocks
written to disk is less than 50.
|
WrBlkRate > 100
|
The event is rearmed when the rate exceeds 100.
|
Disk transfer rate
|
(XferRate > XferRate@P) && ((XferRate - XferRate@P) > (XferRate@P * 0.5))
|
An event is generated each time the rate of transfer to disk has increased
50%.
|
None
|
None
|
The following adapters are supported, each by its own resource class:
See Ethernet Device for details on what can be monitored for an adapter. The other adapters have the same types of attributes. Only the adapter name is different.
The program name of this resource class is IBM.ATMDevice. The details of this class are identical to those of the IBM.EthernetDevice class except that the display name of the resource class is "ATM Device." See the description of Ethernet Device for details that also apply to this device.
The program name of this resource class is IBM.EthernetDevice. This resource class allows attributes of all Ethernet adapters that are installed in a system to be monitored. The network interfaces that may be defined on the adapters are not represented.
A network adapter card is the hardware that is physically attached to the network cabling. It is responsible for receiving and transmitting data at the physical level. The network adapter card is controlled by the network adapter device driver. A machine must have one network adapter card (or connection) for each network (not network type) to which it connects. For instance, if a host attaches to two Token-Ring networks, it must have two network adapter cards. When a new network adapter is physically installed in the system, the operating system assigns it a logical name. Some examples are: tok0 for a Token-Ring adapter, ent0 for an Ethernet adapter, or atm0 for an ATM adapter. The trailing number assigned, creates a unique logical number. For example, a second Token-ring adapter would have the logical name, tok1. The lsdev command can be used to display information about network adapters.
Messages received by a LAN adapter, referred to as frames, are encapsulated within destination, header, and trailer information added by the various network protocol layers. A counter, maintained for each adapter, tracks the number of frame-receive errors at the adapter device level that caused unsuccessful reception due to hardware or network errors. This counter is the raw value for RecErrorRate.
When frames are received by an adapter, they are transferred from the adapter into a device-managed receive queue. The number of packets accepted but dropped by the device driver level for any reason (for example, queue buffer shortage) is tracked by a counter, which provides the raw value of the RecDropRate property.
Messages and data sent by an application to a LAN adapter for transmission are broken up into packets and appended with address, header, and trailer information by the various network protocol layers. At the adapter device driver level, packets are placed in buffers on a transmit queue. The packets are appended with a network interface header, then transmitted as frames by the adapter device.
Counters are maintained for each adapter to track the number of transmission errors at the device level (due to hardware or network errors), number of transmission queue overflows at the device driver level (due to buffer shortage), and the number of packets dropped (packets not passed to the device by the driver for any reason). These counters provide the raw values for XmitErrorRate, XmitOverflowRate , and XmitDropRate, respectively.
The following properties can be monitored:
This resource class externalizes the characteristics of all Ethernet adapters that are installed in a system. It is important to note that this class does not represent the network interfaces that may be defined on the adapters. This class represents the actual adapters (i.e. ent0, etc.).
The characteristics are limited to a small set in the first release that are compatible with what is available through Event Management's aixos resource monitor.
The following table shows the predefined conditions and examples of
expressions that are available for monitoring device performance. All
resources are monitored.
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description |
---|---|---|---|---|
Ethernet receive error rate
|
RecErrorRate > 1
|
An event is generated when the number of receive errors exceeds 1 per
second.
|
(RecErrorRate == 0) && (RecErrorRate@P == 0)
|
The event is rearmed when the receive error rate is 0 for two consecutive
observations.
|
Ethernet receive drop rate
|
RecDropRate > 10
|
An event is generated when the number of receive packets dropped exceeds 10
per second.
|
RecDropRate < 5
|
The event is rearmed when the number of dropped packets goes below 5 per
second.
|
Ethernet transmit drop rate
|
XmitDropRate > 10
|
An event is generated when the number of outbound packets dropped exceeds
10 per second.
|
XmitDropRate< 5
|
The event is rearmed when the number of dropped packets goes below 5 per
second.
|
Ethernet transmit error rate
|
XmitErrorRate > 1
|
An event is generated when the number of transmit errors exceeds 1 per
second.
|
(XmitErrorRate == 0) && (XmitErrorRate@P == 0)
|
The event is rearmed when the transmit error rate is 0 for two consecutive
observations.
|
Ethernet transmit overflow rate
|
XmitOverflowRate > 10
|
An event is generated when the number of transmit queue overflows exceeds
10 per second.
|
XmitOverflowRate < 2
|
The event is rearmed when the number of overflows goes below 2 per
second.
|
The program name of this resource class is IBM.FDDIDevice. The details of this class are identical to those of the IBM.EthernetDevice class except that the display name of the resource class is "FDDI Device." See the description of Ethernet Device for details that also apply to this device.
The program name of this class is IBM.TokenRingDevice. The details of this class are identical to those of the IBM.EthernetDevice class except that the display name of the resource class is "Token-Ring Device." See the description of Ethernet Device for details that also apply to this device.
The program name of this resource class is IBM.Program resource class. This resource class can monitor a set of processes that are running a specific program or command whose attributes match a filter criterion. The filter criterion includes the real or effective user name of the process, arguments that the process was started with, etc. The primary aspect of a program resource that can be monitored is the set of processes that meet the program definition. A client can be informed when processes with the properties that meet the program definition are initiated and when they are terminated. This resource class typically is used to detect when a required subsystem fails so that recovery actions can be performed, or the administrator can be notified, or both.
A program definition requires the program name and the user name of the owner of the program. The program should be identified by user name in addition to program name to avoid confusion when two or more programs have the same name. These persistent attributes are defined as follows:
In order for a process to match a program definition and thus be considered to be running the program, its executable name must match the ProgramName persistent attribute value. In addition, the expression defined by the Filter persistent attribute must evaluate to TRUE by using the properties of the process. The Filter attribute is a string that consists of the names of various properties of a process, comparison operators, and literal values. For example, a value of user==greg restricts the process set to those processes that run ProgramName under the userid greg. The syntax for the Filter value is the same as for a string.
For more information on selection strings, see Using Expressions.
Processes must have a minimum duration (approximately 15 seconds) to be monitored by the IBM.Program resource class. (If a program runs for only a few seconds, all processes that run the program may not be detected.)
This property can be monitored: Processes
These elements of Processes can be monitored:
ps -e -o "ruser,pid,ppid,comm" | grep biod root 7786 8040 biod root 8040 5624 biod root 8300 8040 biod root 8558 8040 biod root 8816 8040 biod root 9074 8040 biod
To be informed when the number of processes running the specified program changes, you can define this event expression:
Processes.CurPidCount!=Processes.PrevPidCount
To be informed when no processes are running the specified program, you can define this event expression:
Processes.CurPidCount==0
This resource class is typically used to detect when a required subsystem
fails so that some recovery action can be performed or an administrator can be
notified. The following table shows the predefined conditions and
examples of expression that are available for monitoring programs.
Condition Name | Event Expression | Event Description | Rearm Expression | Rearm Description | Monitored Resources |
---|---|---|---|---|---|
sendmail daemon state |
Processes .CurPidCount <=0
|
An event is generated whenever the sendmail daemon is not
running.
|
Processes .CurPidCount> 1
|
The event is rearmed when the sendmail daemon is running.
|
sendmail
|
inetd daemon state |
Processes .CurPidCount <=0
| An event is generated whenever the inetd daemon is not running. |
Processes .CurPidCount> 1
| The event is rearmed when the inetd daemon is running. | inetd |
The following predefined responses are shipped as templates or as starting points for monitoring.
Use the Web-based System Manager online help or the ERRM commands (particularly, the chresponse command) to customize these predefined responses.
See Using Expressions for a summary of the data types and operators that you can
use in selection strings for a customized response.
Response Name | Action | Command |
---|---|---|
Critical notification
|
Name: log critical event
|
|
Name: e-mail root
|
| |
Name: broadcast message
|
| |
Warning notification
|
Name: log warning event
|
|
Name: e-mail root
|
| |
Informational notification
|
Name: log info event
|
|
Name: e-mail root
|
| |
Log event anytime
|
Name: log event
|
|
Send e-mail to root anytime
|
Name: e-mail root
|
|
Send e-mail to root off-shift
|
Name: e-mail root
|
|
Broadcast event anytime
|
Name: broadcast message
|
|
Display in Events plug-in
| Display an event in the Events plug-in. | Available from Web-based System Manager only. This is the only response that can be used by a non-root user. |
As an alternative to the Monitoring GUI, you can use the following scripts, utilities, commands, and files to control Monitoring on your system. See the man pages or AIX Commands Reference for detailed usage information.