The RSCT Resource Monitoring and Control (RMC) application is part of Reliable Scalable Cluster Technology (RSCT). It provides consistent comprehensive monitoring of system resources. By monitoring conditions of interest and providing automated responses when these conditions occur, RSCT Resource Monitoring and Control helps maintain system availability. RSCT Resource Monitoring and Control is installed as part of the base operating system and is administered by means of the easy-to-use Web-based System Manager graphical user interface or the command line.
You can set up monitoring easily through the Web-based System Manager user interface or you can use the command line. See Using the Monitoring Application for more details.
The Monitoring application offers a comprehensive set of monitoring and response capabilities that lets you detect, and in many cases correct, system resource problems such as a critical filesystem becoming full. You can monitor virtually all aspects of your system resources and specify a wide range of actions to be taken when a problem occurs, from simple notification by e-mail to recovery that runs a user-written script. You can specify an unlimited number of actions to be taken in response to an event.
As system administrator, you have a great deal of flexibility in responding to events. You can respond to an event in different ways based on the day of the week and time of day. The following are some examples of how you can use monitoring:
Monitoring lets you detect conditions of interest in your machine and its associated resources, and automatically take action when those conditions occur. The key elements in monitoring are conditions and responses.
A condition identifies one or more resources you want to monitor, such as the /var file system, and the specific resource state you are interested in, such as /var > 90% full.
A response specifies one or more actions to be taken when the condition is found to be true. Actions can include notification, running commands, and logging.
To understand and use conditions, you need to know about the following:
System resources that you can monitor are organized into general categories called resource classes. Examples of resource classes include Processor, File System, Physical Volume, and Ethernet Device.
Each resource class includes individual system resources that belong to the class. For example, the File System resource class might include these resources:
When a resource is specified for use in a condition, it is called a monitored resource.
Each resource class also has a set of properties that you can monitor. For example, the File System resource class has the following properties available for monitoring:
When a resource property is specified for use in a condition, it is called a monitored property. (In the underlying subsystems, a monitored property is referred to as a dynamic attribute.)
In the condition, you specify the monitored property in a logical expression that defines the threshold or state of the monitored resources. The logical expression is the event expression of the condition. Event expressions are typically used to monitor potential problems and significant changes in the system. For example, the event expression for a /var space used condition might be PercentTotUsed > 90. When the logical expression is evaluated to be true, an event is generated.
The rearm expression of a condition is optional. A rearm expression typically indicates when the monitored resource has returned to an acceptable state. When the rearm expression is met, monitoring for the condition resumes.
If a rearm event is not specified, when the event expression becomes true an event is generated for certain properties every time the monitored property is evaluated.
If a rearm expression is specified, evaluation of the rearm expression starts once the event expression becomes true. When the rearm expression becomes true, a rearm event is generated; then the evaluation of the event expression starts again. For example, if the event expression for a /var space used condition is 90% full and the rearm expression is PercentTotUsed < 80, then an event is generated when /var is more than 90% full. The next time the condition is evaluated, the rearm expression is used. When /var is less than 80% full, an event is generated indicating that the condition has been reset, and the event expression is used again to evaluate the condition.
See Using Expressions for more information about data types and operators that you can use in an event expression or a rearm expression.
Predefined conditions are provided with the Monitoring
application. To create a new condition you will need to set the
following condition components:
Condition Component | Description | Example |
---|---|---|
Condition name | The name you want to give the condition. | /var space used |
Resource class | The resource class to be monitored. | FileSystem |
Monitored property | The property of the resource class to be monitored. | PercentTotUsed |
Monitored resources | The specific resources in the resource class that are to be monitored. | /var |
Event expression | A logical expression defining the value or state of the monitored property that is to generate an event. | PercentTotUsed > 90 |
Event description | A text description of the event expression. | An event occurs when /var is more than 90% full. |
Rearm expression | When a rearm expression is specified, the rearm expression is evaluated when the event expression becomes true. When the rearm expression becomes true, the event expression is used for evaluation again. | PercentTotUsed < 80 |
Rearm description | A text description of the rearm expression. | A rearm event occurs when /var is less than 80% full. |
Severity | The severity of the condition: Informational, Warning, or Critical. | Critical |
A response consists of one or more actions to be performed by the system when an event or rearm event occurs for a condition. In the Monitoring application you can use the predefined responses or create new responses, and associate them with conditions as needed. You can associate multiple responses with one condition, and associate a single response with multiple conditions.
The responses for a condition remain deactivated until you start monitoring for that condition. When you select a condition to start monitoring, you need to activate at least one of its responses. The responses that are not active remain available to be used at another time. This allows you to use different responses for a condition as needed, without having to redefine them.
Predefined responses are provided with the Monitoring application. To create a new response you will need to set the following response and action components:
Response Component | Description | Example |
---|---|---|
Response name | The name you want to give the response. | Response for critical conditions |
Actions | One or more actions to be taken as part of the response. | Log events to a file |
Action Component | Description | Example |
---|---|---|
Action name | The name of an action to be taken as part of the response. | Send e-mail to the operator |
When in effect | The days and times when this action is to be used to respond to the condition. | 08:00 - 17:00 Monday-Friday |
Use for event, rearm event, or both | Whether the action is to be used to respond to an event, a rearm event, or both. | Event |
Command | The command to be run when an event or rearm event occurs. | A recovery script |
You can associate multiple responses with a condition if you want to define different responses based on when the event occurs. For example, you might have a work day response and a weekend response, each containing one or more actions. Consider how you might respond to a /var space used condition with the following responses. During working hours, you might want to e-mail the operator, run a command, and broadcast a message to users who are logged on. During weekend hours, you might want to e-mail the system administrator and log a message to a file.
Once monitoring for the condition begins, the system evaluates the event expression to see if it is true. When the event expression becomes true, an event occurs that automatically notifies all of the associated event responses, which causes each event response to run its defined actions.
The event expression and the rearm expression work together as follows when a condition is monitored. First, the event expression is evaluated. When the event expression becomes true, an event occurs, and the specified actions are taken. When the event expression becomes true, the system begins evaluating the rearm expression. When the rearm expression becomes true, the rearm event occurs, which automatically starts the actions defined for the rearm event. When the rearm event occurs, the system returns to evaluating the event expression.
The following illustration shows this cycle:
When an event occurs for the monitored condition, the actions for the event are automatically taken. If the condition has a rearm expression, a rearm event causes the rearm actions to be taken.
The following illustration shows these interactions:
A root user can perform all Monitoring tasks:
A non-root user can perform only the following Monitoring tasks:
For a complete description of the underlying components of the RSCT Resource Monitoring and Control application, see Components Provided for Monitoring.