Resource Monitoring and Control Guide and Reference

Diagnostic Information

Files are created in the /var/ct/IW/log/mc/ Resource Manager directory to contain internal trace output that is useful to a software service organization for resolving problems. An internal trace utility tracks the activity of the resource manager daemon. Multiple levels of detail may be available for diagnosing problems. Some minimal level of tracing is on at all times. Full tracing can be activated with the command:

traceson -s IBM.HostRM

Minimal tracing can be activated with the command:

tracesoff -s IBM.HostRM

where IBM.HostRM is used as an example of a resource manager.

Resource Manager Diagnostic Files

All trace files are written by the trace utility to the /var/ct/IW/log/mc/Resource Manager directory. Each file in this directory that is named trace<.n> corresponds to a separate run of the resource manager. The latest file that corresponds to the current run of the resource manager is called trace. Trace files from earlier runs have a suffix of .n, where n starts at 0 and increases for older runs.

Use the rpttr command to view these files. Records can be viewed as they are added for an active process by adding the -f option to the rpttr command.

Any core files that result from a program error are written by the trace utility to the /var/ct/IW/run/mc/Resource Manager directory. Like the trace files, older core files have a .n suffix that increases with age. Core files and trace files with the same suffix correspond to the same run instance.

The log and run directories have a default limit of 10MB. The resource managers ensure that the total amount of disk space used is less than this limit. Trace files without corresponding core files are removed first when the resource manager is over the limit. Then pairs of core and trace files are removed, starting with the oldest. At least one pair of core and trace files is always retained.

Recovering from RMC and Resource Manager Problems

This section describes the tools that you can use to recover from infrastructure problems. It tells you how to determine if the components of the monitoring system are running and what to do if the RMC subsystem or one of the resource managers should abnormally stop. Common troubleshooting problems and solutions are also described.

The Audit Log, Event Response, File System, and Host resource managers recover from most errors because they have few dependencies. In some cases, the recovery consists of terminating and restarting the appropriate daemon. These resource managers can recover from at least the following errors:

Losing connection to the RMC daemon, probably caused by the terminating of the RMC daemon or another system problem.
Programming errors that cause the process to abnormally terminate. In this case, the SRC subsystem restarts the daemon. This includes errors such as invalid memory references and memory leaks.
The /var or /tmp directories filling up. When this happens, core and trace files cannot be captured.

In addition, all parameters received from the RMC subsystem are verified to avoid impacting other clients that may be using the same resource manager.

The following tools are described:

ctsnap command
SRC-controlled commands
rmcctrl command for the RMC subsystem
Audit log

ctsnap Command

For debugging purposes, the ctsnap command can be used to tar the RSCT and resource-manager programs and send them to the software service organization. The ctsnap command gathers system configuration information and compresses the information into a tar file, which can then be downloaded to disk or tape and transmitted to a remote system. The information gathered with the ctsnap command may be required to identify and resolve system problems. See the man page for the ctsnap command for more information.

SRC-Controlled Commands

The RMC subsystem and the resource managers are controlled by the System Resource Controller (SRC). They can be viewed and manipulated by SRC commands. For example:

To see the status of all resource managers, enter:

lssrc -g rsct_rm

To see the status of an individual resource manager, enter:

lssrc -s rmname

where rmname can be:

IBM.AuditRM
IBM.ERRM
IBM.FSRM
IBM.HostRM

To see the status of all SRC-controlled subsystems on the local machine, enter:

lssrc -a

To see the status of a particular subsystem, for example, the RMC subsystem, which is known to SRC as ctrmc, enter:

lssrc -s ctrmc

The SRC has these commands:

lssrc
startsrc
stopsrc
traceson
tracesoff

For more information, see the command man pages or AIX Commands and Technical References.

To find out more about SRC, see System Management Concepts: Operating System and Devices.

Recovery Support for RMC Using rmcctrl

The RMC command rmcctrl controls the operation of the RMC subsystem and the RSCT resource managers. It is not normally run from the command line, but it can be used in some diagnostic environments; for example, it can be used to add, start, stop, or delete an RMC subsystem. See the rmcctrl command in the AIX Commands Reference, which is available at http://www.ibm.com/servers/aix/library.

Tracking ERRM Events with the Audit Log

The audit log is a system-wide facility for recording information about the system's operation. It can include information about the normal operation of the system as well as system problems and errors. It is meant to augment error log functionality by conveying the relationship of the error relative to other system activities. All detailed information about system problems is still written to the operating system error log.

Records are created in the audit log by subsystems that have been instrumented to do that. For example, the Event Response subsystem runs in the background to monitor conditions defined by the administrator and then invokes one or more actions when a condition becomes true. Because this subsystem runs in the background, it is difficult for the operator or administrator to understand the total set of events that occurred and the results of any actions that were taken in response to an event. Because the Event Response subsystem records its activity in the audit log, the administrator can easily view Event Response subsystem activity as well as that of other subsystems through the lsaudrec command.

Troubleshooting Problems and Solutions

See the Web-based System Manager online help for common RMC troubleshooting problems and solutions.