IBM Books

Diagnosis Guide


Managing and monitoring the error log

To manage and monitor the error log, you can do the following:

Viewing error log information in parallel

It may be helpful when diagnosing a system problem to look at all of the error logs at once in parallel.

It is not a good idea to copy the /var/adm/ras/errlog files from the various nodes to a central place and then run errpt against the combined file. First, copying time is added to the sequential processing time of all the nodes and the total time required will be longer than viewing the logs in parallel. Second, error log analysis requires per node information from the ODM database (on each node).

Note:
A user must have specific authorization to use the dsh command. To learn how a user can acquire this authorization, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.

Use the dsh command with the errpt command and its options to view the error log. Perform the following steps:

  1. View the summary information for all nodes to determine which ones are to be examined more closely. For example:
    dsh -a errpt -s 0930020094 |pg
    

    In this example, all error entries that occurred after September 30, 1994 at 2 a.m. for every node defined in the System Data Repository, are listed. The output is piped to pg in a one entry per line format.

  2. Pick out the nodes that have error entries that require further examination.
  3. View the selected nodes. For example:
    dsh -w host1,host2,host3 errpt -a -s 0930020094 > /tmp/930errors
    

    This example collects all the fully expanded error log reports after September 30, 1994 at 2 a.m. from nodes with a hostname of host1, host2, host3.

Summary log for SP Switch, SP Switch2, and switch adapter errors

For systems running PSSP 3.1 or higher, a centralized error log records information about SP Switch, SP Switch2, and switch adapter errors. Logging of switch and adapter errors in the AIX error log on nodes and on the control workstation causes the generation of a summary record in the summary log. This log has the name: /var/adm/SPlogs/css/summlog and is located on the control workstation. The summary log provides a centralized location for monitoring system-wide error activity. It also improves the usability of log output collected from individual nodes.

The summary log contains one summary entry for each CSS error log entry recorded on the failing node or control workstation. Entries in the log have the following fields, which are separated by blanks:

The summary log contains a record for each CSS error log entry produced on each node in the system. You can use this log to obtain a single image of error activity across the entire SP system. Using the log, you can identify situations involving multiple nodes and determine the nodes that are affected. You can use the timestamps to determine which node experienced a problem first, so that you can more easily identify the root cause of a problem.

Viewing SP Switch error log reports

Enter the following command to view all the SP switch adapter error reports in parallel:

dsh -a errpt -a -N css

It sends to stdout all the fully-formatted error log entries for all unusual status detected for the switch adapter device drivers that are contained in the error log. This may be for the past 90 days. AIX has a default crontab entry that removes all hardware error entries after 90 days and all software error entries after 30 days.

Enter the following command to view all the SP switch information in parallel:

dsh -a errpt -a -N Worm

It sends to stdout all the fully-formatted error log entries for the switch. This includes errors found during switch diagnostics.

Using the AIX Error Notification Facility

You can be notified of an SP error when it occurs by using the AIX Error Notification Facility.

IBM General Concepts and Procedures for RS/6000 (GC23-2202) explains how to use the AIX Error Notification Facility. IBM RS/6000 Problem Solving Guide (SC23-2204) explains the use of the AIX Error Log. This facility will perform an ODM method defined by the administrator when a particular error occurs or a particular process fails. The following classifications of errors can have notification objects defined by the administrator. Many of these messages will not occur often, so these notification objects can be defined even for large SP systems.

  1. PSSP AIX Error Log Labels that end in _EM.

    The EM suffix signifies an emergency error and is usually used to tell the administrator information that would be needed to re-IPL a node. To find these messages, issue the command:

    errpt -t |grep "_EM "
    
  2. Any AIX Error Log entries that have an Error Type of PEND.

    PEND signifies an impending loss of availability, and that action will soon be required of the administrator.

  3. Any AIX Error Log entries for the boot device of the node.

    The boot device of the node usually has a resource name of hdisk0, but the name may vary if the installation has been customized.

  4. The AIX Error Label EPOW_SUS.

    The EPOW_SUS error log entry is generated before power down when an unexpected loss of electrical power is encountered.

  5. The AIX Error Labels KERN_PANIC and DOUBLE_PANIC.

    KERNEL_PANIC or DOUBLE_PANIC error log entries are generated when a kernel panic occurs.

The examples on the following pages may help the administrator in adding Error Notification Objects on the SP system. Adding a dsh -a command to the ODM commands will perform the action on all nodes of the SP system.

Example 1

Mail the error report to root@controlworkstation when a switch adapter fails online diagnostics.

Example 2

Error Notification when any Error Type of PEND occurs.

Example 3

Error Notification when any Error on the boot device of hdisk0 occurs.

Example 4

Error Notification when unexpected power loss and kernel panics occur.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]