Diagnosis Guide

Managing and monitoring the error log

To manage and monitor the error log, you can do the following:

View error log information in parallel.
View SP switch error log reports.
Use AIX error log notification.

Viewing error log information in parallel

It may be helpful when diagnosing a system problem to look at all of the error logs at once in parallel.

It is not a good idea to copy the /var/adm/ras/errlog files from the various nodes to a central place and then run errpt against the combined file. First, copying time is added to the sequential processing time of all the nodes and the total time required will be longer than viewing the logs in parallel. Second, error log analysis requires per node information from the ODM database (on each node).

Note:: A user must have specific authorization to use the dsh command. To learn how a user can acquire this authorization, see "Using the SP System Monitor" chapter of PSSP: Administration Guide.

Use the dsh command with the errpt command and its options to view the error log. Perform the following steps:

View the summary information for all nodes to determine which ones are to be examined more closely. For example:
```
dsh -a errpt -s 0930020094 |pg
```
In this example, all error entries that occurred after September 30, 1994 at 2 a.m. for every node defined in the System Data Repository, are listed. The output is piped to pg in a one entry per line format.
Pick out the nodes that have error entries that require further examination.
View the selected nodes. For example:
```
dsh -w host1,host2,host3 errpt -a -s 0930020094 > /tmp/930errors
```
This example collects all the fully expanded error log reports after September 30, 1994 at 2 a.m. from nodes with a hostname of host1, host2, host3.

Summary log for SP Switch, SP Switch2, and switch adapter errors

For systems running PSSP 3.1 or higher, a centralized error log records information about SP Switch, SP Switch2, and switch adapter errors. Logging of switch and adapter errors in the AIX error log on nodes and on the control workstation causes the generation of a summary record in the summary log. This log has the name: /var/adm/SPlogs/css/summlog and is located on the control workstation. The summary log provides a centralized location for monitoring system-wide error activity. It also improves the usability of log output collected from individual nodes.

The summary log contains one summary entry for each CSS error log entry recorded on the failing node or control workstation. Entries in the log have the following fields, which are separated by blanks:

Timestamp - A timestamp in the form: MMDDhhmmYYYY.
Node - The reliable_hostname as stored in the SDR, for the originating node, with the domain portion removed.
Snap indicator - A value that indicates whether a snap dump was taken:
- Y indicates that a snap dump was taken.
- N indicates that a snap dump was NOT taken.
Partition - The name of the system partition to which the node belongs.
For error log entries that do not pertain to a particular system partition, this field contains global.
Index - The error log index for the entry being reported.
Label - The error log entry label field for the entry being reported.

The summary log contains a record for each CSS error log entry produced on each node in the system. You can use this log to obtain a single image of error activity across the entire SP system. Using the log, you can identify situations involving multiple nodes and determine the nodes that are affected. You can use the timestamps to determine which node experienced a problem first, so that you can more easily identify the root cause of a problem.

Viewing SP Switch error log reports

Enter the following command to view all the SP switch adapter error reports in parallel:

dsh -a errpt -a -N css

It sends to stdout all the fully-formatted error log entries for all unusual status detected for the switch adapter device drivers that are contained in the error log. This may be for the past 90 days. AIX has a default crontab entry that removes all hardware error entries after 90 days and all software error entries after 30 days.

Enter the following command to view all the SP switch information in parallel:

dsh -a errpt -a -N Worm

It sends to stdout all the fully-formatted error log entries for the switch. This includes errors found during switch diagnostics.

Using the AIX Error Notification Facility

You can be notified of an SP error when it occurs by using the AIX Error Notification Facility.

IBM General Concepts and Procedures for RS/6000 (GC23-2202) explains how to use the AIX Error Notification Facility. IBM RS/6000 Problem Solving Guide (SC23-2204) explains the use of the AIX Error Log. This facility will perform an ODM method defined by the administrator when a particular error occurs or a particular process fails. The following classifications of errors can have notification objects defined by the administrator. Many of these messages will not occur often, so these notification objects can be defined even for large SP systems.

PSSP AIX Error Log Labels that end in _EM.
The EM suffix signifies an emergency error and is usually used to tell the administrator information that would be needed to re-IPL a node. To find these messages, issue the command:
```
errpt -t |grep "_EM "
```
Any AIX Error Log entries that have an Error Type of PEND.
PEND signifies an impending loss of availability, and that action will soon be required of the administrator.
Any AIX Error Log entries for the boot device of the node.
The boot device of the node usually has a resource name of hdisk0, but the name may vary if the installation has been customized.
The AIX Error Label EPOW_SUS.
The EPOW_SUS error log entry is generated before power down when an unexpected loss of electrical power is encountered.
The AIX Error Labels KERN_PANIC and DOUBLE_PANIC.
KERNEL_PANIC or DOUBLE_PANIC error log entries are generated when a kernel panic occurs.

The examples on the following pages may help the administrator in adding Error Notification Objects on the SP system. Adding a dsh -a command to the ODM commands will perform the action on all nodes of the SP system.

Example 1

Mail the error report to root@controlworkstation when a switch adapter fails online diagnostics.

Step 1. Set up directories for the Error Notification objects and methods.
```
 mkdir /customerdefinedpath/errnotify/objects
 mkdir /customerdefinedpath/errnotify/methods
```
Keep the methods scripts on each node so you can run them if distributed file system problems occur. File Collections is an excellent way to keep these scripts updated. The object files may be in a distributed file system since they are not used unless changes to the object are required.

Step 2. Create the Error Notification Method scripts.

Create a script or program that will be run when the error occurs. For example:

 #!/bin/ksh
 ######################################################################
 # Run errpt to get the fully expanded error report for the error
 # that was just written and redirect to a unique tempfile with the PID
 #of this script.
 ######################################################################
 errpt -a -l $1 > /tmp/tempfile.$$
 ####################################################################
 # Mail the fully expanded error report to root@controlworkstation
 # This could be anywhere in the network.
 # root@controlworkstation is the user and hostname that the
 # administrator wants to be notified at.
 #####################################################################
 mail root@controlworkstation < /tmp/tempfile.$$

Step 3. Create the Error Notification Object
Create a file that contains the Error Notification Object to catch the switch diagnostic failed error.
```
errnotify:
          en_name = "tbx_diagerr.obj"
          en_persistenceflg = 1
          en_label = "SWT_DIAG_ERROR2_ER"
          en_method = "/customerdefinedpath/methods/errnot.
                      test.ksh$1"
```
(The en_name value can be a maximum of 16 characters long.) Enter the odmshow errnotify command to view the Error Notification object.
It is easy to modify an existing set of ODM errnotify stanzas. To do this, enter:
```
odmget errnotify > file
```
and edit the file. Include only attributes that have values.
Step 4. Add the Error Notification Object to the errnotify class.
```
odmadd  /customerdefinedpath/object/tbx_diagerr.obj
```
(The file name is the name of the file with the Error Notification Object in it.)
To delete this object, enter:
```
odmdelete -o errnotify -q "en_name = tbx_diagerr.obj"
```
To view this object in the ODM database, enter:
```
odmget  -q "en_name = tbx_diagerr.obj" errnotify
```

Step 5. The following mail will be sent to root@controlworkstation when an SP Switch MX adapter fails diagnostics:

 From root@sp2n5.kgn.ibm.com Mon Oct  3 11:25:59 1994
 Received: from sp2n5.kgn.ibm.com by ppsras.kgn.ibm.com
           (AIX 3.2/UCB 5.64/4.03)
           id AA24781; Mon, 7 May 1995 10:14:59 -0400
 Date: Mon, 3 Oct 1994 11:25:59 -0400
 From: root
 Message-Id: <9410031525.AA24781@sp2n5.kgn.ibm.com>
 To: root
 Status: RO
 
  ---------------------------------------------------------------------------
 ERROR LABEL: SWT_DIAG_ERROR2_ER
 ERROR ID: 323C48A0
 
 Date/Time:       Mon Oct  3 11:25:57
 Sequence Number: 18282
 Machine Id:      000004911800
 Node Id:         sp2n5
 Error Class:     H
 Error Type:      PERM
 Resource Name:   Worm
 Resource Class:  NONE
 Resource Type:   NONE
 Location:        NONE
 
 Error Description
 Switch adapter failed On-Line diagnostics
 
 Probable Causes
 Switch clock signal missing
 Switch adapter failure
 
 User Causes
 Switch cable loose or disconnected
 
 Recommended Actions
 Run adapter diagnostics
 
 Failure Causes
 Switch adapter hardware
 
 Recommended Actions
 Run adapter diagnostics
 
 Detail Data
 DETECTING MODULE
 LP=PSSP,Fn=dtb3mx,SID=1.35,L#=1303,
 Service Request Number
 763-942

Example 2

Error Notification when any Error Type of PEND occurs.

Steps 1 and 2 are the same as defined in the switch diagnostic failure example.

Step 3. Create the Error Notification Object

Create a file that contains the Error Notification Object to catch the pending availability problems. For example:

errnotify:
        en_name = "errnot.PEND.obj"
        en_persistenceflg = 1
        en_type = "PEND"
        en_method = "/tmp/errnot.test.ksh $1"
 
errnotify:
        en_name = "errnot.pend.obj"
        en_persistenceflg = 1
        en_type = "pend"
        en_method = "/tmp/errnot.test.ksh $1"
 
errnotify:
        en_name = "errnot.Pend.obj"
        en_persistenceflg = 1
        en_type = "Pend"        en_method = "/tmp/errnot.test.ksh $1"

(The variations of PEND are added because upper case is not strictly adhered to by all AIX LPs and vendors.)

Step 4. Add the Error Notification Objects to the errnotify class. For example:

odmadd  /customerdefinedpath/object/errnot.pend.obj

(The file name is the name of the file with the Error Notification Object in it.)

To delete these objects enter:

odmdelete -o errnotify -q "en_name = errnot.PEND.obj"
odmdelete -o errnotify -q "en_name = errnot.pend.obj"
odmdelete -o errnotify -q "en_name = errnot.Pend.obj"

To view this object in the ODM database, enter:

odmget  -q "en_name = errnot.PEND.obj" errnotify
odmget  -q "en_name = errnot.pend.obj" errnotify
odmget  -q "en_name = errnot.Pend.obj" errnotify

Step 5. Mail is sent to the administrator when an error that has an Error Type of PEND occurs.

Example 3

Error Notification when any Error on the boot device of hdisk0 occurs.

Step 1 and 2 are the same as defined in Example 1.
Step 3. Create the Error Notification Object.
Create a file that contains the Error Notification Object to catch the boot disk errors. Assume that hdisk0 is the boot device.
```
errnotify:
        en_name = "errnot.boot.obj"
        en_persistenceflg = 1
        en_resource = "hdisk0"
        en_method = "/tmp/errnot.test.ksh $1"
```

Step 4. Add the Error Notification Object to the errnotify class.

odmadd  /customerdefinedpath/object/errnot.boot.obj

To delete this object, enter:

odmdelete -o errnotify -q "en_name = errnot.boot.obj"

To view this object in the ODM database, enter:

odmget  -q "en_name = errnot.boot.obj" errnotify

Step 5. Mail with the fully expanded error report will be sent to the administrator when an error on hdisk0 occurs.

Example 4

Error Notification when unexpected power loss and kernel panics occur.

Steps 1 and 2 are the same as defined in Example 1.

Step 3. Create the Error Notification Object

Create a file that contains the Error Notification Object to catch the kernel panic and power loss Error Labels. For example:

errnotify:
               en_name = "power.obj"
               en_persistenceflg = 1
               en_label = "EPOW_SUS"
               en_method = "/customerdefinedpath/methods/
                           errnot.test.ksh $1"
 
errnotify:
               en_name = "panic.obj"
               en_persistenceflg = 1
               en_label = "KERNEL_PANIC"
               en_method = "/customerdefinedpath/methods/
                           errnot.test.ksh $1"
 
errnotify:
               en_name = "dbl_panic.obj"
               en_persistenceflg = 1
               en_label = "DOUBLE_PANIC"
               en_method = "/customerdefinedpath/methods
                           /errnot.test.ksh $1"

Step 4. Add the Error Notification Object to the errnotify class. For example:
```
odmadd  /customerdefinedpath/object/power.panic.obj
```
The file name is the name of the file with the Error Notification Object in it.
Step 5. Mail with the fully expanded error report will be sent to the administrator when any power loss or kernel panic occurs.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]