To manage and monitor the error log, you can do the following:
It may be helpful when diagnosing a system problem to look at all of the error logs at once in parallel.
It is not a good idea to copy the /var/adm/ras/errlog files from the various nodes to a central place and then run errpt against the combined file. First, copying time is added to the sequential processing time of all the nodes and the total time required will be longer than viewing the logs in parallel. Second, error log analysis requires per node information from the ODM database (on each node).
Use the dsh command with the errpt command and its options to view the error log. Perform the following steps:
dsh -a errpt -s 0930020094 |pg
In this example, all error entries that occurred after September 30, 1994 at 2 a.m. for every node defined in the System Data Repository, are listed. The output is piped to pg in a one entry per line format.
dsh -w host1,host2,host3 errpt -a -s 0930020094 > /tmp/930errors
This example collects all the fully expanded error log reports after September 30, 1994 at 2 a.m. from nodes with a hostname of host1, host2, host3.
For systems running PSSP 3.1 or higher, a centralized error log records information about SP Switch, SP Switch2, and switch adapter errors. Logging of switch and adapter errors in the AIX error log on nodes and on the control workstation causes the generation of a summary record in the summary log. This log has the name: /var/adm/SPlogs/css/summlog and is located on the control workstation. The summary log provides a centralized location for monitoring system-wide error activity. It also improves the usability of log output collected from individual nodes.
The summary log contains one summary entry for each CSS error log entry recorded on the failing node or control workstation. Entries in the log have the following fields, which are separated by blanks:
For error log entries that do not pertain to a particular system partition, this field contains global.
The summary log contains a record for each CSS error log entry produced on each node in the system. You can use this log to obtain a single image of error activity across the entire SP system. Using the log, you can identify situations involving multiple nodes and determine the nodes that are affected. You can use the timestamps to determine which node experienced a problem first, so that you can more easily identify the root cause of a problem.
Enter the following command to view all the SP switch adapter error reports in parallel:
dsh -a errpt -a -N css
It sends to stdout all the fully-formatted error log entries for all unusual status detected for the switch adapter device drivers that are contained in the error log. This may be for the past 90 days. AIX has a default crontab entry that removes all hardware error entries after 90 days and all software error entries after 30 days.
Enter the following command to view all the SP switch information in parallel:
dsh -a errpt -a -N Worm
It sends to stdout all the fully-formatted error log entries for the switch. This includes errors found during switch diagnostics.
You can be notified of an SP error when it occurs by using the AIX Error Notification Facility.
IBM General Concepts and Procedures for RS/6000 (GC23-2202) explains how to use the AIX Error Notification Facility. IBM RS/6000 Problem Solving Guide (SC23-2204) explains the use of the AIX Error Log. This facility will perform an ODM method defined by the administrator when a particular error occurs or a particular process fails. The following classifications of errors can have notification objects defined by the administrator. Many of these messages will not occur often, so these notification objects can be defined even for large SP systems.
The EM suffix signifies an emergency error and is usually used to tell the administrator information that would be needed to re-IPL a node. To find these messages, issue the command:
errpt -t |grep "_EM "
PEND signifies an impending loss of availability, and that action will soon be required of the administrator.
The boot device of the node usually has a resource name of hdisk0, but the name may vary if the installation has been customized.
The EPOW_SUS error log entry is generated before power down when an unexpected loss of electrical power is encountered.
KERNEL_PANIC or DOUBLE_PANIC error log entries are generated when a kernel panic occurs.
The examples on the following pages may help the administrator in adding Error Notification Objects on the SP system. Adding a dsh -a command to the ODM commands will perform the action on all nodes of the SP system.
Mail the error report to root@controlworkstation when a switch adapter fails online diagnostics.
mkdir /customerdefinedpath/errnotify/objects mkdir /customerdefinedpath/errnotify/methods
Keep the methods scripts on each node so you can run them if distributed file system problems occur. File Collections is an excellent way to keep these scripts updated. The object files may be in a distributed file system since they are not used unless changes to the object are required.
Create a script or program that will be run when the error occurs. For example:
#!/bin/ksh ###################################################################### # Run errpt to get the fully expanded error report for the error # that was just written and redirect to a unique tempfile with the PID #of this script. ###################################################################### errpt -a -l $1 > /tmp/tempfile.$$ #################################################################### # Mail the fully expanded error report to root@controlworkstation # This could be anywhere in the network. # root@controlworkstation is the user and hostname that the # administrator wants to be notified at. ##################################################################### mail root@controlworkstation < /tmp/tempfile.$$
Create a file that contains the Error Notification Object to catch the switch diagnostic failed error.
errnotify: en_name = "tbx_diagerr.obj" en_persistenceflg = 1 en_label = "SWT_DIAG_ERROR2_ER" en_method = "/customerdefinedpath/methods/errnot. test.ksh$1"
(The en_name value can be a maximum of 16 characters long.) Enter the odmshow errnotify command to view the Error Notification object.
It is easy to modify an existing set of ODM errnotify stanzas. To do this, enter:
odmget errnotify > file
and edit the file. Include only attributes that have values.
odmadd /customerdefinedpath/object/tbx_diagerr.obj
(The file name is the name of the file with the Error Notification Object in it.)
odmdelete -o errnotify -q "en_name = tbx_diagerr.obj"
To view this object in the ODM database, enter:
odmget -q "en_name = tbx_diagerr.obj" errnotify
From root@sp2n5.kgn.ibm.com Mon Oct 3 11:25:59 1994 Received: from sp2n5.kgn.ibm.com by ppsras.kgn.ibm.com (AIX 3.2/UCB 5.64/4.03) id AA24781; Mon, 7 May 1995 10:14:59 -0400 Date: Mon, 3 Oct 1994 11:25:59 -0400 From: root Message-Id: <9410031525.AA24781@sp2n5.kgn.ibm.com> To: root Status: RO --------------------------------------------------------------------------- ERROR LABEL: SWT_DIAG_ERROR2_ER ERROR ID: 323C48A0 Date/Time: Mon Oct 3 11:25:57 Sequence Number: 18282 Machine Id: 000004911800 Node Id: sp2n5 Error Class: H Error Type: PERM Resource Name: Worm Resource Class: NONE Resource Type: NONE Location: NONE Error Description Switch adapter failed On-Line diagnostics Probable Causes Switch clock signal missing Switch adapter failure User Causes Switch cable loose or disconnected Recommended Actions Run adapter diagnostics Failure Causes Switch adapter hardware Recommended Actions Run adapter diagnostics Detail Data DETECTING MODULE LP=PSSP,Fn=dtb3mx,SID=1.35,L#=1303, Service Request Number 763-942
Error Notification when any Error Type of PEND occurs.
Create a file that contains the Error Notification Object to catch the pending availability problems. For example:
errnotify: en_name = "errnot.PEND.obj" en_persistenceflg = 1 en_type = "PEND" en_method = "/tmp/errnot.test.ksh $1" errnotify: en_name = "errnot.pend.obj" en_persistenceflg = 1 en_type = "pend" en_method = "/tmp/errnot.test.ksh $1" errnotify: en_name = "errnot.Pend.obj" en_persistenceflg = 1 en_type = "Pend" en_method = "/tmp/errnot.test.ksh $1"
(The variations of PEND are added because upper case is not strictly adhered to by all AIX LPs and vendors.)
odmadd /customerdefinedpath/object/errnot.pend.obj
(The file name is the name of the file with the Error Notification Object in it.)
To delete these objects enter:
odmdelete -o errnotify -q "en_name = errnot.PEND.obj" odmdelete -o errnotify -q "en_name = errnot.pend.obj" odmdelete -o errnotify -q "en_name = errnot.Pend.obj"
To view this object in the ODM database, enter:
odmget -q "en_name = errnot.PEND.obj" errnotify odmget -q "en_name = errnot.pend.obj" errnotify odmget -q "en_name = errnot.Pend.obj" errnotify
Error Notification when any Error on the boot device of hdisk0 occurs.
Create a file that contains the Error Notification Object to catch the boot disk errors. Assume that hdisk0 is the boot device.
errnotify: en_name = "errnot.boot.obj" en_persistenceflg = 1 en_resource = "hdisk0" en_method = "/tmp/errnot.test.ksh $1"
odmadd /customerdefinedpath/object/errnot.boot.obj
odmdelete -o errnotify -q "en_name = errnot.boot.obj"
To view this object in the ODM database, enter:
odmget -q "en_name = errnot.boot.obj" errnotify
Error Notification when unexpected power loss and kernel panics occur.
Create a file that contains the Error Notification Object to catch the kernel panic and power loss Error Labels. For example:
errnotify: en_name = "power.obj" en_persistenceflg = 1 en_label = "EPOW_SUS" en_method = "/customerdefinedpath/methods/ errnot.test.ksh $1" errnotify: en_name = "panic.obj" en_persistenceflg = 1 en_label = "KERNEL_PANIC" en_method = "/customerdefinedpath/methods/ errnot.test.ksh $1" errnotify: en_name = "dbl_panic.obj" en_persistenceflg = 1 en_label = "DOUBLE_PANIC" en_method = "/customerdefinedpath/methods /errnot.test.ksh $1"
odmadd /customerdefinedpath/object/power.panic.obj
The file name is the name of the file with the Error Notification Object in it.