Error Logging

[ Previous | Next | Contents | Home | Search ]

AIX Version 4.3 Kernel Extensions and Device Support Programming Concepts

Error Logging

The error facility allows a device driver to have entries recorded in the system error log. These error log entries record any software or hardware failures that need to be available either for informational purposes or for fault detection and corrective action. The device driver, using the errsave kernel service, adds error records to the special file /dev/error.

The errdemon daemon then picks up the error record and creates an error log entry. When you access the error log either through SMIT (System Management Interface Tool) or with the errpt command, the error record is formatted according to the error template in the error template repository and presented in either a summary or detailed report. See the Flow of the Error Logging Facility figure for an illustration of this.

Precoding Steps to Consider

Follow three precoding steps before initiating the error logging process. It is beneficial to understand what services are available to developers, and what the customer, service personnel, and defect personnel see.

Determine the Importance of the Error

The first precoding step is to review the error-logging documentation and determine whether a particular error should be logged. Do not use system resources for logging information that is unimportant or confusing to the intended audience.

It is, however, a worse mistake not to log an error that merits logging. You should work in concert with the hardware developer, if possible, to identify detectable errors and the information that should be relayed concerning those errors.

Determine the Text of the Message

The next step is to determine the text of the message. Use the errmsg command with the -w flag to browse the system error messages file for a list of available messages. If you are developing a product for wide-spread general distribution and do not find a suitable system error message, you can submit a request to your supplier for a new message or follow the procedures that your organization uses to request new error messages. If your product is an in-house application, you can use the errmsg command to define a new message that meets your requirements.

Determine the Correct Level of Thresholding

Finally, determine the correct level of thresholding. Each error to be logged, regardless of whether it is a software or hardware error, can be limited by thresholding to avoid filling the error log with duplicate information.

Side effects of runaway error logging include overwriting existing error log entries and unduly alarming the end user. The error log is not unlimited in size. When its size limit is reached, the log wraps. If a particular error is repeated needlessly, existing information is overwritten, possibly causing inaccurate diagnostic analyses. The end user or service person can perceive a situation as more serious or pervasive than it is if they see hundreds of identical or nearly identical error entries.

You are responsible for implementing the proper level of thresholding in the device driver code.

The error log currently equals 1MB. As shipped, it cleans up any entries older than 30 days. In order to ensure that your error log entries are actually informative, noticed, and remain intact, test your driver thoroughly.

Coding Steps

To begin error logging,

Select the error text.
Construct error record templates.
Add error logging calls into the device driver code.

Selecting the Error Text

The first task is to select the error text. After browsing the contents of the system message file, three possible paths exist for selecting the error text. Either all of the desired messages for the new errors exist in the message file, none of the messages exist, or a combination of errors exists.

If the messages required already exist in the system message file, make a note of the four-digit hexadecimal identification number, as well as the message-set identification letter. For instance, a desired error description can be:
```
SET E
E859 "The wagon wheel is broken."
```
If none of the system error messages meet your requirements, and if you are responsible for developing a product for wide-spread general distribution, you can either contact your supplier to allocate new messages or follow the procedures that your organization uses to request new messages. If you are creating an in-house product, use the errmsg command to write suitable error messages and use the errinstall command to install them. Refer to "Software Product Packaging" in AIX Version 4.3 General Programming Concepts: Writing and Debugging Programs for more information. Take care not to overwrite other error messages.
It is also possible to use a combination of existing messages and new messages within the same error record template definition.

Constructing Error Record Templates

The second step is to construct your error record templates. An error record template defines the text that appears in the error report. Each error record template has the following general form:

Error Record Template
        +LABEL:
             Comment =
             Class =
             Log =
             Report =
             Alert =
             Err_Type =
             Err_Desc =
             Probable_Causes =
             User_Causes =
             User_Actions =
             Inst_Causes =
             Inst_Actions =
             Fail_Causes =
             Fail_Actions =
             Detail_Data = <data_len>, <data_id>, <data_encoding>

Each field in this stanza has well-defined criteria for input values. See the errupdate command for more information. The fields are:

Label

Requires a unique label for each entry to be added. The label must follow C language rules for identifiers and must not exceed 16 characters in length.

Comment

Indicates this is a comment field. You must enclose the comment in double quotation marks; and it cannot exceed 40 characters.

Class

Requires class values of H (hardware), S (software), or U (Undetermined).

Log

Requires values True or False. If failure occurs, the errors are logged only if this field value is set to True. When this value is False the Report and Alert fields are ignored.

Report

The values for this field are True or False. If the logged error is to be displayed using error report, the value of this field must be True.

Alert

Set this field to True for errors that are alertable. For errors that are not alertable, set this field to False.

Err_Type

Describes the severity of the failure that occurred. Possible values are INFO, PEND, PERF, PERM, TEMP, and UNKN where:

INFO	The error log entry is informational and was not the result of an error.
PEND	A condition in which it is determined that the loss of availability of a device or component is imminent.
PERF	A condition in which the performance of a device or component was degraded below an acceptable level.
PERM	A permanent failure is defined as a condition that was not recoverable. For example, an operation was retried a prescribed number of times without success.
TEMP	Recovery from this temporary failure was successful, yet the number of unsuccessful recovery attempts exceeded a predetermined threshold.
UNKN	A condition in which it is not possible to assess the severity of a failure.

Err_Desc

Describes the failure that occurred. Proper input for this field is the four-digit hexadecimal identifier of the error description message to be displayed from SET E in the message file.

Prob_Causes

Describes one or more probable causes for the failure that occurred. You can specify a list of up to four Prob_Causes identifiers separated by commas. A Prob_Causes identifier displays a probable cause text message from SET P in the message file. List probable causes in the order of decreasing probability. At least one probable cause identifier is required.

User_Causes

Specifies a condition that an operator can resolve without contacting any service organization. You can specify a list of up to four User_Causes identifiers separated by commas. A User_Causes identifier displays a text message from SET U in the message file. List user causes in the order of decreasing probability. Leave this field blank if it does not apply to the failure that occurred. If this field is blank, either the Inst_Causes or the Fail_Causes field must not be blank.

User_Actions

Describes recommended actions for correcting a failure that resulted from a user cause. You can specify a list of up to four recommended User_Actions identifiers separated by commas. A recommended User_Actions identifier displays a recommended action text message, SET R in the message file. You must leave this field blank if the User_Causes field is blank.

The order in which the recommended actions are listed is determined by the expense of the action and the probability that the action corrects the failure. Actions that have little or no cost and little or no impact on system operation should always be listed first. When actions for which the probability of correcting the failure is equal or nearly equal, list the least expensive action first. List remaining actions in order of decreasing probability.

Inst_Causes

Describes a condition that resulted from the initial installation or setup of a resource. You can specify a list of up to four Inst_Causes identifiers separated by commas. An Inst_Causes identifier displays a text message, SET I in the message file. List the install causes in the order of decreasing probability. Leave this field blank if it is not applicable to the failure that occurred. If this field is blank, either the User_Causes or the Failure_Causes field must not be blank.

Inst_Actions

Describes recommended actions for correcting a failure that resulted from an install cause. You can specify a list of up to four recommended Inst_actions identifiers separated by commas. A recommended Inst_actions identifier identifies a recommended action text message, SET R in the message file. Leave this field blank if the Inst_Causes field is blank. The order in which the recommended actions are listed is determined by the expense of the action and the probability that the action corrects the failure. See the User_Actions field for the list criteria.

Fail_Causes

Describes a condition that resulted from the failure of a resource. You can specify a list of up to four Fail_Causes identifiers separated by commas. A Fail_Causes identifier displays a failure cause text message, SET F in the message file. List the failure causes in the order of decreasing probability. Leave this field blank if it is not applicable to the failure that occurred. If you leave this field blank, either the User_Causes or the Inst_Causes field must not be blank.

Fail_Actions

Describes recommended actions for correcting a failure that resulted from a failure cause. You can specify a list of up to four recommended action identifiers separated by commas. The Fail_Actions identifiers must correspond to recommended action messages found in SET R of the message file. Leave this field blank if the Fail_Causes field is blank. Refer to the description of the User_Actions field for criteria in listing these recommended actions.

Detail_Data

Describes the detailed data that is logged with the error when the failure occurs. The Detail_data field includes the name of the detecting module, sense data, or return codes. Leave this field blank if no detailed data is logged with the error.

You can repeat the Detail_Data field. The amount of data logged with an error must not exceed the maximum error record length defined in the sys/err_rec.h header file. Save failure data that cannot be contained in an error log entry elsewhere, for example in a file. The detailed data in the error log entry contains information that can be used to correlate the failure data to the error log entry. Three values are required for each detail data entry:

data_len

Indicates the number of bytes of data to be associated with the data_id value. The data_len value is interpreted as a decimal value.

data_id

Identifies a text message to be printed in the error report in front of the detailed data. These identifiers refer to messages in SET D of the message file.

data_encoding

Describes how the detailed data is to be printed in the error report. Valid values for this field are:

ALPHA	The detailed data is a printable ASCII character string.
DEC	The detailed data is the binary representation of an integer value, the decimal equivalent is to be printed.
HEX	The detailed data is to be printed in hexadecimal.

Sample Error Record Template

An example of an error record template is:

+ MISC_ERR:
        Comment = "Interrupt: I/O bus timeout or channel check"
        Class = H
        Log = TRUE
        Report = TRUE
        Alert = FALSE
        Err_Type = UNKN
        Err_Desc = E856
        Prob_Causes = 3300, 6300
        User_Causes =
        User_Actions =
        Inst_Causes =
        Inst_Actions =
        Fail_Causes = 3300, 6300
        Fail_Actions = 0000
        Detail_Data = 4, 8119, HEX      *IOCC bus number
        Detail_Data = 4, 811A, HEX      *Bus Status Register
        Detail_Data = 4, 811B, HEX      *Misc. Interrupt Register

Construct the error templates for all new errors to be added in a file suitable for entry with the errupdate command. Run the errupdate command with the -h flag and the input file. The new errors are now part of the error record template repository. A new header file is also created (file.h) in the same directory in which the errupdate command was run. This header file must be included in the device driver code at compile time. Note that the errupdate command has a built-in syntax checker for the new stanza that can be called with the -c flag.

Adding Error Logging Calls into the Code

The third step in coding error logging is to put the error logging calls into the device driver code. The errsave kernel service allows the kernel and kernel extensions to write to the error log. Typically, you define a routine in the device driver that can be called by other device driver routines when a loggable error is encountered. This function takes the data passed to it, puts it into the proper structure and calls the errsave kernel service. The syntax for the errsave kernel service is:

#include <sys/errids.h>
void errsave(buf, cnt)
char *buf;
unsigned int cnt;

where,

buf	Specifies a pointer to a buffer that contains an error record as described in the sys/errids.h header file.
cnt	Specifies a number of bytes in the error record contained in the buffer pointed to by the buf parameter.

The following sample code is an example of a device driver error logging routine. This routine takes data passed to it from some part of the main body of the device driver. This code simply fills in the structure with the pertinent information, then passes it on using the errsave kernel service.

void
errsv_ex (int err_id, unsigned int port_num,
            int line, char *file, uint data1, uint data2)
{
    dderr   log;
    char      errbuf[255];
    ddex_dds  *p_dds;
   
    p_dds = dds_dir[port_num];
    log.err.error_id = err_id;
   
    if (port_num = BAD_STATE) {
            sprintf(log.err.resource_name, "%s :%d",
              p_dds->dds_vpd.adpt_name, data1);
            data1 = 0;
    }
   
   
 else
                sprintf(log.err.resource_name,"%s",p_dds->dds_vpd.devname);
    
        sprintf(errbuf, "line: %d file: %s", line, file);
        strncpy(log.file, errbuf, (size_t)sizeof(log.file));
     
        log.data1 = data1;
        log.data2 = data2;
    
        errsave(&log, (uint)sizeof(dderr));   /* run actual logging  */
}  /* end errlog_ex */

The data to be passed to the errsave kernel service is defined in the dderr structure which is defined in a local header file, dderr.h. The definition for dderr is:

typedef struct dderr {
        struct  err_rec0 err;
        int  data1;     /* use data1 and data2 to show detail  */
        int  data2;     /* data in the errlog report. Define   */
                        /* these fields in the errlog template */
                        /* These fields may not be used in all */
                        /* cases.                              */
} dderr;

The first field of the dderr.h header file is comprised of the err_rec0 structure, which is defined in the sys/err_rec.h header file. This structure contains the ID (or label) and a field for the resource name. The two data fields hold the detail data for the error log report. As an alternative, you could simply list the fields within the function.

You can also log a message into the error log from the command line. To do this, use the errlogger command.

After you add the templates using the errupdate command, compile the device driver code along with the new header file. Simulate the error and verify that it was written to the error log correctly. Some details to check for include:

Is the error demon running? This can be verified by running the ps -ef command and checking for /usr/lib/errdemon as part of the output.
Is the error part of the error template repository? Verify this by running the errpt -at command.
Was the new header file, which was created by the errupdate command and which contains the error label and unique error identification number, included in the device driver code when it was compiled?

Writing to the /dev/error Special File

The error logging process begins when a loggable error is encountered and the device driver error logging subroutine sends the error information to the errsave kernel service. The error entry is written to the /dev/error special file. Once the information arrives at this file, it is time-stamped by the errdemon daemon and put in a buffer. The errdemon daemon constantly checks the /dev/error special file for new entries, and when new data is written, the daemon collects other information pertaining to the resource reporting the error. The errdemon daemon then creates an entry in the /var/adm/ras/errlog error logging file.

[ Previous | Next | Contents | Home | Search ]