Kernel Extensions and Device Support Programming Concepts

Error Logging

The error facility records device-driver entries in the system error log. These error log entries record any software or hardware failures that need to be available either for informational purposes or for fault detection and corrective action. The device driver, using the errsave kernel service, adds error records to the /dev/error special file.

The errdemon daemon picks up the error record and creates an error log entry. When you access the error log either through SMIT (System Management Interface Tool) or with the errpt command, the error record is formatted according to the error template in the error template repository and presented in either a summary or detailed report.

Before initiating the error logging process, determine what services are available to developers, and what services are available to the customer, service personnel, and defect personnel.

Determine the Importance of the Error: Use system resources for logging only information that is important or helpful to the intended audience. Work with the hardware developer, if possible, to identify detectable errors and the information that should be relayed concerning those errors.
Determine the Text of the Message: Use regular national language support (NLS) XPG/4 messages instead of the codepoints. For more information about NLS messages, see Message Facility in AIX 5L Version 5.2 National Language Support Guide and Reference.
Determine the Correct Level of Thresholding: Each software or hardware error to be logged, can be limited by thresholding to avoid filling the error log with duplicate information. Side effects of runaway error logging include overwriting existing error log entries and unduly alarming the end user. The error log is limited in size. When its size limit is reached, the log wraps. If a particular error is repeated needlessly, existing information is overwritten, which might cause inaccurate diagnostic analyses. The end user or service person can perceive a situation as more serious or pervasive than it is if they see hundreds of identical or nearly identical error entries.
You are responsible for implementing the proper level of thresholding in the device driver code.

The size of the error is 1 MB. As shipped, it cleans up any entries older than 30 days. To ensure that your error log entries are informative, noticed, and remain intact, test your driver thoroughly.

Setting up Error Logging

To begin error logging, do the following:

Select the error text.
Construct error record templates.
Add error logging calls into the device driver code.

Step 1: Selecting the Error Text

Browse the contents of the system message file. Either all of the desired messages for the new errors exist in the message file, none of the messages exist, or a combination of errors exists.

If the messages required already exist in the system message file, make a note of the four-digit hexadecimal identification number, as well as the message-set identification letter. For instance, an error description might be:
```
SET E
E859 "The wagon wheel is broken."
```
If none of the system error messages meet your requirements, and if you are responsible for developing a product for general distribution, you can either contact your supplier to allocate new messages or follow the procedures that your organization uses to request new messages. If you are creating an in-house product, use the errmsg command to write suitable error messages and use the errinstall command to install them. For more information, see Software Product Packaging in AIX 5L Version 5.2 General Programming Concepts: Writing and Debugging Programs. Make sure that you do not overwrite other error messages.
You can use a combination of existing messages and new messages within the same error record template definition.

Step 2: Constructing Error Record Templates

Construct your error record templates, which define the text that displays in the error report. Each error record template has the following general form:

Error Record Template
        +LABEL:
             Comment =
             Class =
             Log =
             Report =
             Alert =
             Err_Type =
             Err_Desc =
             Probable_Causes =
             User_Causes =
             User_Actions =
             Inst_Causes =
             Inst_Actions =
             Fail_Causes =
             Fail_Actions =
             Detail_Data = <data_len>, <data_id>, <data_encoding>

Each field in this stanza has well-defined criteria for input values. For more information, see the errupdate command. The fields are as follows:

Label

Requires a unique label for each entry to be added. The label must follow C language rules for identifiers and must not exceed 16 characters in length.

Comment

Indicates that this is a comment field. You must enclose the comment in double quotation marks, and it cannot exceed 40 characters.

Class

Requires class values of H (hardware), S (software), or U (Undetermined).

Log

Requires values True or False. If failure occurs, the errors are logged only if this field value is set to True. When this value is False the Report and Alert fields are ignored.

Report

Requires values True or False. If the logged error is to be displayed using error report, the value of this field must be True.

Alert

Requires values True or False. Set this field to True for errors that are alertable. For errors that are not alertable, set this field to False.

Err_Type

Describes the severity of the failure that occurred. Possible values for Err_Type are as follows:

INFO: The error log entry is informational and was not the result of an error.
PEND: A condition in which the loss of availability of a device or component is imminent.
PERF: A condition in which the performance of a device or component was degraded below an acceptable level.
PERM: A permanent failure is defined as a condition that was not recoverable. For example, an operation was retried a prescribed number of times without success.
TEMP: Recovery from this temporary failure was successful, yet the number of unsuccessful recovery attempts exceeded a predetermined threshold.
UNKN: A condition in which it is not possible to assess the severity of a failure.

Err_Desc

Describes the failure that occurred. Proper input for this field is the four-digit hexadecimal identifier of the error description message to be displayed from SET E in the message file.

Prob_Causes

Describes one or more probable causes for the failure that occurred. You can specify a list of up to four Prob_Causes identifiers separated by commas. A Prob_Causes identifier displays a probable cause text message from SET P in the message file. List probable causes in the order of decreasing probability. At least one probable cause identifier is required.

User_Causes

Specifies a condition that an operator can resolve without contacting any service organization. You can specify a list of up to four User_Causes identifiers separated by commas. A User_Causes identifier displays a text message from SET U in the message file. List user causes in the order of decreasing probability. Leave this field blank if it does not apply to the failure that occurred. If this field is blank, either the Inst_Causes or the Fail_Causes field must not be blank.

User_Actions

Describes recommended actions for correcting a failure that resulted from a user cause. You can specify a list of up to four recommended User_Actions identifiers separated by commas. A recommended User_Actions identifier displays a recommended action text message, SET R in the message file. You must leave this field blank if the User_Causes field is blank.

The order in which the recommended actions are listed is determined by the expense of the action and the probability that the action corrects the failure. Actions that have little or no cost and little or no impact on system operation should always be listed first. When actions for which the probability of correcting the failure is equal or nearly equal, list the least expensive action first. List remaining actions in order of decreasing probability.

Inst_Causes

Describes a condition that resulted from the initial installation or setup of a resource. You can specify a list of up to four Inst_Causes identifiers separated by commas. An Inst_Causes identifier displays a text message, SET I in the message file. List the install causes in the order of decreasing probability. Leave this field blank if it is not applicable to the failure that occurred. If this field is blank, either the User_Causes or the Failure_Causes field must not be blank.

Inst_Actions

Describes recommended actions for correcting a failure that resulted from an install cause. You can specify a list of up to four recommended Inst_actions identifiers separated by commas. A recommended Inst_actions identifier identifies a recommended action text message, SET R in the message file. Leave this field blank if the Inst_Causes field is blank. The order in which the recommended actions are listed is determined by the expense of the action and the probability that the action corrects the failure. See the User_Actions field for the list criteria.

Fail_Causes

Describes a condition that resulted from the failure of a resource. You can specify a list of up to four Fail_Causes identifiers separated by commas. A Fail_Causes identifier displays a failure cause text message, SET F in the message file. List the failure causes in the order of decreasing probability. Leave this field blank if it is not applicable to the failure that occurred. If you leave this field blank, either the User_Causes or the Inst_Causes field must not be blank.

Fail_Actions

Describes recommended actions for correcting a failure that resulted from a failure cause. You can specify a list of up to four recommended action identifiers separated by commas. The Fail_Actions identifiers must correspond to recommended action messages found in SET R of the message file. Leave this field blank if the Fail_Causes field is blank. Refer to the description of the User_Actions field for criteria in listing these recommended actions.

Detail_Data

Describes the detailed data that is logged with the error when the failure occurs. The Detail_data field includes the name of the detecting module, sense data, or return codes. Leave this field blank if no detailed data is logged with the error.

You can repeat the Detail_Data field. The amount of data logged with an error must not exceed the maximum error record length defined in the sys/err_rec.h header file. Save failure data that cannot be contained in an error log entry elsewhere, for example in a file. The detailed data in the error log entry contains information that can be used to correlate the failure data to the error log entry. Three values are required for each detail data entry:

data_len

Indicates the number of bytes of data to be associated with the data_id value. The data_len value is interpreted as a decimal value.

data_id

Identifies a text message to be printed in the error report in front of the detailed data. These identifiers refer to messages in SET D of the message file.

data_encoding

Describes how the detailed data is to be printed in the error report. Valid values for this field are:

ALPHA: The detailed data is a printable ASCII character string.
DEC: The detailed data is the binary representation of an integer value, the decimal equivalent is to be printed.
HEX: The detailed data is to be printed in hexadecimal.

Sample Error Record Template

An example of an error record template is:

+& MISC_ERR:
        Comment = "Interrupt: I/O bus timeout or channel check"
        Class = H
        Log = TRUE
        Report = TRUE
        Alert = FALSE
        Err_Type = UNKN
        Err_Desc = E856
        Prob_Causes = 3300, 6300
        User_Causes =
        User_Actions =
        Inst_Causes =
        Inst_Actions =
        Fail_Causes = 3300, 6300
        Fail_Actions = 0000
        Detail_Data = 4, 8119, HEX      *IOCC bus number
        Detail_Data = 4, 811A, HEX      *Bus Status Register
        Detail_Data = 4, 811B, HEX      *Misc. Interrupt Register

Construct the error templates for all new errors to be added in a file suitable for entry with the errupdate command. Run the errupdate command with the -h flag and the input file. The new errors are now part of the error record template repository. A new header file is also created (file.h) in the same directory in which the errupdate command was run. This header file must be included in the device driver code at compile time. Note that the errupdate command has a built-in syntax checker for the new stanza that can be called with the -c flag.

Adding Error Logging Calls into the Code

The third step in coding error logging is to put the error logging calls into the device driver code. The errsave kernel service allows the kernel and kernel extensions to write to the error log. Typically, you define a routine in the device driver that can be called by other device driver routines when a loggable error is encountered. This function takes the data passed to it, puts it into the proper structure and calls the errsave kernel service. The syntax for the errsave kernel service is:

#include <sys/errids.h>
void errsave(buf, cnt)
char *buf;
unsigned int cnt;

where:

buf	Specifies a pointer to a buffer that contains an error record as described in the sys/errids.h header file.
cnt	Specifies a number of bytes in the error record contained in the buffer pointed to by the buf parameter.

The following sample code is an example of a device driver error logging routine. This routine takes data passed to it from some part of the main body of the device driver. This code simply fills in the structure with the pertinent information, then passes it on using the errsave kernel service.

void
errsv_ex (int err_id, unsigned int port_num,
            int line, char *file, uint data1, uint data2)
{
    dderr   log;
    char      errbuf[255];
    ddex_dds  *p_dds;
   
    p_dds = dds_dir[port_num];
    log.err.error_id = err_id;
   
    if (port_num = BAD_STATE) {
            sprintf(log.err.resource_name, "%s :%d",
              p_dds->dds_vpd.adpt_name, data1);
            data1 = 0;
    }
   
   
 else
                sprintf(log.err.resource_name,"%s",p_dds->dds_vpd.devname);
    
        sprintf(errbuf, "line: %d file: %s", line, file);
        strncpy(log.file, errbuf, (size_t)sizeof(log.file));
     
        log.data1 = data1;
        log.data2 = data2;
    
        errsave(&log, (uint)sizeof(dderr));   /* run actual logging  */
}  /* end errlog_ex */

The data to be passed to the errsave kernel service is defined in the dderr structure, which is defined in a local header file, dderr.h. The definition for dderr is:

typedef struct dderr {
        struct  err_rec0 err;
        int  data1;     /* use data1 and data2 to show detail  */
        int  data2;     /* data in the errlog report. Define   */
                        /* these fields in the errlog template */
                        /* These fields may not be used in all */
                        /* cases.                              */
} dderr;

The first field of the dderr.h header file is comprised of the err_rec0 structure, which is defined in the sys/err_rec.h header file. This structure contains the ID (or label) and a field for the resource name. The two data fields hold the detail data for the error log report. As an alternative, you could simply list the fields within the function.

You can also log a message into the error log from the command line. To do this, use the errlogger command.

After you add the templates using the errupdate command, compile the device driver code along with the new header file. Simulate the error and verify that it was written to the error log correctly. Some details to check for include:

Is the error demon running? This can be verified by running the ps -ef command and checking for /usr/lib/errdemon as part of the output.
Is the error part of the error template repository? Verify this by running the errpt -at command.
Was the new header file, which was created by the errupdate command and which contains the error label and unique error identification number, included in the device driver code when it was compiled?