[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home |
Legal |
Search ]
Kernel Extensions and Device Support Programming Concepts
Error Logging
The error facility records device-driver entries
in the system error log. These error log entries record any software or hardware
failures that need to be available either for informational purposes or for
fault detection and corrective action. The device driver, using the errsave kernel service, adds error records to the /dev/error special file.
The errdemon daemon picks up the error record and creates
an error log entry. When you access the error log either through SMIT (System
Management Interface Tool) or with the errpt command, the error record is formatted according to the error
template in the error template repository and presented in either a summary
or detailed report.
Before initiating the error logging process, determine
what services are available to developers, and what services are available
to the customer, service personnel, and defect personnel.
- Determine the Importance of the Error: Use system resources for logging only information that is important or
helpful to the intended audience. Work with the hardware developer, if possible,
to identify detectable errors and the information that should be relayed concerning
those errors.
- Determine the Text of the Message: Use regular
national language support (NLS) XPG/4 messages instead of the codepoints.
For more information about NLS messages, see Message
Facility in AIX 5L Version 5.2 National Language Support Guide and Reference.
- Determine the Correct Level of Thresholding:
Each software or hardware error to be logged, can be limited by thresholding
to avoid filling the error log with duplicate information. Side effects of
runaway error logging include overwriting existing error log entries and unduly
alarming the end user. The error log is limited in size. When its size limit
is reached, the log wraps. If a particular error is repeated needlessly, existing
information is overwritten, which might cause inaccurate diagnostic analyses.
The end user or service person can perceive a situation as more serious or
pervasive than it is if they see hundreds of identical or nearly identical
error entries.
You are responsible for implementing
the proper level of thresholding in the device driver code.
The size of the error is 1 MB. As shipped, it cleans up any entries older
than 30 days. To ensure that your error log entries are informative, noticed,
and remain intact, test your driver thoroughly.
Setting up Error Logging
To begin error logging, do the following:
- Select the error text.
- Construct error record templates.
- Add error logging calls into the device driver code.
Step 1: Selecting the Error Text
Browse the contents of the system message file. Either
all of the desired messages for the new errors exist in the message file,
none of the messages exist, or a combination of errors exists.
Step 2: Constructing Error Record Templates
Construct your error record templates, which
define the text that displays in the error report. Each error record template
has the following general form:
Error Record Template
+LABEL:
Comment =
Class =
Log =
Report =
Alert =
Err_Type =
Err_Desc =
Probable_Causes =
User_Causes =
User_Actions =
Inst_Causes =
Inst_Actions =
Fail_Causes =
Fail_Actions =
Detail_Data = <data_len>, <data_id>, <data_encoding>
Each field
in this stanza has well-defined criteria for input values. For more information,
see the errupdate command.
The fields are as follows:
- Label
- Requires a unique label for each entry to be added. The label must
follow C language rules for identifiers and must not exceed 16 characters
in length.
- Comment
- Indicates that this is a comment field. You must enclose the comment
in double quotation marks, and it cannot exceed 40 characters.
- Class
- Requires class values of H (hardware), S (software), or U (Undetermined).
- Log
- Requires values True or False. If failure occurs, the errors are logged
only if this field value is set to True. When this value is False the Report and Alert fields
are ignored.
- Report
- Requires values True or False. If the logged error is to be displayed
using error report, the value of this field must be True.
- Alert
- Requires values True or False. Set this field to True for errors that
are alertable. For errors that are not alertable, set this field to False.
- Err_Type
-
Describes the severity of the failure that occurred. Possible
values for Err_Type are as follows:
- INFO
- The error log entry is informational and was not the result of an
error.
- PEND
- A condition in which the loss of availability of a device or component
is imminent.
- PERF
- A condition in which the performance of a device or component was
degraded below an acceptable level.
- PERM
- A permanent failure is defined as a condition that was not recoverable.
For example, an operation was retried a prescribed number of times without
success.
- TEMP
- Recovery from this temporary failure was successful, yet the number
of unsuccessful recovery attempts exceeded a predetermined threshold.
- UNKN
- A condition in which it is not possible to assess the severity of
a failure.
- Err_Desc
- Describes the failure that occurred. Proper input for this field is
the four-digit hexadecimal identifier of the error description message to
be displayed from SET E in the message file.
- Prob_Causes
- Describes one or more probable causes for the failure that occurred.
You can specify a list of up to four Prob_Causes
identifiers separated by commas. A Prob_Causes
identifier displays a probable cause text message from SET P in the message file. List probable causes in the order of decreasing
probability. At least one probable cause identifier is required.
- User_Causes
- Specifies a condition that an operator can resolve without contacting
any service organization. You can specify a list of up to four User_Causes identifiers separated by commas. A User_Causes identifier displays a text message from SET U in the message file. List user causes in the order of decreasing
probability. Leave this field blank if it does not apply to the failure that
occurred. If this field is blank, either the Inst_Causes or the Fail_Causes field must not be blank.
- User_Actions
- Describes recommended actions for correcting a failure that resulted
from a user cause. You can specify a list of up to four recommended User_Actions identifiers separated by commas. A recommended User_Actions identifier displays a recommended action text message, SET R in the message file. You must leave this field
blank if the User_Causes field is blank.
The order in which the recommended actions are listed is determined by the
expense of the action and the probability that the action corrects the failure.
Actions that have little or no cost and little or no impact on system operation
should always be listed first. When actions for which the probability of correcting
the failure is equal or nearly equal, list the least expensive action first.
List remaining actions in order of decreasing probability.
- Inst_Causes
- Describes a condition that resulted from the initial installation
or setup of a resource. You can specify a list of up to four Inst_Causes identifiers separated by commas. An Inst_Causes identifier displays a text message, SET I in the message file. List the install causes in the order of decreasing
probability. Leave this field blank if it is not applicable to the failure
that occurred. If this field is blank, either the User_Causes or the Failure_Causes field must
not be blank.
- Inst_Actions
- Describes recommended actions for correcting a failure that resulted
from an install cause. You can specify a list of up to four recommended Inst_actions identifiers separated by commas. A recommended Inst_actions identifier identifies a recommended action
text message, SET R in the message file. Leave
this field blank if the Inst_Causes field is blank.
The order in which the recommended actions are listed is determined by the
expense of the action and the probability that the action corrects the failure.
See the User_Actions field for the list criteria.
- Fail_Causes
- Describes a condition that resulted from the failure of a resource.
You can specify a list of up to four Fail_Causes
identifiers separated by commas. A Fail_Causes
identifier displays a failure cause text message, SET F in the message file. List the failure causes in the order of decreasing
probability. Leave this field blank if it is not applicable to the failure
that occurred. If you leave this field blank, either the User_Causes or the Inst_Causes field must
not be blank.
- Fail_Actions
- Describes recommended actions for correcting a failure that resulted
from a failure cause. You can specify a list of up to four recommended action
identifiers separated by commas. The Fail_Actions
identifiers must correspond to recommended action messages found in SET R of the message file. Leave this field blank if the Fail_Causes field is blank. Refer to the description of the User_Actions field for criteria in listing these recommended actions.
- Detail_Data
- Describes the detailed data that is logged with the error when the
failure occurs. The Detail_data field includes
the name of the detecting module, sense data, or return codes. Leave this
field blank if no detailed data is logged with the error.
You
can repeat the Detail_Data field. The amount of
data logged with an error must not exceed the maximum error record length
defined in the sys/err_rec.h header file. Save failure
data that cannot be contained in an error log entry elsewhere, for example
in a file. The detailed data in the error log entry contains information that
can be used to correlate the failure data to the error log entry. Three values
are required for each detail data entry:
- data_len
- Indicates the number of bytes of data to be associated with the data_id value. The data_len value
is interpreted as a decimal value.
- data_id
- Identifies a text message to be printed in the error report in front
of the detailed data. These identifiers refer to messages in SET D of the message file.
- data_encoding
- Describes how the detailed data is to be printed in the error report.
Valid values for this field are:
- ALPHA
- The detailed data is a printable ASCII character string.
- DEC
- The detailed data is the binary representation of an integer value,
the decimal equivalent is to be printed.
- HEX
- The detailed data is to be printed in hexadecimal.
Sample Error Record Template
An example of an error record template is:
+& MISC_ERR:
Comment = "Interrupt: I/O bus timeout or channel check"
Class = H
Log = TRUE
Report = TRUE
Alert = FALSE
Err_Type = UNKN
Err_Desc = E856
Prob_Causes = 3300, 6300
User_Causes =
User_Actions =
Inst_Causes =
Inst_Actions =
Fail_Causes = 3300, 6300
Fail_Actions = 0000
Detail_Data = 4, 8119, HEX *IOCC bus number
Detail_Data = 4, 811A, HEX *Bus Status Register
Detail_Data = 4, 811B, HEX *Misc. Interrupt Register
Construct
the error templates for all new errors to be added in a file suitable for
entry with the errupdate command. Run the errupdate command with the -h flag and the input
file. The new errors are now part of the error record template repository.
A new header file is also created (file.h) in the same
directory in which the errupdate command was run. This
header file must be included in the device driver code at compile time. Note
that the errupdate command has a built-in syntax checker
for the new stanza that can be called with the -c flag.
Adding Error Logging Calls into the Code
The third step in coding error logging is to put the error logging
calls into the device driver code. The errsave kernel service allows the kernel and kernel extensions to
write to the error log. Typically, you define a routine in the device driver
that can be called by other device driver routines when a loggable error is
encountered. This function takes the data passed to it, puts it into the proper
structure and calls the errsave kernel service. The
syntax for the errsave kernel service is:
#include <sys/errids.h>
void errsave(buf, cnt)
char *buf;
unsigned int cnt;
where:
buf |
Specifies a pointer to a buffer that contains an error record as
described in the sys/errids.h header file. |
cnt |
Specifies a number of bytes in the error record contained in the
buffer pointed to by the buf parameter. |
The following sample code is an example of a device
driver error logging routine. This routine takes data passed to it from some
part of the main body of the device driver. This code simply fills in the
structure with the pertinent information, then passes it on using the errsave kernel service.
void
errsv_ex (int err_id, unsigned int port_num,
int line, char *file, uint data1, uint data2)
{
dderr log;
char errbuf[255];
ddex_dds *p_dds;
p_dds = dds_dir[port_num];
log.err.error_id = err_id;
if (port_num = BAD_STATE) {
sprintf(log.err.resource_name, "%s :%d",
p_dds->dds_vpd.adpt_name, data1);
data1 = 0;
}
else
sprintf(log.err.resource_name,"%s",p_dds->dds_vpd.devname);
sprintf(errbuf, "line: %d file: %s", line, file);
strncpy(log.file, errbuf, (size_t)sizeof(log.file));
log.data1 = data1;
log.data2 = data2;
errsave(&log, (uint)sizeof(dderr)); /* run actual logging */
} /* end errlog_ex */
The data to be passed to the errsave kernel service is defined in the dderr structure,
which is defined in a local header file, dderr.h. The
definition for dderr is:
typedef struct dderr {
struct err_rec0 err;
int data1; /* use data1 and data2 to show detail */
int data2; /* data in the errlog report. Define */
/* these fields in the errlog template */
/* These fields may not be used in all */
/* cases. */
} dderr;
The first field of the dderr.h header
file is comprised of the err_rec0 structure, which is
defined in the sys/err_rec.h header file. This structure
contains the ID (or label) and a field for the resource name. The two data
fields hold the detail data for the error log report. As an alternative, you
could simply list the fields within the function.
You can also log
a message into the error log from the command line. To do this, use the errlogger command.
After you
add the templates using the errupdate command, compile
the device driver code along with the new header file. Simulate the error
and verify that it was written to the error log correctly. Some details to
check for include:
- Is the error demon running? This can be verified by running the ps -ef command and checking for /usr/lib/errdemon as part of the output.
- Is the error part of the error template
repository? Verify this by running the errpt -at command.
- Was the new header file, which was created by the errupdate command and which contains the error label and unique error identification
number, included in the device driver code when it was compiled?
[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home |
Legal |
Search ]