[ Previous | Next | Contents | Home | Search ]
AIX Version 4.3 Understanding the Diagnostic Subsystem for AIX

Diagnostic Applications

Most resources in a system have a Diagnostic Application (DA), started by the Diagnostic Controller, that tests an area. DAs are associated with each resource supported by diagnostics in the configuration database.

DAs analyze the error log, display prompts and questions to the user, control which tests are run, call Application Test Units, and analyze test results.

The following topics are discussed in detail:

Device Configuration

In some cases, the DA will have to configure a device in order to test it. If the Configuration Method associated with the device does not contain the code that is required to load the device driver into the kernel and initialize it, then the DA will have to perform this function.

However, in most cases, the DA may use one of the diagnostic library functions provided to perform the configuration. The following library functions aid in the configuration/unconfiguration process:

If a resource is reconfigured, then it must be restored to its initial state before the DA exits. Also, never assume that the parent resource(s) are always configured.

Determining the Level of Tests to Execute

Each DA is responsible for determining the level of tests that can be safely executed. This determination is a function of how the underlying device drivers support access to the device.

For nonshared, nonmultiplexed devices, the DA should attempt to open() the device with read/write privileges and thus determine its access privileges. For shared or multiplexed devices, a more complicated strategy needs to be developed. Perhaps the simplest method - at least from an application standpoint - is to add support for an openx() system call to the device driver, where the ext parameter distinguishes between port-level and card-level diagnostics.

Drivers Used for Diagnostic Purposes

There are different scenarios for configuring a resource to test. Depending on the relationship the resource to be tested has with other resources, it may be desirable to use one method over another. For instance, to unconfigure a resource in order to load a separate diagnostic driver or kernel extension, it will also be necessary to unconfigure all of the children resources connected to the particular resource, if any. This could cause a problem if the child resources are in use. In this case, it is desirable to use the production driver for diagnostic purposes. In all cases, it is important to restore the resource (and child resources) to their original state after testing.

Production Driver Used for Diagnostic Purposes

If the resource is in the DEFINED state, the resource must be configured before testing. After the resource is configured, tests can be performed on the resource, and then the resource must be put back into its original state.

Separate Diagnostic Driver Used for Diagnostic Purposes

If the resource is in the DEFINED state, the diagnostic driver may be loaded for testing, then unloaded after testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unload the production driver, load the diagnostic driver, perform the tests, unload the diagnostic driver, and then reload the production driver. Any child resources must be unconfigured before the resource under test can be unconfigured.

Diagnostic Kernel Extension Used for Diagnostic Purposes

If the resource is in the DEFINED state, the resource must be put into the DIAGNOSE state for testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unconfigure the resource and all its children, reconfigure the resource into the DIAGNOSE state, test it, and then reconfigure the resource and all its children back to their original states.

Acquiring a Greater Share of the Resource

If further testing is required, then the DA should assist the user in determining if the user should proceed with the testing.

For some devices, it may be best to ask the user to switch to another window and vary the device offline before continuing. For others, it may be best to send software-terminate signals. And for still others, it may be best to start the commands that have been specifically provided to gracefully degrade the system.

Error Log Analysis

If the dmode field in the TMInput Test Mode Input object class is set to either DMODE_ELA or DMODE_PD, then Error Log Analysis should be performed. Error log analysis should be considered a shared test.

The getdainput subroutine is used to get the test mode input parameters.

resource_alias Attribute

When a DA needs to analyze error logs from multiple resources, like the base system DA and system planar, memory and l2 cache resources, or a DA wants to analyze error logs that are logged against hardware events, like machine checks or environmental and power warnings (EPOW), then a PDiagAtt stanza must be used to define the alias between the device under test and the additional resources.

For example, the DA for the system planar on the RSPC platform performs error log analysis for machine checks that are logged by the RSPC Machine Check Error Handler. The following PDiagAtt stanza must be used to define the alias between the resource, sysplanar0, and the machine check event, MACHCHECK.

PDiagAtt:
  	DClass = "planar"
  	DSClass = "sys"
  	DType = "sysplanar_rspc"
  	attribute = "resource_alias"
  	value = "MACHCHECK"
  	rep = "n"
  	DApp = ""

Thus any error logged against "MACHCHECK" will be analyzed by the DA for the resource of the class, subclass and type of "planar/sys/sysplanar_rspc", which is typically "sysplanar0". Any repair action done for the resource (sysplanar0) will be associated with the error logged against "MACHCHECK".

Another example: The Diagnostic Application for the base system on the CHRP platform performs error log analysis for the firmware generated error logs for the system planar, memory and l2 cache resources. The following stanzas are used to invoke error log analysis from Problem Determination mode and to record the repair action in the error log after the system verification procedure.

PDiagAtt:
  	DClass = "planar"
  	DSClass = "sys"
  	DType = "sysplanar_rspc"
  	attribute = "resource_alias"
  	value = "mem0"
  	rep = "n"
  	DApp = ""

PDiagAtt:
  	DClass = "planar"
  	DSClass = "sys"
  	DType = "sysplanar_rspc"
  	attribute = "resource_alias"
  	value = "l2cache0"
  	rep = "n"
  	DApp = ""

Persistent Variables

DAs must store state variables in the DAVarsDiagnostic Application Variables object class to support loop mode. DAs are executed for each pass of loop mode, and thus lose state.

The putdavar and getdavar subroutines are used to put or get persistent variables.

Field Replaceable Units (FRUs)

DAs report FRU Buckets to identify parts that need to be replaced. The addfrub subroutine is used to add a FRU bucket to the FRU Bucket object class in the configuration database.

Each DA should base its good or bad status on the status of its children. A resource may pass its tests and be labeled bad when it has multiple children that have been labeled bad.

If a problem is detected with resource x, which has a parent called resource y and a sibling called resource z, then two FRU Buckets should be output.

The Diagnostic Controller decides which FRU Bucket to use, based on the good/bad status of the sibling. If the sibling passes its tests, then FRU Bucket 2 is named.

Specifying a Text Conclusion

DAs can also specify a menu as a conclusion. A menu should be specified if the repair action can be performed by the customer. For example, if the problem can be solved by formatting a hard disk, then a menu should be specified.

The menugoal subroutine performs this function by adding the menu goal to the Menugoal object class.

Library Restrictions for Diagnostic Programs

Library libc.a.min is the libc included in the standalone diagnostic package. Do not use any function that is not part of libc.a.min in your application. If a function is used in a diagnostic program that is not an exported symbol of libc.a.min, then an immediate software error (803-xxx) will occur when attempting to run the diagnostic program in standalone diagnostic mode.

To ensure that all symbols used by your diagnostics application are included in the standalone environment, compile and link the application code with the libc.a.min library found in the /usr/ccs/lib directory.

One method is to create a directory containing the libraries needed for linking:

  1. Copy libraries libodm.a, libcfg.a, and libcrypt.a to the new directory.
  2. Make a link from /usr/ccs/lib/libc.a.min to libc.a in the new directory.
  3. Make a link from /usr/ccs/lib/libc.a.min to libbind.a in the new directory.
  4. Export LIBPATH to the new directory.
  5. Compile and Link your application.

You can ignore any unresolved symbols coming from libasl, or others that you know about.

Errors found indicating unresolved symbols must be fixed before the program will properly execute in standalone diagnostics mode.

Completion Status for Diagnostic Applications

DAs must issue the macro DA_EXIT() to exit.

Individual values can be set by calling the appropriate DA_SETRC_XXXXXX() macro definition.

The following values are defined:

DA_STATUS_GOOD No problems were found.
DA_STATUS_BAD A FRU Bucket or a Menu Goal was reported.
DA_USER_NOKEY No special function keys were entered.
DA_USER_EXIT The Exit key was entered by the user.
DA_USER_QUIT The Cancel key was entered by the user.
DA_ERROR_NONE No errors were encountered performing a normal operation such as displaying a menu, accessing the object repository, and allocating memory.
DA_ERROR_OPEN Could not open the device.
DA_ERROR_OTHER An error was encountered performing a normal operation.
DA_TESTS_NOTEST No tests were executed.
DA_TEST_FULL The full tests were executed.
DA_TEST_SUB The subtests were executed.
DA_TEST_SHR The shared tests were executed.
DA_MORE_NOCONT The isolation process is complete.
DA_MORE_CONT The path to the device should be tested. The next DA to be called will be either the parent or sibling, depending on the value of DNext in the Predefined Diagnostic Resources PDiagRes object class.

Control Flow of a Diagnostic Application

The DA performs these tasks:

  1. Displays first stand-by menu.
  2. Obtains its input from the TMInput object class.
  3. References the state1 and state2 variables in the TMInput object class to determine if the child devices which were tested during the current session are defective. If so, then the DA should name the parent as being bad.
  4. Determines the level of tests to run.
  5. Calls TU_OPEN.
  6. Calls Application Test Units (TU).
  7. Calls TU_CLOSE.
  8. Reconfigures the device if DA caused it to be configured.
  9. Performs error-log analysis if the dmode variable in the TMInput object class is equal to PD or ELA.
  10. Returns status to the Diagnostic Controller through the DA_EXIT() macro call.

SRN Architecture

SRNs should be grouped so that each set of FRU callouts are grouped together. For example, if a Diagnostic Application callout consists of:

Then the SRNs should be grouped like the following:

The guidelines for the Reason Codes for SRN Source Numbers 700 to 799 and 811 to 999 that are not decoded from some type of special information are:

000 Reserved
001 Indicates that an adapter or device could not be found
002 to 100 Reserved
101 to 199 Reserved for non-ELA callouts with a single FRU
200 to 299 Reserved for non-ELA callouts with two FRUs
300 to 399 Reserved for non-ELA callouts with three FRUs
400 to 499 Reserved for non-ELA callouts with four or more FRUs
500 to 599 Reserved for non-ELA cases that require a special action such as waiting for a thermal device to cool or checking the level of a device.
600 to 699 Reserved for ELA callouts with a single FRU
700 to 799 Reserved for ELA callouts with two or more FRUs
800 to 899 Reserved for ELA cases that require a special action, such as waiting for a thermal device to cool or checking the level of a device.
900 to 999 Reserved

This is done to group the SRNs with like FRUs into one entry in the SRN Tables.


[ Previous | Next | Contents | Home | Search ]