[ Bottom of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]

Understanding the Diagnostic Subsystem for AIX

Diagnostic Applications

Note
The Diagnostic subsystem supports 32-bit diagnostic applications only.

Most resources in a system have a Diagnostic Application (DA), started by the Diagnostic Controller, that tests an area. DAs are associated with each resource supported by diagnostics in the configuration database.

DAs analyze the error log, display prompts and questions to the user, control which tests are run, call Application Test Units, and analyze test results.

The following topics are discussed in detail:

Device Configuration

In some cases, the DA will have to configure a device in order to test it. If the Configuration Method associated with the device does not contain the code that is required to load the device driver into the kernel and initialize it, then the DA performs this function.

However, in most cases, the DA may use one of the diagnostic library functions provided to perform the configuration. The following library functions aid in the configuration/unconfiguration process:

If a resource is reconfigured, then it must be restored to its initial state before the DA exits. Also, never assume that the parent resource(s) are always configured.

Determining the Level of Tests to Execute

Each DA is responsible for determining the level of tests that can be safely executed. This determination is a function of how the underlying device drivers support access to the device.

For nonshared, nonmultiplexed devices, the DA should attempt to open() the device with read/write privileges and thus determine its access privileges. For shared or multiplexed devices, a more complicated strategy needs to be developed. Perhaps the simplest method - at least from an application standpoint - is to add support for an openx() system call to the device driver, where the ext parameter distinguishes between port-level and card-level diagnostics.

Drivers Used for Diagnostic Purposes

There are different scenarios for configuring a resource to test. Depending on the relationship the resource to be tested has with other resources, it may be desirable to use one method over another. For instance, to unconfigure a resource in order to load a separate diagnostic driver or kernel extension, it is necessary to unconfigure all of the children resources connected to the particular resource, if any. This could cause a problem if the child resources are in use. In this case, it is desirable to use the production driver for diagnostic purposes. In all cases, it is important to restore the resource (and child resources) to their original state after testing.

Production Driver Used for Diagnostic Purposes

If the resource is in the DEFINED state, the resource must be configured before testing. After the resource is configured, tests can be performed on the resource, and then the resource must be put back into its original state.

Separate Diagnostic Driver Used for Diagnostic Purposes

If the resource is in the DEFINED state, the diagnostic driver may be loaded for testing, then unloaded after testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unload the production driver, load the diagnostic driver, perform the tests, unload the diagnostic driver, and then reload the production driver. Any child resources must be unconfigured before the resource under test can be unconfigured.

Diagnostic Kernel Extension Used for Diagnostic Purposes

If the resource is in the DEFINED state, the resource must be put into the DIAGNOSE state for testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unconfigure the resource and all its children, reconfigure the resource into the DIAGNOSE state, test it, and then reconfigure the resource and all its children back to their original states.

Acquiring a Greater Share of the Resource

If further testing is required, then the DA should assist the user in determining if the user should proceed with the testing.

For some devices, it may be best to ask the user to switch to another window and vary the device offline before continuing. For others, it may be best to send software-terminate signals. And for still others, it may be best to start the commands that have been specifically provided to gracefully degrade the system.

Error Log Analysis

If the dmode field in the TMInput, Test Mode Input, object class is set to either DMODE_ELA or DMODE_PD, then Error Log Analysis should be performed. Error log analysis should be considered a shared test.

The getdainput subroutine is used to get the test mode input parameters.

resource_alias Attribute

When a DA needs to analyze error logs from multiple resources, like the base system DA and system planar, memory and l2 cache resources, or a DA wants to analyze error logs that are logged against hardware events, like machine checks or environmental and power warnings (EPOW), then a PDiagAtt stanza must be used to define the alias between the device under test and the additional resources.

For example, the DA for the system planar on the RSPC platform performs error log analysis for machine checks that are logged by the RSPC Machine Check Error Handler. The following PDiagAtt stanza must be used to define the alias between the resource, sysplanar0, and the machine check event, MACHCHECK.

PDiagAtt:
  DClass = "planar"
  DSClass = "sys"
  DType = "sysplanar_rspc"
  attribute = "resource_alias"
  value = "MACHCHECK"
  rep = "n"
  DApp = ""

Thus, any error logged against "MACHCHECK" is analyzed by the DA for the resource of the class, subclass and type of "planar/sys/sysplanar_rspc", which is typically "sysplanar0". Any repair action done for the resource (sysplanar0) is associated with the error logged against "MACHCHECK".

Another example: The Diagnostic Application for the base system on the CHRP platform performs error log analysis for the firmware generated error logs for the system planar, memory and l2 cache resources. The following stanzas are used to invoke error log analysis from Problem Determination mode and to record the repair action in the error log after the system verification procedure.

PDiagAtt:
  	DClass = "planar"
  	DSClass = "sys"
  	DType = "sysplanar_rspc"
  	attribute = "resource_alias"
  	value = "mem0"
  	rep = "n"
  	DApp = ""

PDiagAtt:
  	DClass = "planar"
  	DSClass = "sys"
  	DType = "sysplanar_rspc"
  	attribute = "resource_alias"
  	value = "l2cache0"
  	rep = "n"
  	DApp = ""

Enhanced Error Handling (EEH) Option

The Diagnostics Application interface includes the pdiag_set_eeh_option, pdiag_set_slot_reset, and pdiag_read_slot_reset subroutines. These subroutines provide the DA with the necessary tools for adequate testing on the EEH option. The DA Support for this feature requires that the DA perform the following sequence of instructions in order:

  1. Open I/O Adapter Test Units (TU_OPEN).
  2. Call pdiag_read_slot_reset.
    Verify that the EEH option is supported.
  3. Execute full suite of Test Units (normal Test Units execution for affected component).
    If an EEH error is reported and EEH is supported:
    - Call pdiag_set_slot_reset.
    - Set the PCI slot to reset state (reset active) for the I/O adapter being tested.
    - Report EEH error.
    If an EEH error is reported and EEH is not supported:
    - Report a software error
  4. Close I/O Adapter Test Units (TU_CLOSE).

Persistent Variables

DAs must store state variables in the DAVars, Diagnostic Application Variables, object class to support loop mode. DAs are executed for each pass of loop mode, and thus lose state.

The putdavar and getdavar subroutines are used to put or get persistent variables.

Field Replaceable Units (FRUs)

DAs report FRU Buckets to identify parts that need to be replaced. The addfrub subroutine is used to add a FRU bucket to the FRU Bucket object class in the configuration database.

As part of the FRU information, a FRU part number for a fru not in the ODM database can be returned by the DA. The FRU part number is placed in the DAVars object class. Also, if the FRU bucket contains a sub-FRU (for example a memory module or daughter cards), the DA must return its physical or logical location code as part of the FRU bucket.

Each DA should base its good or bad status on the status of its children. A resource may pass its tests and be labeled bad when it has multiple children that have been labeled bad.

If a problem is detected with resource x, which has a parent called resource y and a sibling called resource z, then two FRU Buckets should be output.

The Diagnostic Controller decides which FRU Bucket to use, based on the good/bad status of the sibling. If the sibling passes its tests, then FRU Bucket 2 is named.

Specifying a Text Conclusion

DAs can also specify a menu as a conclusion. A menu should be specified if the repair action can be performed by the customer. For example, if the problem can be solved by formatting a hard disk, then a menu should be specified.

The menugoal subroutine performs this function by adding the menu goal to the Menugoal object class.

Library Restrictions for Diagnostic Programs

Library libc.a.min is the libc included in the standalone diagnostic package. Do not use any function that is not part of libc.a.min in your application. If a function is used in a diagnostic program that is not an exported symbol of libc.a.min, then an immediate software error (803-xxx) occurs when attempting to run the diagnostic program in standalone diagnostic mode.

To ensure that all symbols used by your diagnostics application are included in the standalone environment, compile and link the application code with the libc.a.min library found in the /usr/ccs/lib directory.

One method is to create a directory containing the libraries needed for linking:

  1. Copy libraries libodm.a, libcfg.a, and libcrypt.a to the new directory.
  2. Make a link from /usr/ccs/lib/libc.a.min to libc.a in the new directory.
  3. Make a link from /usr/ccs/lib/libc.a.min to libbind.a in the new directory.
  4. Export LIBPATH to the new directory.
  5. Compile and Link your application.

You can ignore any unresolved symbols coming from libasl, or others that you know about.

Errors found indicating unresolved symbols must be fixed before the program will properly execute in standalone diagnostics mode.

Guidelines for Writing Diagnostic Programs using C++

  1. The standard library libC.a is not supported. Do not use this library's API.
  2. All of the language support functions in libC.a need to be statically linked at compile time. Use -lCns.a and -bI:/usr/lpp/xlC/lib/libC.imp arguments to compile with xlC.
  3. Use an exception only for exceptional cases. For example, an exception should not be used for a program's normal flow of control.
  4. Never throw an exception across a shared library and executable boundaries.
  5. No kernel extension shall be written in C++.

Completion Status for Diagnostic Applications

DAs must issue the macro DA_EXIT() to exit.

Individual values can be set by calling the appropriate DA_SETRC_XXXXXX() macro definition.

The following values are defined:

DA_STATUS_GOOD No problems were found.
DA_STATUS_BAD A FRU Bucket or a Menu Goal was reported.
DA_USER_NOKEY No special function keys were entered.
DA_USER_EXIT The Exit key was entered by the user.
DA_USER_QUIT The Cancel key was entered by the user.
DA_ERROR_NONE No errors were encountered performing a normal operation such as displaying a menu, accessing the object repository, and allocating memory.
DA_ERROR_OPEN Could not open the device.
DA_ERROR_OTHER An error was encountered performing a normal operation.
DA_TESTS_NOTEST No tests were executed.
DA_TEST_FULL The full tests were executed.
DA_TEST_SUB The subtests were executed.
DA_TEST_SHR The shared tests were executed.
DA_MORE_NOCONT The isolation process is complete.
DA_MORE_CONT The path to the device should be tested. The next DA to be called is either the parent or sibling, depending on the value of DNext in the Predefined Diagnostic Resources PDiagRes object class.

Control Flow of a Diagnostic Application

The DA performs these tasks:

  1. Displays first stand-by menu.
  2. Obtains its input from the TMInput object class.
  3. References the state1 and state2 variables in the TMInput object class to determine if the child devices which were tested during the current session are defective. If so, then the DA should name the parent as being bad.
  4. Determines the level of tests to run.
  5. Calls TU_OPEN.
  6. Calls Application Test Units (TU).
  7. Calls TU_CLOSE.
  8. Reconfigures the device if DA caused it to be configured.
  9. Performs error-log analysis if the dmode variable in the TMInput object class is equal to PD or ELA.
  10. Returns status to the Diagnostic Controller through the DA_EXIT() macro call.

SRN Architecture

Diagnostic applications report problems through SRNs (Service Request Numbers). SRNs take the following forms:

Six-digit SRNs should be grouped so that each set of FRU callouts are grouped together. For example, if a Diagnostic Application callout consists of:

Then the SRNs should be grouped like the following:

The guidelines for the Reason Codes for SRN Source Numbers 700 to 799 and 811 to 999 that are not decoded from some type of special information are:

000 Reserved
001 Indicates that an adapter or device could not be found
002 to 100 Reserved
101 to 199 Reserved for non-ELA callouts with a single FRU
200 to 299 Reserved for non-ELA callouts with two FRUs
300 to 399 Reserved for non-ELA callouts with three FRUs
400 to 499 Reserved for non-ELA callouts with four or more FRUs
500 to 599 Reserved for non-ELA cases that require a special action such as waiting for a thermal device to cool or checking the level of a device.
600 to 699 Reserved for ELA callouts with a single FRU
700 to 799 Reserved for ELA callouts with two or more FRUs
800 to 899 Reserved for ELA cases that require a special action, such as waiting for a thermal device to cool or checking the level of a device.
900 to 999 Reserved

This is done to group the SRNs with like FRUs into one entry in the SRN Tables.

Diagnostic Controller Generated SRNs

The following table lists SRN generated by the diagnostic controller when the event shown in the description column occurs.

Note
"xxx" in the following table represents the source number of the diagnostic application that executed.

SRN Description
802-xxx The diagnostic did not detect an installed device (Online Diagnostics).
803-xxx An error not related to the diagnostic tests occurred.
804-xxx A halt occurred in the diagnostic application.
801-101
801-102
The diagnostics did not detect an installed device (Standalone Diagnostics).

Source Numbers

The following source numbers are defined for use by third party vendors.

Note
If the LED field of the PdDV object class for a particular device is different than the source number shown in the table below, the LED takes precedence. Source Numbers shown in the following table are hexadecimal values.

Source Number Description
661 IDE Tape Drive
66a USB Open Host Controller Type
66b USB Universal Host Controller Type
74b ATM Adapter
74d Sound Card
74e Fibre Channel Adapter
892 Graphics Display Adapter
893 Local Area Network (LAN) Adapter
894 Async Protocol Adapter
901 SCSI Protocol Device
902 Graphics Display
904 Parallel Port Attached Device
753 IDE CD ROM Drive
891 SCSI Device Adapter
752 IDE Disk Drive
805 CD Read/Write Drive
711 Generic Adapter (Not covered above)

[ Top of Page | Previous Page | Next Page | Contents | Index | Library Home | Legal | Search ]