Most resources in a system have a Diagnostic Application (DA), started by the Diagnostic Controller, that tests an area. DAs are associated with each resource supported by diagnostics in the configuration database.
DAs analyze the error log, display prompts and questions to the user, control which tests are run, call Application Test Units, and analyze test results.
The following topics are discussed in detail:
In some cases, the DA will have to configure a device in order to test it. If the Configuration Method associated with the device does not contain the code that is required to load the device driver into the kernel and initialize it, then the DA will have to perform this function.
However, in most cases, the DA may use one of the diagnostic library functions provided to perform the configuration. The following library functions aid in the configuration/unconfiguration process:
If a resource is reconfigured, then it must be restored to its initial state before the DA exits. Also, never assume that the parent resource(s) are always configured.
Each DA is responsible for determining the level of tests that can be safely executed. This determination is a function of how the underlying device drivers support access to the device.
For nonshared, nonmultiplexed devices, the DA should attempt to open() the device with read/write privileges and thus determine its access privileges. For shared or multiplexed devices, a more complicated strategy needs to be developed. Perhaps the simplest method - at least from an application standpoint - is to add support for an openx() system call to the device driver, where the ext parameter distinguishes between port-level and card-level diagnostics.
There are different scenarios for configuring a resource to test. Depending on the relationship the resource to be tested has with other resources, it may be desirable to use one method over another. For instance, to unconfigure a resource in order to load a separate diagnostic driver or kernel extension, it will also be necessary to unconfigure all of the children resources connected to the particular resource, if any. This could cause a problem if the child resources are in use. In this case, it is desirable to use the production driver for diagnostic purposes. In all cases, it is important to restore the resource (and child resources) to their original state after testing.
If the resource is in the DEFINED state, the resource must be configured before testing. After the resource is configured, tests can be performed on the resource, and then the resource must be put back into its original state.
If the resource is in the DEFINED state, the diagnostic driver may be loaded for testing, then unloaded after testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unload the production driver, load the diagnostic driver, perform the tests, unload the diagnostic driver, and then reload the production driver. Any child resources must be unconfigured before the resource under test can be unconfigured.
If the resource is in the DEFINED state, the resource must be put into the DIAGNOSE state for testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unconfigure the resource and all its children, reconfigure the resource into the DIAGNOSE state, test it, and then reconfigure the resource and all its children back to their original states.
If further testing is required, then the DA should assist the user in determining if the user should proceed with the testing.
For some devices, it may be best to ask the user to switch to another window and vary the device offline before continuing. For others, it may be best to send software-terminate signals. And for still others, it may be best to start the commands that have been specifically provided to gracefully degrade the system.
If the dmode field in the TMInput Test Mode Input object class is set to either DMODE_ELA or DMODE_PD, then Error Log Analysis should be performed. Error log analysis should be considered a shared test.
The getdainput subroutine is used to get the test mode input parameters.
When a DA needs to analyze error logs from multiple resources, like the base system DA and system planar, memory and l2 cache resources, or a DA wants to analyze error logs that are logged against hardware events, like machine checks or environmental and power warnings (EPOW), then a PDiagAtt stanza must be used to define the alias between the device under test and the additional resources.
For example, the DA for the system planar on the RSPC platform performs error log analysis for machine checks that are logged by the RSPC Machine Check Error Handler. The following PDiagAtt stanza must be used to define the alias between the resource, sysplanar0, and the machine check event, MACHCHECK.
PDiagAtt: DClass = "planar" DSClass = "sys" DType = "sysplanar_rspc" attribute = "resource_alias" value = "MACHCHECK" rep = "n" DApp = ""
Thus any error logged against "MACHCHECK" will be analyzed by the DA for the resource of the class, subclass and type of "planar/sys/sysplanar_rspc", which is typically "sysplanar0". Any repair action done for the resource (sysplanar0) will be associated with the error logged against "MACHCHECK".
Another example: The Diagnostic Application for the base system on the CHRP platform performs error log analysis for the firmware generated error logs for the system planar, memory and l2 cache resources. The following stanzas are used to invoke error log analysis from Problem Determination mode and to record the repair action in the error log after the system verification procedure.
PDiagAtt: DClass = "planar" DSClass = "sys" DType = "sysplanar_rspc" attribute = "resource_alias" value = "mem0" rep = "n" DApp = "" PDiagAtt: DClass = "planar" DSClass = "sys" DType = "sysplanar_rspc" attribute = "resource_alias" value = "l2cache0" rep = "n" DApp = ""
DAs must store state variables in the DAVarsDiagnostic Application Variables object class to support loop mode. DAs are executed for each pass of loop mode, and thus lose state.
The putdavar and getdavar subroutines are used to put or get persistent variables.
DAs report FRU Buckets to identify parts that need to be replaced. The addfrub subroutine is used to add a FRU bucket to the FRU Bucket object class in the configuration database.
Each DA should base its good or bad status on the status of its children. A resource may pass its tests and be labeled bad when it has multiple children that have been labeled bad.
If a problem is detected with resource x, which has a parent called resource y and a sibling called resource z, then two FRU Buckets should be output.
The Diagnostic Controller decides which FRU Bucket to use, based on the good/bad status of the sibling. If the sibling passes its tests, then FRU Bucket 2 is named.
DAs can also specify a menu as a conclusion. A menu should be specified if the repair action can be performed by the customer. For example, if the problem can be solved by formatting a hard disk, then a menu should be specified.
The menugoal subroutine performs this function by adding the menu goal to the Menugoal object class.
Library libc.a.min is the libc included in the standalone diagnostic package. Do not use any function that is not part of libc.a.min in your application. If a function is used in a diagnostic program that is not an exported symbol of libc.a.min, then an immediate software error (803-xxx) will occur when attempting to run the diagnostic program in standalone diagnostic mode.
To ensure that all symbols used by your diagnostics application are included in the standalone environment, compile and link the application code with the libc.a.min library found in the /usr/ccs/lib directory.
One method is to create a directory containing the libraries needed for linking:
You can ignore any unresolved symbols coming from libasl, or others that you know about.
Errors found indicating unresolved symbols must be fixed before the program will properly execute in standalone diagnostics mode.
DAs must issue the macro DA_EXIT() to exit.
Individual values can be set by calling the appropriate DA_SETRC_XXXXXX() macro definition.
The following values are defined:
DA_STATUS_GOOD | No problems were found. |
DA_STATUS_BAD | A FRU Bucket or a Menu Goal was reported. |
DA_USER_NOKEY | No special function keys were entered. |
DA_USER_EXIT | The Exit key was entered by the user. |
DA_USER_QUIT | The Cancel key was entered by the user. |
DA_ERROR_NONE | No errors were encountered performing a normal operation such as displaying a menu, accessing the object repository, and allocating memory. |
DA_ERROR_OPEN | Could not open the device. |
DA_ERROR_OTHER | An error was encountered performing a normal operation. |
DA_TESTS_NOTEST | No tests were executed. |
DA_TEST_FULL | The full tests were executed. |
DA_TEST_SUB | The subtests were executed. |
DA_TEST_SHR | The shared tests were executed. |
DA_MORE_NOCONT | The isolation process is complete. |
DA_MORE_CONT | The path to the device should be tested. The next DA to be called will be either the parent or sibling, depending on the value of DNext in the Predefined Diagnostic Resources PDiagRes object class. |
The DA performs these tasks:
SRNs should be grouped so that each set of FRU callouts are grouped together. For example, if a Diagnostic Application callout consists of:
Then the SRNs should be grouped like the following:
The guidelines for the Reason Codes for SRN Source Numbers 700 to 799 and 811 to 999 that are not decoded from some type of special information are:
000 | Reserved |
001 | Indicates that an adapter or device could not be found |
002 to 100 | Reserved |
101 to 199 | Reserved for non-ELA callouts with a single FRU |
200 to 299 | Reserved for non-ELA callouts with two FRUs |
300 to 399 | Reserved for non-ELA callouts with three FRUs |
400 to 499 | Reserved for non-ELA callouts with four or more FRUs |
500 to 599 | Reserved for non-ELA cases that require a special action such as waiting for a thermal device to cool or checking the level of a device. |
600 to 699 | Reserved for ELA callouts with a single FRU |
700 to 799 | Reserved for ELA callouts with two or more FRUs |
800 to 899 | Reserved for ELA cases that require a special action, such as waiting for a thermal device to cool or checking the level of a device. |
900 to 999 | Reserved |
This is done to group the SRNs with like FRUs into one entry in the SRN Tables.