Most resources in a system have a Diagnostic Application (DA), started by the Diagnostic Controller, that tests an area. DAs are associated with each resource supported by diagnostics in the configuration database.
DAs analyze the error log, display prompts and questions to the user, control which tests are run, call Application Test Units, and analyze test results.
The following topics are discussed in detail:
In some cases, the DA will have to configure a device in order to test it. If the Configuration Method associated with the device does not contain the code that is required to load the device driver into the kernel and initialize it, then the DA performs this function.
However, in most cases, the DA may use one of the diagnostic library functions provided to perform the configuration. The following library functions aid in the configuration/unconfiguration process:
If a resource is reconfigured, then it must be restored to its initial state before the DA exits. Also, never assume that the parent resource(s) are always configured.
Each DA is responsible for determining the level of tests that can be safely executed. This determination is a function of how the underlying device drivers support access to the device.
For nonshared, nonmultiplexed devices, the DA should attempt to open() the device with read/write privileges and thus determine its access privileges. For shared or multiplexed devices, a more complicated strategy needs to be developed. Perhaps the simplest method - at least from an application standpoint - is to add support for an openx() system call to the device driver, where the ext parameter distinguishes between port-level and card-level diagnostics.
There are different scenarios for configuring a resource to test. Depending on the relationship the resource to be tested has with other resources, it may be desirable to use one method over another. For instance, to unconfigure a resource in order to load a separate diagnostic driver or kernel extension, it is necessary to unconfigure all of the children resources connected to the particular resource, if any. This could cause a problem if the child resources are in use. In this case, it is desirable to use the production driver for diagnostic purposes. In all cases, it is important to restore the resource (and child resources) to their original state after testing.
If the resource is in the DEFINED state, the resource must be configured before testing. After the resource is configured, tests can be performed on the resource, and then the resource must be put back into its original state.
If the resource is in the DEFINED state, the diagnostic driver may be loaded for testing, then unloaded after testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unload the production driver, load the diagnostic driver, perform the tests, unload the diagnostic driver, and then reload the production driver. Any child resources must be unconfigured before the resource under test can be unconfigured.
If the resource is in the DEFINED state, the resource must be put into the DIAGNOSE state for testing. If the resource is in the AVAILABLE state because the production driver is loaded, it is necessary to unconfigure the resource and all its children, reconfigure the resource into the DIAGNOSE state, test it, and then reconfigure the resource and all its children back to their original states.
If further testing is required, then the DA should assist the user in determining if the user should proceed with the testing.
For some devices, it may be best to ask the user to switch to another window and vary the device offline before continuing. For others, it may be best to send software-terminate signals. And for still others, it may be best to start the commands that have been specifically provided to gracefully degrade the system.
If the dmode field in the TMInput, Test Mode Input, object class is set to either DMODE_ELA or DMODE_PD, then Error Log Analysis should be performed. Error log analysis should be considered a shared test.
The getdainput subroutine is used to get the test mode input parameters.
When a DA needs to analyze error logs from multiple resources, like the base system DA and system planar, memory and l2 cache resources, or a DA wants to analyze error logs that are logged against hardware events, like machine checks or environmental and power warnings (EPOW), then a PDiagAtt stanza must be used to define the alias between the device under test and the additional resources.
For example, the DA for the system planar on the RSPC platform performs error log analysis for machine checks that are logged by the RSPC Machine Check Error Handler. The following PDiagAtt stanza must be used to define the alias between the resource, sysplanar0, and the machine check event, MACHCHECK.
PDiagAtt: DClass = "planar" DSClass = "sys" DType = "sysplanar_rspc" attribute = "resource_alias" value = "MACHCHECK" rep = "n" DApp = ""
Thus, any error logged against "MACHCHECK" is analyzed by the DA for the resource of the class, subclass and type of "planar/sys/sysplanar_rspc", which is typically "sysplanar0". Any repair action done for the resource (sysplanar0) is associated with the error logged against "MACHCHECK".
Another example: The Diagnostic Application for the base system on the CHRP platform performs error log analysis for the firmware generated error logs for the system planar, memory and l2 cache resources. The following stanzas are used to invoke error log analysis from Problem Determination mode and to record the repair action in the error log after the system verification procedure.
PDiagAtt: DClass = "planar" DSClass = "sys" DType = "sysplanar_rspc" attribute = "resource_alias" value = "mem0" rep = "n" DApp = "" PDiagAtt: DClass = "planar" DSClass = "sys" DType = "sysplanar_rspc" attribute = "resource_alias" value = "l2cache0" rep = "n" DApp = ""
The Diagnostics Application interface includes the pdiag_set_eeh_option, pdiag_set_slot_reset, and pdiag_read_slot_reset subroutines. These subroutines provide the DA with the necessary tools for adequate testing on the EEH option. The DA Support for this feature requires that the DA perform the following sequence of instructions in order:
DAs must store state variables in the DAVars, Diagnostic Application Variables, object class to support loop mode. DAs are executed for each pass of loop mode, and thus lose state.
The putdavar and getdavar subroutines are used to put or get persistent variables.
DAs report FRU Buckets to identify parts that need to be replaced. The addfrub subroutine is used to add a FRU bucket to the FRU Bucket object class in the configuration database.
As part of the FRU information, a FRU part number for a fru not in the ODM database can be returned by the DA. The FRU part number is placed in the DAVars object class. Also, if the FRU bucket contains a sub-FRU (for example a memory module or daughter cards), the DA must return its physical or logical location code as part of the FRU bucket.
Each DA should base its good or bad status on the status of its children. A resource may pass its tests and be labeled bad when it has multiple children that have been labeled bad.
If a problem is detected with resource x, which has a parent called resource y and a sibling called resource z, then two FRU Buckets should be output.
The Diagnostic Controller decides which FRU Bucket to use, based on the good/bad status of the sibling. If the sibling passes its tests, then FRU Bucket 2 is named.
DAs can also specify a menu as a conclusion. A menu should be specified if the repair action can be performed by the customer. For example, if the problem can be solved by formatting a hard disk, then a menu should be specified.
The menugoal subroutine performs this function by adding the menu goal to the Menugoal object class.
Library libc.a.min is the libc included in the standalone diagnostic package. Do not use any function that is not part of libc.a.min in your application. If a function is used in a diagnostic program that is not an exported symbol of libc.a.min, then an immediate software error (803-xxx) occurs when attempting to run the diagnostic program in standalone diagnostic mode.
To ensure that all symbols used by your diagnostics application are included in the standalone environment, compile and link the application code with the libc.a.min library found in the /usr/ccs/lib directory.
One method is to create a directory containing the libraries needed for linking:
You can ignore any unresolved symbols coming from libasl, or others that you know about.
Errors found indicating unresolved symbols must be fixed before the program will properly execute in standalone diagnostics mode.
DAs must issue the macro DA_EXIT() to exit.
Individual values can be set by calling the appropriate DA_SETRC_XXXXXX() macro definition.
The following values are defined:
DA_STATUS_GOOD | No problems were found. |
DA_STATUS_BAD | A FRU Bucket or a Menu Goal was reported. |
DA_USER_NOKEY | No special function keys were entered. |
DA_USER_EXIT | The Exit key was entered by the user. |
DA_USER_QUIT | The Cancel key was entered by the user. |
DA_ERROR_NONE | No errors were encountered performing a normal operation such as displaying a menu, accessing the object repository, and allocating memory. |
DA_ERROR_OPEN | Could not open the device. |
DA_ERROR_OTHER | An error was encountered performing a normal operation. |
DA_TESTS_NOTEST | No tests were executed. |
DA_TEST_FULL | The full tests were executed. |
DA_TEST_SUB | The subtests were executed. |
DA_TEST_SHR | The shared tests were executed. |
DA_MORE_NOCONT | The isolation process is complete. |
DA_MORE_CONT | The path to the device should be tested. The next DA to be called is either the parent or sibling, depending on the value of DNext in the Predefined Diagnostic Resources PDiagRes object class. |
The DA performs these tasks:
Diagnostic applications report problems through SRNs (Service Request Numbers). SRNs take the following forms:
Six-digit SRNs should be grouped so that each set of FRU callouts are grouped together. For example, if a Diagnostic Application callout consists of:
Then the SRNs should be grouped like the following:
The guidelines for the Reason Codes for SRN Source Numbers 700 to 799 and 811 to 999 that are not decoded from some type of special information are:
000 | Reserved |
001 | Indicates that an adapter or device could not be found |
002 to 100 | Reserved |
101 to 199 | Reserved for non-ELA callouts with a single FRU |
200 to 299 | Reserved for non-ELA callouts with two FRUs |
300 to 399 | Reserved for non-ELA callouts with three FRUs |
400 to 499 | Reserved for non-ELA callouts with four or more FRUs |
500 to 599 | Reserved for non-ELA cases that require a special action such as waiting for a thermal device to cool or checking the level of a device. |
600 to 699 | Reserved for ELA callouts with a single FRU |
700 to 799 | Reserved for ELA callouts with two or more FRUs |
800 to 899 | Reserved for ELA cases that require a special action, such as waiting for a thermal device to cool or checking the level of a device. |
900 to 999 | Reserved |
This is done to group the SRNs with like FRUs into one entry in the SRN Tables.
The following table lists SRN generated by the diagnostic controller when the event shown in the description column occurs.
SRN | Description |
---|---|
802-xxx | The diagnostic did not detect an installed device (Online Diagnostics). |
803-xxx | An error not related to the diagnostic tests occurred. |
804-xxx | A halt occurred in the diagnostic application. |
801-101
801-102 |
The diagnostics did not detect an installed device (Standalone Diagnostics). |
The following source numbers are defined for use by third party vendors.
Source Number | Description |
---|---|
661 | IDE Tape Drive |
66a | USB Open Host Controller Type |
66b | USB Universal Host Controller Type |
74b | ATM Adapter |
74d | Sound Card |
74e | Fibre Channel Adapter |
892 | Graphics Display Adapter |
893 | Local Area Network (LAN) Adapter |
894 | Async Protocol Adapter |
901 | SCSI Protocol Device |
902 | Graphics Display |
904 | Parallel Port Attached Device |
753 | IDE CD ROM Drive |
891 | SCSI Device Adapter |
752 | IDE Disk Drive |
805 | CD Read/Write Drive |
711 | Generic Adapter (Not covered above) |