Strategy

[ Previous | Next | Contents | Home | Search ]

AIX Version 4.3 Understanding the Diagnostic Subsystem for AIX

Strategy for Diagnostics

The strategy for diagnostics on RS/6000 machines is founded on:

Staging diagnostics based on underlying hardware capabilities according to three levels of testing:
- Shared
- Subtest
- Full-test
Isolating defective field replaceable units (FRUs) such that there is the least impact to the system. This is accomplished by either:
- Option Checkout
- System Checkout

Staging the Impact of Diagnostics

The impact of diagnostics is staged. There are three levels of tests supported by diagnostics:

Shared	The tests in this category are nondisruptive. Diagnostics does not need exclusive access to run these tests. All Diagnostic Applications (DA) should support the shared testing category since DAs perform error-log analysis. Other possible shared tests are error circuitry testing, cyclic redundancy checks of Loadable ROS, On Board Self Tests (provided the appropriate recovery procedures are included), and selected functional testing such as diagnostic reads and writes.
Subtest	The tests in this category apply to multiplexed resources such as Native I/O Planar and multiport async cards. The sub-tests are disruptive, but only to a portion of the resource. To run these tests, diagnostics needs exclusive access to the portion of the resource that is being tested.
Full-test	The tests in this category impact the entire resource. Diagnostics must have exclusive access to the entire resource to run these tests.

Option Checkout

If the configuration is viewed as a tree structure, diagnostics starts testing at the leaves of the tree, and moves vertically and horizontally down the tree toward the root. The leaves represent terminal devices, and the root is the processor.

The following algorithm generally describes the isolation strategy. It starts at an arbitrary node in the tree and isolates to the correct FRU bucket based on the good or bad status of siblings and parent resources.

The steps are:

Test resource x. If no problems are detected, no further isolation is required.
Test a sibling of resource x, called resource y. If no problems are found, the fault of resource x is isolated to resource x.
Test the parent of resources x and y. If no problems are detected, the problem has not been isolated to a single failing resource. The FRU buckets associated with resources x and y will both be reported. No further isolation is required. However, if the parent fails its tests, disregard the failures of resources x and y and continue isolating the problem for the parent.

This general process of testing siblings and parents is repeated until a resource passes its tests or until a DA indicates that no further testing is required.

The RS/6000 Diagnostic Subsystem attempts to isolate to a single failing device. When multiple child devices fail their tests, the fault most likely lies with the parent. Thus the DA testing the parent in step 3 should name the parent as being defective and indicate that no more devices should be tested, in which case the Diagnostic Controller would only report the parent. The status of the child devices that have been tested is identified in the DA's control block.

System Checkout

Each resource in the system that has not been deleted from the resource selection list is tested during system checkout. System Checkout selection is accomplished by selecting All Resources from the Resource Selection Menu. User interaction is not allowed unless a problem has been detected and a question needs to be asked to isolate the problem.

Configuration processing for system checkout is different from that for option checkout, which impacts the effectiveness of the FRU Callout. Option checkout is the specification of an individual resource to test. When option checkout is chosen, the option chosen is tested first, and if a problem is found, it is traced back through its siblings and parents until it has been isolated. The configuration is processed from the outside in. When system checkout is chosen, the configuration is processed from the inside out. For example, the configuration is processed starting with the system planar, and works its way out on a per-card basis. First a card is tested, then the devices attached to the card are tested, and then the devices attached to the device attached to the card are tested, and so on. This process is repeated for each card attached to the system planar.

Option Checkout is more effective because the children are tested before the parent, which allows the parent to determine its own culpability above and beyond its own test results. The parent can implicate itself for no other reason than that its children are failing.

[ Previous | Next | Contents | Home | Search ]