This document discusses checkstops, a machine check that occurs during another machine check. This document applies to AIX Versions 3.2, 4.1, 4.2 and 4.3.
For more in-depth coverage of this subject, the following IBM publications are recommended:
The product documentation library is also available:
http://www.rs6000.ibm.com/resource/aix_resource/Pubs/index.html
A checkstop is indicated by an LED value of 185, 186, or 187 on the LED display of the main unit. If the machine does not have an LED display or the machine has been rebooted, then evidence of a checkstop should exist in the system error report. Look for an entry labeled CHECKSTOP in the error report to determine if a checkstop occurred.
A machine check is an error logged by the machine check handler. Causes of a machine check could be:
A non-maskable interrupt (NMI) is generated. The operating system logs the machine check, including various error logging registers reporting the cause of the machine check, and a system dump initiates.
A checkstop is a machine check that occurs during another machine check. A checkstop also occurs when the machine--usually a processor but sometimes a cache, memory, or I/O bus controller--determines that something is in an "impossible" state. An error occurs that cannot be isolated to a particular bus transfer in progress, or a processor detects no progress being made. The processor is not able to complete any instructions for some period of time.
When a system checkstops, the clocks in the machine are frozen within a few cycles after the error and the service processor saves the part of the state of the CPUs in NVRAM. It then attempts to do a full hardware reset and restart the system a number of times.
When the system reboots, the data is copied to a file in the /usr/lib/ras directory (ras stands for reliability and service). Two file names are used, checkstop.A and checkstop.B, in a rotating manner. The total number of checkstops that occurred during the reboot attempts, before the system came up successfully, is logged in the error log entry along with the file name.
If a second machine check occurs before the operating system completes logging the error to NVRAM and initiates a complete hardware reset or halts, the processor will checkstop.
Checkstops are inherently hardware phenomena. They do not necessarily indicate a solid failure of a component, so diagnostics will rarely determine that a problem exists. The checkstop file that is generated is required to determine the cause of the checkstop and the corrective actions needed to resolve the situation. This file would be examined by your hardware service organization. For further information, contact one of the following:
Use the following instructions to package these files for hardware service examination.
Gather system information by performing the following steps:
cp /usr/lib/ras/checkstop* /tmp/ibmsupt/testcase
tar -cvf /dev/fd0 /tmp/ibmsuptfd0 is the floppy device.
Very important: If the person sending in this testcase is not the person who reported the problem, be sure to include the name of the person who reported it. If the proper information is not on the package, then it takes valuable time to process and delays solving your problem. The incident# will be the reference number that your hardware service organization assigns to this problem.
Listed below are some possible software resolutions for checkstop conditions as of this document's last update. To check for the latest checkstop-related software fixes, go to the TechSupport Online databases; select the APAR databases and search on the keyword CHECKSTOP.
APAR DESCRIPTION HARDWARE APARS are no longer being written for 3.2.x. Upgrading to the latest level of the OS will resolve any problems that can be fixed via APAR. IX53114 3.2.5.101 Upgrade from 3.2.5.1 IX60081 3.2.5.102 Upgrade from 3.2.5.1
APAR DESCRIPTION HARDWARE APARS are no longer being written for 4.1.x. Upgrading to the latest level of the OS will resolve any problems that can be fixed via APAR. IX88586 Latest AIX 4.1.5 Fixes as of March 1999
APAR      DESCRIPTION                               HARDWARE
IX69143   CHECKSTOP 185/186 ON GXT500D/GXT500
           WITH X -BS OPTION                        GXT500
IX70175   NEED SW WORKAROUND FOR PEGASUS 6XX        7012-G30, 7012-G40, 7012-G50
           BUS LIVELOCK                             7013-J30, 7013-J40, 7013-J50,
                                                    7015-R30, 7015-R40, 7015-R50
IX62156   UNALIGNED TRANSFERS ON 825A CAN CAUSE
           MACHINE CHECK                            PCI F/W SCSI Adap.
IX66931   PCI SCSI ADAPTER CAUSES MACHINE CHECK
           IN WILDCAT
IX61252   GXT500 CHECKSTOPS DOING SOLID MODEL
           ROTATION IN CATIA                        GXT500
IX83745   CHECKSTOP ON SPHINX                       43P-260
IX74688   ROBUST RISC CHECKSTOP ANALYSIS            7012-G30/G40,7013-J30/J40/J50                                                   
                                                    7015-R30/R40/R50
IX75066   THE CHECKSTOP ERROR ISN'T ACCURATE 
          FOR PAL MACHINES
IX89142   NEED TO RENAME 'CHECKSTOP' FILE TO 
          SOMETHING BENIGN ON CHRP BOX
IX75637   ADD FUNCTIONALITY TO SNAP TO COLLECT 
          CHECKSTOP FILES
APAR      DESCRIPTION                               HARDWARE
IX72262   APACHE DEADLOCK AVOIDANCE WORKAROUNDS     7017-S70
IX83586   CHECKSTOP ON SPHINX                       43P-260
IX89790   699:SPHINX2 CHECKSTOP W/MTN&2MIR WHEN     GXT2000P,GXT3000P
           CATIA CHAINSAW GRAPER FCN
IY04927   CHECKSTOP WITH GIGABIT ETHERNET WITH      604e+GigaBit ETHERNET
           604e PROCESSOR
IX74019   ROBUST RISC CHECKSTOP ANALYSIS            7012-G30/G40,7013-J30/J40/J50                                                   
                                                    7015-R30/R40/R50
IX84430   NEED TO RENAME 'CHECKSTOP' FILE TO 
          SOMETHING BENIGN ON CHRP BOX
IX74302   ADD FUNCTIONALITY TO SNAP TO COLLECT 
          CHECKSTOP FILES