Starting with machine type 7044 model 270, the hardware of all systems with more than two processors is able to detect correctable errors, which are gathered by the firmware. These errors are not fatal and, as long as they remain rare occurrences, can be safely ignored. However, when a pattern of failures seems to be developing on a specific processor, this pattern might indicate that this component is likely to exhibit a fatal failure in the near future. This prediction is made by the firmware based on the failure rates and threshold analysis.
On these systems, AIX implements continuous hardware surveillance and regularly polls the firmware for hardware errors. When the number of processor errors hits a threshold and the firmware recognizes that there is a distinct probability that this system component will fail, the firmware returns an error report. In all cases, the error is logged in the system error log. In addition, on multiprocessor systems, depending on the type of failure, AIX attempts to stop using the untrustworthy processor and deallocate it. This feature is called Dynamic Processor Deallocation.
At this point, the processor is also flagged by the firmware for persistent deallocation for subsequent reboots, until maintenance personnel replaces the processor.
Processor deallocation is transparent for the vast majority of applications, including drivers and kernel extensions. However, you can use the published interfaces to determine whether an application or kernel extension is running on a multiprocessor machine, find out how many processors there are, and bind threads to specific processors.
The interface for binding processes or threads to processors uses logical CPU numbers. The logical CPU numbers are in the range [0..N-1] where N is the total number of CPUs. To avoid breaking applications or kernel extensions that assume no "holes" in the CPU numbering, AIX always makes it appear to applications as if it is the "last" (highest numbered) logical CPU to be deallocated. For instance, on an 8-way SMP, the logical CPU numbers are [0..7]. If one processor is deallocated, the total number of available CPUs becomes 7, and they are numbered [0..6]. Externally, it seems as if CPU 7 has disappeared, regardless of which physical processor failed.
Potentially, applications or kernel extensions that are binding processes or threads could be broken if AIX silently terminated their bound threads or forcefully moved them to another CPU when one of the processors needs to be deallocated. Dynamic Processor Deallocation provides programming interfaces so that such applications and kernel extensions can be notified that a processor deallocation is about to happen. When these applications and kernel extensions receive notification, they are responsible for moving their bound threads and associated resources (such as timer request blocks) away from the last logical CPU and for adapting themselves to the new CPU configuration. After notification, if some threads remain bound to the last logical CPU, the deallocation is aborted, the aborted deallocation is logged in the error log, and AIX continues using the ailing processor. When the processor ultimately fails, it causes a total system failure. Therefore, it is important that applications or kernel extensions receive notification of an impending processor deallocation and act on this notice.
Even in the rare cases that the deallocation cannot go through, Dynamic Processor Deallocation still gives advanced warning to system administrators. By recording the error in the error log, it gives them a chance to schedule a maintenance operation on the system to replace the ailing component before a global system failure occurs.
The typical flow of events for processor deallocation is as follows:
If there is a failure at any point of the deallocation, the failure and its cause are logged. The system administrator can look at the error log, take corrective action (when possible) and restart the deallocation. For instance, if the deallocation was aborted because an application did not unbind its bound threads, the system administrator can stop the application, restart the deallocation, and then restart the application.
Dynamic Processor Deallocation can be enabled or disabled by changing the value of the cpuguard attribute of the ODM object sys0. The possible values for the attribute are enable and disable.
Beginning with AIX 5.2, the default is enabled (the attribute cpuguard has a value of enable). System administrators who want to disable this feature must use either the Web-based System Manager system menus, the SMIT System Environments menu, or the chdev command. (In previous AIX versions, the default was disabled.)
Sometimes the processor deallocation fails because an application did not move its bound threads away from the last logical CPU. Once this problem has been fixed, either by unbinding (when it is safe to do so) or by stopping the application, the system administrator can restart the processor deallocation process using the ha_star command.
The syntax for this command is:
ha_star -C
where -C is for a CPU predictive failure event.
Physical processors are represented in the ODM database by objects named procn where n is a decimal number that represents the physical processor number. Like any other device represented in the ODM database, processor objects have a state, such as Defined/Available, and attributes.
The state of a proc object is always Available as long as the corresponding processor is present, regardless of whether it is usable. The state attribute of a proc object indicates if the processor is used and, if not, the reason. This attribute can have three values:
enable | The processor is used. |
disable | The processor has been dynamically deallocated. |
faulty | The processor was declared defective by the firmware at startup time. |
If an ailing processor is successfully deallocated, its state goes from enable to disable. Independently of AIX, this processor is also flagged in the firmware as defective. Upon reboot, the deallocated processor will not be available and will have its state set to faulty. The ODM proc object, however, is still marked Available. You must physically remove the defective CPU from the system board or remove the CPU board (if possible) for the proc object to change to Defined.
In the following scenario, processor proc4 is working correctly and is being used by the operating system, as shown in the following output:
# lsattr -EH -l proc4 attribute value description user_settable state enable Processor state False type PowerPC_RS64-III Processor type False #
When processor proc4 gets a predictive failure, it gets deallocated by the operating system, as shown in the following:
# lsattr -EH -l proc4 attribute value description user_settable state disable Processor state False type PowerPC_RS64-III Processor type False #
At the next system restart, processor proc4 is reported by firmware as defective, as shown in the following:
# lsattr -EH -l proc4 attribute value description user_settable state faulty Processor state False type PowerPC_RS64-III Processor type False #
But the status of processor proc4 remains Available, as shown in the following:
# lsdev -CH -l proc4 name status location description proc4 Available 00-04 Processor #
Three different error log messages are associated with CPU deallocation. The following are examples.
# errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION 804E987A 1008161399 I O proc4 CPU DEALLOCATED 8470267F 1008161299 T S proc4 CPU DEALLOCATION ABORTED 1B963892 1008160299 P H proc4 CPU FAILURE PREDICTED #
Error description: Predictive Processor Failure
This error indicates that the hardware detected that a processor has a high probability to fail in a near future. It is always logged whether or not processor deallocation is enabled.
DETAIL DATA: Physical processor number, location
Example error log entry - long form
LABEL: CPU_FAIL_PREDICTED IDENTIFIER: 1655419A Date/Time: Thu Sep 30 13:42:11 Sequence Number: 53 Machine Id: 00002F0E4C00 Node Id: auntbea Class: H Type: PEND Resource Name: proc25 Resource Class: processor Resource Type: proc_rspc Location: 00-25 Description CPU FAILURE PREDICTED Probable Causes CPU FAILURE Failure Causes CPU FAILURE Recommended Actions ENSURE CPU GARD MODE IS ENABLED RUN SYSTEM DIAGNOSTICS. Detail Data PROBLEM DATA 0144 1000 0000 003A 8E00 9100 1842 1100 1999 0930 4019 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 4942 4D00 5531 2E31 2D50 312D 4332 0000 0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ... ... ... ... ...
Error Description: A processor has been successfully deallocated after detection of a predictive processor failure. This message is logged when processor deallocation is enabled, and when the CPU has been successfully deallocated.
DETAIL DATA: Logical CPU number of deallocated processor.
Example: error log entry - long form:
LABEL: CPU_DEALLOC_SUCCESS IDENTIFIER: 804E987A Date/Time: Thu Sep 30 13:44:13 Sequence Number: 63 Machine Id: 00002F0E4C00 Node Id: auntbea Class: O Type: INFO Resource Name: proc24 Description CPU DEALLOCATED Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE Detail Data LOGICAL DEALLOCATED CPU NUMBER 0
In this example, proc24 was successfully deallocated and was logical CPU 0 when the failure occurred.
Error Description: A processor deallocation, due to a predictive processor failure, was not successful. This message is logged when CPU deallocation is enabled, and when the CPU has not been successfully deallocated.
DETAIL DATA: Reason code, logical CPU number, additional information depending of the type of failure.
The reason code is a numeric hexadecimal value. The possible reason codes are:
2 | One or more processes/threads remain bound to the last logical CPU. In this case, the detailed data give the PIDs of the offending processes. |
3 | A registered driver or kernel extension returned an error when notified. In this case, the detailed data field contains the name of the offending driver or kernel extension (ASCII encoded). |
4 | Deallocating a processor causes the machine to have less than two available CPUs. This operating system does not deallocate more than N-2 processors on an N-way machine to avoid confusing applications or kernel extensions using the total number of available processors to determine whether they are running on a Uni Processor (UP) system where it is safe to skip the use of multiprocessor locks, or a Symmetric Multi Processor (SMP). |
200 (0xC8) | Processor deallocation is disabled (the ODM attribute cpuguard has a value of disable). You normally do not see this error unless you start ha_star manually. |
Examples: error log entries - long format
Example 1:
LABEL: CPU_DEALLOC_ABORTED IDENTIFIER: 8470267F Date/Time: Thu Sep 30 13:41:10 Sequence Number: 50 Machine Id: 00002F0E4C00 Node Id: auntbea Class: S Type: TEMP Resource Name: proc26 Description CPU DEALLOCATION ABORTED Probable Causes SOFTWARE PROGRAM Failure Causes SOFTWARE PROGRAM Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE SEE USER DOCUMENTATION FOR CPU GARD Detail Data DEALLOCATION ABORTED CAUSE 0000 0003 DEALLOCATION ABORTED DATA 6676 6861 6568 3200
In this example, the deallocation for proc26 failed. The reason code 3 means that a kernel extension returned an error to the kernel notification routine. The DEALLOCATION ABORTED DATA above spells fvhaeh2, which is the name the extension used when registering with the kernel.
Example 2:
LABEL: CPU_DEALLOC_ABORTED IDENTIFIER: 8470267F Date/Time: Thu Sep 30 14:00:22 Sequence Number: 71 Machine Id: 00002F0E4C00 Node Id: auntbea Class: S Type: TEMP Resource Name: proc19 Description CPU DEALLOCATION ABORTED Probable Causes SOFTWARE PROGRAM Failure Causes SOFTWARE PROGRAM Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE; SEE USER DOCUMENTATION FOR CPU GARD Detail Data DEALLOCATION ABORTED CAUSE 0000 0002 DEALLOCATION ABORTED DATA 0000 0000 0000 4F4A
In this example, the deallocation for proc19 failed. The reason code 2 indicates thread(s) were bound to the last logical processor and did not unbind after receiving the SIGCPUFAIL signal. The DEALLOCATION ABORTED DATA shows that these threads belonged to process 0x4F4A.
Options of the ps command (-o THREAD, -o BND) allow you to list all threads or processes along with the number of the CPU they are bound to, when applicable.
Example 3:
LABEL: CPU_DEALLOC_ABORTED IDENTIFIER: 8470267F Date/Time: Thu Sep 30 14:37:34 Sequence Number: 106 Machine Id: 00002F0E4C00 Node Id: auntbea Class: S Type: TEMP Resource Name: proc2 Description CPU DEALLOCATION ABORTED Probable Causes SOFTWARE PROGRAM Failure Causes SOFTWARE PROGRAM Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE SEE USER DOCUMENTATION FOR CPU GARD Detail Data DEALLOCATION ABORTED CAUSE 0000 0004 DEALLOCATION ABORTED DATA 0000 0000 0000 0000
In this example, the deallocation of proc2 failed because there were two or fewer active processors at the time of failure (reason code 4).