Starting with machine type 7044 model 270, the hardware of all systems with more than two processors is able to detect correctable errors, which are gathered by the firmware. These errors are not fatal and, as long as they remain rare occurrences, can be safely ignored. However, when a pattern of failures seems to be developing on a specific processor, this pattern might indicate that this component is likely to exhibit a fatal failure in the near future. This prediction is made by the firmware based-on-failure rates and threshold analysis.
This operating system, on these systems, implements continuous hardware surveillance and regularly polls the firmware for hardware errors. When the number of processor errors hits a threshold and the firmware recognizes that there is a distinct probability that this system component will fail, the firmware returns an error report. In all cases, the error is logged in the system error log. In addition, on multiprocessor systems, depending on the type of failure, this operating system attempts to stop using the untrustworthy processor and deallocate it. This feature is called Dynamic Processor Deallocation.
At this point, the processor is also flagged by the firmware for persistent deallocation for subsequent reboots, until maintenance personnel replaces the processor.
This processor decallocation is transparent for the vast majority of applications, including drivers and kernel extensions. However, you can use the published interfaces to determine whether an application or kernel extension is running on a multiprocessor machine, find out how many processors there are, and bind threads to specific processors.
The interface for binding processes or threads to processors uses logical CPU numbers. The logical CPU numbers are in the range [0..N-1] where N is the total number of CPUs. To avoid breaking applications or kernel extensions that assume no "holes" in the CPU numbering, this operating system always makes it appear for applications as if it is the "last" (highest numbered) logical CPU to be deallocated. For instance, on an 8-way SMP, the logical CPU numbers are [0..7]. If one processor is deallocated, the total number of available CPUs becomes 7, and they are numbered [0..6]. Externally, it looks like CPU 7 has disappeared, regardless of which physical processor failed. In the rest of this description, the term CPU is used for the logical entity and the term processor for the physical entity.
Applications or kernel extensions using processes/threads binding could potentially be broken if this operating system silently terminated their bound threads or forcefully moved them to another CPU when one of the processors needs to be deallocated. Dynamic Processor Deallocation provides programming interfaces so that those applications and kernel extensions can be notified that a processor deallocation is about to happen. When these applications and kernel extensions get this notification, they are responsiblefor moving their bound threads and associated resources (such as timer request blocks) away form the last logical CPU and adapt themselves to the new CPU configuration.
If, after notification of applications and kernel extensions, some of the threads are still bound to the last logical CPU, the deallocation is aborted. In this case, the fact that the deallocation has been aborted is logged in the error log and continues using the ailing processor. When the processor ultimately fails, it creates a total system failure. Thus, it is important for applications or kernel extensions binding threads to CPUs to get the notification of an impending processor deallocation, and act on this notice.
Even in the rare cases where the deallocation cannot go through, Dynamic Processor Deallocation still gives advanced warning to system administrators. By recording the error in the error log, it gives them a chance to schedule a maintenance operation on the system to replace the ailing component before a global system failure occurs.
The typical flow of events for processor deallocation is as follows:
In case of failure at any point of the deallocation, the failure is logged with the reason why the deallocation was aborted. The system administrator can look at the error log, take corrective action (when possible) and restart the deallocation. For instance, if the deallocation was aborted because at least one application did not unbind its bound threads, the system administrator could stop the application(s), restart the deallocation (which should go through this time) and restart the application.
Dynamic Processor Deallocation can be enabled or disabled by changing the value of the cpuguard attribute of the ODM object sys0. The possible values for the attribute are enable and disable.
The default, is that the dynamic processor deallocation is disabled (the attribute cpuguard has a value of disable). System administrators who want to take advantage of this feature must enable it using either the Web-based System Manager system menus, the SMIT System Environments menu, or the chdev command.
Note: If processor deallocation is turned off, the errors are still reported in the error log and you will see the error indicating that the operating system was notified of the problem with a CPU (CPU_FAILURE_PREDICTED, see the following format).
Sometimes the processor deallocation fails because an application did not move its bound threads away from the last logical CPU. Once this problem has been fixed, by either unbinding (when it is safe to do so) or stopping the application, the system administrator can restart the processor deallocation process using the ha_star command.
The syntax for this command is:
ha_star -C
where -C is for a CPU predictive failure event.
Physical processors are represented in the ODM data base by objects named procn where n is the physical processor number (n is a decimal number). Like any other "device" represented in the ODM database, processor objects have a state (Defined/Available) and attributes.
The state of a proc object is always Available as long as the corresponding processor is present, regardless of whether it is usable. The state attribute of a proc object indicates if the processor is used and, if not, the reason. This attribute can have three values:
enable | The processor is used. |
disable | The processor has been dynamically deallocatd. |
faulty | The processor was declared defective by the firmware at startup time. |
In the case of CPU errors, if a processor for which the firmware reports a predictive failure is successfully deallocated, its state goes from enable to disable. Independently of of this operating system, this processor is also flagged as defective in the firmware. Upon reboot, it is not available and will have its state set to faulty. But the ODM proc object is still marked Available. Only if the defective CPU was physically removed from the system board or CPU board (if it were at all possible) is the proc object change to Defined.
Examples:
Processor proc4 is working correctly and used by the operating system:
# lsattr -EH -l proc4 attribute value description user_settable state enable Processor state False type PowerPC_RS64-III Processor type False #
Processor proc4 gets a predictive failure and gets deallocated by the operating system:
# lsattr -EH -l proc4 attribute value description user_settable state disable Processor state False type PowerPC_RS64-III Processor type False #
At the next system restart, processor proc4 is reported by firmware as defective and not available to the operating system:
# lsattr -EH -l proc4 attribute value description user_settable state faulty Processor state False type PowerPC_RS64-III Processor type False #
But in all three cases, the status of processor proc4 is Available:
# lsdev -CH -l proc4 name status location description proc4 Available 00-04 Processor #
The following are examples wth descriptions of error logentries:
# errpt IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION 804E987A 1008161399 I O proc4 CPU DEALLOCATED 8470267F 1008161299 T S proc4 CPU DEALLOCATION ABORTED 1B963892 1008160299 P H proc4 CPU FAILURE PREDICTED #
Error description: Predictive Processor Failure
This error indicates that the hardware detected that a processor has a high probability to fail in a near future. It is always logged whether or not processor deallocation is enabled.
DETAIL DATA: Physical processor number, location
Example: error log entry - long form
LABEL: CPU_FAIL_PREDICTED IDENTIFIER: 1655419A Date/Time: Thu Sep 30 13:42:11 Sequence Number: 53 Machine Id: 00002F0E4C00 Node Id: auntbea Class: H Type: PEND Resource Name: proc25 Resource Class: processor Resource Type: proc_rspc Location: 00-25 Description CPU FAILURE PREDICTED Probable Causes CPU FAILURE Failure Causes CPU FAILURE Recommended Actions ENSURE CPU GARD MODE IS ENABLED RUN SYSTEM DIAGNOSTICS. Detail Data PROBLEM DATA 0144 1000 0000 003A 8E00 9100 1842 1100 1999 0930 4019 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 4942 4D00 5531 2E31 2D50 312D 4332 0000 0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ... ... ... ... ...
Error Description: A processor has been successfully deallocated upon detection of a predictive processor failure.
This message is logged when processor deallocation is enabled, and when the CPU has been successfully deallocated.
DETAIL DATA: Logical CPU number of deallocated processor.
Example: error log entry - long form:
LABEL: CPU_DEALLOC_SUCCESS IDENTIFIER: 804E987A Date/Time: Thu Sep 30 13:44:13 Sequence Number: 63 Machine Id: 00002F0E4C00 Node Id: auntbea Class: O Type: INFO Resource Name: proc24 Description CPU DEALLOCATED Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE Detail Data LOGICAL DEALLOCATED CPU NUMBER 0
The preceding example shows that proc24 was successfully deallocated and was logical CPU 0 when the failure occurred.
Error Description: A processor deallocation, due to a predictive processor failure, was not successful.
This message is logged when CPU deallocation is enabled, and when the CPU has not been successfully deallocated.
DETAIL DATA: Reason code, logical CPU number, additional information depending of the type of failure.
The reason code is a numeric hexadecimal value. The possible reason
codes are:
2 | One or more processes/threads remain bound to the last logical CPU. In this case, the detailed data give the PIDs of the offending processes. |
3 | A registered driver or kernel extension returned an error when notified. In this case, the detailed data field contains the name of the offending driver or kernel extension (ASCII encoded). |
4 | Deallocating a processor causes the machine to have less than two available CPUs. This operating system does not deallocate more than N-2 processors on an N-way machine to avoid confusing applications or kernel extensions using the total number of available processors to determine whether they are running on a Uni Processor (UP) system where it is safe to skip the use of multiprocessor locks, or a Symmetric Multi Processor (SMP). |
200 (0xC8) | Processor deallocation is disabled (the ODM attribute cpuguard has a value of disable). You normally do not see this error unless you start ha_star manually. |
Examples: error log entries - long format
Example 1:
LABEL: CPU_DEALLOC_ABORTED IDENTIFIER: 8470267F Date/Time: Thu Sep 30 13:41:10 Sequence Number: 50 Machine Id: 00002F0E4C00 Node Id: auntbea Class: S Type: TEMP Resource Name: proc26 Description CPU DEALLOCATION ABORTED Probable Causes SOFTWARE PROGRAM Failure Causes SOFTWARE PROGRAM Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE SEE USER DOCUMENTATION FOR CPU GARD Detail Data DEALLOCATION ABORTED CAUSE 0000 0003 DEALLOCATION ABORTED DATA 6676 6861 6568 3200
The preceding example shows that the deallocation for proc26 failed. The reason code 3 means that a kernel extension returned an error to the kernel notification routine. The DEALLOCATION ABORTED DATA above spells fvhaeh2, which is the name the extension used when registering with the kernel.
Example 2:
LABEL: CPU_DEALLOC_ABORTED IDENTIFIER: 8470267F Date/Time: Thu Sep 30 14:00:22 Sequence Number: 71 Machine Id: 00002F0E4C00 Node Id: auntbea Class: S Type: TEMP Resource Name: proc19 Description CPU DEALLOCATION ABORTED Probable Causes SOFTWARE PROGRAM Failure Causes SOFTWARE PROGRAM Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE; SEE USER DOCUMENTATION FOR CPU GARD Detail Data DEALLOCATION ABORTED CAUSE 0000 0002 DEALLOCATION ABORTED DATA 0000 0000 0000 4F4A
The preceding example shows that the deallocation for proc19 failed. The reason code 2 indicates thread(s) were bound to the last logical processor and did not unbind upon receiving the SIGCPUFAIL signal. The DEALLOCATION ABORTED DATA shows that these threads belonged to process 0x4F4A.
Options of the ps command (-o THREAD, -o BND) allow listings of all threads or processes, with the number of the CPU they are bound to when applicable.
Example 3:
LABEL: CPU_DEALLOC_ABORTED IDENTIFIER: 8470267F Date/Time: Thu Sep 30 14:37:34 Sequence Number: 106 Machine Id: 00002F0E4C00 Node Id: auntbea Class: S Type: TEMP Resource Name: proc2 Description CPU DEALLOCATION ABORTED Probable Causes SOFTWARE PROGRAM Failure Causes SOFTWARE PROGRAM Recommended Actions MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE SEE USER DOCUMENTATION FOR CPU GARD Detail Data DEALLOCATION ABORTED CAUSE 0000 0004 DEALLOCATION ABORTED DATA 0000 0000 0000 0000
The preceding example shows that the deallocation of proc2 failed because there were two or fewer active processors at the time of failure (reason code 4).