System Management Concepts: Operating System and Devices

Enabling Dynamic Processor Deallocation

Starting with machine type 7044 model 270, the hardware of all systems with more than two processors is able to detect correctable errors, which are gathered by the firmware. These errors are not fatal and, as long as they remain rare occurrences, can be safely ignored. However, when a pattern of failures seems to be developing on a specific processor, this pattern might indicate that this component is likely to exhibit a fatal failure in the near future. This prediction is made by the firmware based-on-failure rates and threshold analysis.

This operating system, on these systems, implements continuous hardware surveillance and regularly polls the firmware for hardware errors. When the number of processor errors hits a threshold and the firmware recognizes that there is a distinct probability that this system component will fail, the firmware returns an error report. In all cases, the error is logged in the system error log. In addition, on multiprocessor systems, depending on the type of failure, this operating system attempts to stop using the untrustworthy processor and deallocate it. This feature is called Dynamic Processor Deallocation.

At this point, the processor is also flagged by the firmware for persistent deallocation for subsequent reboots, until maintenance personnel replaces the processor.

Potential Impact to Applications

This processor decallocation is transparent for the vast majority of applications, including drivers and kernel extensions. However, you can use the published interfaces to determine whether an application or kernel extension is running on a multiprocessor machine, find out how many processors there are, and bind threads to specific processors.

The interface for binding processes or threads to processors uses logical CPU numbers. The logical CPU numbers are in the range [0..N-1] where N is the total number of CPUs. To avoid breaking applications or kernel extensions that assume no "holes" in the CPU numbering, this operating system always makes it appear for applications as if it is the "last" (highest numbered) logical CPU to be deallocated. For instance, on an 8-way SMP, the logical CPU numbers are [0..7]. If one processor is deallocated, the total number of available CPUs becomes 7, and they are numbered [0..6]. Externally, it looks like CPU 7 has disappeared, regardless of which physical processor failed. In the rest of this description, the term CPU is used for the logical entity and the term processor for the physical entity.

Applications or kernel extensions using processes/threads binding could potentially be broken if this operating system silently terminated their bound threads or forcefully moved them to another CPU when one of the processors needs to be deallocated. Dynamic Processor Deallocation provides programming interfaces so that those applications and kernel extensions can be notified that a processor deallocation is about to happen. When these applications and kernel extensions get this notification, they are responsiblefor moving their bound threads and associated resources (such as timer request blocks) away form the last logical CPU and adapt themselves to the new CPU configuration.

If, after notification of applications and kernel extensions, some of the threads are still bound to the last logical CPU, the deallocation is aborted. In this case, the fact that the deallocation has been aborted is logged in the error log and continues using the ailing processor. When the processor ultimately fails, it creates a total system failure. Thus, it is important for applications or kernel extensions binding threads to CPUs to get the notification of an impending processor deallocation, and act on this notice.

Even in the rare cases where the deallocation cannot go through, Dynamic Processor Deallocation still gives advanced warning to system administrators. By recording the error in the error log, it gives them a chance to schedule a maintenance operation on the system to replace the ailing component before a global system failure occurs.

Processor Deallocation:

The typical flow of events for processor deallocation is as follows:

The firmware detects that a recoverable error threshold has been reached by one of the processors.
The firmware error report is logged in the system error log, and, when executing on a machine supporting processor deallocation, start the deallocation process.
This operating system notifies non-kernel processes and threads bound to the last logical CPU.
This operating system waits for all the bound threads to move away from the last logical CPU. If threads remain bound, the operating system eventually times out (after ten minutes)and aborts the deallocation
Otherwise, the previously registered High Availability Event Handlers (HAEHs) is involked. An HAEH might return an error that aborts the deallocation.
Otherwise, the deallocation process continues and ultimately stops the failing processor.

In case of failure at any point of the deallocation, the failure is logged with the reason why the deallocation was aborted. The system administrator can look at the error log, take corrective action (when possible) and restart the deallocation. For instance, if the deallocation was aborted because at least one application did not unbind its bound threads, the system administrator could stop the application(s), restart the deallocation (which should go through this time) and restart the application.

System Administration

Turning Processor Deallocation On and Off

Dynamic Processor Deallocation can be enabled or disabled by changing the value of the cpuguard attribute of the ODM object sys0. The possible values for the attribute are enable and disable.

The default, is that the dynamic processor deallocation is disabled (the attribute cpuguard has a value of disable). System administrators who want to take advantage of this feature must enable it using either the Web-based System Manager system menus, the SMIT System Environments menu, or the chdev command.

Note: If processor deallocation is turned off, the errors are still reported in the error log and you will see the error indicating that the operating system was notified of the problem with a CPU (CPU_FAILURE_PREDICTED, see the following format).

Restarting an Aborted Processor Deallocation

Sometimes the processor deallocation fails because an application did not move its bound threads away from the last logical CPU. Once this problem has been fixed, by either unbinding (when it is safe to do so) or stopping the application, the system administrator can restart the processor deallocation process using the ha_star command.

The syntax for this command is:

	ha_star -C

where -C is for a CPU predictive failure event.

Processor State Considerations

Physical processors are represented in the ODM data base by objects named procn where n is the physical processor number (n is a decimal number). Like any other "device" represented in the ODM database, processor objects have a state (Defined/Available) and attributes.

The state of a proc object is always Available as long as the corresponding processor is present, regardless of whether it is usable. The state attribute of a proc object indicates if the processor is used and, if not, the reason. This attribute can have three values:

enable	The processor is used.
disable	The processor has been dynamically deallocatd.
faulty	The processor was declared defective by the firmware at startup time.

In the case of CPU errors, if a processor for which the firmware reports a predictive failure is successfully deallocated, its state goes from enable to disable. Independently of of this operating system, this processor is also flagged as defective in the firmware. Upon reboot, it is not available and will have its state set to faulty. But the ODM proc object is still marked Available. Only if the defective CPU was physically removed from the system board or CPU board (if it were at all possible) is the proc object change to Defined.

Examples:

Processor proc4 is working correctly and used by the operating system:

	# lsattr -EH -l proc4
	attribute	value			description		user_settable
 
	state		enable			Processor state		False
	type		PowerPC_RS64-III	Processor type		False
		#

Processor proc4 gets a predictive failure and gets deallocated by the operating system:

	# lsattr -EH -l proc4
	attribute	value			description		user_settable
 
	state		disable			Processor state		False
	type		PowerPC_RS64-III	Processor type		False
		#

At the next system restart, processor proc4 is reported by firmware as defective and not available to the operating system:

	# lsattr -EH -l proc4
	attribute	value			description		user_settable
 
	state		faulty			Processor state		False
	type		PowerPC_RS64-III	Processor type		False
		#

But in all three cases, the status of processor proc4 is Available:

	# lsdev -CH -l proc4
	name		status			location		description
	
	proc4		Available		00-04			Processor
	#

Error Log Entries

The following are examples wth descriptions of error logentries:

errpt short format - summary

Three different error log messages are associated with CPU deallocation. The following is an example of entries displayed by the errpt command (without options):

        # errpt
        IDENTIFIER         TIMESTAMP          T      C      RESOURCE_NAME        DESCRIPTION
        804E987A           1008161399         I      O      proc4                CPU DEALLOCATED
        8470267F           1008161299         T      S      proc4                CPU DEALLOCATION ABORTED
        1B963892           1008160299         P      H      proc4                CPU FAILURE PREDICTED
        #

If processor deallocation is enabled, a CPU FAILURE PREDICTED message is always followed by either a CPU DEALLOCATED message or a CPU DEALLOCATION ABORTED message.
If processor deallocation is not enabled, only the CPU FAILURE PREDICTED message is logged. Enabling processor deallocation any time after one or more CPU FAILURE PREDICTED messages have been logged initiates the deallocation process and results in a success or failure error log entry, as described above, for each processor reported failing.

errpt long format - detailed description

The following is the form of output obtained with errpt -a:

CPU_FAIL_PREDICTED

Error description: Predictive Processor Failure

This error indicates that the hardware detected that a processor has a high probability to fail in a near future. It is always logged whether or not processor deallocation is enabled.

DETAIL DATA: Physical processor number, location

Example: error log entry - long form

	LABEL:			CPU_FAIL_PREDICTED
	IDENTIFIER:		1655419A
 
	Date/Time:		Thu Sep 30 13:42:11
	Sequence Number:	53
	Machine Id:		00002F0E4C00
	Node Id:		auntbea
	Class:			H
	Type:			PEND
	Resource Name:		proc25
	Resource Class:		processor
	Resource Type:		proc_rspc
	Location:		00-25
 
	Description
	CPU FAILURE PREDICTED
 
	Probable Causes
	CPU FAILURE
 
	Failure Causes
	CPU FAILURE
 
		Recommended Actions
		ENSURE CPU GARD MODE IS ENABLED
		RUN SYSTEM DIAGNOSTICS.
 
	Detail Data
	PROBLEM DATA
	0144	1000	0000	003A	8E00	9100	1842	1100	1999	0930	4019
	0000	0000	0000	0000	0000
	0000	0000	0000	0000	0000	0000	0000	0000	4942	4D00	5531
	2E31	2D50	312D	4332	0000
	0002	0000	0000	0000	0000	0000	0000	0000	0000	0000	0000
	0000	0000	0000	0000	0000
	0000	0000	0000	0000	0000	0000	0000	0000	0000	0000	0000
	0000	0000	0000	0000	0000
	...	...	...	...	...

CPU_DEALLOC_SUCCESS
Error Description: A processor has been successfully deallocated upon detection of a predictive processor failure.
This message is logged when processor deallocation is enabled, and when the CPU has been successfully deallocated.
DETAIL DATA: Logical CPU number of deallocated processor.
Example: error log entry - long form:
```
	LABEL:			CPU_DEALLOC_SUCCESS
	IDENTIFIER:		804E987A
 
	Date/Time:		Thu Sep 30 13:44:13
	Sequence Number:	63
	Machine Id:		00002F0E4C00
	Node Id:		auntbea
	Class:			O
	Type:			INFO
	Resource Name:		proc24
 
	Description
	CPU DEALLOCATED
 
 
		Recommended Actions
		MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE
 
	Detail Data
	LOGICAL DEALLOCATED CPU NUMBER
 
		0
```
The preceding example shows that proc24 was successfully deallocated and was logical CPU 0 when the failure occurred.

CPU_DEALLOC_FAIL

Error Description: A processor deallocation, due to a predictive processor failure, was not successful.

This message is logged when CPU deallocation is enabled, and when the CPU has not been successfully deallocated.

DETAIL DATA: Reason code, logical CPU number, additional information depending of the type of failure.

The reason code is a numeric hexadecimal value. The possible reason codes are:

2	One or more processes/threads remain bound to the last logical CPU. In this case, the detailed data give the PIDs of the offending processes.
3	A registered driver or kernel extension returned an error when notified. In this case, the detailed data field contains the name of the offending driver or kernel extension (ASCII encoded).
4	Deallocating a processor causes the machine to have less than two available CPUs. This operating system does not deallocate more than N-2 processors on an N-way machine to avoid confusing applications or kernel extensions using the total number of available processors to determine whether they are running on a Uni Processor (UP) system where it is safe to skip the use of multiprocessor locks, or a Symmetric Multi Processor (SMP).
200 (0xC8)	Processor deallocation is disabled (the ODM attribute cpuguard has a value of disable). You normally do not see this error unless you start ha_star manually.

Examples: error log entries - long format

Example 1:

	LABEL:			CPU_DEALLOC_ABORTED
	IDENTIFIER:		8470267F
	Date/Time:		Thu Sep 30 13:41:10
	Sequence Number:	50
	Machine Id:		00002F0E4C00
	Node Id:		auntbea
	Class:			S
	Type:			TEMP
	Resource Name:		proc26
 
Description
CPU DEALLOCATION ABORTED
 
Probable Causes
SOFTWARE PROGRAM
 
Failure Causes
SOFTWARE PROGRAM
 
	Recommended Actions
	MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE
	SEE USER DOCUMENTATION FOR CPU GARD
 
Detail Data
DEALLOCATION ABORTED CAUSE
0000 0003
DEALLOCATION ABORTED DATA
6676 6861 6568 3200

The preceding example shows that the deallocation for proc26 failed. The reason code 3 means that a kernel extension returned an error to the kernel notification routine. The DEALLOCATION ABORTED DATA above spells fvhaeh2, which is the name the extension used when registering with the kernel.

Example 2:

	LABEL:			CPU_DEALLOC_ABORTED
	IDENTIFIER:		8470267F
	Date/Time:		Thu Sep 30 14:00:22
	Sequence Number:	71
	Machine Id:		00002F0E4C00
	Node Id:		auntbea
	Class:			S
	Type:			TEMP
	Resource Name:		proc19
 
Description
CPU DEALLOCATION ABORTED
 
Probable Causes
SOFTWARE PROGRAM
 
Failure Causes
SOFTWARE PROGRAM
 
	Recommended Actions
	MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE;
	SEE USER DOCUMENTATION FOR CPU GARD
 
Detail Data
DEALLOCATION ABORTED CAUSE
0000 0002
DEALLOCATION ABORTED DATA
0000 0000 0000 4F4A

The preceding example shows that the deallocation for proc19 failed. The reason code 2 indicates thread(s) were bound to the last logical processor and did not unbind upon receiving the SIGCPUFAIL signal. The DEALLOCATION ABORTED DATA shows that these threads belonged to process 0x4F4A.

Options of the ps command (-o THREAD, -o BND) allow listings of all threads or processes, with the number of the CPU they are bound to when applicable.

Example 3:

	LABEL:			CPU_DEALLOC_ABORTED
	IDENTIFIER:		8470267F
 
	Date/Time:		Thu Sep 30 14:37:34
	Sequence Number:	106
	Machine Id:		00002F0E4C00
	Node Id:		auntbea
	Class:			S
	Type:			TEMP
	Resource Name:		proc2
 
Description
CPU DEALLOCATION ABORTED
 
Probable Causes
SOFTWARE PROGRAM
 
Failure Causes
SOFTWARE PROGRAM
 
	Recommended Actions
	MAINTENANCE IS REQUIRED BECAUSE OF CPU FAILURE
	SEE USER DOCUMENTATION FOR CPU GARD
 
Detail Data
DEALLOCATION ABORTED CAUSE
0000 0004
DEALLOCATION ABORTED DATA
0000 0000 0000 0000

The preceding example shows that the deallocation of proc2 failed because there were two or fewer active processors at the time of failure (reason code 4).