[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]

System Management Concepts: Operating System and Devices


Enabling Dynamic Processor Deallocation

Starting with machine type 7044 model 270, the hardware of all systems with more than two processors is able to detect correctable errors, which are gathered by the firmware. These errors are not fatal and, as long as they remain rare occurrences, can be safely ignored. However, when a pattern of failures seems to be developing on a specific processor, this pattern might indicate that this component is likely to exhibit a fatal failure in the near future. This prediction is made by the firmware based-on-failure rates and threshold analysis.

This operating system, on these systems, implements continuous hardware surveillance and regularly polls the firmware for hardware errors. When the number of processor errors hits a threshold and the firmware recognizes that there is a distinct probability that this system component will fail, the firmware returns an error report. In all cases, the error is logged in the system error log. In addition, on multiprocessor systems, depending on the type of failure, this operating system attempts to stop using the untrustworthy processor and deallocate it. This feature is called Dynamic Processor Deallocation.

At this point, the processor is also flagged by the firmware for persistent deallocation for subsequent reboots, until maintenance personnel replaces the processor.

Potential Impact to Applications

This processor decallocation is transparent for the vast majority of applications, including drivers and kernel extensions. However, you can use the published interfaces to determine whether an application or kernel extension is running on a multiprocessor machine, find out how many processors there are, and bind threads to specific processors.

The interface for binding processes or threads to processors uses logical CPU numbers. The logical CPU numbers are in the range [0..N-1] where N is the total number of CPUs. To avoid breaking applications or kernel extensions that assume no "holes" in the CPU numbering, this operating system always makes it appear for applications as if it is the "last" (highest numbered) logical CPU to be deallocated. For instance, on an 8-way SMP, the logical CPU numbers are [0..7]. If one processor is deallocated, the total number of available CPUs becomes 7, and they are numbered [0..6]. Externally, it looks like CPU 7 has disappeared, regardless of which physical processor failed. In the rest of this description, the term CPU is used for the logical entity and the term processor for the physical entity.

Applications or kernel extensions using processes/threads binding could potentially be broken if this operating system silently terminated their bound threads or forcefully moved them to another CPU when one of the processors needs to be deallocated. Dynamic Processor Deallocation provides programming interfaces so that those applications and kernel extensions can be notified that a processor deallocation is about to happen. When these applications and kernel extensions get this notification, they are responsiblefor moving their bound threads and associated resources (such as timer request blocks) away form the last logical CPU and adapt themselves to the new CPU configuration.

If, after notification of applications and kernel extensions, some of the threads are still bound to the last logical CPU, the deallocation is aborted. In this case, the fact that the deallocation has been aborted is logged in the error log and continues using the ailing processor. When the processor ultimately fails, it creates a total system failure. Thus, it is important for applications or kernel extensions binding threads to CPUs to get the notification of an impending processor deallocation, and act on this notice.

Even in the rare cases where the deallocation cannot go through, Dynamic Processor Deallocation still gives advanced warning to system administrators. By recording the error in the error log, it gives them a chance to schedule a maintenance operation on the system to replace the ailing component before a global system failure occurs.

Processor Deallocation:

The typical flow of events for processor deallocation is as follows:

  1. The firmware detects that a recoverable error threshold has been reached by one of the processors.
  2. The firmware error report is logged in the system error log, and, when executing on a machine supporting processor deallocation, start the deallocation process.
  3. This operating system notifies non-kernel processes and threads bound to the last logical CPU.
  4. This operating system waits for all the bound threads to move away from the last logical CPU. If threads remain bound, the operating system eventually times out (after ten minutes)and aborts the deallocation
  5. Otherwise, the previously registered High Availability Event Handlers (HAEHs) is involked. An HAEH might return an error that aborts the deallocation.
  6. Otherwise, the deallocation process continues and ultimately stops the failing processor.

In case of failure at any point of the deallocation, the failure is logged with the reason why the deallocation was aborted. The system administrator can look at the error log, take corrective action (when possible) and restart the deallocation. For instance, if the deallocation was aborted because at least one application did not unbind its bound threads, the system administrator could stop the application(s), restart the deallocation (which should go through this time) and restart the application.

System Administration

Turning Processor Deallocation On and Off

Dynamic Processor Deallocation can be enabled or disabled by changing the value of the cpuguard attribute of the ODM object sys0. The possible values for the attribute are enable and disable.

The default, is that the dynamic processor deallocation is disabled (the attribute cpuguard has a value of disable). System administrators who want to take advantage of this feature must enable it using either the Web-based System Manager system menus, the SMIT System Environments menu, or the chdev command.

Note: If processor deallocation is turned off, the errors are still reported in the error log and you will see the error indicating that the operating system was notified of the problem with a CPU (CPU_FAILURE_PREDICTED, see the following format).

Restarting an Aborted Processor Deallocation

Sometimes the processor deallocation fails because an application did not move its bound threads away from the last logical CPU. Once this problem has been fixed, by either unbinding (when it is safe to do so) or stopping the application, the system administrator can restart the processor deallocation process using the ha_star command.

The syntax for this command is:

	ha_star -C

where -C is for a CPU predictive failure event.

Processor State Considerations

Physical processors are represented in the ODM data base by objects named procn where n is the physical processor number (n is a decimal number). Like any other "device" represented in the ODM database, processor objects have a state (Defined/Available) and attributes.

The state of a proc object is always Available as long as the corresponding processor is present, regardless of whether it is usable. The state attribute of a proc object indicates if the processor is used and, if not, the reason. This attribute can have three values:


enable The processor is used.
disable The processor has been dynamically deallocatd.
faulty The processor was declared defective by the firmware at startup time.

In the case of CPU errors, if a processor for which the firmware reports a predictive failure is successfully deallocated, its state goes from enable to disable. Independently of of this operating system, this processor is also flagged as defective in the firmware. Upon reboot, it is not available and will have its state set to faulty. But the ODM proc object is still marked Available. Only if the defective CPU was physically removed from the system board or CPU board (if it were at all possible) is the proc object change to Defined.

Examples:

Processor proc4 is working correctly and used by the operating system:

	# lsattr -EH -l proc4
	attribute	value			description		user_settable
 
	state		enable			Processor state		False
	type		PowerPC_RS64-III	Processor type		False
		#	
 

Processor proc4 gets a predictive failure and gets deallocated by the operating system:

	# lsattr -EH -l proc4
	attribute	value			description		user_settable
 
	state		disable			Processor state		False
	type		PowerPC_RS64-III	Processor type		False
		#	
 

At the next system restart, processor proc4 is reported by firmware as defective and not available to the operating system:

	# lsattr -EH -l proc4
	attribute	value			description		user_settable
 
	state		faulty			Processor state		False
	type		PowerPC_RS64-III	Processor type		False
		#	

But in all three cases, the status of processor proc4 is Available:

	# lsdev -CH -l proc4
	name		status			location		description
	
	proc4		Available		00-04			Processor
	#

Error Log Entries

The following are examples wth descriptions of error logentries:

errpt short format - summary
Three different error log messages are associated with CPU deallocation. The following is an example of entries displayed by the errpt command (without options):

        # errpt
        IDENTIFIER         TIMESTAMP          T      C      RESOURCE_NAME        DESCRIPTION
        804E987A           1008161399         I      O      proc4                CPU DEALLOCATED
        8470267F           1008161299         T      S      proc4                CPU DEALLOCATION ABORTED
        1B963892           1008160299         P      H      proc4                CPU FAILURE PREDICTED
        #

errpt long format - detailed description
The following is the form of output obtained with errpt -a:


[ Previous | Next | Table of Contents | Index | Library Home | Legal | Search ]