HACMP DEADMAN SWITCH TIMEOUT CONDITIONS

ITEM: RTA000065193



An SE has some problems with a customer site with HACMP V2.1                    
installed here in Japan.                                                        
The system consists of 2(two) RS/6000 590s with 2(two) 9333-501s                
as shared disks, and the configuration of HACMP is idle-standby.                
The problem was that the system had crashed on a server node and                
takeover had occurred once or twice a week with the increase of                 
CPU utilization.                                                                
They investigated the system dump and reached the conclusion that               
the deadman switch of HACMP caused this crash.                                  
                                                                                
They opened the PMR (326X6611760) and changed the "cycles_to_fail"              
parameter of clstrmgr from 4 (default value) to 12 according to                 
the advice of the PMR. Since then, no crash has occurred.                       
I also suggested to tune the I/O pacing parameter in addtion to                 
the tuining of "cycles_to_fail" parameter to them.                             
                                                                                
The customer, however, is worring about the crash-problem might                 
occure again with the further increase of CPU load toward the end               
of the year.                                                                    
Could you clarify the following points to mitigate their concerns?              
                                                                                
Q1. Could you explain what the "kernel lock" mean?                              
    In the README file of PTF U432018, chapter of "AIX Kernel Lock              
    when Using HACMP/6000" shows the several causes to have the                 
    deadmen switch halt the system. In that chapter, the second                 
    paragraph says that "This is due to the 3.2.x AIX kernel being              
    built in a way that causes events to be threaded through a                  
    single kernel lock. During I/O intensive operations (df, find,              
    etc.), the cluster manager process may be required to wait too              
    long for a chance at the kernel lock."                                     
    Are there any lock function used by clstrmgr to get AIX kernel              
    services? Or, does it just mean that the processes are serialized           
    to access I/O queue?                                                        
Q2. I understand that the tuning the I/O pacing and the "cycles_to              
    fail" depends upon the trials_and_errors basis test.                        
    But, are there any performance parameters or data to be                     
    monitored to anticipate the tendency toward the timeout of                  
    deadman switch ?                                                            
    The customer wants to know how the CPU utilization or I/O                   
    utilization are related to the probability of the timeout                   
    of dms, and to estimate such parameters or data and set                     
    parameters appropriately prior to miss_node_takeover caused                 
    by the timeout of dms.                                                      
                                                                                
                                                                               
ANSWER                                                                          
                                                                                
1. AIX Kernel Lock and the Cluster Manager                                      
The kernel lock is a lock used by kernel processes to serialize access to       
global data structures. One process can hold the kernel lock at a time.         
The HACMP cluster manager needs to obtain the kernel lock each KA cycle,        
as it does an "ioctl" to check the status of each of its adapters, and,         
in the case of "serial" non-TCP/IP networks like RS232 and SCSI Target          
Mode, it needs the kernel lock in order to send the KA packet itself. The       
I/O intensive operations mentioned in your question also use the kernel         
lock, and can present a delay for the cluster manager obtaining it,             
especially in the case where they get paged out while still holding the         
lock.                                                                           
                                                                                
In AIX 3.2.5, since it is single-threaded, there is only one kernel lock       
and only one process at a time can have it. AIX 4.1, being multi-threaded       
will allow more layers of granularity in terms of kernel locking, but it        
is really too early and not appropriate to speculate right now on what          
effect that will have on HACMP and a cluster manager running on AIX 4.1.        
                                                                                
2. Any predictive data or parameters for the deadman switch?                    
Unfortunately, after discussing this with defect support and development,       
I have to say that there really is no "threshold" that I can give you. In       
some cases, for instance, it has happened that a cron job, doing a backup       
using the find command, at night with no users, has caused the deadman          
switch to go off, before the changes recommended in your PMR and in the         
PTF notes for U432018 and U431606 were implemented. Here is, again, a           
list of changes that have been made in PTF U432018 (also included in            
follow-on superceding PTFs), and recommendations to the user, to avoid a        
situation where the deadman switch times out:                                  
 1. Cluster manager is now pinned in memory by default, preventing it           
    from getting paged out. No action required beyond applying PTF              
    U432018 or supercede.                                                       
 2. Application server processes are no longer fixed priority. Addresses        
    the previous potential for a fixed priority CPU-heavy application           
    server process keeping the cluster manager from getting cycles. The         
    application server processes are now subject to the priority adjust-        
    ment of the scheduler, and to the "nice" command if desired. No user        
    action required beyond installing the PTF U432018 or supercede.             
 3. Recommendation to change the interval for the syncd daemon from its         
    default of 60 seconds to 10 seconds. This increases the frequency of        
    I/O buffers being flushed (uses kernel lock), and therefore each            
    instance takes a shorter time. Change this in the /sbin/rc.boot             
    script, and reboot the system. This change is highly recommended.           
 4. Implement I/O pacing as described in the documentation and in the          
    PTF notes. This prevents long I/O queues from blocking out the clstr        
    manager also. Use "smit chssys" to change, and reboot the system.           
    High water mark of 33 and low water mark of 24 is recommended as a          
    starting point.                                                             
 5. Adjustments to "cycles_to_fail" and "keepalive cycle" parameters, as        
    described in the PTF notes, and as implemented by you in your               
    situation. Adjustments to these do affect the error detection time,         
    but are necessary in some cases to avoid the deadman switch timing          
    out.                                                                        
                                                                                
I am sorry I cannot give you any more specific answers, but I hope this         
information helps your and your customer's understanding.                       
                                                                                
S e a r c h - k e y w o r d s:                                                  
HACMP DEADMAN SWITCH DMS TIMEOUT                                               


WWQA: ITEM: RTA000065193 ITEM: RTA000065193
Dated: 09/1995 Category: ITSAIHA6000
This HTML file was generated 99/06/24~12:43:24
Comments or suggestions? Contact us