Error demon dies abnormally

ITEM: RTA000066564



My customer ran into the following problem while they tested the error          
notification function with HACMP/6000 V1.2 under AIX V3.2.4.                    
The error demon sometimes (not always) dies about 20 seconds after              
one of the ethernet adapters failure (actually, they pull the ethernet          
cable out of the adapter) during HACMP/6000 is running.                         
The system consists of 3(three) RS/6000 570s with 2(two) etherenet              
adapters each, and they are configured as third-party takeover with             
HACMP/6000 V1.2.                                                                
Usually, ENT_ERR2 and ENT_ERR6 are issued a lot (10 per second) after           
the ethernet adapter failure.                                                   
The questions are:                                                              
Q)  Why does error demon dies sometimes in the above situation?                 
    Is it caused by the high rate of error logging (ENT_ERR2 and ENT_           
    ERR6)?                                                                      
    Do you have any suggestions to avoid this problem?                         
                                                                                
The following is the error notification setting and the notification            
method which is used in this test.                                              
                                                                                
  Notification Object Name             ENT_STDBY                                
  Persistence across system restart?   Yes                                      
  Process ID for use by Notify Method  0                                        
  Select Error Class                   none                                     
  Select Error Type                    none                                     
  Match ALERTable errors?              none                                     
  Select Error ID Label                ENT_ERR2                                 
  Notify Method                        /home/sys/check_erren.sl $6              
                                                                                
                                                      
  set -x                                                                       
  SYSDIR="/home/sys"                                                            
  LOGIDR="/home/sys/log"                                                        
  PROGNAME=$0                                                                   
  ERR_ADAPTER=$1                                                                
                                                                                
  if .. $# ¢= "1" ..                                                            
  then                                                                          
      $SYSDIR/ha_log 311 "Invalid Parameter" $PROGNAME                          
         # "ha_log" is a utility to log the message in cluster.log              
      exit 1                                                                    
  fi                                                                            
  case $ERR_ADAPTER in                                                          
   ent.123.) /usr/sbin/cluster/clacdNM -MD -n 'ENT_STDBY'                       
              # To avoid this shell kicked repeatedly, the error                
              # template is deleted here                                       
             if .. -f $LOGDIR/$ERR_ADAPTER.flg..                                
             then                                                               
                 exit 0                                                         
             fi                                                                 
             touch $LOGDIR/$ERR_ADAPTER.flg                                     
             $SYSDIR/ha_log 312 "Adapter Error" $PROGNAME $ERR_ADAPTER          
             ;;                                                                 
   ent0)     ;;                                                                 
   *)        $SYSDIR/ha_log 311 "Invalid Parameter" $PROGNAME                   
             exit 1                                                             
             ;;                                                                 
  exit1                                                                         
                                                                                
                                                                                
                                                                               
ANSWER                                                                          
From my research, I don't believe at this time that your problem is             
related to the error notification method that you have defined, but             
rather is just a condition with the errdemon and the error device driver.       
There have been situations reported where a large volume of errors              
recorded (any type of errors) very fast have been known to kill the             
errdemon, and sometimes even to hang the machine. This problem was              
usually although not always also happening when the error log became full       
and had to wrap around again. The problem was fixed in APARs IX35555 and        
IX35564. So the first thing to check is to see if these APARs have been         
installed on your machine. If they haven't been installed already, they         
should be installed. If they have been installed, I suggest that this           
problem should be reported as a defect. I would suggest when you do this,       
you would want to provide the answers to these questions:                       
                                                                               
1. Does this problem also happen without the error notification method?         
   (I don't believe that your problem is related to this, but it is             
    possible that having the notification method is causing just enough         
    extra load on the errdemon that it is dying when it is overloaded           
   by many errors.)                                                             
2. Is your /var filesystem full?                                                
3. What has been set as the maximum size of your error log file? (You           
   can get this information with "/usr/lib/errdemon -l".)                       
4. Is your error log file at or near its maximum size? (You can get this        
   information with "ls -l /usr/adm/ras/errlog".)                               
5. If the error log file is near its maximum size, and you are able to          
   reproduce the problem consistently, does it still occur after you have       
   used the errclear command to clear out part or all of the error log?         
                                                                                
If, after investigating these questions, you find that the problem might       
be related to the error log reaching its maximum size and wrapping, this        
could be worked around by using a combination of occasionally running           
the "errclear" command, and/or possibly increasing the maximum size of          
your error log with "/usr/lib/errdemon -s NEWSIZE".                             
                                                                                
If you find that this problem is occurring even if the error log is not         
near its maximum size, you should report it as a defect. I would ask in         
any case that you reopen this question to give me the answers to the            
questions I have asked if possible, so this can be put in the library.          
                                                                                
                                                                                
QUESTION:                                                                       
The PTF for IX35555 and IX35564 is already installed in my customer's           
system and the problem still occurs.I reported this problem as                  
defect to ISC.                                                                 
I investigated your questions when the problem occurred as follows:             
1. This problem still happened without the error notification method.           
2. /var usage was 26%.                                                          
3. The maximum size of the error log file is 1048576 byte.                      
4. The error log size right after the problem was 287728 byte.                  
5. N/A                                                                          
                                                                                
S e a r c h - k e y w o r d s:                                                  
ERRDEMON DIES ERRLOG                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                               


WWQA: ITEM: RTA000066564 ITEM: RTA000066564
Dated: 07/1995 Category: ITSAIHA6000
This HTML file was generated 99/06/24~12:43:24
Comments or suggestions? Contact us