Error demon dies abnormally
ITEM: RTA000066564
My customer ran into the following problem while they tested the error
notification function with HACMP/6000 V1.2 under AIX V3.2.4.
The error demon sometimes (not always) dies about 20 seconds after
one of the ethernet adapters failure (actually, they pull the ethernet
cable out of the adapter) during HACMP/6000 is running.
The system consists of 3(three) RS/6000 570s with 2(two) etherenet
adapters each, and they are configured as third-party takeover with
HACMP/6000 V1.2.
Usually, ENT_ERR2 and ENT_ERR6 are issued a lot (10 per second) after
the ethernet adapter failure.
The questions are:
Q) Why does error demon dies sometimes in the above situation?
Is it caused by the high rate of error logging (ENT_ERR2 and ENT_
ERR6)?
Do you have any suggestions to avoid this problem?
The following is the error notification setting and the notification
method which is used in this test.
Notification Object Name ENT_STDBY
Persistence across system restart? Yes
Process ID for use by Notify Method 0
Select Error Class none
Select Error Type none
Match ALERTable errors? none
Select Error ID Label ENT_ERR2
Notify Method /home/sys/check_erren.sl $6
set -x
SYSDIR="/home/sys"
LOGIDR="/home/sys/log"
PROGNAME=$0
ERR_ADAPTER=$1
if .. $# ¢= "1" ..
then
$SYSDIR/ha_log 311 "Invalid Parameter" $PROGNAME
# "ha_log" is a utility to log the message in cluster.log
exit 1
fi
case $ERR_ADAPTER in
ent.123.) /usr/sbin/cluster/clacdNM -MD -n 'ENT_STDBY'
# To avoid this shell kicked repeatedly, the error
# template is deleted here
if .. -f $LOGDIR/$ERR_ADAPTER.flg..
then
exit 0
fi
touch $LOGDIR/$ERR_ADAPTER.flg
$SYSDIR/ha_log 312 "Adapter Error" $PROGNAME $ERR_ADAPTER
;;
ent0) ;;
*) $SYSDIR/ha_log 311 "Invalid Parameter" $PROGNAME
exit 1
;;
exit1
ANSWER
From my research, I don't believe at this time that your problem is
related to the error notification method that you have defined, but
rather is just a condition with the errdemon and the error device driver.
There have been situations reported where a large volume of errors
recorded (any type of errors) very fast have been known to kill the
errdemon, and sometimes even to hang the machine. This problem was
usually although not always also happening when the error log became full
and had to wrap around again. The problem was fixed in APARs IX35555 and
IX35564. So the first thing to check is to see if these APARs have been
installed on your machine. If they haven't been installed already, they
should be installed. If they have been installed, I suggest that this
problem should be reported as a defect. I would suggest when you do this,
you would want to provide the answers to these questions:
1. Does this problem also happen without the error notification method?
(I don't believe that your problem is related to this, but it is
possible that having the notification method is causing just enough
extra load on the errdemon that it is dying when it is overloaded
by many errors.)
2. Is your /var filesystem full?
3. What has been set as the maximum size of your error log file? (You
can get this information with "/usr/lib/errdemon -l".)
4. Is your error log file at or near its maximum size? (You can get this
information with "ls -l /usr/adm/ras/errlog".)
5. If the error log file is near its maximum size, and you are able to
reproduce the problem consistently, does it still occur after you have
used the errclear command to clear out part or all of the error log?
If, after investigating these questions, you find that the problem might
be related to the error log reaching its maximum size and wrapping, this
could be worked around by using a combination of occasionally running
the "errclear" command, and/or possibly increasing the maximum size of
your error log with "/usr/lib/errdemon -s NEWSIZE".
If you find that this problem is occurring even if the error log is not
near its maximum size, you should report it as a defect. I would ask in
any case that you reopen this question to give me the answers to the
questions I have asked if possible, so this can be put in the library.
QUESTION:
The PTF for IX35555 and IX35564 is already installed in my customer's
system and the problem still occurs.I reported this problem as
defect to ISC.
I investigated your questions when the problem occurred as follows:
1. This problem still happened without the error notification method.
2. /var usage was 26%.
3. The maximum size of the error log file is 1048576 byte.
4. The error log size right after the problem was 287728 byte.
5. N/A
S e a r c h - k e y w o r d s:
ERRDEMON DIES ERRLOG
WWQA: ITEM: RTA000066564 ITEM: RTA000066564
Dated: 07/1995 Category: ITSAIHA6000
This HTML file was generated 99/06/24~12:43:24
Comments or suggestions?
Contact us