HACMP DEADMAN SWITCH TIMEOUT CONDITIONS
ITEM: RTA000065193
An SE has some problems with a customer site with HACMP V2.1
installed here in Japan.
The system consists of 2(two) RS/6000 590s with 2(two) 9333-501s
as shared disks, and the configuration of HACMP is idle-standby.
The problem was that the system had crashed on a server node and
takeover had occurred once or twice a week with the increase of
CPU utilization.
They investigated the system dump and reached the conclusion that
the deadman switch of HACMP caused this crash.
They opened the PMR (326X6611760) and changed the "cycles_to_fail"
parameter of clstrmgr from 4 (default value) to 12 according to
the advice of the PMR. Since then, no crash has occurred.
I also suggested to tune the I/O pacing parameter in addtion to
the tuining of "cycles_to_fail" parameter to them.
The customer, however, is worring about the crash-problem might
occure again with the further increase of CPU load toward the end
of the year.
Could you clarify the following points to mitigate their concerns?
Q1. Could you explain what the "kernel lock" mean?
In the README file of PTF U432018, chapter of "AIX Kernel Lock
when Using HACMP/6000" shows the several causes to have the
deadmen switch halt the system. In that chapter, the second
paragraph says that "This is due to the 3.2.x AIX kernel being
built in a way that causes events to be threaded through a
single kernel lock. During I/O intensive operations (df, find,
etc.), the cluster manager process may be required to wait too
long for a chance at the kernel lock."
Are there any lock function used by clstrmgr to get AIX kernel
services? Or, does it just mean that the processes are serialized
to access I/O queue?
Q2. I understand that the tuning the I/O pacing and the "cycles_to
fail" depends upon the trials_and_errors basis test.
But, are there any performance parameters or data to be
monitored to anticipate the tendency toward the timeout of
deadman switch ?
The customer wants to know how the CPU utilization or I/O
utilization are related to the probability of the timeout
of dms, and to estimate such parameters or data and set
parameters appropriately prior to miss_node_takeover caused
by the timeout of dms.
ANSWER
1. AIX Kernel Lock and the Cluster Manager
The kernel lock is a lock used by kernel processes to serialize access to
global data structures. One process can hold the kernel lock at a time.
The HACMP cluster manager needs to obtain the kernel lock each KA cycle,
as it does an "ioctl" to check the status of each of its adapters, and,
in the case of "serial" non-TCP/IP networks like RS232 and SCSI Target
Mode, it needs the kernel lock in order to send the KA packet itself. The
I/O intensive operations mentioned in your question also use the kernel
lock, and can present a delay for the cluster manager obtaining it,
especially in the case where they get paged out while still holding the
lock.
In AIX 3.2.5, since it is single-threaded, there is only one kernel lock
and only one process at a time can have it. AIX 4.1, being multi-threaded
will allow more layers of granularity in terms of kernel locking, but it
is really too early and not appropriate to speculate right now on what
effect that will have on HACMP and a cluster manager running on AIX 4.1.
2. Any predictive data or parameters for the deadman switch?
Unfortunately, after discussing this with defect support and development,
I have to say that there really is no "threshold" that I can give you. In
some cases, for instance, it has happened that a cron job, doing a backup
using the find command, at night with no users, has caused the deadman
switch to go off, before the changes recommended in your PMR and in the
PTF notes for U432018 and U431606 were implemented. Here is, again, a
list of changes that have been made in PTF U432018 (also included in
follow-on superceding PTFs), and recommendations to the user, to avoid a
situation where the deadman switch times out:
1. Cluster manager is now pinned in memory by default, preventing it
from getting paged out. No action required beyond applying PTF
U432018 or supercede.
2. Application server processes are no longer fixed priority. Addresses
the previous potential for a fixed priority CPU-heavy application
server process keeping the cluster manager from getting cycles. The
application server processes are now subject to the priority adjust-
ment of the scheduler, and to the "nice" command if desired. No user
action required beyond installing the PTF U432018 or supercede.
3. Recommendation to change the interval for the syncd daemon from its
default of 60 seconds to 10 seconds. This increases the frequency of
I/O buffers being flushed (uses kernel lock), and therefore each
instance takes a shorter time. Change this in the /sbin/rc.boot
script, and reboot the system. This change is highly recommended.
4. Implement I/O pacing as described in the documentation and in the
PTF notes. This prevents long I/O queues from blocking out the clstr
manager also. Use "smit chssys" to change, and reboot the system.
High water mark of 33 and low water mark of 24 is recommended as a
starting point.
5. Adjustments to "cycles_to_fail" and "keepalive cycle" parameters, as
described in the PTF notes, and as implemented by you in your
situation. Adjustments to these do affect the error detection time,
but are necessary in some cases to avoid the deadman switch timing
out.
I am sorry I cannot give you any more specific answers, but I hope this
information helps your and your customer's understanding.
S e a r c h - k e y w o r d s:
HACMP DEADMAN SWITCH DMS TIMEOUT
WWQA: ITEM: RTA000065193 ITEM: RTA000065193
Dated: 09/1995 Category: ITSAIHA6000
This HTML file was generated 99/06/24~12:43:24
Comments or suggestions?
Contact us