-RPC.LOCKD:1831-084 MESSAGE NO LONGER IN WAIT
ITEM: RTA000041338
QUESTION:
I have a customer who recently started getting the following error
message on two of his RS/6000 file servers:
rpc.lockd: 1831-084 message no longer in wait queue
At the same time, one of the Ethernet interfaces on one of the systems
seemed to hang for a few minutes and then recover on its own.
What does this message mean?
---------- ---------- ---------- --------- ---------- ----------
A: The 1831-084 message usually indicates that the server thinks
that a client requested a lock, and the client does not know
that it has a lock. You may be able to resolve the problem
by executing the following commands on both server and client:
rm /etc/sm/*
rm /etc/sm.bak/*
and then reboot both server and client.
This error message can also be associated with name resolution
problems. If the above suggestions do not work, please provide
me with the following information:
1) Level of AIX on the NFS clients and servers.
2) The exact output of the following commands on an NFS server
that is generating the 1831-084 message:
host
host
host
host
3) The exact output of the above commands on an NFS client
which has mounted a filesystem from the same NFS server.
4) Are you using DNS, NIS, or just the local /etc/hosts files?
5) What information is logged when you start the rpc.lockd
daemon with the command:
/usr/etc/rpc.lockd -d5 >/tmp/lockd.out 2>&1 &
---------- ---------- ---------- --------- ---------- ----------
QUESTION:
We are not able to determine which client is causing this message
to be produced. Is there a way to find this out?
Also, what does the "d5" do? I don't see it documented.
---------- ---------- ---------- --------- ---------- ----------
A: To find the client which is causing the message, view the files
in the /etc/sm and /etc/sm.bak directories on the NFS server. You
should find entries in these files indicating which clients have
locks.
You are correct that the -d flag is not documented. I have opened
PMR 9X484 with IBM Software Services to report a documentation defect
with the "lockd Daemon" article in InfoExplorer.
The -d flag specifies that the lockd daemon should report Diagnostic
(or "Debugging") information. The information is sent to stdout.
If you are unable to diagnose and resolve the problem, please provide
the information requested in items 1) through 5) above.
---------- ---------- ---------- --------- ---------- ----------
QUESTION:
Here's what we've found on our server (dcs7) and two of the
clients (dcst14 and dcst20).
1. Level of AIX on the NFS clients and servers.
NFS server (dcs7) is at >3240
NFS clients are at SunOS 4.1.3
2. On the server (dcs7):
# host dcs7
aixs0-n45 is 147.145.45.8, Aliases: dcs7-e0, dcs7, aixs0
# host 147.145.45.8
aixs0-n45 is 147.145.45.8, Aliases: dcs7-e0, dcs7, aixs0
# host dcst14
dcst14 is 147.145.45.94, Aliases: dc44
# host 147.145.45.94
dcst14 is 147.145.45.94, Aliases: dc44
# host dcst20
dcst20 is 147.145.45.100
# host 147.145.45.100
dcst20 is 147.145.45.100
3. The "host" command doesn't exist on the SunOS client machines.
4. The NFS servers and all clients are NIS clients.
5. We have not restarted the rpc.lockd out of fear of causing
problems with active clients. Should we be able to restart
rpc.lockd without causing any problems? If there is any chance
that it might cause a problem, we'll have to wait until a
scheduled down time to restart it.
---------- ---------- ---------- --------- ---------- ----------
A: If you remove the files in /etc/sm and /etc/sm.bak, and then
restart the rpc.lockd daemon, all the clients who have locks
on files in the mounted filesystems will lose their locks.
Whether this will cause a problem depends on the applications
that acquired the locks.
Usually, applications use file locks to maintain data integrity
by restricting write access to the FCFS (First Come First Served)
queueing discipline. If an application loses its lock on a file,
it will have to re-acquire it. This allows for the possibility
that another application could lock the file before the original
application can re-lock it. In other words, restarting the
rpc.lockd daemon could result in data inconsistency.
As an update to -d flag issue, I have heard back from Software
Services. They said -d is not documented deliberately. The
developer put the -d flag in for his own purposes and did not
intend for it to be used generally.
---------- ---------- ---------- --------- ---------- ----------
QUESTION:
We have learned about a PTF that might take care of our problem. It's
U423872 (and unfortunately I don't have an APAR number). Based on
what you know from our discussion here, do you think we're seeing a
bug, and will U423872 really address it?
Thanks again.
---------- ---------- ---------- --------- ---------- ----------
A: The only reference I could find to PTF U423872 is:
PTF RQSTD: U423872 U423878 U423884
in APAR IX39607. I have appended excerpts from this APAR below.
Based on the information I have at this point, I do not believe you
are experiencing are bug.
STAT= CLOSED PER FESN0960475- CTID= TX2527 ISEV= 1
SB93/10/07 RC93/10/07 CL93/10/07 PD SEV= 1
PE= TYPE= F
RCOMP= 575603001 AIX V3 FOR RS/6 RREL= R320
FCOMP= 575603001 AIX V3 FOR RS/6 PFREL= F999 TREL= T
ACTION= SEC/INT= DUP/
USPTF= U427492 PDPTF= U427492 DUPS 0
ERROR DESCRIPTION:
In some cases when a lockd is brought down and then up further RPC
interactions to or from that machine can fail. iptraces show
NLM_LOCK_RES packets being refused at the machine due to icmp_err
port unreachable.
LOCAL FIX:
restart the lockd on the machine that is sending out the packet with
the bad port in it.
PROBLEM SUMMARY:
If a lockd client was reinstalled while holding locks, the lockd
server would remember the wrong port number in a cache and lockd
communications between the two machines were messed up until the
server's lockd was stopped and restarted.
PROBLEM CONCLUSION:
Changed the lockd cache code to age cache members that consistently
timeout when trying to deliver RPC messages.
TEMPORARY FIX:
Shut down server's rpc.lockd and restart it.
CIRCUMVENTION:
Don't reinstall clients while they hold locks. Bring them down
gracefully before a reinstall.
---------- ---------- ---------- --------- ---------- ----------
QUESTION:
Earlier you said that the clients that have locks could be found in
the /etc/sm and /etc/sm.bak directories on the server. I assume
that all clients with locks would be in these directories. Is there
a way to find out which of these clients is causing our 1831-084
message? We'd like to see if any of these clients were reinstalled
or rebooted ungracefully. If so, then the APAR description may match
our problem.
A couple of related questions . . .
Can the "bad port" information be caused simply by ungracefully
rebooting a client? Or, does something like a re-install have
to happen?
We're planning on restarting lockd with the "d5" flag this weekend.
Is it possible that simply restarting lockd might cause the problem
to go away?
And finally, when I first opened this question, I mentioned that
one of the Ethernet interfaces seemed to hang and then eventually
recover. Other interfaces on the same machine had no problem.
The "1831-084" message was the only message or clue that
we had to pursue. Even though we continue to get these messages,
the Ethernet interface no longer has any problems. Do you think
the hang could have been related to these lockd messages?
(I guess if we can determine which specific clients are causing the
messages to occur, then we can see if they were using the interface
that was hung.)
---------- ---------- ---------- --------- ---------- ----------
Q1 - "Is there a way to find out which of these clients is causing
our 1831-084 message?"
A1 - Unfortunately, there is no officially supported way of obtaining
this kind of information. However, you should be able to deduce
this by inspecting the output generated by using the rpc.lockd
"-d5" flag.
Each time rpc.lockd receives a lock/unlock request, it records
this event. Each record is 10 to 15 lines long. For example,
here are excerpts of two records:
NLM_PROG+++ version 1 proc 7 Tue Mar 15 13:35:09 1994
oh= gorf19516 lockID(cookie)= 39
enter proc_nlm_lock_msg(20070db8) enter local_lock
nlm_reply: (gorf, 12), result = 0
***** no entry in message queue *****
NLM_PROG+++ version 1 proc 9 Tue Mar 15 13:35:39 1994
oh= gorf19516 lockID(cookie)= 3a
enter proc_nlm_unlock_msg(2007bba8)
enter local_unlock: choice=MSG
nlm_reply: (gorf, 14), result = 0
***** no entry in message queue *****
These records were generated on a host named "u2e" when I ran a
short C program to lock and then unlock a file. I ran my program
on a host named "gorf", and locked a file residing on "u2e". My
program locked the file, slept for 30 seconds, and then unlocked
the file. Notice that both records indicate (nlm_reply) that the
client's name is "gorf".
When you receive the next 1831-084 message, note the exact time
and compare it with the rpc.lockd log.
After experimenting with setting and clearing locks, I do not
believe that a specific client is causing the 1831-084 message. I
believe that the server is attempting to notify a client that a
previously requested lock is now available, but the client is no
longer interested in having that particular lock. If this is the
case, the above-mentioned APAR may apply to you. See A3 below.
Q2 - "Can the bad port information be caused...by rebooting a client?"
A2 - Yes. When you reboot a client, the client will forget which
port it was using, but the server will still try to use that
port.
Q3 - "(Will) restarting lockd...cause the problem to go away?"
A3 - Restarting rpc.lockd on the server should stop the messages from
being generated.
Q4 - "Do you think the hang could have been related to these lockd
messages?"
A4 - Yes. When the interface hung, it is possible that certain client
applications timed out, hung, or were otherwise unable to continue.
For example, if a client application hung, the user might have
become frustrated and killed the application or rebooted their
machine, hoping that would solve the problem.
---------- ---------- ---------- --------- ---------- ----------
This item was created from library item Q657160 CQMNS
Additional search words:
APR94 COMMUNICATIO CQMNS IX LONGER MESSAGE MSG NFS OZNEW QUEUE
RISCSYSTEM RISCTCP RPC.LOCKD SERVERS SOFTWARE TCPIP WAIT WAITING
1831-084
WWQA: ITEM: RTA000041338 ITEM: RTA000041338
Dated: 04/1996 Category: RISCTCP
This HTML file was generated 99/06/24~12:43:16
Comments or suggestions?
Contact us