-RPC.LOCKD:1831-084 MESSAGE NO LONGER IN WAIT

ITEM: RTA000041338



QUESTION:                                                                       
I have a customer who recently started getting the following error              
message on two of his RS/6000 file servers:                                     
                                                                                
     rpc.lockd: 1831-084 message no longer in wait queue                        
                                                                                
At the same time, one of the Ethernet interfaces on one of the systems          
seemed to hang for a few minutes and then recover on its own.                   
                                                                                
What does this message mean?                                                    
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
A: The 1831-084 message usually indicates that the server thinks                
   that a client requested a lock, and the client does not know                 
   that it has a lock.  You may be able to resolve the problem                 
   by executing the following commands on both server and client:               
                                                                                
        rm /etc/sm/*                                                            
        rm /etc/sm.bak/*                                                        
                                                                                
   and then reboot both server and client.                                      
                                                                                
   This error message can also be associated with name resolution               
   problems.  If the above suggestions do not work, please provide              
   me with the following information:                                           
                                                                                
   1) Level of AIX on the NFS clients and servers.                              
                                                                                
   2) The exact output of the following commands on an NFS server               
      that is generating the 1831-084 message:                                 
                                                                                
      host                                                         
      host                                                   
      host                                                         
      host                                                   
                                                                                
   3) The exact output of the above commands on an NFS client                   
      which has mounted a filesystem from the same NFS server.                  
                                                                                
   4) Are you using DNS, NIS, or just the local /etc/hosts files?               
                                                                                
   5) What information is logged when you start the rpc.lockd                   
      daemon with the command:                                                  
                                                                                
         /usr/etc/rpc.lockd -d5 >/tmp/lockd.out 2>&1 &                         
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
QUESTION:                                                                       
We are not able to determine which client is causing this message               
to be produced.  Is there a way to find this out?                               
                                                                                
Also, what does the "d5" do?  I don't see it documented.                        
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
A: To find the client which is causing the message, view the files              
   in the /etc/sm and /etc/sm.bak directories on the NFS server.  You           
   should find entries in these files indicating which clients have             
   locks.                                                                       
                                                                                
   You are correct that the -d flag is not documented.  I have opened          
   PMR 9X484 with IBM Software Services to report a documentation defect        
   with the "lockd Daemon" article in InfoExplorer.                             
                                                                                
   The -d flag specifies that the lockd daemon should report Diagnostic         
   (or "Debugging") information.  The information is sent to stdout.            
                                                                                
   If you are unable to diagnose and resolve the problem, please provide        
   the information requested in items 1) through 5) above.                      
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
QUESTION:                                                                       
Here's what we've found on our server (dcs7) and two of the                     
clients (dcst14 and dcst20).                                                    
                                                                                
1.  Level of AIX on the NFS clients and servers.                               
                                                                                
    NFS server (dcs7) is at >3240                                               
                                                                                
    NFS clients are at SunOS 4.1.3                                              
                                                                                
2.  On the server (dcs7):                                                       
                                                                                
    # host dcs7                                                                 
    aixs0-n45 is 147.145.45.8, Aliases: dcs7-e0, dcs7, aixs0                    
                                                                                
    # host 147.145.45.8                                                         
    aixs0-n45 is 147.145.45.8, Aliases: dcs7-e0, dcs7, aixs0                    
                                                                                
    # host dcst14                                                               
    dcst14 is 147.145.45.94,  Aliases:   dc44                                  
                                                                                
    # host 147.145.45.94                                                        
    dcst14 is 147.145.45.94,  Aliases:   dc44                                   
                                                                                
    # host dcst20                                                               
    dcst20 is 147.145.45.100                                                    
                                                                                
    # host 147.145.45.100                                                       
    dcst20 is 147.145.45.100                                                    
                                                                                
3.  The "host" command doesn't exist on the SunOS client machines.              
                                                                                
4.  The NFS servers and all clients are NIS clients.                            
                                                                                
5.  We have not restarted the rpc.lockd out of fear of causing                 
    problems with active clients.  Should we be able to restart                 
    rpc.lockd without causing any problems?  If there is any chance             
    that it might cause a problem, we'll have to wait until a                   
    scheduled down time to restart it.                                          
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
A: If you remove the files in /etc/sm and /etc/sm.bak, and then                 
   restart the rpc.lockd daemon, all the clients who have locks                 
   on files in the mounted filesystems will lose their locks.                   
   Whether this will cause a problem depends on the applications                
   that acquired the locks.                                                     
                                                                                
   Usually, applications use file locks to maintain data integrity              
   by restricting write access to the FCFS (First Come First Served)            
   queueing discipline.  If an application loses its lock on a file,           
   it will have to re-acquire it.  This allows for the possibility              
   that another application could lock the file before the original             
   application can re-lock it.  In other words, restarting the                  
   rpc.lockd daemon could result in data inconsistency.                         
                                                                                
   As an update to -d flag issue, I have heard back from Software               
   Services.  They said -d is not documented deliberately.  The                 
   developer put the -d flag in for his own purposes and did not                
   intend for it to be used generally.                                          
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
QUESTION:                                                                       
We have learned about a PTF that might take care of our problem.  It's          
U423872 (and unfortunately I don't have an APAR number).  Based on              
what you know from our discussion here, do you think we're seeing a            
bug, and will U423872 really address it?                                        
                                                                                
Thanks again.                                                                   
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
A: The only reference I could find to PTF U423872 is:                           
                                                                                
      PTF RQSTD: U423872  U423878  U423884                                      
                                                                                
   in APAR IX39607.  I have appended excerpts from this APAR below.             
   Based on the information I have at this point, I do not believe you          
   are experiencing are bug.                                                    
                                                                                
STAT= CLOSED  PER  FESN0960475-     CTID= TX2527 ISEV= 1                        
SB93/10/07  RC93/10/07  CL93/10/07  PD           SEV= 1                        
                       PE=                       TYPE= F                        
RCOMP= 575603001    AIX V3 FOR RS/6 RREL= R320                                  
FCOMP= 575603001    AIX V3 FOR RS/6 PFREL= F999  TREL= T                        
ACTION=             SEC/INT=                     DUP/                           
USPTF= U427492      PDPTF= U427492               DUPS 0                         
                                                                                
ERROR DESCRIPTION:                                                              
In some cases when a lockd is brought down and then up further RPC              
interactions to or from that machine can fail. iptraces show                    
NLM_LOCK_RES packets being refused at the machine due to icmp_err               
port unreachable.                                                               
                                                                                
LOCAL FIX:                                                                      
restart the lockd on the machine that is sending out the packet with            
the bad port in it.                                                            
                                                                                
PROBLEM SUMMARY:                                                                
If a lockd client was reinstalled while holding locks, the lockd                
server would remember the wrong port number in a cache and lockd                
communications between the two machines were messed up until the                
server's lockd was stopped and restarted.                                       
                                                                                
PROBLEM CONCLUSION:                                                             
Changed the lockd cache code to age cache members that consistently             
timeout when trying to deliver RPC messages.                                    
                                                                                
TEMPORARY FIX:                                                                  
Shut down server's rpc.lockd and restart it.                                    
                                                                                
CIRCUMVENTION:                                                                 
Don't reinstall clients while they hold locks.  Bring them down                 
gracefully before a reinstall.                                                  
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
QUESTION:                                                                       
Earlier you said that the clients that have locks could be found in             
the /etc/sm and /etc/sm.bak directories on the server.  I assume                
that all clients with locks would be in these directories.  Is there            
a way to find out which of these clients is causing our 1831-084                
message?  We'd like to see if any of these clients were reinstalled             
or rebooted ungracefully.  If so, then the APAR description may match           
our problem.                                                                    
                                                                                
A couple of related questions . . .                                             
                                                                               
Can the "bad port" information be caused simply by ungracefully                 
rebooting a client?  Or, does something like a re-install have                  
to happen?                                                                      
                                                                                
We're planning on restarting lockd with the "d5" flag this weekend.             
Is it possible that simply restarting lockd might cause the problem             
to go away?                                                                     
                                                                                
And finally, when I first opened this question, I mentioned that                
one of the Ethernet interfaces seemed to hang and then eventually               
recover.  Other interfaces on the same machine had no problem.                  
The "1831-084" message was the only message or clue that                        
we had to pursue.  Even though we continue to get these messages,               
the Ethernet interface no longer has any problems.  Do you think                
the hang could have been related to these lockd messages?                      
(I guess if we can determine which specific clients are causing the             
messages to occur, then we can see if they were using the interface             
that was hung.)                                                                 
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
                                                                                
Q1 - "Is there a way to find out which of these clients is causing              
      our 1831-084 message?"                                                    
                                                                                
A1 - Unfortunately, there is no officially supported way of obtaining           
     this kind of information.  However, you should be able to deduce           
     this by inspecting the output generated by using the rpc.lockd             
     "-d5" flag.                                                                
                                                                                
     Each time rpc.lockd receives a lock/unlock request, it records            
     this event.  Each record is 10 to 15 lines long.  For example,             
     here are excerpts of two records:                                          
                                                                                
        NLM_PROG+++ version 1 proc 7    Tue Mar 15 13:35:09 1994                
        oh= gorf19516   lockID(cookie)= 39                                      
        enter proc_nlm_lock_msg(20070db8) enter local_lock                      
        nlm_reply: (gorf, 12), result = 0                                       
        ***** no entry in message queue *****                                   
        NLM_PROG+++ version 1 proc 9    Tue Mar 15 13:35:39 1994                
        oh= gorf19516   lockID(cookie)= 3a                                      
        enter proc_nlm_unlock_msg(2007bba8)                                     
        enter local_unlock: choice=MSG                                          
        nlm_reply: (gorf, 14), result = 0                                       
        ***** no entry in message queue *****                                   
                                                                               
     These records were generated on a host named "u2e" when I ran a            
     short C program to lock and then unlock a file.  I ran my program          
     on a host named "gorf", and locked a file residing on "u2e".  My           
     program locked the file, slept for 30 seconds, and then unlocked           
     the file.  Notice that both records indicate (nlm_reply) that the          
     client's name is "gorf".                                                   
                                                                                
     When you receive the next 1831-084 message, note the exact time            
     and compare it with the rpc.lockd log.                                     
                                                                                
     After experimenting with setting and clearing locks, I do not              
     believe that a specific client is causing the 1831-084 message.  I         
     believe that the server is attempting to notify a client that a            
     previously requested lock is now available, but the client is no           
     longer interested in having that particular lock.  If this is the         
     case, the above-mentioned APAR may apply to you.  See A3 below.            
                                                                                
Q2 - "Can the bad port information be caused...by rebooting a client?"          
                                                                                
A2 - Yes.  When you reboot a client, the client will forget which               
     port it was using, but the server will still try to use that               
     port.                                                                      
                                                                                
Q3 - "(Will) restarting lockd...cause the problem to go away?"                  
                                                                                
A3 - Restarting rpc.lockd on the server should stop the messages from           
     being generated.                                                           
                                                                                
Q4 - "Do you think the hang could have been related to these lockd              
     messages?"                                                                
                                                                                
A4 - Yes.  When the interface hung, it is possible that certain client          
     applications timed out, hung, or were otherwise unable to continue.        
     For example, if a client application hung, the user might have             
     become frustrated and killed the application or rebooted their             
     machine, hoping that would solve the problem.                              
                                                                                
---------- ---------- ---------- --------- ---------- ----------                
                                                                                
                                                                                
This item was created from library item Q657160      CQMNS                      
                                                                                
Additional search words:                                                        
APR94 COMMUNICATIO CQMNS IX LONGER MESSAGE MSG NFS OZNEW QUEUE                  
RISCSYSTEM RISCTCP RPC.LOCKD SERVERS SOFTWARE TCPIP WAIT WAITING               
1831-084                                                                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                               


WWQA: ITEM: RTA000041338 ITEM: RTA000041338
Dated: 04/1996 Category: RISCTCP
This HTML file was generated 99/06/24~12:43:16
Comments or suggestions? Contact us