ITEM: BP6949L

(central) nfs server can not communicate with work station



Question:

r20
problem with  nfs server -- can not communicate with work station
it is causing clients to hang and take a long time.
Can he prevent one client from blocking other? -- needs 
callback in morning  

Response:

Calling customer

Response:

Not In, Left Message

Response:

Calling customer

Response:

Customer's voicemail indicates he will be out until after Nov. 4th
Closing until customer gets back from vacation.

Response:

Response:

Calling customer

Response:

Not In, Left Message

Response:

Response:

Calling customer

Response:

Response:

Customer Contact

Response:

E: 3.2.5 NFS server with >500 clients,R20 w/HACMP and RAID array, all
clients NFS mount the export: /home/share There are 128 NFSD's running
on the server.  Each client has two ethernet interfaces..en0 and en1
All of the clients and the server's en0's are in one subnet, while all
of the clients en1's (along with the server's en1) are in another.  On
the clients: en0 is the primary IP interface and has the machines
hostname associated with it at the DNS..while en1 has the default
route attached to it..and also has an alias hostname attached to it.
Both interfaces are attached to to the same logical network, but
different physical networks. (different routers, etc.)  The effect is
this: Any client see the server as two different servers on two
different subnets..The same is true of the server looking back at the
clients -it sees two separate clients for each real client.  HACMP is
in use on the server (4 interfaces -two active, two reserve).  HACMP
is not on the clients (only the two active interfaces).

Desc.: Whenever ONE of the over 500 clients "loses" one of it's
interfaces..  all other clients actions are "stopped" by the NFS
server: The NFS server becomes unresponsive to client requests...until
the condition is cleared-up.: Clearing-up the condition consists of
starting and stopping the lockd, and statd on the affected client,
then going to the server and stopping lockd,statd, -clearing out the
/etc/sm and /etc/sm.bak and /etc/state and restarting.  Customer has
gotten "good" at recognizing this condition -and performs the exact
same sequence above to remedy the situation each time it occurs.

Action: What we would need now is a line trace of actions on the wire
when the event recurs: What I would specifically be looking at (for) is
this:

1.) Is the client application that locked the file still trying to
reach the server -but via the default route (other interface) instead
of the one that went down?
.2.) If the above (\#1)is true: how does the
server react to (what it sees) as a different client trying to
continue the same conversation that the first one initiated?
.3.) Does the router offer a route to either of the server's interfaces
no matter which subnet we are on? (Can a client that has lost his en0
interface (and hostname, for reachability purposes) still contact the
server's en0 interface via the en1 lan?)
.4.) If \#3 holds true, then what happens to RPC authentication when the
server does a reverse name lookup? Is he (the server) then trying to
respond to the clients "dead" en0 hostname? (not getting through).
.5.) Since file locking is designed to survive a reboot of the NFS
server, is it the simple lock on this file that is keeping all of the
other clients from accessing the file until the condition is cleared?

If \#5 is true: suggest either running HACMP on each client (4 NICS
-just like the server) so we will never have any downed interfaces;
and/or add some network management/interface monitoring software to
alert someone immediately when an interface goes down anywhere on
either lan.

N: Customer has to maintain 24x7 availability and cannot reproduce at
will, but will attempt to place a sniffer on the wire next time this
happens to confirm any of the above possibilities.

Customer wishes to close.


Support Line: (central) nfs server can not communicate with work station ITEM: BP6949L
Dated: February 1997 Category: N/A
This HTML file was generated 99/06/24~13:30:20
Comments or suggestions? Contact us