NFS Problem Determination

AIX Version 4.3 System Management Guide: Communications and Networks

NFS Problem Determination

As with other network services, problems can occur on machines that use the Network File System (NFS). Troubleshooting for these problems involves understanding the strategies for tracking NFS problems, recognizing NFS-related error messages, and selecting the appropriate solutions. When tracking down an NFS problem, isolate each of the three main points of failure to determine which is not working: the server, the client, or the network itself.

Note: See "Troubleshooting the Network Lock Manager" for file lock problems.

Identifying Hard-Mounted and Soft-Mounted File Problems

When the network or server has problems, programs that access hard-mounted remote files fail differently from those that access soft-mounted remote files.

If a server fails to respond to a hard-mount request, NFS prints the message:

NFS server hostname not responding, still trying

Hard-mounted remote file systems cause programs to hang until the server responds because the client retries the mount request until it succeeds. You should use the -bg flag with the mount command when performing a hard mount so that if the server does not respond, the client will retry the mount in the background.

If a server fails to respond to a soft-mount request, NFS prints the message:

Connection timed out

Soft-mounted remote file systems return an error after trying unsuccessfully for a while. Unfortunately, many programs do not check return conditions on file system operations, so you do not see this error message when accessing soft-mounted files. However, this NFS error message will print on the console.

Identifying NFS Problems Checklist

If a client is having NFS trouble, do the following:

Verify that the network connections are good.
Verify that the inetd, portmap, and biod daemons are running on the client, by following the instructions in "Getting the Current Status of the NFS Daemons".
Verify that a valid mount point exists for the file system being mounted. For more information, see "Configuring an NFS Client".

Verify that the server is up and running by running the following command at the shell prompt of the client:

/usr/bin/rpcinfo -p server_name

If the server is up, a list of programs, versions, protocols, and port numbers is printed, similar to the following:

program  vers proto   port
100000    2   tcp    111  portmapper
100000    2   udp    111  portmapper
100005    1   udp   1025  mountd
100001    1   udp   1030  rstatd
100001    2   udp   1030  rstatd
100001    3   udp   1030  rstatd
100002    1   udp   1036  rusersd
100002    2   udp   1036  rusersd
100008    1   udp   1040  walld
100012    1   udp   1043  sprayd
100005    1   tcp    694  mountd
100003    2   udp   2049  nfs
100024    1   udp    713  status
100024    1   tcp    715  status
100021    1   tcp    716  nlockmgr
100021    1   udp    718  nlockmgr
100021    3   tcp    721  nlockmgr
100021    3   udp    723  nlockmgr
100020    1   udp    726  llockmgr
100020    1   tcp    728  llockmgr
100021    2   tcp    731  nlockmgr

If a similar response is not returned, log in to the server at the server console and check the status of the inetd daemon by following the instructions in "Get the Current Status of the NFS Daemons".

Verify that the mountd, portmap and nfsd daemons are running on the NFS server by entering the following commands at the client shell prompt:
```
/usr/bin/rpcinfo -u server_name mount
/usr/bin/rpcinfo -u server_name portmap
/usr/bin/rpcinfo -u server_name nfs
```
If the daemons are running at the server, the following responses are returned:
```
program 100005 version 1 ready and waiting
program 100000 version 2 ready and waiting
program 100003 version 2 ready and waiting
```
The program numbers correspond to the commands, respectively, as shown in the example output above. If a similar response is not returned, log in to the server at the server console and check the status of the daemons by following the instructions in "Get the Current Status of the NFS Daemons".
Verify that the /etc/exports file on the server lists the name of the file system that the client wants to mount and that the file system is exported. Do this by entering the command:
```
showmount -e server_name
```
This command will list all the file systems currently exported by the server_name .

Asynchronous Write Errors

When an application program writes data to a file in an NFS-mounted file system, the write operation is scheduled for asynchronous processing by the biod daemon. If an error occurs at the NFS server at the same time that the data is actually written to disk, the error is returned to the NFS client and the biod daemon saves the error internally in NFS data structures. The stored error is subsequently returned to the application program the next time it calls either the fsync or close functions. As a consequence of such errors, the application is not notified of the write error until the program closes the file. A typical example of this event is when a file system on the server is full, causing writes attempted by a client to fail.

NFS Error Messages

The following sections explain error codes that can be generated while using NFS.

nfs_server Error Message

Insufficient transmit buffers on your network can cause the following error message:

nfs_server: bad sendreply

To increase transmit buffers, use the Web-based System Manager fast path, wsm devices, or the System Management Interface Tool (SMIT) fast path, smit commodev. Then select your adapter type, and increase the number of transmit buffers.

mount Error Messages

A remote mounting process can fail in several ways. The error messages associated with mounting failures are as follows:

`mount: ... already mounted`
	The file system that you are trying to mount is already mounted.
`mount: ... not found in /etc/filesystems`
	The specified file system or directory name cannot be matched. If you issue the mount command with either a directory or file system name but not both, the command looks in the /etc/filesystems file for an entry whose file system or directory field matches the argument. If the mount command finds an entry such as the following: /dancer.src: dev=/usr/src nodename = d61server type = nfs mount = false then it performs the mount as if you had entered the following at the command line: /usr/sbin/mount -n dancer -o rw,hard /usr/src /dancer.src
`... not in hosts database`
	On a network without Network Information Service (NIS), this message indicates that the host specified to the mount command is not in the /etc/hosts file. On a network running NIS, the message indicates that NIS could not find the host name in the /etc/hosts database or that the NIS ypbind daemon on your machine has died. If the /etc/resolv.conf file exists so that the name server is being used for host name resolution, there can be a problem in the named database. See "Name Resolution on an NFS Server". Check the spelling and the syntax in your mount command. If the command is correct, your network does not run NIS, and you only get this message for this host name, check the entry in the /etc/hosts file. If your network is running NIS, make sure that the ypbind daemon is running by entering the following at the command line: ps -ef You should see the ypbind daemon in the list. Try using the rlogin command to log in remotely to another machine, or use the rcp command to remote-copy something to another machine. If this also fails, your ypbind daemon is probably stopped or hung. If you only get this message for this host name, you should check the /etc/hosts entry on the NIS server.
`mount: ... server not responding: port mapper failure - RPC timed out`
	Either the server you are trying to mount from is down or its port mapper is stopped or hung. Try rebooting the server to restart the inetd, portmap, and ypbind daemons. If you cannot log in to the server remotely with the rlogin command but the server is up, you should check the network connection by trying to log in remotely to some other machine. You should also check the server's network connection.
`mount: ... server not responding: program not registered`
	This means that the mount command got through to the port mapper, but the rpc.mountd NFS mount daemon was not registered.
`mount: access denied ...`
	Your machine name is not in the export list for the file system you are trying to mount from the server. You can get a list of the server's exported file systems by running the following command at the command line: showmount -e hostname If the file system you want is not in the list, or your machine name or netgroup name is not in the user list for the file system, log in to the server and check the /etc/exports file for the correct file system entry. A file system name that appears in the /etc/exports file, but not in the output from the showmount command, indicates a failure in the mountd daemon. Either the daemon could not parse that line in the file, it could not find the directory, or the directory name was not a locally mounted directory. If the /etc/exports file looks correct and your network runs NIS, check the server's ypbind daemon. It may be stopped or hung.
`mount: ...: Permission denied`
	This message is a generic indication that some part of authentication failed on the server. It may be that in the previous example, you are not in the export list, the server could not recognize your machine's ypbind daemon, or that the server does not accept the identity you provided. Check the server's /etc/exports file, and, if applicable, the ypbind daemon. In this case you can just change your host name with the hostname command and retry the mount command.
`mount: ...: Not a directory`
	Either the remote path or the local path is not a directory. Check the spelling in your command and try to run on both directories.
`mount: ...: You are not allowed`
	You must have root authority or be a member of the system group to run the mount command on your machine because it affects the file system for all users on that machine. NFS mounts and unmounts are only allowed for root users and members of the system group.

Identifying the Cause of Slow Access Times for NFS

If access to remote files seems unusually slow, ensure that access time is not being inhibited by a runaway daemon, a bad tty line, or a similar problem.

Checking Processes

At the server, enter the following at the command line:

ps -ef

If the server seems fine and other users are getting timely responses, make sure your biod daemons are running. Try the following steps:

Run the ps -ef command and look for the biod daemons in the display.
If they are not running or are hung, continue with steps 2 and 3.
Stop the biod daemons that are in use by issuing the following command:
```
stopsrc -x biod -c
```
Start the biod daemons by issuing the following command:
```
startsrc -s biod
```

To determine if the biod daemons are hung, run the ps command as above, copy a large file from a remote system, and then run the ps command again. If the biod daemons do not accumulate CPU time, they are probably hung.

Checking Network Connections

If the biod daemons are working, check the network connections. The nfsstat command determines whether you are dropping packets. Use the nfsstat -c and nfsstat -s commands to determine if the client or server is retransmitting large blocks. Retransmissions are always a possibility due to lost packets or busy servers. A retransmission rate of 5% is considered high.

The probability of retransmissions can be reduced by changing communication adapter transmit queue parameters. The System Management Interface Tool (SMIT) can be used to change these parameters.

The following values are recommended for NFS servers.

Communication Adapter Maximum Transmission Unit (MTU) and Transmit Queue Sizes
Adapter	MTU	Transmit Queue
Token Ring 4Mb	1500 3900	50 40 (Increase if the nfsstat command times out.)
16Mb	1500 8500	40 (Increase if the nfsstat command times out.) 40 (Increase if the nfsstat command times out.)
Ethernet	1500	40 (Increase if the nfsstat command times out.)

The larger MTU sizes for each token-ring speed reduce processor use and significantly improve read/write operations.

Notes:
Apply these values to NFS clients if retransmissions persist.

All nodes on a network must use the same MTU size.

Setting MTU Sizes

To set MTU size, use the Web-based System Manager fast path, wsm network, or the SMIT fast path, smit chif. Select the appropriate adapter and enter an MTU value in the Maximum IP Packet Size field.

The ifconfig command can be used to set MTU size (and must be used to set MTU size at 8500). The format for the ifconfig command is:

ifconfig trn NodeName up mtu MTUSize

where trn is your adapter name, for example, tr0 .

Another method of setting MTU sizes combines the ifconfig command with SMIT.

Add the ifconfig command for token rings, as illustrated in the previous example, to the /etc/rc.bsdnet file.
Enter the smit setbootup_option fast path. Toggle the Use BSD Style field to yes.

Setting Transmit Queue Sizes

Communication adapter transmit queue sizes are set with SMIT. Enter the smit chgtok fast path, select the appropriate adapter, and enter a queue size in the Transmit field.

Fixing Hung Programs

If programs hang during file-related work, the NFS server could have stopped. In this case, the following error message may be displayed:

NFS server hostname not responding, still trying

The NFS server (hostname ) is down. This indicates a problem with the NFS server, the network connection, or the NIS server.

Check the servers from which you have mounted file systems if your machine hangs completely. If one or more of them is down, do not be concerned. When the server comes back up, your programs continue automatically. No files are destroyed.

If a soft-mounted server dies, other work is not affected. Programs that time out trying to access soft-mounted remote files fail with the errno message, but you will still be able to access your other file systems.

If all servers are running, determine whether others who are using the same servers are having trouble. More than one machine having service problems indicates a problem with the server's nfsd daemons. In this case, log in to the server and run the ps command to see if the nfsd daemon is running and accumulating CPU time. If not, you may be able to stop and then restart the nfsd daemon. If this does not work, you have to reboot the server.

Check your network connection and the connection of the server if other systems seem to be up and running.

Permissions and Authentication Schemes

Sometimes, after mounts have been successfully established, there are problems in reading, writing, or creating remote files or directories. Such difficulties are usually due to permissions or authentication problems. Permission and authentication problems can vary in cause depending on whether NIS is being used and secure mounts are specified.

The simplest case occurs when nonsecure mounts are specified and NIS is not used. In this case, user IDs (UIDs) and group IDs (GIDs) are mapped solely through the server and clients /etc/passwd and /etc/group files, respectively. In this scheme, for a user named john to be identified both on the client and on the server as john , the user john in the /etc/passwd file must have the same UID number. The following is an example of how this might cause problems:

User john is uid 200 on client foo.
User john is uid 250 on server bar.
User jane is uid 200 on server bar.

The /home/bar directory is mounted from server bar onto client foo . If user john is editing files on the /home/bar remote file system on client foo , confusion results when he saves files.

The server bar thinks the files belong to user jane , because jane is UID 200 on bar . If john logs on directly to bar by using the rlogin command, he may not be able to access the files he just created while working on the remotely mounted file system. jane , however, is able to do so because the machines arbitrate permissions by UID, not by name.

The only permanent solution to this is to reassign consistent UIDs on the two machines. For example, give john UID 200 on server bar or 250 on client foo . The files owned by john would then need to have the chown command run against them to make them match the new ID on the appropriate machine.

Because of the problems with maintaining consistent UID and GID mappings on all machines in a network, NIS is often used to perform the appropriate mappings so that this type of problem is avoided. NIS maintains a database that takes care of the mappings of UID and GID identities across the network. See "Configuring NIS" for more information.

Name Resolution on an NFS Server

When an NFS server services a mount request, it looks up the name of the client making the request. The server takes the client Internet Protocol (IP) address and looks up the corresponding host name that matches that address. Once the host name has been found, the server looks at the exports list for the requested directory and checks the existence of the client's name in the access list for the directory. If an entry exists for the client and the entry matches exactly what was returned for the name resolution, then that part of the mount authentication passes.

If the server is not able to perform the IP address-to-host-name resolution, the server denies the mount request. The server must be able to find some match for the client IP address making the mount request. If the directory is exported with the access being to all clients, the server still must be able to do the reverse name lookup to allow the mount request.

The server also must be able to look up the correct name for the client. For example, if there exists an entry in the /etc/exports file like the following:

/tmp   -access=silly:funny

the following corresponding entries exist in the /etc/hosts file:

150.102.23.21    silly.domain.name.com
150.102.23.52    funny.domain.name.com

Notice that the names do not correspond exactly. When the server looks up the IP address-to-host-name matches for the hosts silly and funny , the string names do not match exactly with the entries in the access list of the export. This type of name resolution problem usually occurs when using the named daemon for name resolution. Most named daemon databases have aliases for the full domain names of hosts so that users do not have to enter full names when referring to hosts. Even though these host-name-to-IP address entries exist for the aliases, the reverse lookup may not exist. The database for reverse name lookup (IP address to host name) usually has entries containing the IP address and the full domain name (not the alias) of that host. Sometimes the export entries are created with the shorter alias name, causing problems when clients try to mount.

Limitations on the Number of Groups in the NFS Structure

On systems that use NFS Version 3.2, users cannot be a member of more than 16 groups without complications. (Groups are defined by the groups command.) If a user is a member of 17 or more groups, and the user tries to access files owned by the 17th (or greater) group, the system will not allow the file to be read or copied. To permit the user access to the files, rearrange the group order.

Mounting from NFS Servers That Have Earlier Version of NFS

When mounting a file system from a pre-Version 3 NFS server onto a Version 3 NFS client, a problem occurs when the user on the client executing the mount is a member of more than eight groups. Some servers are not able to deal correctly with this situation and deny the request for the mount. The solution is to change the user's group membership to a number less than eight and then retry the mount. The following error message is characteristic of this group problem:

RPC:  Authentication error; why=Invalid client credential

Problems That Occur If the NFS Kernel Extension Is Not Loaded

Some NFS commands do not execute correctly if the NFS kernel extension is not loaded. Some commands with this dependency are: nfsstat, exportfs, mountd, nfsd, and biod. When NFS is installed on the system, the kernel extension is placed in the /usr/lib/drivers/nfs.ext file. This file is then loaded as the NFS kernel extension when the system is configured. The script that does this kernel extension loads the /etc/rc.net file. There are many other things done in this script, one of which is to load the NFS kernel extension. It is important to note that Transmission Control Protocol/Internet Protocol (TCP/IP) kernel extension should be loaded before the NFS kernel extension is loaded.

Note: The gfsinstall command is used to load the NFS kernel extension into the kernel when the system initially starts. This command can be run more than once per system boot and it will not cause a problem. The system is currently shipped with the gfsinstall command used in both the /etc/rc.net and /etc/rc.nfs files. This is correct. There is no need to remove either of these calls.