NFS Tuning

AIX Versions 3.2 and 4 Performance Tuning Guide

NFS Tuning

NFS allows programs on one system to access files on another system transparently by mounting the remote directory. Normally, when the server is booted, directories are made available by the exportfs command, and the daemons to handle remote access (nfsds) are started. Similarly, the mounts of the remote directories and the initiation of the appropriate numbers of biods to handle remote access are performed during client system boot.

The figure"NFS Client-Server Interaction" illustrates the structure of the dialog between NFS clients and a server. When a thread in a client system attempts to read or write a file in an NFS-mounted directory, the request is redirected from the normal I/O mechanism to one of the client's NFS block I/O daemons (biods). The biod sends the request to the appropriate server, where it is assigned to one of the server's NFS daemons (nfsds). While that request is being processed, neither the biod nor the nfsd involved do any other work.

How Many biods and nfsds Are Needed for Good Performance?

Because biods and nfsds handle one request at a time, and because NFS response time is often the largest component of overall response time, it is undesirable to have threads blocked for lack of a biod or nfsd. The general considerations for configuring NFS daemons are:

Increasing the number of daemons cannot compensate for inadequate client or server processor power or memory, or inadequate server disk bandwidth. Before changing the number of daemons, you should check server and client resource-utilization levels with iostat and vmstat.
NFS daemons are comparatively cheap. A biod costs 36KB of memory (9 pages total, 4 of them pinned), while an nfsd costs 28KB (7 pages total, 2 of them pinned). Of course, the unpinned pages are only in real memory if the nfsd or biod has been active recently. Further, idle AIX nfsds do not consume CPU time.
All NFS requests go through an nfsd; only reads and writes go through a biod.

Choosing Initial Numbers of nfsds and biods

Determining the best numbers of nfsds and biods is an iterative process. Rules of thumb can give you no more than a reasonable starting point.

By default there are six biods on a client and eight nfsds on a server. The defaults are a good starting point for small systems, but should probably be increased for client systems with more than two users or servers with more than 2 clients. A few guidelines are:

In each client, estimate the maximum number of files that will be written simultaneously. Configure at least two biods per file. If the files are large (more than 32KB), you may want to start with four biods per file to support read-ahead or write-behind activity. It is common for up to five biods to be busy writing to a single large file.
In each server, start by configuring as many nfsds as the sum of the numbers of biods that you have configured on the clients to handle files from that server. Add 20% to allow for non-read/write NFS requests.
If you have fast client workstations connected to a slower server, you may have to constrain the rate at which the clients generate NFS requests. The best solution is to reduce the number of biods on the clients, with due attention to the relative importance of each client's workload and response time.

Tuning the Numbers of nfsds and biods

After you have arrived at an initial number of biods and nfsds, or have changed one or the other:

First, recheck the affected systems for CPU or I/O saturation with vmstat and iostat. If the server is now saturated, you need to reduce its load or increase its power, or both.
Use netstat -s to determine if any system is experiencing UDP socket buffer overflows. If so, use no -a to verify that the recommendations in "Tuning Other Layers to Improve NFS Performance" have been implemented. If so, and the system is not saturated, you should increase the number of biods or nfsds.

The numbers of nfsds and biods are changed with the chnfs command. To change the number of nfsds on a server to 10, both immediately and at each subsequent system boot, you would use:

# chnfs -n 10

To change the number of biods on a client to 8 temporarily, with no permanent change (that is, the change happens now but is lost at the next system boot), you would use:

# chnfs -N -b 8

To change both the number of biods and the number of nfsds on a system to 9, with the change delayed until the next system boot (that is, the next IPL), you would use:

# chnfs -I -b 9 -n 9

In extreme cases of a client overrunning the server, it may be necessary to reduce the client to one biod. This can be done with:

# stopsrc -s biod

This leaves the client with the kproc biod still running.

Performance Implications of Hard or Soft NFS Mounts

One of the choices you make when configuring NFS-mounted directories is whether the mounts will be hard or soft. When, after a successful mount, an access to a soft-mounted directory encounters an error (typically, a timeout), the error is immediately reported to the program that requested the remote access. When an access to a hard-mounted directory encounters an error, NFS retries the operation.

A persistent error accessing a hard-mounted directory can escalate into a perceived performance problem because the default number of retries (1000) and the default timeout value (.7 second), combined with an algorithm that increases the timeout value for successive retries, mean that NFS will try practically forever (subjectively) to complete the operation.

It is technically possible to reduce the number of retries, or increase the timeout value, or both, using options of the mount command. Unfortunately, changing these values sufficiently to remove the perceived performance problem might lead to unnecessary reported hard errors. Instead, hard-mounted directories should be mounted with the intr option, which allows the user to interrupt from the keyboard a process that is in a retry loop.

Although soft-mounting the directories would cause the error to be detected sooner, it runs a serious risk of data corruption. In general, read/write directories should be hard mounted.

Tuning to Avoid Retransmits

Related to the hard-versus-soft mount question is the question of the appropriate timeout duration for a given network configuration. If the server is heavily loaded, is separated from the client by one or more bridges or gateways, or is connected to the client by a WAN, the default timeout criterion may be unrealistic. If so, both server and client will be burdened with unnecessary retransmits. For example, if

$ nfsstat -cr

reports a significant number (> 5% of the total) of both timeout s and badxid s, you could increase the timeo parameter with:

# smit chnfsmnt

Identify the directory you want to change, and enter a new value on the line "NFS TIMEOUT. In tenths of a second." For LAN-to-LAN traffic via a bridge, try 50 (tenths of seconds). For WAN connections, try 200. Check the NFS statistics again after at least one day. If they still indicate excessive retransmits, increase timeo by 50% and try again. You will also want to look at the server workload and the loads on the intervening bridges and gateways to see if any element is being saturated by other traffic.

Tuning the NFS File-Attribute Cache

NFS maintains a cache on each client system of the attributes of recently accessed directories and files. Five parameters that can be set in the /etc/filesystems file control how long a given entry is kept in the cache. They are:

actimeo	Absolute time for which file and directory entries are kept in the file-attribute cache after an update. If specified, this value overrides the following *min and *max values, effectively setting them all to the actimeo value.
acregmin	Minimum time after an update that file entries will be retained. The default is 3 seconds.
acregmax	Maximum time after an update that file entries will be retained. The default is 60 seconds.
acdirmin	Minimum time after an update that directory entries will be retained. The default is 30 seconds.
acdirmax	Maximum time after an update that directory entries will be retained. The default is 60 seconds.

Each time the file or directory is updated, its removal is postponed for at least acregmin or acdirmin seconds. If this is the second or subsequent update, the entry is kept at least as long as the interval between the last two updates, but not more than acregmax or acdirmax seconds.

Disabling Unused NFS ACL Support

If your workload does not use the NFS ACL support on a mounted file system, you can reduce the workload on both client and server to some extent by specifying:

options = noacl

as part of the client's /etc/filesystems stanza for that file system.

Tuning for Maximum Caching of NFS Data

NFS does not have a data caching function, but the AIX Virtual Memory Manager caches pages of NFS data just as it caches pages of disk data. If a system is essentially a dedicated NFS server, it may be appropriate to permit the VMM to use as much memory as necessary for data caching. This is accomplished by setting the maxperm parameter, which controls the maximum percentage of memory occupied by file pages, to 100% with:

# vmtune -P 100

The same technique could be used on NFS clients, but would only be appropriate if the clients were running workloads that had very little need for working-segment pages.

Tuning Other Layers to Improve NFS Performance

NFS uses UDP to perform its network I/O. You should be sure that the tuning techniques described in "TCP and UDP Performance Tuning" and "mbuf Pool Performance Tuning" have been applied. In particular, you should:

Ensure that the LAN adapter transmit and receive queues are set to the maximum (150).
Increase the maximum socket buffer size (sb_max) to at least 131072. If the MTU size is not 4096 bytes or larger, set sb_max to at least 262144. Set the UDP socket buffer sizes (udp_sendspace and udp_recvspace) to 131072 also.
If possible, increase the MTU size on the LAN. On a 16Mb Token Ring, for example, an increase in MTU size from the default 1492 bytes to 8500 bytes allows a complete 8KB NFS read or write request to be transmitted without fragmentation. It also makes much more efficient use of mbuf space, reducing the probability of overruns.

Increasing NFS Socket Buffer Size

In the course of tuning UDP, you may find that the command:

$ netstat -s

shows a significant number of UDP socket buffer overflows. As with ordinary UDP tuning, you should increase the sb_max value. You also need to increase the value of nfs_chars, which specifies the size of the NFS socket buffer. The sequence:

# no -o sb_max=131072
# nfso -o nfs_chars=130000
# stopsrc -s nfsd
# startsrc -s nfsd

sets sb_max to a value at least 100 bytes larger than the desired value of nfs_chars, sets nfs_chars to 130972, then stops and restarts the nfsds to put the new values into effect. If you determine that this change improves performance, you should put the no and nfso commands in /etc/rc.nfs, just before the startsrc command that starts the nfsds.

NFS Server Disk Configuration

NFS servers that experience high levels of write activity can benefit from configuring the journal logical volume on a separate physical volume from the data volumes. This technique is discussed in "Disk Pre-Installation Guidelines".

Hardware Accelerators

Prestoserve

The objective of the Prestoserve product is to reduce NFS write latency by providing a faster method than disk I/O of satisfying the NFS requirement for synchronous writes. It provides nonvolatile RAM into which NFS can write data. The data is then considered "safe," and NFS can allow the client to proceed. The data is later written to disk as device availability allows. Ultimately, it is impossible to exceed the long-term bandwidth of the disk, but since much NFS traffic is in bursts, Prestoserve is able to smooth out the workload on the disk with sometimes dramatic performance effects.

Interphase Network Coprocessor

This product handles NFS protocol processing on Ethernets, reducing the load on the CPU. NFS protocol processing is particularly onerous on Ethernets because NFS blocks must be broken down to fit within Ethernet's maximum MTU size of 1500 bytes.

Misuses of NFS That Affect Performance

Many of the misuses of NFS occur because people don't realize that the files they are accessing are at the other end of an expensive communication path. A few examples we have seen are:

A COBOL application running on one AIX system doing random updates of an NFS-mounted inventory file--supporting a real-time retail cash register application.
A development environment in which a source code directory on each system was NFS-mounted on all of the other systems in the environment, with developers logging onto arbitrary systems to do editing and compiles. This practically guaranteed that all of the compiles would be obtaining their source code from, and writing their output to, remote systems.
Running the ld command on one system to transform .o files in an NFS-mounted directory into an a.out file in the same directory.

It can be argued that these are valid uses of the transparency provided by NFS. Perhaps so, but these uses cost processor time and LAN bandwidth and degrade response time. When a system configuration involves NFS access as part of the standard pattern of operation, the configuration designers should be prepared to defend the consequent costs with offsetting technical or business advantages, such as:

Placing all of the data or source code on a server, rather than on individual workstations, will improve source-code control and simplify centralized backups.
A number of different systems access the same data, making a dedicated server more efficient than one or more systems combining client and server roles.