Introduction to NFS Performance
|Originally published Octomber, 1995|
|¿ 1995, 2005 Carlo Kopp|
Performance is the most important metric of any file serving protocol, all other things being equal. The NFS protocol, in most of its implementations, employs a number of measures specifically aimed at improving performance. These we will now examine more closely.
Server Implementation - the nfsd
A central aspect of a typical Unix NFS server implementation is the design of the NFS daemon or nfsd. The NFS daemon is an RPC server process which services NFS requests coming in from clients.
In practice, an NFS daemon will listen for incoming NFS protocol requests, which once received, must be serviced by executing the appropriate series of Unix system calls required to fulfill the request. Once the service is complete, the results are returned via the RPC protocol to the requesting client.
Achieving good performance in a file serving protocol however requires a little more than simply setting up a single server process. The environment which is typical for a file server will usually see many concurrent requests coming in from a multiplicity of client platforms/processes. The simple-minded solution of a single server process would therefore require the queuing up of incoming requests and then servicing of these, one by one. Subject to the dictates of queuing theory, this strategy would result in abysmal performance, as the queue of waiting requests ultimately has to wait for the server's hosts disks to be serviced. Disks are not fast devices (see OSR July/August 95).
It is for this reason that typical NFS daemon implementations are designed to run several copies of the daemon concurrently. This means that each incoming NFS service request can be handed over to an uncommitted daemon which may then immediately attempt its necessary file system operation, such as a read, write or directory operation.
This approach very cleverly exploits the fact that filesystem operations vary significantly in duration, as a result of host cache hit-to-miss ratios, and where a miss occurs, as a result of filesystem and disk performance. Daemons which are servicing requests which are short in duration are available to service new requests much more quickly than their peers which are servicing slower operations. The pool of NFS daemons will therefore be split into a group of active daemons, servicing RPC requests, and a pool of idling daemons, waiting for requests. Only should the situation arise, where all daemons are busy, will the incoming requests begin to experience significant queuing delays in the NFS mechanism.
Parallel processing is not the only trick exploited in an NFS server. Unix kernels typically employ a number of caching techniques to speed up filesystem performance, and the NFS daemon relies heavily upon these caches to improve its response time.
It is worth noting here that the proper sizing of these caches will benefit local filesystem operations equally, and proper tuning of a server should include a review of cache performance.
Client Implementation - the biod
A typical NFS client implementation is built around the NFS biod daemon. The biod performs a number of important functions within an NFS client, all of which serve to improve client NFS performance.
The first function which is performed by the biod is the integration of NFS operations with the client platform's buffer cache. The Unix buffer cache on the client platform will cache blocks read from the filesystem, or written to, thereby significantly improving I/O performance (as above). The biod provides the capability to cache blocks read or written in NFS operations in the host's buffer cache, thereby speeding up performance for those operations which repeatedly access the same blocks. Because an NFS operation is typically much slower than a local disk operation, the relative gain resulting from the use of the buffer cache is significant.
The second function performed by the biod is that of parallel operations, much like that of the nfsd in the server. Processes making operations against multiple files in the Unix virtual file system (VFS) will generate multiple NFS requests, which may be immediately handed over to multiple instances of the biod, each of which will then attempt to service the request, either from the buffer cache or the server. Short duration operations will result in those instances of the biod becoming available for further operations much sooner than their peers executing slow operations.
A third function performed by the biod is that of multiplexing dirty pages, and occurs where the size of the buffer cache page differs from the 8 kbyte buffer size of NFS. A host with a native page size of 4192 bytes will see two buffer cache pages loaded into a single NFS buffer for transmission to the server, when being flushed. This approach provides some reduction in the overhead of flushing dirty pages on an NFS client to the server, as it incurs only one NFS write operation per several cache pages to be flushed.
An NFS client can in theory, function without the biod but in doing so loses its capacity to exploit the client platform's cache. This situation may occur where the client is generating requests at a rate significantly faster than what the pool of biod processes can handle, as a result of which all become active and thus unable to service requests.
An additional performance enhancing feature available in many biod implementations is that of read lookahead caching, whereby the biod will not only fetch the requested block from the server, but also the subsequent block. Where a CPU is executing an NFS mounted remote binary, this feature can be very helpful as it effectively pipelines multiple reads.
The last important performance enhancing feature of the client side NFS implementation is that of file attribute caching. File attribute manipulations are relatively frequent operations (eg doing an ls -l ) which more than often involve small amounts of data but significant time spent rummaging around in the filesystem. A typical NFS client implementation will therefore cache the attributes of files accessed so that subsequent getattr and setattr operations are never propagated to the server directly. To maintain the consistency of the file attribute cache, it is periodically flushed to the server, or invalidated to force a reread from the server when next accessed.
An NFS Operation - Putting It All Together
The easiest way of understanding the potential performance bottlenecks in an NFS client server pair is to trace an operation against a file which has not been previously read, as this will cause all block caching mechanisms along the path to miss.
Let us consider a read operation against file "blogs", sitting on a fileserver. Our client application will make a read call against the file descriptor associated with blogs. This will fall through to the vnode on the the client, which identifies the file as an NFS filesystem object.
The virtual filesystem code will then invoke the NFS read operation against blogs. The biod is then called, and it first checks to see whether the file has been cached. As the file has already been opened, its attributes have been cached, but no pages are present in the buffer cache, so it initiates an NFSPROC_READ against the server.
The NFSPROC_READ call results in the sending of an NFS NFSPROC_READ message, encapsulated by the User Datagram Protocol (UDP). The UDP packet is enqueued by the network protocol stack handling code, which in turn encapsulates it in an IP packet and hands it to the network driver. The network driver sends the packet over the network.
At the receiving end, the network device driver receives the packet and forwards it to the protocol stack handlers, which identify the protocol type and port, eventually stripping off the headers and passing the NFS NFSPROC_READ message to one of the idling nfsd daemons.
The daemon decodes the NFS request and executes a read against the Unix filesystem. During this read, the page is cached in the server's buffer cache, and the DNLC, since overwritten, is updated. The block fetched from the filesystem is then attached to the NFS RPC response, and undergoes the same steps of encapsulation, transmission and decapsulation at the client end.
The client biod upon the receiving the block, caches it in the client's buffer cache, and updates its attribute cache appropriately. The data is then read into the I/O buffer associated with the library read operation and the read is essentially completed.
This somewhat simplified model illustrates the lengthy path and complexity of operations which must be executed in an NFS read. From a performance perspective, we must consider that the request has had to be initially processed by the biod, then propagated the the client's network protocol stack, transmitted over the network, propagated up through the server's protocol stack and processed by the server's nfsd. The nfsd has then had to call the server's filesystem, which has then called the device driver, which has operated upon the disk to access the block. The block must then be returned via the same path, incurring similar delays in the process of doing so.
The performance of the network protocol stacks, network drivers, and server filesystems and storage are important, as is the performance of the client's biod and the server's nfsd. Should any of these entities incur a significant delay, the performance of the operation as a whole will suffer accordingly.
Performance Metrics for NFS Client Server Interaction
There are two basic ways in which we can quantify the performance of an NFS client-server pair. The first, and simplest method, is to look at throughput in the same fashion as the throughput of a local filesystem.
In local filesystem operations, as well as NFS operations, the upper limit of throughput performance is achieved with a 100% hit rate to the client's buffer cache. Under these conditions, all reads and writes will be done against the client's memory and thus the throughput performance will be determined by the memory bandwidth of the client platform. Figures under these conditions vary from Megabytes to tens of Megabytes per second.
As we reduce the buffer cache hit rate, the average throughput will diminish, as those operations which resulted in misses will incur the time delays discussed above. At some point the buffer cache hit rate drops to zero and we observe the raw throughput of the "physical" channel, which for a disk is the combination of filesystem/device-driver/disk, and for an NFS server the aforementioned NFS-RPC-network-RPC-NFS channel.
Whereas a filesystem/disk can typically sustain several Megabytes per second of throughput, this is not the case with an NFS channel, which is ultimately bottlenecked by the network interface (see diagram 1). Given all of the overheads previously discussed, it is not surprising to find that sustained throughput over the network is limited to about 0.9 Megabytes per second, which in fact corresponds to just over 70% of an Ethernet throughput, which is about 80% of the theoretical limit for an Ethernet. This performance is typically one to one and a half orders of magnitude beneath the performance of a local disk. This is a very good reason for avoiding diskless clients like the plague.
At some further point this throughput performance will further diminish, if we compromise the performance of the server by introducing a poor buffer cache hit rte and poor filesystem and disk response time.
The second useful method for quantifying performance is by looking at the response time for NFS operations, as a function of increasing load upon the server. This is the approach used by Legato's nhfsstone benchmark, and its successor, the LADDIS benchmark, which is now part of the SPEC SFS benchmark suite. While Legato's benchmark is in the public domain, unlike the SPEC product, it has been written around the BSD kernel and cannot be run on SVR4 as a result (whoever ports nhfsstone to SVR4 will make a lot of friends very quickly).
The nhfsstone benchmark measures performance by firing a mixture of NFS requests at server, and measuring the time it takes for each request to be processed. The mix is representative of the typical ratios of operation types seen in operation under typical conditions. The benchmark will measure the time for a given load in operations per second, and then incrementally increase this load by spawning additional processes, each of which concurrently fire requests at the server. In this fashion it is possible to plot the server's (or rather client-server pair's) NFS operation response time as a function of increasing load (see diagram 2).
A typical nhfsstone or LADDIS characteristic will display the behaviour of a compound queuing system, followed by a steep drop as the server saturates and performance degrades.
Measures for Improving NFS Performance
Improving NFS performance is an interesting exercise, and given the complexity of NFS, often useful gains can be made with a very modest effort. Should however exceptional performance be sought, then more expensive means such as accelerator hardware must be employed.
The first step in improving NFS performance is to look at the state of the client and the server platforms. Should either be experiencing performance difficulties in handling their computational or local disk I/O load, then these problems must be rectified first. Should there be a shortfall of memory or compute idle time, tinkering with NFS will yield little result. Similarly should the server's disk I/O be saturated, altering NFS setup will be quite useless. Starvation of system resources impacts all activities on the host.
Assuming that the client and server are in a healthy condition, we then turn our attention to the client and take a look at its biod processes. Most Unix NFS clients as delivered will be set up with four or eight biod processes which are started at boot time. The use of ps or top will indicate whether these are coming under significant load, or not. If yes, then additional instances of the daemon should be run, and the number should be increased until one or more daemons are always idle at the required load.
Where the client buffer cache can be altered in size, this would be another area to examine. Should the buffer cache have a poor hit rate due undersizing, enlarging it should be of some use.
Network performance can affect both client and server, and should either suffer from a saturated interface, consideration should be given to fitting the host with multiple interfaces to multiple Ethernets to spread the load. Another option is to look at a faster network, such as 100Base-T or FDDI. Should a faster network be available, then the issue of host interface throughput limits becomes relevant. An issue which has raised some discussion in the context of faster networks is the NFS/biod/nfsd internal buffer size of 8 kbyte, which is seen by some observers as a bottleneck under these conditions.
Assuming that client performance has been improved, we can now turn to the server. Again the first step is to look at daemon activity levels, and determine whether more daemons should be running. Again, buffer cache performance should be examined, and if necessary, the number of users and size of the inode table should be increased to an appropriate level.
Whether an NFS accelerator is required will depend upon the achieved server performance is acceptable or not, after tuning.
A conventional filesystem, or an NFS mounted filesystem, where caching is working properly, tends to have a much higher rate of writes to disk, in comparison with reads. This is in fact the idea behind log structured filesystems.
In an NFS environment, a server heavily burdened with write traffic will suffer a loss in overall NFS performance as its NFS daemons will spend most of their time waiting for filesystem/disk writes to complete, compared to read operations which will largely hit the server's buffer cache.
NFS accelerator hardware resolves this problem by providing a battery backed cache memory dedicated to caching NFS writes. In a system using an accelerator, the nfsd can complete its NFS write by dumping the block into the NFS write cache, and therefore avoid having to wait for the filesystem and disk I/O subsystem to complete the operation. The battery backed NFS cache is then asynchronously flushed to the disk. Should the power fail, the NFS cache retains the data until the next boot, at which time it is flushed to disk.
The usage of an NFS accelerator will decouple NFS performance from filesystem and disk write performance, and under the proper conditions, significant performance gains may be achieved.
Given its ubiquity in the marketplace, NFS is a surprisingly complex beast, which requires good insight if high performance is to be achieved. Providing that a systematic approach is followed, most sites can benefit from some measure of NFS tuning. The expense of purchasing substantially larger server platforms should only ever be sought as a last resort, where tuning and the use of an accelerator have failed to yield the desired result. The NFS environment is one situation where a little insight can certainly save a lot of expense.
Diagram 1 Text
NFS Client Throughput Performance. Produced by the Self Scaling Benchmark (P.M.Chen et al), this plot provides a good indication of file read and write performance with a range of buffer cache hit rates. With a high cache hit rate, average performance is in excess of 1.5 Mbytes/s, but declines rapidly. The plateau between 30 and 80 MB file size is where the buffer cache becomes ineffective, and performance is dominated by the Ethernet channel. Beyond 80 MB aggregate performance diminishes rapidly due filesystem behaviour on the server. The client was an Indigo 2 R4400 with 128 MB memory, the server a 4 CPU Iris 4D/240S with 96 MB memory. The benchmark tests I/O and buffer cache performance by reading and writing files of ever increasing size, once the file size exceeds the buffer cache size, hit rates suffer accordingly (Author).
Diagram 2 Text (hard copy in mail, will try to get postscript)
NFS client server pair performance benchmarked with nhfsstone shows the performance of a SPARCStation 2 server with 64 MB memory and 24 active nfsd daemons. The client was another SPARCStation 2, with 6 MB memory. Both systems were running SunOS 4.1 (BSD). Note the queuing system behaviour of the unaccelerated server (courtesy Legato).
|$Revision: 1.1 $|
|Last Updated: Sun Apr 24 11:22:45 GMT 2005|
|Artwork and text ¿ 2005 Carlo Kopp|