Dr Carlo Kopp's Publications Archive

Industry Publications Index ... Click Here

Brave Little Toasters?

Originally published June, 1997

by Carlo Kopp

The massive growth in networked computing has produced some most interesting technological trends. We are beginning to see the gradual adaptation to this paradigm, with an increasing trend toward specialisation rather than the use of general purpose computing equipment for such applications.

Whether this trend will follow through, seeing the displacement of general purpose platforms on most sites remains to be seen. A more likely outcome is that the market will yet again split into smaller segments, as has happened in the past, with specialised platforms occupying their respective niches, and absorbing the brunt of the growth in the market. The first notable example of this specialisation was the proliferation of the router, which displaced in most sites the Unix host tasked with packet routing. Whether the Network Computer (read Xterminal with embedded/local clients) succeeds as well as the router remains to be seen, but certainly on many sites the NC is likely to offer significant price/performance benefits, particularly in support, against the mainstay of desktop platforms, the humble PC.

A more recent item of specialised hardware to enter the market is the dedicated file/http server, typified by the Network Appliance family of servers. More commonly known as "appliances", these are NFS/CIFS(SMB) or HTTP protocol servers which run a simple dedicated proprietary operating system, providing sole user access through the aforementioned network protocols. Without the overheads associated with established operating systems, such devices can provide very high throughput performance for a given installed CPU and memory.

This should not come as a surprise, because any general purpose computing platform, be it Unix or a proprietary system such as NT, usually carry significant internal overheads associated with the need to support time-sharing multi-user operations and very often also a GUI. The internal management overheads associated with multi-user operation, the complexity of the required kernel support for a wide range of device drivers, the often clever and complex scheduling mechanisms, and the need for virtual memory all increase the complexity of such general purpose operating systems. Additional complexity translates into costs, directly through development effort required, and indirectly through greater grunt required and more complex system administration.

A highly specialised appliance trades the flexibility of the general purpose system for raw performance per dollar. In an environment where the sole purpose of a installed general purpose platform is to perform a special purpose function, a specialised appliance offers arguably a better return on investment. To better appreciate the technological issues involved we shall explore the Network Appliance family of "appliances", with a particular focus on where they differ from general purpose platforms and how this reflects in achievable performance.

The Network Appliance - Hardware

The Network Appliance family of servers employs essentially generic hardware components, using a custom operating system to provide the complete system level package. Smaller appliances such as the F220 and F330 employ a generic Pentium based motherboard, third party OEM SCSI controllers, network interface cards, be they 10 Base T, 100 Base T or FDDI, all packaged in a custom designed rack style chassis.

The largest machine in the series, the F540, employs a DEC Alpha RISC CPU board. Smaller systems employ 4 GB Seagate Hawk drives with fast/narrow/single-ended interfaces, with the larger 200 GB F540 systems using fast/wide/differential 4 GB Seagate Barracuda drives.

The disks are accessed in a RAID-4 scheme. Evidently Netapps's designers have aimed to minimise build costs and internal design overheads by using standard OEM hardware components. This is a clever strategy as it offers mature, stable hardware at highly competitive costs. It also shifts many of the design overheads and testing issues on to the card and component suppliers and represents an excellent example of leveraging the cheaply available high performance hardware in the current marketplace.

An interesting note here is that the earliest Cisco routers were also built up from generic OEM cards, this being a parallel in implementation to the basic model of specialised rather than general purpose network resources. An interesting feature of Netapp hardware is the standard inclusion of Non-Volatile RAM (NVRAM) to support both filesystem integrity and NFS acceleration. Unlike the established model used by products such as Prestoserve, which employ an add in NVRAM card to a standard Unix host, the Netapp approach integrates the NVRAM at a more fundamental level.

In summary the Netapp hardware is in many respects boringly basic, put together from standard components selected for very high performance. This translates into lower build costs and arguably, better product stability in comparison with custom designed and build board sets.

The Network Appliance - Kernel and Protocol Stacks

The operating system of the Appliance is the really interesting part, and indeed will be the biggest single part of this discussion piece. The simplest overview is that the Appliance uses a microkernel, running dedicated and complete top-to-bottom protocol stacks for NFS, CIFS (SMB/RFC-1001/1002), HTTP (get only) and interfacing to a log structured file system. The kernel was written by Dave Hitz, one of the founders of Netapp, and is closest in design to the Thoth microkernel, a design which is similar in some respects to the Walnut microkernel project which the author was previously involved with at Monash.

The Appliance microkernel employs a relatively simple scheduling mechanism, a message passing inter-process communications scheme, and is designed to be very small and frugal in terms of required hardware resources such as memory and CPU time, in comparison with monolithic kernels such as Unix, or pretend microkernels such as NT. Virtually all workload is performed at a process level, this including device driver upper halves, protocol processing and filesystem operations. The use of a microkernel strategy rather than a monolithic kernel strategy was clearly intended to minimise the time overheads of context switching which is a critical performance factor in heavily interrupt loaded applications.

For instance, a conventional operating system such as Unix must context switch to the interrupt service of the device driver, then context switch to the top half of the device driver, then context switch to the inetd and rpcd daemons, then the nfsd/biod and finally to the kernel resident filesystem and its associated block drivers for disk storage devices. If RAID is implemented in software, this may also require an additional context switch to accommodate a RAID pseudo-device driver to execute the required block mappings.

Each context switch can be quite costly as a fixed chunk of CPU time is required to save registers, process state and in general tidy up the CPU state. If we further factor virtual memory, paging and swapping, this incurs further time overheads. It follows therefore that at high interrupt loads, a significant amount of CPU time can be gobbled up in context switches alone. As a relevant example this author likes to recall his experience in testing an SS2 with a purely interrupt driven (ie no DMA) HDLC serial driver.

At about 8000 to 9000 sustained interrupts per second the 40 MHz SPARC running SunOS 4.1 was spending cca 70% of its CPU on the interrupt services and associated context switching. The system simply ran out of CPU cycles. The used of process resident protocol stacks (incidently also a feature of th Walnut) eliminates this very fundamental performance bottleneck characteristic of monolithic and pseudo-microkernel designs, as instead of the expensive context switch required between stages of processing, the only overhead required is that of a function call. Indeed this is the basic idea behind multi-threading processes. Function calls are cheap as they merely incur the generation if a stack frame for each call.

The weakness of process resident protocol stacks is the absence of virtual memory protection between processing stages, which can bite in complex and/or poorly testable designs. Where the design is simple this is however not an issue. The Netapp design embeds the interrupt service (bottom of driver) in the kernel, as is typical for microkernels, and then embeds the top half of the drivers, IP, TCP, UDP, RPC and NFS protocols, the log-structured file system processing, RAID management (ie block mapper) and upper half of the SCSI driver in a dedicated process. This approach means that the path of any block of data between the network and disk platters is as simple as possible.

Moreover, there is no need to compromise throughput performance at any point in order to produce standard interfaces between processing stages, or pump data though an interprocess comms channel. A single process calling a stack of routines is all that is required. Of some technical interest here is that the same strategy as is used for NFS (both Version 2 and 3 are supported), is also applied to the CIFS/SMB stack, and the HTTP stack.

CIFS is of particular importance here, as it employs some very complex locking mechanisms and is essentially "stateful" in operation compared to the simpler "stateless" NFS. The most popular CIFS implementation in use on heterogeneous sites is Samba, which has a number of limitations in its implementation which are a direct result of attempting to map the stateful and highly lockable CIFS model on to a Unix filesystem. As it runs as a Unix process, emulating an embedded CIFS stack, it also incurs performance penalties.

In homogeneous sites the NT CIFS implementation suffers frequent performance limitations due to NT's rather ordinary context switching and inter-process communications performance. The Netapp strategy is clever in the sense of using a specialised code for protocol stacks which can fully implement features of the protocols in question without the need to make concessions in protocol implementation or performance. The principal limitation of the current Netapp design lies in its ability to implement only the HTTP GET operation, requiring a separate conventional HTTP server (eg Unix) to implement CGI scripts. Whether Netapp intend to embed CGI in the appliance model is unclear from currently published literature. Protocol support for Netbeui and IPX/SPX is available as an alternative to the basic CIFS stack.

The Write Anywhere File Layout (WAFL)

The WAFL is Netapp's proprietary log structured file system (LFS). The LFS is a fairly recent model in filesystem design, created at UCB earlier this decade by John Ousterhout's development group, and more recently offered as an alternative file system by major Unix vendors (see earlier OSR features). The LFS model is fundamentally different from established file systems such as the Unix FFS/UFS. The limitation of the latter lies in write performance, in that disk heads must seek out to the specific locations on the disk required for optimal multi-block read operations and then write the block to the platters. An LFS bypasses this problem by appending all writes to single consecutive stream of blocks termed a log. The log grows continuously, as instead of overwriting an existing file, any write operation creates a new version of the file at the end of the log. An LFS therefore continuously grows with every write operation, containing in effect not only the filesystem dataset but also its complete history.

Because head movement is minimised for write operations, any LFS will exhibit typically better write performance than a conventional equivalent. What is almost certainly unique about the WAFL is that it seamlessly integrates NVRAM (described above) into the operation of the filesystem to improved robustness against power outages of similar crashes, while also concurrently performing an NFS write cache function to accelerate NFS writes, allowing asynchronous rather than synchronous write operations (see earlier OSR feature on NFS performance).

How the WAFL merges the LFS model with the NVRAM NFS write acceleration is arguably the single most clever architectural feature of this family of designs. The model used in WAFL incorporates the concept of a "Snapshot". What a snapshot is, is an image of the complete filesystem at a given point in time. This image is created by saving the contents of the key filesystem datastructures which associate files with blocks in the WAFL. Because the WAFL is an LFS, it accumulates all changes as time runs, and thus a series of saved snapshots allows a user to reconstruct what the filesystem looked like at exactly the time of the given snapshot. In operation, the WAFL produces a "consistency point" snapshot every several seconds, so that an unexpected crash or power outage causes the loss of only several seconds worth of data, at most.

Moreover, the WAFL design will not flush updates associated with NFS requests until after a consistency point snapshot has been logged. The first question the clever reader will put here is "what happens to all of the NFS/CIFS requests made between consistency point snapshots, should the power fail. This is where the NVRAM is cleverly used. Incoming NFS or CIFS write requests are logged, data inclusive, into an NVRAM resident memory log structure. Because the NVRAM is robust storage, the NFS or CIFS write operation can return as completed immediately, in effect achieving NFS write acceleration (similar in effect but different in implementation to say the Prestoserve).

The model used by the WAFL is conceptually similar to a database transaction log. If the power fails, the accumulated NFS writes beyond the last consistency point can simply be replayed to bring the WAFL to a current and consistent state. After any given consistency point snapshot, the accumulated NFS writes are directly flushed to disk. This ordered technique for flushing writes to disk also provides the file system code with enough time to figure out the optimum placement for the blocks to be written, thereby improving performance.

This conceptually a pipelining technique, in that incoming requests are queued up and then processed in groups, the timeslice of the pipeline model is the consistency point interval. Additional features have been added to the WAFL design to enhance performance. The most important of these is a clever hashing mechanism to speed up searches for files in large directories. This facility, termed a directory hash, is in effect a directory level name cache which is designed to cache every single file in the directory, rather than a recently accessed subset as is the case in more conventional name caching models. Netapps published data suggest this provides a five-fold performance increase for a 30,000 file directory search. The RAID-4 storage model used is conventional, employing a block level parity disk for a fixed number of data disks (see the earlier feature on RAID fundamentals).

While RAID 4 is not commonly used, in this instance it appears that it was chosen as it provided a good fit for the block allocation strategy used in the WAFL (or vice versa). The WAFL/RAID scheme used by Netapp tightly integrates the NFS/CIFS interface and filesystem, to provide a highly robust high performance interface which would be extremely difficult to integrate into traditional general purpose operating systems, without major brain surgery. In the author's opinion this is the most clever scheme of this ilk which he has encountered to date.

Performance

Netapp have made no secret, as one would expect for a commercial vendor, of the fact the the product produces, as we would expect, considerably better performance than general purpose multiuser machines configured to operate as NFS, CIFS or HTTP servers. Extensive benchmarking data has been posted at the Netapp web site (http://www.netapp.com) and the SPEC web site (http://www.specbench.com).

The published performance data are LADDIS benchmarks, which are a synthetic benchmark derived from the earlier nhfsstone benchmark produced by the makers of Prestoserve. The benchmark generates an increasing workload of NFS requests, and measures the average response time for each access. An averaged number is produced. In general, the dedicated "toaster" architecture provides at least twice the speed of conventional systems particularly at high loads. Clearly the specialised architecture provides a distinct performance advantage over Unix based NFS servers, and an even greater margin over NT hosted CIFS servers, of comparable CPU performance and memory size. The author has yet to see a Webstone benchmark, but it is reasonable to expect that a similar advantage will exist.

Management Facilities and Backups

The Netapp "toasters" provide a HTTP/HTML based management interface for activity monitoring, control and configuration of the server. This standard interface was evidently used to simplify the design of the product, as the user interface can be an arbitrary HTML browser resident on a system admin's machine. Backups and recoveries are essentially conventional, using a variant of the Unix dump and restore toolset. Facilities are available for over the network backups using a range of standard tools. Given the high bandwidth of the RAID array a fast tape drive such as a DLT may be required.

Perspective

The Network Appliance "toaster" family of dedicated NFS, CIFS and HTTP multi-protocol file servers employ a technically impressive blend of recent operating system design techniques, file system design techniques and a very clever integrated NFS accelerator and filesystem operation log mechanism to provide a truly dedicated high performance alternative to general purpose hosts used as file servers.

The architectural model used is robust, has considerable technical merit, and is clearly a credit to its designers. We have yet to see this technology proliferate more widely, but if the experience with Cisco and Wellfleet is anything to go by, the appliance model is here to stay. From a market perspective we can expect appliance devices to nibble a certain chunk of the Unix server market, less so in Australia because really large Unix machines are typically used as compute rather than dedicated NFS/CIFS servers.

Where smaller appliance type devices promise to make a huge difference is in larger PC-centric sites, where the technology is likely to devastate in the longer term both Novelle and Microsoft's top end server products, which have historically been due fundamental technology less efficient than Unix in terms of resource utilisation. As a result, the price-performance benefits of an appliance are significantly greater, and the high robustness and low management overheads promise to further enhance the value per expense relationship. Yet another paradigm to contemplate.

$Revision: 1.1 $

Last Updated: Sun Apr 24 11:22:45 GMT 2005