|Originally published September, 1994|
|¿ 1994, 2005 Carlo Kopp|
Filesystems are one of the least appreciated components of an operating system, yet they can have a dramatic impact upon the performance of a host system.
Unix has seen much evolution in filesystem technology over its history, and this evolution is still taking place today. Contemporary Unix filesystems are a far cry from the technology of the seventies, and offer substantially better access time and fragmentation performance, as well as greater robustness than their predecessors.
Quantifying filesystem performance is a major issue within itself, and for the purposes of this discussion we will identify three central performance criteria. These are speed, storage efficiency and robustness. This discussion will focus upon speed performance, as it is the most important measure of filesystem performance.
Speed performance is determined by access time performance and transfer rate performance. Access time is the time it takes for a filesystem object (eg a file) to be located for the purpose of reading or writing. Transfer rate is a measure of how quickly the object can be read into memory, or flushed out of memory to disk. A central caveat in this respect is the dependency of absolute speed performance upon disk performance, the only meaningful comparison between two filesystem types can be made only if their performance is measured running on a particular type of disk, using a particular application. To do otherwise could be misleading, as the behaviour of the disk and the application could adversely or positively affect performance, and we really wish to compare oranges and oranges.
Storage efficiency is a measure of disk space wastage, and this in turn impacts the cost effectiveness of a storage subsystem. Disk space which is unusable because of limitations in filesystem architecture is money flushed down /dev/null. Therefore storage efficiency is worth some serious thought.
Robustness is a measure of the filesystem's resilience to damage, or its ability to recover from corruption. Whilst many today argue that redundancy schemes such as mirroring or RAID can protect disk resident data from corruption, these schemes are expensive and few sites can afford to take advantage of them. Hence most users' data is protected between backups only by the resilience of the filesystem to externally induced damaged.
Conventional Unix Filesystems
The Unix filesystem, whatever type it may be, is a collection of algorithms and memory and disk datastructures. The datastructures store not only data, but also indexing information which tells the operating system where to find data blocks on the disk platters and in memory. Conventional flavours of Unix use monolithic kernels, where the kernel contains the filesystem code, the system calls which operate on the filesystem and the device driver which actually reads and writes to the disk surfaces.
One of the great strengths of Unix has been its ability to hide the idiosyncrasies of the hardware from the user, and this has in turn made for excellent application portability across often very diverse hardware. This has been accomplished by the use of a file descriptor mechanism, which is the familiar FILE construct seen by C programmers. The file descriptor of an open file will point indirectly to a structure termed an inode, which in turn points to the blocks on the disk. Every file on a disk has its inode, which wholly describes the file and where to find its contents.
When a program opens a file, it does so by calling the kernel which will call the filesystem, which will in turn call the driver, to drag the file's inode off the disk and into a table in memory. All subsequent operations by the program are done against the file descriptor, and the kernel will worry about the messy little details involved in finding blocks on the disk or in memory.
Typical Unix implementations will use one or another scheme of buffer cache, where frequently accessed disk blocks are held in memory. Because memory can be accessed at least 1000 times faster than disk, this provides a major improvement in performance against operating systems which cannot do so. In judging the performance of a Unix filesystem we must never lose sight of the buffer cache, should it perform particularly well it can do a very good job of hiding a lethargic filesystem.
Unix literature recognises the original Bell Labs filesystem as the "old" Unix filesystem. This filesystem was considered innovative in its day and was good enough to persist well into the eighties, until supplanted by the Berkeley Fast File System.
The "old" Unix filesystem was built around a set of central disk structures. The Superblock, found at the beginning of the disk, describes all of the basic parameters of the filesystem, and where other structures live on the disk. The other important structures were a list of free disk blocks, and files, which are comprised of inodes and the blocks pointed to by the inodes. Directories are implemented with files, which contain names of other files and pointers to their inodes, as well as other useful bits of information.
An "old" Unix filesystem was layed out upon the disk in a straightforward fashion, first the superblock, then a table of inodes, and finally the data blocks on the disk. Data blocks were typically 512 bytes long.
While this scheme worked well, it had serious performance problems stemming from its basic design. When accessing files on a disk, access speed in maximised by minimising the distance which a disk head must travel to find the blocks in the file. In turn transfer rate performance is maximised when disk blocks in a file are in consecutive locations on a disk cylinder. Ideally the disk head should seek out to a cylinder, and then sit there through a revolution reading off or writing the data blocks.
When a filesystem of this type was created, and files loaded onto it off a tape, all was well as the disk blocks would sit in consecutive locations, and files occupy positions spanning consecutive disk cylinders. However, as the filesystem was used and files deleted and added, block positions on the disk would become increasingly randomised with time. This effect, termed fragmentation, drove performance into the mud very quickly, as accessing any file would lead to seeks all over the disk surface to find the blocks. Transfer rate performance could degrade by as much as 80% over a period of usage.
The resolution of these problems led to the Berkeley Fast File System (FFS, or in SVR4 UFS), which was developed by Berkeley's Computer Systems Research Group (CSRG) in conjunction with Melbourne Uni. The FFS is the yardstick by which modern Unix filesystem performance is judged.
The FFS involved a substantial change to filesystem block placement strategies. A scheme termed cylinder grouping was introduced, whereby the filesystem was split into cylinder groups, and blocks in files preferentially allocated to particular groups. This arrangement allowed blocks associated with a particular file to live in one cylinder group, and very often in consecutive positions. This led to a major improvement in speed performance, particularly transfer rate. In addition, a scheme of optimising rotational block positions was also adopted.
Each cylinder group contains a copy of the superblock, thereby improving robustness through redundancy, and a table of inodes. Grouping the inodes with the files provided yet another gain in performance, in that finding the blocks once the inode was found could take place with a minimum amount of disk seeking. Since Unix file accesses are characterised by a large number of accesses to relatively small files, this optimisation works very nicely.
The FFS also saw a major improvement in robustness, as superblocks were not only redundant, but also staggered in position so that they end up spread across all disk platters. Storage utilisation was improved by adopting a fragmenting scheme, whereby larger disk blocks were used, but could also be split into fragments of smaller size. Functional enhancements included support for longer file names, file locking and symbolic links which could span filesystems.
The proof of the pudding was to be found in performance tests, which in prototype implementations yielded improvements as great as sevenfold in read performance. The adoption of the FFS into SVR4, albeit late, could be said to be final acknowledgement by the traditionalists of the supremacy of BSD's FFS.
The Virtual Filesystem
The virtual filesystem emerged during the eighties largely as a response to Sun's innovative Network File System (NFS), which allows the remote mounting of disks between hosts on TCP/IP networks. The VFS mechanism allows the operating system to hide the filesystem, be it NFS mounted remote or local, so that the program sees only a single type of file descriptor.
This was accomplished by introducing yet another structure in memory, termed a vnode. In the VFS, a program accesses a file descriptor, which points to a vnode, which in turn points to either a local filesystem inode, or a file in an NFS mounted filesystem.
The vnode is often described as an object oriented interface, as it contains pointers to the system calls which can be used to operate upon the file. As such it is an object, as it contains both data and functions. This approach provided the flexibility to accommodate various types of filesystem with a single interface, while also improving performance by reducing the overheads which come with hunting around for file type specific functions.
The BSD Enhanced File System
The EFS is perhaps the final incarnation of the long lived FFS/UFS, and was developed by CSRG in the early nineties to accommodate later generation cached SCSI disk drives. Cached drives contain a local memory in the disk, which will typically store a track's worth of disk blocks in a revolution. This generation of drives also cracked the Gigabyte barrier, as a result of which filesystem size limits in the existing FFS were breached.
The EFS therefore saw a series of minor design changes to accommodate larger file sizes, larger filenames and larger disks, but also introduced a major change, which is clustering. Clustering involves aggregating multiple contiguous blocks for reads and writes, so that the filesystem will transfer clusters of blocks rather than single blocks or fragments thereof.
For instance, with a cluster size set to 64 kbytes, eight 8192 byte blocks are transfered per I/O operation, rather than one block. As the biggest single time overhead in a read or write operation is positioning the disk head to the block positions, the clustering scheme achieves a significant economy. Tests carried out by CSRG achieved a 250% increase in raw write performance, and a 200% increase in raw read performance against the FFS. The performance gains in other benchmarks were less dramatic, but nevertheless worthwhile.
Log Structured Filesystems
The Log Structured Filesystem family emerged in the late eighties and early nineties, as researchers sought to improve filesystem write performance. The effect of declining memory prices and refinements in Unix buffer cache design led to a situation where buffer caches typically achieved read hit rates of the order of 90%.
However, database applications will usually force the flushing of written disk blocks to memory immediately during commit operations, to minimise the window of vulnerability to an operating system crash. As a result, disk traffic becomes dominated by writes rather than reads. A strategy which maximises write performance was therefore argued to improve filesystem performance as a whole.
An LFS radically differs from conventional Unix filesystems, using a scheme whereby files are appended to a growing log as they are written. Every write is appended to the end of the occupied part of the filesystem, which continuously increases in length, and thus the LFS will contain as many versions of a file as there were writes to the file. Needless to say this scheme would soon fill up the disk, and thus a garbage collector daemon (ie background "cleaner" process) periodically cleans up the filesystem, deleting the oldest versions of files and reshuffling the blocks to preserve the contiguous log and free disk space area.
The LFS is characterised by very little head movement during writes, and the ability to aggregate large numbers of blocks during writes. This can provide a dramatic improvement in transaction processing performance where writes are dominant. Other interesting features of this filesystem scheme are the ability to undelete (unrm) files which haven't been hit by the cleaner daemon, and the ability to support versioning and extensible filesystems using robotic stackers.
A number of commercially available Unix implementations now use variations on the LFS scheme, however the writer has had little success in gleaning papers on the subject, therefore this discussion cannot review these filesystems.
The Logical Volume appeared in the early nineties, and one could argue largely to pander to the expectations of lazy system administrators. One of the ongoing problems with conventional Unix filesystems is dealing with that time when the filesystem is filled up. In the conventional scheme this requires getting a bigger disk, and copying the files across, or splitting the data across multiple filesystems. This can be very painful with many database products, which may not be agreeable with splitting their data set in this fashion.
Another issue is performance under heavy I/O loads, where accesses may be concentrated upon a single disk, queuing up and dramatically reducing performance. A logical volume can, if implemented properly, spread the accesses over multiple spindles and thus reduce queuing delays.
The logical volume scheme involves spreading a filesystem across multiple partitions or multiple disks. This is implemented by sandwiching an additional layer of code between the filesystem proper and the driver. This logical volume code is essentially a disk block address mapper, which translates a logical volume block address into a physical disk block address. Vendors claim this incurs a performance penalty of less than 5 %, easily compensated by throughput improvements.
A logical volume scheme is typically used in conjunction with a Log Structured Filesystem, as conventional filesystems use block allocation strategies based on disk geometry. These strategies are likely to break once the geometry assumptions are changed.
As is clearly evident, modern Unix filesystems offer a range of performance attributes which may or may not be well suited to an application. The central caveat is therefore to test drive before buying, particularly if disk I/O performance is critical to the application. Large multiuser systems can benefit significantly from appropriate choice of filesystem, and thus suitable benchmarking techniques, for instance using remote terminal emulation schemes, are the best strategy for determining the proper choice.
As with all such technology, quality of implementation is an issue, and serious systems integrators are well advised to tread cautiously. The payoff can be well worth the effort.
Diagram 1. (UFS/FFS Inode Structure)
The Inode is the on disk datastructure which wholly describes a file in the UFS/FFS filesystem. This structure contains access modes, ownership information, timestamps, file size, file block count and four sets of block pointers. The first twelve block disk address pointers are embedded in the inode, to improve performance in operations on small files. Singly, doubly and triply indirect block pointers are held in disk blocks, allowing support for large file sizes with access time penalties which increase with file size.
Diagram 2. (UFS/FFS Cylinder Group Layout)
This diagram is a good illustration of disk geometry in relation to filesystem layout. To access an arbitrary block the heads must seek to the cylinder containing the block, and then wait for the block(s) to pass under the head. Disk blocks are written onto the platter surface serially and typically include checksumming information used by the drive to verify the integrity of the read or written data on the medium. The use of cylinder grouping was a major innovation in the FFS, as it positioned all the blocks in a given file in close geometrical proximity so as to minimise required head movement and thus improve performance.
Diagram 3. (UFS/FFS Directory Entry)
A directory block contains variable length records, each of which describes a file. Fixed length fields contain a pointer to the file inode, the size of the filename field and the size of the directory entry. The filename is a variable length entry containing an ascii string filename of up to 255 bytes length.
Diagram 4. (Log Structured Filesystem Structure)
Log structured filesystems merge attributes of conventional filesystems with those of database transaction logs. An LFS will append all written files to the end of the used disk space, and a cleaner daemon will be used to discard older copies of files and compact the disk space. The LFS can provide gains in write performance by minimising head seeks during write operations, which typically involve aggregated multiple writes. LFS filesystems are often used in conjunction with logical volume schemes.
Diagram 5. (Logical Volumes)
This scheme is based upon the concept of a logical volume, which may contain several physical disks or disk partitions. Logical volume block addresses which are consecutive in the logical volume address space are remapped to physical block addresses on arbitrary spindles. This is typically implemented by creating a logical volume device in /dev, which contains entry points to the block address mapping code, which in turn calls the physical disk driver. Logical volumes can provide some performance gain by spreading traffic over multiple physical drives thereby reducing queuing delays on drive accesses, but do so at some penalty in CPU time and disk space to store mapping information.
Diagram 6. (Unix File Access Hierarchy)
The Unix file access hierarchy starts with the file descriptor or FILE, which points to a vnode, which in turn points to an inode. This scheme hides the idiosyncrasies of the file type from the programmer, allowing transparent access to various filesystem types or NFS mounted filesystems.
|$Revision: 1.1 $|
|Last Updated: Sun Apr 24 11:22:45 GMT 2005|
|Artwork and text ¿ 2005 Carlo Kopp|