Logical Volume Managers |
Originally published October, 1999 |
by
Carlo Kopp |
¿ 1999, 2005 Carlo Kopp |
One of the more interesting technologies which has proliferated into the wider marketplace in recent years is the Logical Volume Manager (LVM), an offshoot of sixties mainframe storage management technology. Today virtually every mainstream vendor can supply an in house LVM product for a proprietary Unix variant, and the third party Veritas LVM is widely used as a tack on alternative to the vendor offering. In line with commercial trends, the public domain Vinum LVM is now becoming more widely seen as a bundled option to public domain Unix variants. In the last few years LVMs has progressed from being a simple means of producing "growable" virtual disks, to incorporating performance enhancements such as striping, and in the latest generation of products, software RAID support. In this month's feature we will explore the basic ideas in LVM technology, what advantages it offers, what pitfalls it may create, and finally take a closer look at the Vinum LVM. Why Logical Volume Managers? The current generation of LVMs appeared earlier in the decade, primarily as a tool for selling IBM RS6000s and DEC Alphas to commercial clients who were not adequately Unix literate to cope with the difficulties in managing disk space and filesystems properly. Very soon striping was incorporated, since it offered a useful performance improvement. The literal explosion in RAID products after the mid nineties became yet another feature which could be bundled into an LVM, since the required effort to do so was modest. The basic idea behind all LVMs is that of aggregating several disks and creating the illusion of a single contiguous address space of blocks, upon which a filesystem can be built. Once this is accomplished, the remaining features can be incorporated without great difficulty, by manipulating the mapping function. The mechanics of any LVM design require that a disk block remapping scheme be employed. In essence, a mapping algorithm or table must exist, which translates the the "logical" disk block address into the identifier of the disk on which the block resides, and the physical disk address of the block on the disk. The latter was a messy proposition until the advent of SCSI, which itself uses a very tidy addressing model, from the system programmer's perspective. In operation, a conventional filesystem will call the upper half of the disk block device driver, which will in turn access the physical disk to read or write the disk block in question. If the disk is a "logical disk" under an LVM, a specific logical disk device driver or pseudo-device driver must exist, which is capable of remapping the logical and physical addresses, and which in turn will access the device drivers for the physical disks. On an access by a program, the filesystem must thus access the logical (pseudo-)device driver which remaps the logical address into a physical address, and in turn accesses the physical device driver and thus the required disk. On a read operation, once the physical device returns the block the physical device driver must return it (or appropriate addressing information) to the logical driver, which in turn places it into the appropriate place (eg buffer cache). Obviously this level of indirection on every disk access will incur some performance penalty, and some mechanism must also be employed to save the mapping information since if it is lost, then the information which is stored is also lost. Therefore the quality of implementation, and whether a two layer driver scheme as described is employed, or a more direct approach, in which a single driver performs both operations is employed, does matter. In the simplest arrangement the LVM cannot allow the disk space to the "grown" incrementally, since the filesystem at creation time assumes the partition it is on is fixed in size. However, should the filesystem be designed in such a manner to accommodate resizing, then further physical disks may be added to the virtual or logical disk as the need arises. The LVM is then reconfigured to incorporate the mapping information required to find logical addresses which map to physical addresses on the added disk space. In this manner, the user can simply keep growing the logical volume as he or she fills up the disk already in place. Is this the panacea for every system administrator with rapidly growing demands upon disk space ? Not entirely, since disk blocks will more than likely end up scattered over a number of physical drives which may be different in size and performance. If the disk space is filled up incrementally, odds are that particular accesses will crowd in one area and thus saturate a single disk alone, which will cost performance and still incur the overheads of the logical to physical mapping. Indeed there are no free lunches ! The bigger the volume, more than likely the bigger the aggregate load, and thus the greater the potential for saturation of any given disk. Setting aside the issue of inhomogeneous mixes of disk sizes and speeds, which create some less than tractable complications, there is an elegant technique which can bypass the the saturation. This is the technique of striping, which involves spreading accesses across multiple drives in stripes. Striping first appeared in the sixties, and became very popular in supercomputing applications by the eighties, as a means of dramatically improving the performance of large storage arrays. It also provided the inspiration for RAID techniques. The fundamental idea behind striping is to exploit the non-linear behaviour of queuing systems. Consider a single disk to which we have fired off say a couple of dozen I/O requests, reads or writes. Because the disk heads have to be moved and the platter has to rotate into position under the heads, we have a access delay on each I/O operation which varies with the position of the accessed block on the disk, and upon the previous position of the disk heads. While some tricks can be played, using for instance "elevator" algorithms, to reorder the accesses to get the best possible efficiency, the basic behaviour this system exhibits is that of a queue. Each I/O access has to wait until the disk drive can service it. In a single queue system the typical behaviour which we see, as the rate of I/O requests approaches the maximum rate at which the drive can service them, is a dramatic increase in the length of the queue of backed up requests. In turn, this means that the waiting time in the queue gets progressively greater. The mathematical model says that in the limit, when the rate of incoming requests is equal to the rate at which they are serviced, the waiting time is infinite. In practical terms this means that to achieve good throughput on a disk we have to keep the queues short. Consider now if we split these I/O requests across a pair of disks. Do we double the throughput, and halve the waiting time for I/O servicing ? The answer is no. We actually do better, because of the non-linearity of the behaviour of the backed up queue. This advantage gets increasingly better, the larger the number of queues, and thus drives in the system. This is the fundamental idea behind striping, and behind RAID techniques. By spreading the load across multiple spindles, we push the queuing behaviour under heavy load out of the region where saturation occurs, and thus avoid the massive increase in queuing delays this incurs. The central issue now becomes that of what strategy to choose for block layout on the drives, in order to most evenly spread the load across the drives. This is a question to which there are no simple answers. It all depends on the statistical behaviour of the accesses which are to be made. Two basic scenarios exist, the "scattered small file" scenario, and the "large file" scenario. In the former, we end up with a large number of I/Os often to disparate locations in the volume address space, in the latter we end up frequently with large numbers of I/Os to consecutive locations in the volume address space. Choosing the best strategy for spreading the I/Os across drives thus depends quite critically on the behaviour of our application. Striping behaviour depends on the number of drives, and the stripe size across the drives. RAID techniques exhibit a similar dependency on the granularity of access, with the caveat that some extra overhead is incurred to produce the parity block (or word) for redundancy. A more detailed discussion of tuning performance lies outside the scope of this immediate discussion. The bottom line is that the performance penalty incurred by a LVM can be offset if not wholly removed should the LVM employ striping or RAID techniques. Indeed a significant performance advantage may be gained over the use of a set of conventional filesystems used to store the same set of files. Therefore an LVM offers, in principle, the following advantages over conventional filesystems and fixed drives:
The downside of LVMs lies in the increased complexity required, and the need to carefully tune the configuration of the LVM and the I/O subsystem to ensure that the end product delivers the goods. Simply installing one in default configuration may not yield the desired outcome. In practice most LVMs impose constraints on what combination of functions can be employed. Frequently if the LVM is set up to allow extension of the volume with inhomogeneous disk sizes, features such as striping cannot be used. Also limitations may exist on the smallest stripe size usable. These factors should be carefully considered if we are to set up an LVM. Tuning Storage Arrays The process of tuning a storage array set up with an LVM is a non-trivial process and may require several iterations before we can match the configuration to extract the best performance from the application in use. Since the behaviour of the filesystem block layout optimisation algorithm will interact with the LVM block mapping algorithm, there are no trivial answers or simple rules of thumb which can be universally applied. Various publications will recommend various stripe sizes based on the authors' respective empirical experiences in tuning specific products, indeed recommended stripe sizes across a range of publications vary between 30 kbytes and 256 kbytes, and I am sure that if I read more papers I would find more recommended values ! This underscores the central issue, which is that the combination of an application, filesystem type, LVM design and configuration, device driver design and physical disk behaviour produces a complex system for which performance can be be virtually impossible to predict apriori. A good starting point for configuring an LVM and storage array is where we have available access statistics for the application, since this allows us to look at the typical sizes of I/O requests and gain some idea of whether we are facing a high frequency of small accesses, or low frequency of large accesses, or some particular mix. In practice this information can be extremely hard to acquire without an "instrumented" operating system, where the device drivers and filesystem maintain logs for exactly such performance analysis. Means of cheating do of course exist, the cleanest technique being the use of a logic state analyser with a SCSI adaptor and protocol disassembler (in effect a SCSI protocol analyser). This approach does have the advantage of being invisible to the operating system, and not incurring any additional delays which could skew the measurement. In principle, the logic state analyser must be set up to dump the analysed stream of access activity in a compact format out through a serial port, where it is logged on a workstation for statistical analysis. In practice few people will manage to convince management to expend the bucks to hire an analyser for 3 weeks simple to gather statistics. However if you are shipping turnkey systems where you supply hardware and application, this may be worth the trouble even in the shorter term. Once you have gathered access statistics, it is simple to pinpoint the best strategy in terms of stripe size, and configure the array accordingly. An important point is that in the absence of off-the-shelf tools for this purpose, you will have to craft your own. More than likely you will not convince management to devote the required resources and thus the fallback is the much more labour intensive approach of iterative empirical analysis. The rationale here being that staff time doesn't cost anything since it is already paid for. This requires that you set up an offline copy of the application, preferably with some means of exercising it with an almost realistic load. The ideal approach is the use of Remote Terminal Emulation (RTE) techniques, but this can be costly depending on the scale of the exercise. Some database products include tools or facilities for synthetic load generation. Once we have an offline system set up, we attach the disk array and take a stab at the best stripe size. We then exercise the system with the appropriate load and look at its response times. This yields our first datum point. We then trash the files on the array, reconfigure with either a larger or smaller stripe size and repeat the exercise. If performance improves, we progress further in the direction of larger or smaller accordingly, until we hit the point where there is no performance improvement. If performance degrades, we move in the opposite direction, until there is no discernable performance improvement. This is extremely tedious and time consuming, and odds are that management may after all relent and allow you to hire an analyser for this purpose. Either way, tuning a large disk array is not a trivial exercise, even if it is in theory relatively simple. Other strategies exist, where a live system is not used, for instance over weekends. One approach which incurs some political risk is to manipulate the array setup on a production system over weekends, and then observe the behaviour of users in the following week. The appropriate direction for tuning increments can be measured by their hostility, or otherwise, when the system is in operation ! The common approach of using crafted shell scripts to perform I/O operations and thus generate a wholly synthetic load is not a reliable strategy for performance tuning, since it cannot replicate the idiosyncratic behaviour of the application, especially in the locality of file accesses. Having explored the basic rationale of the LVM and tuning issues, we will now take a closer look at the Vinum LVM, which is available for FreeBSD systems. The Vinum Volume Manager The Vinum LVM was crafted by Greg Lehey and is available fully featured as a commercial product, or with a restricted features set as a public domain extra for FreeBSD 3.1 systems. The name is a play on the commercial Veritas product, and the author openly acknowledges that the Veritas product was the inspiration for Vinum. The Vinum LVM employs the basic strategy of a two layered driver arrangement, where the Vinum module sits above the standard SCSI device driver and maps the block addresses in accordance with configuration information. Each drive in the array contains a redundant copy of the configuration information to allow recovery in the event of a spindle loss. Unlike many commercial products, the Vinum LVM employs a simple command line configuration utility, which is used to set up a configuration database. A simple syntax is employed, although a user will be required to understand the workings of Vinum to make sense of the database. Vinum supports a number of useful modes:
With the exception of the commercial RAID 5 mode, all modes are in the public domain release. The tool has some limitations:
While the Vinum LVM lacks many of the "bell and whistles" of commercial products, it is adequate for many purposes and since it is free of charge, is an excellent training tool for sysadmins who may wish to climb the learning curve on LVM configuration and tuning, before tackling a major production system. For ISPs who use FreeBSD as a production platform, Vinum provides a means of matching the basic capabilities of commercial Unix variants. In summary, the Logical Volume Manager offers much to a competent user who can exploit its capabilities, but should be treated with some caution by the less experienced, since sufficient complexity exists to cause difficulties if configuration and performance issues are misunderstood or not understood. Like many items of modern technology, the LVM offers important gains, but only to those who know how to exploit it. |
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |