Caches |
Originally published May, 2000 |
by
Carlo Kopp |
¿ 2000, 2005 Carlo Kopp |
The ubiquitous cache is one of those unavoidable entities which we find in every imaginable part of a computer system. Whether we are looking at a CPU, a disk drive, an operating system or an application, odds are that we will find references to one or more caches. In this month's feature we will take a closer look at the issue of caching in general, and discuss some of the performance issues associated with the cache. Caching is not a new idea, indeed caches have been with us for decades now, embedded in various parts of systems. The term itself is used to describe a myriad of different schemes, all of which are based upon some very simple ideas. A problem which is frequently encountered in any computing task is that of dealing with a slow storage device. Everybody wants speed, be it in crunching an application or in interactive response times. Yet very few users could realistically afford the kind of storage performance they would want. Indeed a 512MB main memory built of 5 nsec SRAM chips is an artifact only to be found in Crays or machines of a similar ilk. Inevitably such a divergence between what is wanted and what is affordable will create pressure for alternatives, and the cache evolved to fill this niche. In the simplest of terms a cache is a relatively small chunk of fast, if not very fast, storage which is interposed between a computing device and a storage system. The idea is that data or code which is frequently accessed finds itself in the cache, and can thereby be accessed very quickly, unlike the lumbering main storage element in question. This is a very simple and elegant idea, but for it to actually work effectively the caveat "frequently accessed" must hold. Only if the data or code is accessed frequently will the cache deliver a useful performance improvement. The most important issue for any caching scheme is that of the "cache hit ratio", which is the proportion of total accesses to the cache, during which the sought item is found in the cache. Ideally a cache delivers a very high hit ratio, better than 75%. As a result, the average time it takes to access the storage device is then dominated by the access time to the cache, rather than the slower main storage. An ideal cache would deliver a hit ratio approaching 100%, noting that a 100% hit ratio doesn't make sense since the cache itself has to be loaded at some point in time. To elaborate further, when any access is made to storage, the first step is to look into the cache and see if the accessed item is there. If it is, then it is retrieved quickly, since the cache is by definition fast, and we say that a "hit" has occurred. On the other hand if the item is not in the cache, it must be found in the slow main storage device, upon which the cache must be updated to hold it, and we say that a "miss" has occurred. The average access time to storage is thus a function of the very short cache access time, the much longer main storage access time, and the ratio of hits and misses to the total number of accesses. The more hits, the more frequently the fast cache gets used, the more misses, the more frequently the slow main storage gets used. This of course leads us to the issue of "bang for buck". Ideally a cache should have as high a hit ratio and speed as possible, but also be as cheap as possible. Frequently this is a very tricky balancing act for a hardware designer, systems programmer or system architect, since hit ratios usually improve with size. The bigger and the faster the cache is, the more expensive it is. As always, there are no free lunches in this game. Getting the best possible performance to price ratio in a cache is never trivial and to better understand why, we will will look at several different areas of caching and what particular measures can be applied. CPU Caches Of all of the types of caches used the CPU caches are by far the most vital, indeed modern computing machinery would largely come to a screeching halt without them. CPU caches first appeared when CPU instruction execution times started becoming shorter than access times to main memories. Once this situation occurred, machines could lose significant amounts of performance since the CPU would stall, waiting for instructions being fetched from the memory. Early minicomputers had no caches, but by the seventies the cache became a necessity, as CPU speeds outstripped memory speeds. For caching to work at all the accesses a CPU does to memory must be very frequently to the same locations. As it turns out this is the case, for a reason which may not be all that obvious at first glance : programs usually contain plenty of loops. Any loop, when unravelled into a sequence of machine level instructions, amounts to little more than the repeated execution of the very same stream of instructions. Therefore the number of times the CPU cycles through the loop is the number of times the same set of address locations are accessed to read instructions. The larger the number of loops in the code, the better the cache hit ratio you can expect, given certain assumptions. Caches employed in CPUs need a simple and fast mechanism for testing an address reference to determine whether it is in the cache or not. The most commonly used technique is to divide the cache into two parts, one of which stores a "cache entry tag" for identification, and the other storing the "cache entry" which is the contents of cached memory location. The tag is typically produced by throwing away the lower address bits, and using the upper address bits as the identifier. The lower address bits are then used to address both the tag and the entry in the cache. This is a very simple and elegant technique, since the combination of upper and lower address bits, and address location contents, is always unique. In operation the lower address bits are used to find the tag and entry. If the tag matches the upper address bits, a cache "hit" occurs and the entry can be used. If not then a cache "miss" has occurred and the sought contents must be found in memory, and then loaded into the cache entry field, while the tag is loaded with the upper address bits. This scheme is termed a "directly mapped cache" and is very commonly used in Level 2 (L2) caches. It does not distinguish between instructions or data. The limitation of the directly mapped cache is that it usually needs to be big to achieve a high hit ratio and thus be effective. If the directly mapped cache is fairly small, and a loop in the code is bigger than the cache, then the cache will be overrun, since part-way through the loop the cache will be filled, and subsequent instructions will start to overwrite what is already in the cache. In such a situation the hit ratio drops dramatically and performance suffers very badly indeed. This problem can be alleviated by dividing the cache into two parts, and allowing two concurrent cache entries with the same lower address bits, but unique upper address bits. Such a cache is termed a "2-way set associative cache". The idea can be extended further, splitting a cache into a 4-way or 8-way set associative arrangement. Performance analysis however suggests that little extra in terms of hit ratios is gained beyond a 2-way set associative scheme. Another common optimisation applied to CPU caches is the use of a "Harvard architecture" or split data/instruction cache, in which separate caches are used for instructions and data. Since instructions are only ever read, and accessed very more often than data, Harvard architecture caches are typically assymetric, with the instruction cache bigger than the data cache, and usually with a higher order of associativity. The to microprocessors as mainstream machines presented many challenges to machine architects, and fitting in caches proved to be one of them. Caches are big real estate gluttons on a CPU die, and the result by the mid nineties was the widespread use of two level caching schemes. In such a scheme, a small cache, usually under 128 kbyte, and using 2, 4 or even 8-way set associativity and a Harvard architecture, is embedded in the chip die and designated a Level 1 (L1) or "primary" cache. A Level 2 (L2) or "secondary" cache is then used, outside the microprocessor, and is usually much larger and slower. L2 caches are frequently directly mapped, although more recent designs can be also set associative. The use of a two level cache hierarchy can often significantly improve hit ratios, but how dramatic a performance gain is seen depends very much upon the speed and the size of the L2 cache. L1 caches are always fast since they are sitting on the same slab of silicon as the CPU. An L2 cache may live on the motherboard, the CPU module board, or within the same package as the CPU die. How fast the L2 cache is depends on the speed of the cache memory and the width and design of the bus between the cache and the CPU. Therefore determining the performance gain from an L2 cache of a given size may not be easy. For the end user there are a number of issues to be addressed in matching a machine cache to an application. The first necessary observation is that the only choice the user usually has is a limited range of vendor supplied options in CPU speeds and cache sizes, and sometimes a choice between an L2 cache on the module or on the motherboard. Where performance is at a premium, then the more closely the cache is coupled to the CPU, the better, and the wider the cache bus, also the better. Choosing between a 256 kbyte, 512 kbyte, 1 Mbyte, 2 Mbyte, 4 Mbyte or even 8 Mbyte L2 cache size depends mostly on the application in use. A 256 kbyte L2 cache which is tightly coupled will beat a much slower 8 Mbyte cache if the application fits well, conversely if it is too large for the smaller cache then its speed advantage amounts to little. I was recently bemused to observe that a brand new Pentium III / Celeron system running at 400 MHz delivered inferior performance to a clunky 180 MHz Pentium-Pro with a tightly coupled 256 kbyte L2 cache, clearly since the faster CPU with the inferior cache architecture was seeing a poor hit ratio on the applications being run. Extra MHz may be a delight for marketeers, but may not leverage into real performance gains if the cache architecture is not well fitted to the application. In practical terms this means that a prudent buyer will benchmark the system of interest with the application to be run. Ideally this should be done with a fistful of CPU/cache modules, to find exactly which cache size / speed combination best fits the application. Since fast and large L2 cache variants of most CPU modules cost much more than their lightweight siblings, it pays to be meticulous in this game. It is worth noting that on most CPUs the only way to determine actual hit ratios involves instrumenting the machine, or even running a machine simulation in software, neither of which are practical propositions for end users. So the only real choice is benchmarking. In any such benchmark, assuming equal CPU speeds, there will be a large drop in performance once the cache size falls below a given threshold for the application in use. Caching techniques are used in other areas of modern CPU design, such as branch target caches for speculative execution and translation look-aside buffers for virtual memory mappings, but since these are cast into silicon the end user has no choice in the matter. In conclusion on CPU caches, the principal caveat is to choose wisely, since cache performance can make or break a high performance system, regardless of the CPU clock speed. I/O Caches A clever idea such as caching is not one to go unnoticed, and contemporary operating systems abound with various caches, implemented in software. These caches reside in the main memory and are used in a similar manner to CPU caches, except the storage device being cached is the disk. Disks are slow mechanical devices with access times at best of the order of milliseconds, compared to the tens of nanoseconds we see even in the cheapest DRAM technology out there. Therefore caching disk blocks or other disk resident data which is frequently used pays off handsomely. The most important I/O cache is the "buffer cache", "disk block cache" or "block cache", depending on whose terminology we opt for. This cache is a pool of memory blocks which are used to cache specific blocks on disk. Therefore when these disk blocks are to be written into, the disk I/O may complete almost immediately by writing into the cached image of the block in memory, rather than the physical disk itself. A well designed and tuned buffer cache can provide a very large gain in performance, since the disparity in memory and disk speed is so large. A typical buffer cache scheme will leave management of the buffer cache to the operating system, which determines when to flush each block to disk, and also which blocks to sacrifice whenever a new block is brought into memory. The Unix sync command explicitly forces the operating system to flush its buffer cache to disk and is a sensible precaution before doing anything which might result in a crash. Tuning buffer caches is a necessary task on some older Unix variants, on newer ones the system soaks up unused pages in memory and allocates them to the cache. Either way, there is an issue with memory sizing which is of relevance to a system admin. If the buffer cache is undersized for the system and its load, performance will unnecessarily suffer. This is especially true with large servers and multiuser systems which carry a heavy I/O load. The task of tuning the buffer cache usually amounts to configuring a kernel with the appropriate balance between buffer cache size and remaining memory size for applications. Since memory is cheap these days, if the appropriate size of buffer cache eats up so much memory, that applications begin to swap, then it is time to plug in some extra memory. With newer systems the issue amounts to carefully monitoring system I/O behaviour with tools such as sar or vmstat and then sizing the memory as needed. The importance of disk caching to server performance cannot be overstated, if you get it wrong no amount of fast disk or CPU speed will solve the problem. There is no substitute for DRAM in this game. Another important cache commonly found in Unix systems, and first introduced in earlier BSD variants, is the filesystem name cache. It caches bindings between file names and the inodes in the filesystem. While it is hidden from the user, and not a feature which is tuned, it is worth noting that it significantly reduces disk I/O activity. Without it, every reference to file requires looking up the filesystem information which holds the mapping between the file name and inode, before any disk blocks can be read from disk into memory. The final I/O cache worth mentioning is the hardware cache to be usually found embedded in the bowels of any modern disk drive. Interestingly these caches evolved in response to the brain-damaged I/O and filesystem architecture of DOS and early Windoze variants, which was not very good in comparison with its Unix equivalents. Most modern disk drives have caches ranging in size from hundreds of kbytes to Megabytes, and usually employ proprietary algorithms for choosing which blocks to keep in the cache. In terms of performance, an operation to a SCSI or IDE disk with cache, providing a cache hit occurs, usually completes in well under a millisecond, compared to several milliseconds required to access the platter itself. Therefore the performance of the internal cache in a disk drive can have a huge effect on overall disk access time. Since internal disk caches are not easily accessible to a user, they are mostly irrelevant as an item to fuss over. However, users of high integrity databases might wish to explore the cache flushing strategy used by a specific disk, and consider turning the cache off. While this costs performance, there is no chance of being embarrassed by a power failure, and trashed journal not written to the platters. Application Caches Just as the cache found its way into modern operating systems, so it has also become a feature of many application programs. As with caches in hardware and OS internals, application caches aim to improve performance by storing frequently used items in a faster storage medium than the one they usually reside upon. Probably the best known example is the Netscape browser, which maintains a memory cache and disk cache for files it has accessed on the web. Both of these caches can be adjusted by the user, and both can be manually purged as well. The idea is of course to store frequently accessed URLs on a faster medium than the web, not a difficult task considering current web performance. Browsers are of course not the only place caches are found, caching web proxy servers will use them, as will many database applications. For any software developer, the use of a cache should always be considered where there is a strong likelihood of frequent repetitive access to data. Tuning application caches revolves yet again around speed and hit ratios. Speed is maximised by caching in memory, or in a very fast storage device. Hit ratios are application dependent, but generally see improvement with increasing cache size, with the caveat that random web browsing is not conducive to high hit ratios ! The disadvantage of memory resident caches is that they are usually volatile, unless designed to flush into a disk cache. Also memory is at least a factor of ten to a hundred more expensive than disk, per Megabyte. There is a recently developed means of cheating in this game, since some manufacturers now supply SCSI "solid state disks", essentially DRAM devices which are packaged and logically accessible as a SCSI disk, but much faster. Some types also include internal battery backup, and are thus persistent like magnetic media. Therefore it is entirely feasible to place a disk resident application cache on to a solid state disk, rather than a conventional disk. While the solid state disk is much more expensive per Megabyte, if it is holding only a cache, the tradeoff may well be worth pursuing. Caching is an extremely powerful technique for maximising system performance, but like most techniques in this trade, must be well understood to be properly exploited. Beginners are well advised to tread carefully. |
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |