GigaHertz Processors - Getting Bang for the Buck |
Originally published August, 2001 |
by
Carlo Kopp |
¿ 2001, 2005 Carlo Kopp |
The recent emergence of processor chips with clock speeds in excess of a GigaHertz has been a major advance for the industry. There can be no doubt that the availability of commodity chips with 1.7 GHz clocks provides an unprecedented gain in the performance potential of systems, and thus applications. The massive improvement in compute performance potential is paralleled by an ongoing decline in the cost of memory, and especially disk storage. The technology of GMR disk heads has pushed the capacity of the commodity $300 drive into the tens of Gigabytes, defacto outstripping the capacity of most backup tape technologies. Actual performance is however distinct from performance potential, since poor design and implementation of applications may see even GigaHertz clock speed processors yield little return in achieved performance against their siblings of a half decade ago. In this month's issue we will explore what impediments exist to the extraction of the full performance potential of this generation of processors, and discuss what strategies a developer can pursue to extract as much bang as possible from the available bucks. GigaHertz Processors The biggest breakthrough contributing to the breaking of the 1 GHz clock speed barrier was the introduction, last year, of copper metallisation fab technology. After many years of research IBM cracked the problem of how to replace aluminium, the mainstay until then of on-chip wiring. More conductive copper reduces series resistive in the wiring on the chip, in turn reducing RC delay effects. The first processor to utilise this technology was AMD's Athlon, soon followed by the Pentium IV. We can expect to see most of the mainstream chip manufacturers to copper over the next year or two. A commodity microprocessor in this class will have tens of millions of transistors on chip, clock speeds between 1.0 and 1.7 GHz (this year) and a 6 to 9 way superscalar architecture incorporating capabilities such as speculative execution and out-of-order execution. More than likely a 4 to 8-way set associative Level 2 cache or up to 256 kbytes will be integrated on the chip die, as well as a Level 1 cache of up to 128 kbytes. The machine is likely to be using a 64-bit or wider system bus to main memory running at a clock speed between 100 to 400 MHz, depending on the chipset in use. By any measure, such machines have formidable performance potential if properly exploited. Since performance scales in part with clock speed, but also with the degree of superscalarity in the CPU, and the hit ratio of the CPU caches, for many applications such chips will yield performance gains greater than the ratio of its clock speed against the clock speed of a sub 1 GHz chip of similar architecture. Superficially, this would suggest that we can safely assume that if application X running on operating system Y delivers Z amount of performance, going from a 700 MHz CPU to a 1.4 GHz CPU should halve either the acheivable response time in interactive work, or double the speed/workload of non-interactive work. Careful examination of published benchmarks suggests otherwise, with the scaling in benchmark figures with clock speed ratios falling short of the ideal N-fold improvement in performance. To an observer not well versed in machine architecture, this seeming incongruity might appear to be puzzling. To understand why this arises, we must delve a little deeper. Impediments to Performance Let us assume an ideal world - perhaps a dangerous form of speculation! In this world the application and operating system are always resident in the CPU's internal on-chip caches, and the binary executable code created by the compiler is dominated by instructions which are not mutually dependent. In this ideal world, the GigaHertz processor will chew through the stream of instructions in the application at its peak acheivable throughput almost all of the time. Of the N execution units in the CPU, nearly all of the N will be active all of the time. As the application is always resident in the cache, in an ideal world model, every instruction fetch sees the nanosecond class access time of the cache, rather than the tens of nanosecond access times of the main memory. Therefore, the CPU sees an uninterrupted stream of instructions and is never stalled for want of instructions to process. Reality, however, might be very different. Superscalar architectures can execute at peak output only when the code they are executing contains few instructions with mutual dependency. Whenever an instruction is dependent upon the results of a previous instruction, there is potential for performance to be lost. Computer architects use the term Instruction Level Parallelism or ILP to describe the property of executable code, whereby little mutual dependency exists between instructions. A particularly troublesome situation for superscalar processors is branching, since waiting for the outcome of a branch has the potential to empty the internal pipelines in the CPU, incurring significant latency times to refill. Commodity processors contain numerous design features aimed at exploiting ILP and also avoiding stalls. Speculative execution is perhaps the most popular technique in modern processors. At the cost of considerable complexity in logic, supportable due to the large transistor counts available, the CPU will prefetch and execute both outcomes of the branch instruction and discard the outcome which is not used. The difficulty with speculative execution is that the problem becomes intractable where consecutive branches are found. Consider a piece of code in which four or five branches are nested. The first branch statement results in two paths to speculatively execute. Each consecutive branch doubles the number of instruction streams, from two to four, four to eight and so on. The logic handling speculative execution must prefetch instructions for each of these streams. In practice this imposes limits on how far the CPU can look ahead into the code and speculatively execute. Once that limit is hit, the CPU stalls waiting for the resolution to the pending branch operation. The difficulty with the ILP problem is that mutual dependency between operands in instructions is very frequently implicit within the algorithm being used. Therefore no amount of compiler optimisation or other clever trickery can beat the problem. Within any soup of instructions found in a binary module, there will be threads of instructions with mutual dependency, and these in effect form critical timing paths for the algorithms being used. The problem is fundamentally the same as the critical path problem seen in any PERT chart. The push toward Very Large Instruction Word (VLIW) architectures such as Transmeta Crusoe and IA-64 is an attempt to grapple with this problem, by pushing the instruction scheduling problem out of the hardware and into the software. The basic idea is that the bigger the block of instructions the CPU can explore, the more opportunities will exist to keep the CPU's execution units busy with instructions which are not stuck along a critical scheduling path. The issue of cache performance is no less daunting with a very fast CPU. The aim of all CPU caches, be they combined data/instruction caches, or Harvard architecture or split caches with dedicated data and instruction paths, is to hide the poor performance of the main memory. If the instruction or data is resident in the cache, it can be accessed within a clock cycle or two, if not, then the CPU has to wait for the data or instruction to be found in the main memory. The classical metric of cache performance is the hit ratio, or the ratio of cache accesses which find the data or instruction in the cache, to the total number of accesses. A cache design which matches an program well will deliver hit ratios well above 90%. Machine architects like to use a simple metric for measuring the impact of cache hit ratio, the average memory access time. This time is calculated by adding the proportion of hits times cache access time, to the proportion of misses times main memory access time. The higher the hit ratio, the closer the average memory access time is to the desired (very short) cache access time. A good example might be a 500 MHz CPU which uses a 2 nanosecond cache and a 50 nanosecond main memory. If the cache hit ratio is 99%, then the average memory access time comes in at 2.48 nanoseconds. Even a 1% miss ratio results in a 24% performance loss against the ideal 100% hit ratio situation. If the hit ratio is further degraded, say to 50%, then the average memory access time becomes 26 nanoseconds. This is a 13 fold increase against the ideal 100% hit ratio situation! The interesting point then becomes that of what it is costing in real terms. If we assume the CPU can execute an instruction in 2 nanoseconds average time, then in 50 nanoseconds that CPU did not execute 25 instructions, while it was waiting for the stalled fetch to be resolved. What then happens if the CPU is a genuine GHz class processor with a clock speed of 1.5 GHz, but the same order of magnitude 50 nanosecond main memory? The cache access time is much shorter, at 660 picoseconds or 0.66 nanoseconds. With every stall resulting from a cache miss, the CPU idles while no less than 76 instructions are not executed. While the issue of ILP is important for GigaHertz processors, the performance of the cache architecture is a do-or-die issue. The problem of speed disparity between caches and main memories will only get worse over time. This is because the economic driver in the DRAM market is density rather than speed. Moore's Law being what it is, we see CPU clock speeds and thus cache speeds increasing by a factor of about ten every decade. Yet over the last decade we have seen DRAM speeds increase roughly by a factor of 2-3. Around 1990, a typical hot CPU ran at 50 MHz and DRAMs had typically around 70 nanosecond access times, yielding a ratio of about 3.5 for typical commodity hardware. In 2001, a typical hot CPU runs at 1.5 GHz yet commodity DRAMs have typical access times of 30-50 nanoseconds, yielding a ratio of about 45-75. Therefore the disparity between cache and DRAM speeds has grown by a factor of 13 to 21 times. Some confusion may result from contemporary marketing terminology surrounding the use of newer technology synchronous SDRAMs. Such DRAMs are designed for burst mode operation, where the access time for consecutive locations in the burst is typically between 7 and 12 nanoseconds for current technology. However, the snag is the initial latency in the SDRAM access, which is still somewhere between 30 to 50 nanoseconds for commodity hardware. If the CPU has experienced a cache miss and is stalled, then it does indeed have to wait the full 30-50 nanoseconds before the main memory responds. This is the nature of a cache miss. Many tricks are used in modern architectures to avoid misses, and the whole idea of SDRAM techniques is to exploit smart CPUs which will prefetch instructions and try to perform speculative execution. The idea is of course to pipeline the main memory so that it is busy fetching instructions all of the time. However, if the prefetch logic doesn't know where to prefetch from, for instance as a result of the multiple branch scenario saturating the speculative execution control logic, then the cache cannot be prefilled and a cache miss will indeed arise. So we are back to the basic and fundamental problem of cache hit ratios and ever growing disparities between the speeds of caches and DRAMs. The basic conclusion to be drawn here is that the cost of a cache miss will continue to increase over time as Moore's Law continues to drive CPUs along their much steeper performance growth curve. Where does this leave the system developer? Beating the Cache/DRAM Disparity Problem Competitive pressures in the commodity computer market have seen significant improvements in the architecture of x86 instruction set systems in recent times, in effect nullifying much of the traditional advantage held by Unix workstations. During most of the 1990s, x86 systems could usually match the clock speeds of the flavour of the month RISC Unix machines, but were typically installed in motherboards with relatively primitive memory and system bus architectures. Therefore regardless of cache performance, the acheivable bandwidth to memory was an ongoing problem for x86 systems. By the end of the 1990s, AMD had licenced the DEC/Compaq bus developed for the Alpha and Intel have recently released their equivalent in P-IV based systems. Therefore in basic desktop or deskside systems, the performance issues will be firmly centred on the achievable cache hit ratio performance. Cache architectures have also seen serious improvements over the last two years, in a large part due to the availability of more transistors on the chip die. The biggest single step in the x86 market was the last generation of the P-III series, which incorporated an on-chip 256 kbyte 8 way set associative L2 cache with a full speed 256 bit wide bus between the L2 cache and the CPU/L1 cache. It is reasonable to expect similar capabilities across the market over the coming 12 months. Cache hit ratio is a very complex function of the cache architecture and how it interacts with the program being executed. A machine architect has three basic parameters to play with:
What we see in commodity products is usually some balanced trade-off between three items, to maximise the performance of the design against a suite of benchmark programs such as SpecMarks, WinMarks and others. A clever designer will identify which benchmarks are most representative of market needs and bias his or her design in that direction. A developer is therefore largely exposed to what the chip maker's marketeers believe to be the most suitable benchmarks. If the application being developed doesn't fit the market template for that CPU design, the odds are that the full performance potential will not be exploited. There is still some choice left in the market, in that many CPUs can be bought with different cache sizes. Server or power user optimised CPU variants may be available with 0.5, 1.0, 2.0 or even 4.0 - 8.0 Megabyte sized caches. The simplest approach for a developer is to benchmark the application against a range of CPU variants and identify those which deliver significantly poorer performance. Assuming all else is the same, the smallest cache size of the machines which perform well will be the cache size which fits the application well. The very same argument can be applied to the operating system, should this choice be available to the developer. Where the developer is producing a turnkey system, or is in the position to specify a target platform or CPU type, this model is practical and simple. For shrinkwrapped software products, especially those targeted at desktop applications, this model is almost worthless. The user may be running one of many different variants of a commodity CPU, especially in the Intel/AMD market. In this environment, getting improvements to the cache hit ratio may require some clever manipulation of the application design, which is a non-trivial task. Tuning Applications for Cache Size In a given range of commodity CPUs, it is usually not difficult to visit the vendor's website and establish what the cache architectures of the respective CPUs are. The parameters of interest are:
Knowing these parameters for a range of CPUs, it is feasible to pick the lowest common denominator - the smallest L1 data and instruction cache size, and the smallest L2 cache size. Most machines use a combined L2 cache (although this may change as transistor counts go up). Armed with this knowledge, the developer must then explore the behaviour of the application. Most applications will spend much of their running time in a large event loop, scanning for imputs and then executing code modules which respond to these inputs. With suitable profiling tools and a bit of common sense, it is feasible to establish what proportion of time the application spends in which specific modules of the application. Once this is known, the developer can try to identify which modules are run most frequently, and how large their executable code, data and stack segments are. If the application is to achieve a high cache hit ratio, the code, data and stack segments must all be resident in the L2 and L1 caches. Therefore to get best acheivable performance, the most frequently executed modules should be cache resident. A poor cache hit ratio in an infrequently run code module may verge on the irrelevant. Conversely, a library routine which is being hammered in every pass of one or more loops should have a good hit ratio. The developer's aim at this stage is exploratory - identifying which parts of the application are frequently run and likely to be mismatched to the cache. A good example might be a status table on the heap or at the beginning of the stack. If it is known to be much larger than the L1 data cache or even the L2 cache, and it is very frequently accessed, the prospects are very good that it will experience a poor hit ratio. How poor a hit ratio will depend upon the locality of accesses, since it may well be that only a small fraction of the table is accessed very frequently and the rest ignored most of the time. A similar argument can be applied to a code segment. If that code segment comprises a fairly large loop, and is clearly much larger than the L1 data cache or even the L2 cache, the prospects are very good that it will become a performance killer. Once the developer has identified the problem areas in the application, the next step is to make an attempt at fixing the problem. Consider a piece of code which sits in a large loop, progressively working through whatever chores it needs to perform. Is there any reason why this large loop cannot be split into a series of shorter loops, each working on some part of the problem? Smaller loops which fit into the caches will do the same work faster than a large loop which continues to overwrite itself in the cache. Must the data set be held in an enormous array, if it can be split into a larger number of smaller arrays, each of which fits easily into the caches? In a sense, tuning an application for best cache performance is analogous to the tedious chores performed by database developers who must manipulate access patterns in manner devised to minimise disk accesses. Clearly, this can be a tedious and time consuming chore, especially if an existing application is to be hacked into something which performs better. If a new application is being written, the ground rules are simple - smaller is better be it in the length of loops or the size of datastructures. Applying such a strategy intelligently will matter - only those chunks of code and data which are executed very frequently will return a good payoff in expended effort for performance gains seen. A final cautionary note is that shrinkwrapped libraries or operating systems with built in cache hit ratio problems are likely to frustrate even the most intelligent application tuning effort. For those who might be sceptical of the impact of cache performance, I can cite a simple example - a (cache) simulation application which ran much faster on a 180 MHz Pentium-Pro than a 400 MHz Celeron, both of which used a virtually identical core microarchitecture. Cache performance does matter! |
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |