Code Morphing - The Shape of the Future? |
Originally published August, 2000 |
by
Carlo Kopp |
¿ 2000, 2005 Carlo Kopp |
The flavour of the month, or perhaps year, is Transmeta's new Crusoe family of VLIW processors, aimed at the laptop and notepad markets. Transmeta can rightfully claim the laurels for the first VLIW architecture processor to hit the mainstream market, and have implemented some very innovative ideas in this process. In this month's feature we will explore some of the ideas behind the Crusoe processors, and attempt to place them into th industry context. Superscalar - Is the End Approaching ? The superscalar processor is today the mainstay of the computing market. Whether we are looking at an Intel, AMD i86, SPARC, HP-PA, Alpha, PowerPC or other chip, the architectural technique central to all is the superscalar model. The central idea in all superscalar processors is the exploitation of what we term Instruction Level Parallelism or ILP for short. ILP is the property in any piece of code, whereby some machine level instructions have no mutual dependencies. Consider a program in which we add two numbers, and then multiply them by a third number. A compiler, or human writing assembler, would convert this line of program code into two consecutive instructions, an ADD and a MUL. The result from the ADD is used as an operand for the MUL. This is a simple case of dependency. We cannot execute the MUL until the ADD has completed. Regardless of what CPU we have, such a piece of code cannot be easily made to run faster, other than by making the multiplication and addition hardware run more quickly. What, however, happens if we have consecutive lines of code in which the operands have no mutual dependency ? In principle, we do not really care about when we execute them - there is no need to calculate each result in the order in which it appears in the programmer's code. We can execute them in parallel and it makes no difference to the result. This is the consequence of ILP - we can build a CPU in which we put multiple adders, multipliers, shifters, or any other useful bits of arithmetic and logic hardware, and we can then execute instructions which exhibit ILP in parallel. This is the central idea behind the Superscalar Processor - exploitation of ILP to extract additional speed. Superscalar processors have multiple execution units (EU), modern-speak for slightly more elaborate examples of the humble ALU (Arithmetic Logic Unit). The more execution units, one would assume, the faster we can make the CPU run since a greater number of instructions can be executed in parallel. The problem is of course in figuring out which instructions we can execute in parallel (out-of-order, to be formal), and which are mutually dependent and must be executed in the order they appear. Herein lies the limitation of the superscalar processor. To separate the wolves from the sheep, or mutually dependent instructions from mutually independent instructions, in the stream of code being executed, we have to analyse the behaviour of the code. For simple dependencies, such as consecutive arithmetic instructions, this can be fairly messy but is tractable. The real difficulty arises with a particular category of dependencies, which machine architects call a procedural dependency. Procedural dependency occurs with a conditional branch (jump) instruction, where the next instruction to be fetched depends on the outcome of an earlier computation. Until that computation is finished, the CPU cannot know where to fetch the next instruction from, so the machinery is stalled until this happens. Superscalar processor designers were quick to appreciate this problem. Early processors used the technique of branch target caching, where a special register bank is used to hold the outcome of the first branch (jump) encountered. If the code is part of a loop, the odds are extremely good that the next time this branch is encountered, it will follow the same thread of execution. As a result, the much feared stall is typically encountered only at the beginning and the end of a loop. Of course, two stalls per loop was too much, and the preferred technique in modern CPUs is what architects term speculative execution. In such a processor, the hardware will prefetch both paths to be executed following the conditional branch instruction, putting the results into hidden temporary internal registers. Once the outcome of the branch is known, the proper thread of execution is committed and the other thrown away. This technique is used in a great many modern CPUs and is extremely effective, since no stall occurs. Where is the gotcha in this game ? The no free lunches rule asserts itself very quickly if we have several consecutive conditional branches. On the first branch, we have to prefetch two threads, and speculatively execute them side by side. However, if both threads contain each another branch, the number of threads immediately blows out to four. With yet another branch, this becomes eight. The superscalar CPU is thus presented with an analogous problem to that faced by a chess player, trying to think ahead several moves - the number of speculative choices blows out very rapidly indeed - exponentially in fact for some situations. Of course, to find ILP in the instruction stream, the further ahead we can search the better, since the code can in a sense be seen as a pool of mutually independent instructions, through which are threaded gaggles of mutually dependent calculations, and the control flow instructions. When a modern superscalar CPU hits a chunk of code which is rich in conditional branches, such as a nested switch() statement in C, it will bog itself down very quickly indeed since it has only so many execution units and hidden registers to perform speculation with. The direct consequence of this is that beyond a certain point, extracting more performance results in an exponential blowout in the amount of hardware required. There are really no other tricks to play here, since the control flow of the program is fundamental to the algorithm itself, and cannot be escaped by using different languages or compilation techniques. While Moore's Law is exponential, this simply means that all additional transistors on a new chip would be eaten up in trying to overcome this problem alone. The exponential gains in transistor count get eaten up by the exponentially growing demands of speculative execution. As a result, the payoff in exponentially growing the chip complexity is only incremental. Classical VLIW (Very Large Instruction Word) processors aim to overcome this problem by shifting the search for ILP from the CPU into the compiler. Rather than executing existing RISC or CISC instructions, a VLIW processor uses instruction blocks, each of which contains the equivalent of several simple RISC instructions or portions of CISC instructions. The idea is thus to move the time consuming process of trying to find ILP from runtime, in hardware, to compile time, in software. Instead of having to do this all the time, it is done only once. A VLIW processor is thus freed from much of the complexity of the superscalar processor, and most of its hardware can be invested into execution units, registers, and a much simpler control architecture. If the compiler can extract more ILP, the VLIW processor can be given more execution units and thus run considerably faster. What is the gotcha with a VLIW architecture ? The principal issue is that the machine instruction set is quite different from a conventional CPU, and that a very smart compiler will be needed to get the desired result. The uniqueness of a VLIW instruction set is perhaps the most painful problem of all, in a marketplace where a handful of well established CISC and RISC architectures in effect monopolise the established application and operating system software base. Indeed, since the RISC revolution we have seen almost no new entrants into the CPU marketplace, other than a handful of brave clone-makers, producing alternate implementations of common architectures. AMD, recently covered in this series, are a good example. A new architecture in the conventional paradigm means porting compilers, operating systems, and then applications to the new design. Given how reluctant application developers are to port even across well established architectures and operating systems, new entrants are presented with an essentially insurmountable entry barrier to the market. Incumbency presents an advantage to the established players, one which can only be overcome with an enormous investment which for all practical purposes, nobody can afford to make. The obvious solution lies in some form of instruction set emulation, a technique with a long and colourful history. Hitherto instruction set emulation was done in one of two ways - either by using two sets of microcode in the one CPU, or by incorporating a direct translation mechanism in the instruction decoder, to map a single CISC instruction into multiple RISC instructions. The latter technique is basic to Pentium family the RISC instructions being hidden from the programmer. A third technique, plain emulation in software, also found some adherents, and has been a popular method for emulating Windoze on RISC machines. Enter Transmeta. Code Morphing - the Shape of the Future ? Transmeta is a Silicon Valley startup, one which has through clever technical thinking and clever marketing, managed to steal the limelight from the larger players in the last few months. Transmeta's claim to fame is its Crusoe processor family, and the code morphing instruction set emulation used with it. With famous industry names like Linus Torvalds consulting for them, they could hardly go wrong. The classical approach to instruction set emulation essentially amounts to little more than mapping an instruction in one instruction set to one or more instructions in the native instruction set of the machine. At best, such emulations result in a similar class of CPU producing similar performance to the original architecture being emulated. The idiosyncrasies of the program would be mapped across to the new architecture, since the emulation was relatively unintelligent. Every time an instruction is fetched, it is mapped into the new architecture, and thereafter executed on a conventional superscalar chip, like any other program. The only advantage achieved is that a common architecture, such as the X86, could be executed on a different CPU type. The conventional strategy for emulation is certainly one approach which is feasible for breaking into a established market, but in itself offers very little of an incentive for users to . Transmeta needed a decisive technological advantage to grab market share from established players, and a market niche where their product could compete effectively against the big players, like Intel and Motorola. What Transmeta opted to do was to combine emulation of the x86 Intel architecture with VLIW techniques, to produce a CPU competitive in performance with top end Intel superscalar CPUs, yet sufficiently simpler in hardware to dramatically reduce power dissipation and thus seduce the laptop and notebook industry. The key to this commercial play in the market is Transmeta's code morphing software. What Transmeta did was to wrap their proprietary VLIW processor within a layer of software, which translates Intel instructions on the fly into 16 byte VLIW instructions (termed molecules, each comprising four atoms), thus hiding the VLIW hardware from the operating system and the application. Where the Transmeta strategy differs from the conventional approach to emulation is that it is designed to translate blocks of several Intel instructions into blocks of several VLIW instructions, performing appropriate optimisations to exploit ILP in the algorithm. This technique alone can, however, run up a significant overhead since the optimisation will frequently require more compute cycles than the code to be executed itself does. So the Transmeta approach is to cache the translation from Intel instructions to VLIW instructions, so that the next time the block of Intel instructions is encountered, the cached translation is loaded instead of executing the translation algorithm again. What this means is that a chunk of code which is executed very frequently is translated only once, and then the optimised VLIW translation used thereafter. A translation cache is used to store the translation. The Transmeta implementation stores the code morphing software, compiled to native Crusoe VLIW instructions, in Flash EEPROM chips on the motherboard. When the Crusoe comes out of a reset, it jumps to the code morphing software, loads it into the motherboard DRAM, and begins to execute its emulation of the Intel chip. The emulation, in turn, boots the Intel binary of the operating system, which in turn loads applications and executes them. As with all caching schemes, the hit ratio will be decisive to achieving high performance. Transmeta's literature does not elaborate on their choice of cache replacement algorithm, we can speculate that some flavour of a Least Recently Used (LRU) technique will have been used. The translation cache is a reserved area of the CPU's main memory, which the memory management hardware hides from the Intel emulation. The size of the translation cache is either set at boot time, or modifiable by the operating system at runtime. What is of considerable interest is Transmeta's very intelligent approach to handling optimisation of the Intel to VLIW translations. The strategy adopted here is to incorporate additional code into the translation to instrument access and branch outcome statistics. Every time a translation takes a hit in the translation cache, it is logged transparently by the CPU. In this manner, the code morphing software can monitor the frequency with which a particular translation is being reused, or a branch taken in the code. The more frequently it is being used, the greater the payoff in going back and reoptimising the translation more aggressively. For instance, the interrupt service routine for the hard disk would be such a candidate, whereas the code to probe hardware at boot time would not be. Transmeta have been understandably coy about details of their optimisation, indeed these are the crown jewels of the company's technology base. They have however stated that multiple levels of optimisation are supported. The simplest is interpretation, which simply maps Intel instructions directly into VLIW instruction primitives (atoms in Transmeta-speak), and we can presume runs slower than native Intel since the VLIW engine does not have superscalar scheduling hardware. Further up, simple code generation techniques can be used, or ultimately very clever optimisation techniques. The smarter the translation algorithm, the longer the translation takes, but the faster the runtime code in the translation cache becomes. The code morphing software is claimed to be capable of biasing code where conditional branches exist, to more highly optimise the most frequently taken branch, or to perform speculative execution in the manner done by a superscalar CPU. An example cited in Transmeta literature is used to illustrate these techniques. The first step in the translation sees the code morphing software remap a block of four Intel instructions into six atoms, or VLIW instruction primitives. This is performed by the frontend pass of the code optimiser, and produces a very simple block-to-block mapping with no optimisations. To improve performance, a second pass may be performed, using the optimizer. This algorithm employs techniques borrowed from the compiler world and not implementable in hardware, such as the elimination of common subexpressions and dead code, or code motion (AKA loop invariant removals). This can result in primitives being eliminated from the translation, making it much shorter. The final pass is the scheduler which reorders the primitives in a manner similar to a superscalar CPU, to exploit ILP in the algorithm. As Transmeta point out, since this is being done in software, a much longer run of instructions can be worked upon than could be done in superscalar hardware. As a result, more efficient scheduling algorithms can be employed in comparison with those cast into silicon on superscalar chips. The general observation we can make at this point is that Transmeta's code morphing strategy will perform best on applications which tend to be highly localised in execution. Indeed their target market of laptops running multimedia applications is a good fit to the design philosophy. The approach has other interesting attributes. One is that the VLIW hardware architecture can be changed model by model, since to accommodate incrementally different hardware requires only incremental changes to the code morphing algorithm. This provides the important property of evolvability in the product, since inevitable market pressures to tune performance can be accommodated without having to fundamentally change the product. More floating point execution units ? Tack them onto the die, and tweak the code. In a more general sense, this approach has much greater potential. For instance, by changing the front end of the code morphing software, different machine architectures can be emulated. While Transmeta have not discussed this in the literature I have read, it is entirely feasible. Indeed, if done right, it would allow in theory two or more different architectures to be emulated concurrently on the one system, operating system support being of course required. If the executable module uses a specific binary format, such as ELF, it may indeed be possible to kludge a system which can run applications compiled for different architectures using the same operating system. Clearly there is much remaining potential in this interesting concept which is yet to be exploited. At this time Transmeta are offering two CPUs. The smaller TM3200 is supplied in 366 and 400 MHz variants, with a 64kbyte instruction cache, 32 kbyte data cache, and integrated SDRAM and PCI controllers. The chip runs off 1.5V and claimed to consume as little as 15 milliWatts in operation, a trivial number compared to established superscalar chips in the same performance class. The faster TM5400 runs at 500 to 700 MHz, and uses a 2 level cache scheme, with a Harvard L1 cache using 64 kbyte per path, and a 256 kbyte write back L2 cache. The chip uses integrated controllers, like the TM3400, and is claimed to consume 1-2 Watts at full clock speed, running multimedia applications. What is the future for Transmeta ? Given the growth in wireless mobile applications, and the popularity of notepads and laptops, potentially very bright indeed. Much will depend however on convincing incumbent laptop makers to depart from their well tried Intel and PowerPC products, and that may take some doing. Nevertheless, Transmeta's development strategy gives them a significant technological lead over much of the established market. Whether they can fully capitalise upon this remains to be seen. |
$Revision: 1.1 $ |
Last Updated: Sun Apr 24 11:22:45 GMT 2005 |
Artwork and text ¿ 2005 Carlo Kopp |