|Sixty Four Bit Processors - The Technical Issues|
|Originally published August, 1997|
|¿ 1997, 2005 Carlo Kopp|
|The sixty four bit CPU
is upon us, with all of the sales hype which accompanies any new
technology in the marketplace. For many users, sixty four bit
technology will make an important difference, for other users less of a
difference. In this feature we will take a closer look at the technical
issues which underlie this technology, to illustrate what 64-bits have
to offer, against the established 32-bit technology.
Operand Size in Processors
When people speak of 8-bit, 16-bit, 32-bit or 64-bit processors, they are referring to the size of the basic integer operand used in the machine, also termed the word size of the machine (NB it is worth noting that in newer literature the minicomputer nomenclature of a word being a 16-bit integer is commonly used, older literature uses the term in a more general sense).
The terminology has sadly been badly muddled in the marketplace over the years, particularly courtesy of the PC industry and their early commitment to a byte sized machine word. As a result, many people other than Computer Scientists and Engineers may have a rather vague idea of what the issues are really all about.
The problem has been further complicated by a tendency for machines to acquire various performance "accelerator" architectural features with each generation. Therefore going from a 16-bit CPU to a 32-bit CPU has usually involved a lot more than simply increasing word size in the machine. What the operand size in a machine basically determines is the number of bits which gets crunched when doing a basic arithmetic operation such as an integer addition, subtraction, , or multiply or divide, where these are implemented.
What the operand size determines is the width of the processor's internal databus, the width of the processor register bank, and importantly, the width of the Arithmetic Logic Unit or ALU. An operand which is the width of the CPU operand can be efficiently processed in a single or at most few CPU operations. If the operand is smaller than the width of the ALU, no performance loss is incurred, as the operand simply fills a half, a quarter, or an eighth of the registers in use and the ALU.
However, should the operand be larger than the width of the ALU and internal databusses, then a performance loss will be incurred. This is because the operand will have to be split into multiple registers, and operated upon in bits and pieces to produce the desired end result. In the simplest of terms, a double width addition on a given CPU type will take at least twice as long as a single width addition, because you have to add the least significant word, produce a result and a carry if required, and then you must add the carry and the two most significant word components to produce the result.
How efficiently this can be accomplished determines how close you can get to the magic factor of two number. Assuming that all of the ancillary chores can be hidden in a clock cycle, and the carry can be also cleverly fed back into the ALU, then hopefully your double width addition will only take twice the number of clock ticks of a single width addition. This problem of course carries through to all double width operations, which you may need to perform in the machine. Address arithmetic is an important area which we will explore further.
An Historical Perspective on Operand Sizes
Historically the trend has been to ever increasing operand sizes in machines. Traditionally the split between small machines, minis and micros, and mainframes and supercomputers was also a split between operand sizes. Large machines were built to do large floating point of integer jobs, and tended to operate with operand sizes of 18, 24, 32, 48 or 64 bits. For example, the trusty Cray-1 worked with 64-bit operands.
Then contemporary minicomputers, such as various models of the DEC PDP-11 or DG Nova, were firmly 16-bit machines, as was the LSI-11, DEC's chipset based microprocessor implementation of the PDP-11. In the days of CPUs built from sets of boards (this author recalls doing chip level debugs on VAX-11/780 CPU boards a mere decade ago) it was not uncommon to have to build a hefty cage of hardware to make a CPU from LS TTL logic chips (eg the trusty old 11/780 had no less than 32 large boards in the CPU).
Therefore if you needed bigger operand sizes, you needed more hardware. For any given technology doubling the operand size typically meant doubling the volume and thus at least doubling the cost of the hardware. Little machines therefore meant little operands, and big machines big operands.
Microprocessors started with 4-bit operands, and Intel's 4004 would be the classical example. Very soon however, micros progressed to 8-bits, and the first wave of desktop standalone machines were built upon the 8080, 8085, their cousin the Z-80, and the 6809. These early CPU dies were simple both architecturally, and used only real memory. Address arithmetic usually revolved about the magic number of 64k, a result of 16-bit wide address busses.
The IBM PC/PC XT is often described as an 8-bit machine of this generation. This is not true. The PC used the Intel 8088, which was a bastardised (Intel will hopefully forgive my language here) variant of the 16-bit 8086. The 8088 was essentially an 8086 which had its 16-bit datapaths cut down to 8-bits, indeed it could best be described as a hobbled 8086.
Therefore, one would properly describe the 8088 as a pseudo-8-bit machine. The minicomputer, typified by the enormously successful DEC VAX-11/780, peaked out at 32-bits. This operand size is quite sensible for general purpose work, be it integer or floating point. Only where accurate scientific floating point work is required is there a genuine case for 64-bit operands, from a basic arithmetic perspective.
The mid eighties saw a massive growth in microprocessor density, complexity and performance, which has continued unabated to this very day. What is very interesting is that most of the clever features we see touted today in modern micros, such as superscalar operation, pipelines, and clever arithmetic logic, are chip level implementations of sixties mainframe architectural features.
A late nineties micro is from an architectural perspective a collage of sixties mainframe technology, melded with eighties RISC microarchitecture. In computer architecture, ideas are seldom forgotten, they are usually resurrected years or decades later when they can be used to advantage again. Micros progressed very quickly from basic 16-bit architectures to accelerated 16-bit architectures, which used features such as pipelining to improve performance. Interestingly, the big change in the second generation of 16-bit micros was the adoption of more sophisticated memory management techniques.
Virtual memory arrived in the micro at this time, creating a whole new range of issues to deal with. By the mid to late eighties 32-bit micros were emerging, and the modern Unix workstation market was firmly built upon the combination of RISC, virtual memory and 32-bit architecture. The SPARC family, the R2000/3000, the early HP-PA and IBM's POWER architecture were all based upon variations of this basic model.
The 32-bit machine which had ruled the minicomputer market at its peak, also ruled the Unix workstation market. On a parallel track, Intel produced the 386, which was essentially an enhanced 32-bit derivative of the 286, which added paged virtual memory. The 386 was soon followed by the 486, which had further enhancements to improve performance. With clock speeds of tens of Megahertz and 32-bit arithmetic, the modern micro quickly killed off the traditional minicomputer. Indeed, propagation delays and clock skewing issues became major technical obstacles to squeezing more speed out of the mini, and thus minicomputers really became big micros with heavyweight I/O capability absent in workstations and PCs.
The early nineties saw the emergence of 64-bit micros, with DEC leading the pack with its VHSIC (the US DoD sponsored Very High Speed Integrated Circuit research program) technology based 64-bit Alpha. In a 32-bit world, why the sudden move to 64-bits ? What is to be gained from a 64-bit engine ? The answer to these questions will become a little clearer when we take a look at the issue of address arithmetic.
Address arithmetic is all about manipulating memory, and finding your way around in the complex maze of mappings produced by a modern memory management scheme. In the simplest instances, address arithmetic deals with pointer manipulation, and calculating absolute addresses from a base address and an offset. In more complex virtual memory systems, address arithmetic also involves interaction with the virtual memory hardware, to find your intended address through the mapping mechanisms of the virtual memory hardware.
Traditionally, computers have had assymetrical datapath (ie databus) and address bus sizes. In early machines the available address space was usually quite modest, since memory was both bulky and expensive. Therefore it would not be surprising to find an 8 or 16-bit machine with a 16, 18 or 24 bit address bus. An interesting example would be early supercomputers, some of which had 64-bit datapaths but only 24-bit wide address busses. In such machines arithmetic accuracy determined the operand size, and memory technology limited the useful address range to such, that 24-bits sufficed.
In the micro marketplace, recent years have seen a massive growth in the size and complexity of application software. This has in turn placed pressure upon memory sizes, and we are now facing a reality where machines with hundreds of Megabytes of memory are commonplace, Gigabytes of memory are becoming more common, and as this trend continues, tens of Gigabytes will also emerge. Moore's Law, formulated by Intel founder Gordon Moore in the sixties, has yet to hit a fundamental barrier.
If we wish to address a Megabyte, we need an address bus of at least 20-bits of width. If we wish to address a Gigabyte, we need 30 bits, and a hardware address bus width of 32 bits allows only 4.3 Gigabytes. Since virtually all modern machines use virtual memory (no pun intended), this problem is exacerbated, since the address seen by a process in its own virtual address space must be mapped into a physical address in hardware.
Therefore a typical virtual address must also carry a hefty number of bits to provide th necessary mapping information which is used by the virtual memory translation hardware to find the required page in physical memory. As an example, the widely used Intel architecture provides 48 bits for addressing, of which 32 are used to find offsets and 16 bits are used as a segment selector, to set up the virtual address translation hardware.
Providing that the process need not itself address a space larger than 4.3 Gigabytes, this scheme is perfectly viable. When a process (or task) is to be run, the operating system sets up the virtual memory hardware by loading segment registers, and the process then happily lives in its 32-bit virtual world.
Consider however a gigantic database application, which needs to operate upon a memory address space bigger than 4.3 Gigabytes. Addressing beyond 4.3 Gigabytes becomes quite messy. On an Intel or similar engine you would have to manage your address space at an application level (as used to be done in the ugly days of 16-bit minis and memory overlay management), or provide suitable hooks in the operating system to support this.
You would then have to explicitly partition your address space into 4.3 Gigabyte sized chunks. When you need to jump between your 4.3 Gigabyte chunks, you would need to reload the memory management registers, in a similar manner to how you go about doing a process context switch. This means cache and translation buffer invalidation and reloading. It is messy, and time consuming, and if done frequently would incur similar performance penalties to context switching between many user processes on a heavily loaded machine.
Consider having to operate upon a 10 Gigabyte memory mapped file. With a 32-bit machine, this would require some drastic surgery to the operating system. With a 64-bit machine we can therefore allocate a larger number of bits to the address space of a process, thereby avoiding the messy memory management overheads which torment the 32-bit machine in such situations.
This is indeed the central benefit provided by 64-bit technology, and an area where it will provide drastic performance gains, and given the messiness of other alternatives, arguably important gains in reliability.
Other Performance Issues
The adoption of 64-bit architectures will indirectly provide other performance gains. Consider a notional 64-bit version of an existing 32-bit CPU, identical in every respect to a 32-bit CPU, other than a redesigned virtual memory scheme and of course double width datapaths and arithmetic hardware. Assuming that you are doing mundane computing chores, which are integer arithmetic intensive, the primary performance gain you will see will be in a higher instruction fetching bandwidth, as the databus into the CPU is doubled in width.
In RISC machines this is particularly important, because stalling for want of instructions can kill performance very quickly, indeed this the reason why large caches appeared first on RISC CPUs. Assuming alternately that you are doing a lot of double precision floating point arithmetic, such as is typical for scientific and some graphics work, then you have immediately achieved a doubling or better than doubling of your basic arithmetic performance as you can execute the instruction directly upon the registers holding the operands. With additional instruction fetch bandwidth into the CPU, getting operands in and out of registers will also provide some performance benefit.
Understandably, when doubling the opcode fetching bandwidth into the CPU and doubling the bandwidth for operand accesses, it is necessary to ensure that the supporting caches, memory busses and main memory also have the bandwidth to keep up with a hungrier CPU. Dropping a 64-bit CPU into a board design sized in bandwidth for a 32-bit engine is unlikely to provide the expected performance gains. As a rule of thumb it is worth noting that multiprocessing 32-bit servers typically used 64 or 128-bit system busses to main memory, and main memory addressable in widths or 64, 128 or 256 bits. Doubling the width of the CPU datapaths suggests that to get good performance from a multiprocessing 64-bit engine, without changing the basic bus clock speed, we will need to at least double the width of the main bus to between 128 and 256 bits, and access main memory in widths of 128 to 1024 bits.
Needless to say, these are parameters in the class of supercomputers. With top of the line 64-bit scientific computing oriented micros, vendors assertions of supercomputer class performance in some areas are not unreasonable. A purist will correctly argue that supercomputers are characterised by high performance vector processing floating point units, in addition to high scalar performance, and therefore that a 64-bit micro is not a supercomputer.
If you are looking to do supercomputer class work on a desktop, a typical 64-bit micro is unlikely to do the trick if you are doing a lot of array intensive computation, which is the forte of vector capable engines. If your application is one which is hard to vectorise properly, then such an engine may well solve your problems. If you are doing database or typical beancounting applications, then the performance gains provided in floating point are irrelevent.
What you will see is higher instruction fetching and operand access bandwidth, and the ability to address huge amounts of virtual and physical memory.
In summary it is fair to say that much of the hype around 64-bit engines overstates the case significantly. However, the doubling of bandwidth into the CPU for any given clock speed will confer a good performance gain even should the 64-bit CPU offer similar instruction cycle times to a 32-bit CPU of similar architecture. This will be achieved by doubling the complexity of the supporting hardware around the CPU. It is important to note that typically a 64-bit microprocessor will have deeper pipelines, and since most current CPUs are superscalar, also a larger number of execution units in comparison with its 32-bit predecessor. Therefore this will confer a performance gain regardless of operand size and datapath width used.
Operating System Issues
Much has been said about the benefits of "64-bit" operating systems, versus established "32-bit" operating systems. However, unlike the transition from 16-bit MSW to 32-bit NT, where a fundamental change in memory management was involved, the differences between "32-bit" and "64-bit" Unix variants are relatively trivial.
The first and most important area of change is in the memory management routines in the bowels of the kernel, most often lovingly hand crafted in assembly code. These must be changed to reflect the new memory management model. In an ideal world, this would be all that is required.
In reality, most kernels and device drivers are full of naughty little hacks which make implicit assumptions about things such as pointer sizes, and their relationship to basic integer sizes. Therefore the next step in a port from 32-bits to 64-bits is to clean up these untidinesses. Since there are likely to be many, and most well hidden, this is likely to be a painful and lengthy process.
Fortuitously, good "linting" tools are readily available these days, and with a solid dose of regression testing, preferably in the lab rather than at the sites of hapless beta customers, these problems can be fixed. Needless to say, the delays in the availability of proper native 64-bit variants of existing Unix variants have most to do with the latter. Once of course these basic porting chores have been done, an OS engineer can start tweaking other areas. The filesystem and block device drivers will be a natural target, since the bigger address space can immediately be exploited to advantage, to increase addressable file sizes in an efficient manner.
Memory mapping of files, and shared memory handling in the kernel are another area which is likely to see major changes, as available sizes are increased. Users should not expect dramatic changes from a 64-bit variant of an established operating system.
The most visible changes will be in increased limits on usable address ranges. Should other features be manipulated to increase OS performance, in most instances these would be quite independent of basic CPU operand size. Summary At the time of writing a number of 64-bit processors are well established in the marketplace, and the likely trend is that 64-bit machines will dominate the server and workstation markets by the turn of the decade.
Large commercial and database applications, as well as scientific applications, can benefit significantly from this technology. The benefits to desktop users running more mundane "productivity" applications are questionable. Other than basic gains in CPU performance and bus bandwidth, arguably achievable in 32-bit technology, 64-bit technology has little to offer such users.
Standardisation of hardware and operating systems will most likely mean that Unix workstations will largely become 64-bit engines, whether or not the application in question can exploit this. Of the largest vendors, all either have fielded or are in the process of fielding 64-bit CPUs. DEC were the first in the market with the Alpha, and it is unclear how well they have managed to capitalise upon the technology.
The next to follow were SGI, who have been shipping R8000 and R10000 based machines for some time now. Sun Microsystems have the UltraSPARC processor, which is their first step in this direction. HP have the PA-8000 engine, which is their response to the market.
Whether 64-bit processors will make a big difference to your application depends first and foremost upon the nature of the application. Therefore for most users, this next step in the brave world of computing is unlikely to be a panacea. If we distance ourselves from the marketing hype, there are certainly gains to be made for many users in going to 64-bit machines. What a system manager should contemplate is whether his or her users do or do not fall into that category.
|$Revision: 1.1 $|
|Last Updated: Sun Apr 24 21:19:58 GMT 2005|
|Artwork and text ¿ 2005 Carlo Kopp|