Dr Carlo Kopp's Publications Archive

Industry Publications Index ... Click Here

Programming with 64-bit Machines

Originally published October, 2001

by Carlo Kopp

Part I Portability

The emergence of new generation 64-bit machines, be they based upon RISC technology of variants of newer VLIW technology, inevitably brings changes in programming technique.

While for many programmers the complexities of the machine architectures will be buried in the bowels of the compiler being used, the differences in technology are not always so completely transparent that the new architecture can be blithely ignored.

Architecture is relevant in two different respects. When developing new applications it is convenient if the architecture of the platform can be exploited to an advantage to improve application performance or functionality. When porting existing applications, more than often prior limitations in performance or functionality can be engineered out.

VLIW and 64-bit Architectures

The 64-bit machine is not an entirely new item of technology, just as the Very large Instruction Word (VLIW) architecture has a well established history. However, neither of these technologies have to date been widely used in commodity desktop systems which are the bread and butter of the contemporary industry.

A 64-bit architecture differs from the well established 32-bit architectures, and older 16-bit and 8-bit architectures, in the size of the basic data operand, the integer word. In a 64-bit architecture, an integer word comprises 64 bits or 8 bytes, making it twice the size of a 32-bit word.

This results in other consequences in the machine architecture. While the byte will remain the 8-bit entity we love and know, the odds are that a short integer type may become a 32-bit entity, and the size of a virtual address is apt to become a whole 64-bits, rather than the established 32-bits.

In floating point data types, the system is apt to use a double as its basic data type, rather than a standard IEEE 32-bit float.

What are the aims of 64-bit architectures? These will depend primarily upon the applications which are targeted by the machine in question.

For engineering and scientific programmers, a 64-bit processor typically offers big gains in achievable floating point arithmetic performance. This is primarily because the floating point Execution Units in the CPU will be designed to handle 64-bit double operands in a minimal number of processor clock cycles, as compared to a machine in which the basic operand is 32-bits wide.

For programmers working in the database and commercial application environments, a 64-bit architecture breaks through the limitations inherent in a 32-bit addressing model. With 32 bits of address, the address space is limited to 4.3 Gigabytes. In practice, the architecture itself may impose additional restrictions which might make it difficult to address the full 4.3 Gigabytes.

While 4.3 Gigabytes may seem to be a gargantuan size for an address space, it is not. Commodity hardware is available today with the capacity to fit around 1 GB of main memory, and Moore's Law being what it is, we are apt to very soon cross that hardware barrier in very cheap machines. With the trend to create ever larger applications, the need for large virtual address spaces for stacks, heaps and data segments eats into address space very quickly. If a database file of several Gigabytes can be wholly mapped and then held in main memory, the performance of an application can be significantly improved over repeated accesses to rotating mechanical disk storage. The difference between a 100 nanosecond memory and a 10 millisecond disk is a factor of 10 million.

A 64-bit architecture will offer other useful gains, but these are less dramatic in visible effects. The ability to efficiently crunch 64-bit wide integers will be useful in applications performing encryption and compression, as a bigger bite of the data can be chewed at by the CPU, per instruction. Multimedia applications in which audio is processed might also benefit. Some signal processing applications may be able to exploit the greater dynamic range available.

VLIW architectures, or variants thereof, such as the IA-64 model or the Transmeta model, are designed to deliver better performance than established Superscalar architectures, for a given amount of CPU real estate or transistor count.

The established Superscalar architecture machines aim to exploit an effect termed Instruction Level Parallelism or ILP - the absence of mutual dependency between operands in a program. Operands without mutual dependency can be processed out of order or concurrently, in as many Execution Units as the processor might have. Superscalar architecture machines incorporate often very elaborate hardware, occupying substantial proportions of the chip, to determine which instructions are free of mutual dependencies and can be then executed at convenient times to keep all Execution Units busy for as much time as possible.

The difficulty with Superscalar architecture machines is that to discover more instructions with the ILP property, the stream of incoming instructions must be explored in ever increasing depth. This problem is not unlike the lookahead problem in a chess game, as with every conditional branch instruction seen, the stream of instructions forks into two, both of which must be explored to discover exploitable ILP. This very quickly becomes unmanagable, and sets practical bounds on how much performance can be extracted from a Superscalar CPU.

VLIW aims to beat this problem by shifting the discovery of ILP and optimal scheduling of which instructions are to be executed when, into the compiler or a runtime environment (Transmeta). In this manner, chip real estate previously committed to hardware for instruction stream analysis and scheduling can be used for more execution units, and bigger caches, both of which are key drivers of achievable performance. Rather than performing instruction scheduling every time the code is executed, it is done once only during the compilation of the program.

As a result the same amount of Silicon real estate might yield twice as many execution units per CPU with a potential doubling of the performance potential per area of chip. This can be exploited to cram more performance into the same area and same power dissipation, or to match existing Superscalar chip performance with a cheaper and more frugal slab of Silicon (Transmeta).

Application Portability

In an ideal world the between 32-bit and 64-bit like architectures would be wholly transparent. The application is simply recompiled, and all remains as was. The application at runtime will perform better, how much better being very much a function of the application and the platform in question.

Trivial applications, and applications which make little use of the more sophisticated features in the operating system, will most likely follow this pattern with little if any deviation.

Things however may become more complicated when applications have dependencies upon the basic machine architecture, or language data typing which is bound to the architecture.

Problems may arise in a number of areas:

Changes in the size of basic data types following through system libraries.
Changes in the behaviour of arithmetic libraries.
Addressability and alignment of operands.
Alignment behaviour in shared memory.
Changes in address arithmetic arising from virtual memory architecture changes, following through system libraries.
Changes in interrupt handling and latencies, affecting real time applications.

With the emergence of VLIW based architectures, such as the IA-64, the handling of conditional branches at a machine level changes, through the use of predication techniques and software pipelining.

Many of these issues will be addressed in the compilers for the new architectures, but many of these potential headaches will leak through. This is especially true of older applications, and operating systems, which may have been in part written around the characteristic architectural idiosyncrasies of well established CISC and RISC architectures.

Changes in Basic Data Types

Well designed datastructures will in most instances move transparently between a 32-bit and 64-bit architecture. A side effect which may arise is the dependency of datastructure size upon the size of the basic integer operand. Where a compiler continues to treat an integer as a 32-bit entity, and a long integer as a 64-bit entity, the odds are the behaviour will remain as is. Where the integer is in effect promoted to a 64-bit entity, then datastructures will automatically double in size.

Changes in behaviour arising in this area are most relevant for applications which use very large arrays of operands, especially in scientific/engineering computing, but also in other applications which might need to handle such structures. If the array, comprising integers or short integers is tens of Megabytes in size, doubling its size doubles required memory in turn doubling the potential cost of the hardware, if swapping with its associated loss in performance is to be avoided.

An application developer or maintainer will need to carefully explore the behaviour of the compiler when porting the application. It may well be that changing the data type in the structure may be required to preserve existing memory demands.

Most C compilers, and derivative C++ compilers, will usually perform the transparent promotion of the basic float type to a double. As a result, floating point data type behaviour is likely to remain the same.

Arithmetic Libraries

Arithmetic libraries are one of the hidden but vital components of any runtime environment, being used frequently for supporting graphical computations in GUIs, but also to support engineering and scientific applications, and other miscellaneous calculations.

The principal risk which arises is that some library routines may have dependencies upon the data types used, and changes in basic data types could alter the behaviour of these routines.

While many changes, such as a doubling of integer sizes, will enhance the accuracy of integer arithmetic, subtle differences may arise especially in argument passing and returns. Of particular concern will be binaries compiled for 32-bit bit versions of the same architecture, most of which are likely to break precisely in this area.

Addressability and Alignment of Operands

Addressability and alignment are always specific to the hardware in question. While it is reasonable to assume that most new 64-bit architectures will allow operands to be addressed in byte and integer sized chunks, other operands such as short integers may vary across architectures. Therefore existing code which uses operands other than basic integers and bytes should be very carefully reviewed and tested to ensure that it does not break.

Alignment is the relationship between smaller operands and larger operands, in terms of how they can be accessed. In most machines, the basic alignment boundary is the integer, which is in this instance a 64-bit word. This means that words (integers) can only be addressed on address space boundaries which are 64 byte addresses apart. Overlaps are not permitted. Bytes within the alignment boundary are usually individually addressable, but other operand sizes may not be.

This has important implications, insofar as data structures cannot break this rule. Again, this breaks existing precompiled code. However, it might also break data structures in which clever tricks are played to cram multiple small non-byte sized operands into integers.

Yet again, the pragmatic approach is careful code review and testing, to isolate problem areas and fix them.

Alignment Behaviour in Shared Memory

Shared memory is a commonly used feature in many modern applications. It is used either for interprocess communications, or in more complex applications, to provide access to shared datastructures used by more than one process.

In many 32-bit systems, shared memory is implicitly locked into integer sized boundaries, often also tied to increments of a whole 512 byte or larger page size. It is reasonable to expect that most 64-bit systems will exhibit an identical type of behaviour.

This is important with datastructures, since they must comply the alignment rules. The same caveats thus apply as to the previous case.

Changes in Address Arithmetic

Changes the size of the basic machine operand will inevitably result in changes in address arithmetic. These will arise not only from likely changes in the virtual memory architecture but also from the need to change offsets within structures.

Older code is frequently written with assumed operand sizes, and indexing into arrays and structures is performed using numerical values of offsets rather than language or compiler defined values for offsets. Where an earlier developer may have chosen to be clever, or lazy, and hard coded offsets with numerical values rather than defined values, the application is almost guaranteed to break.

As with previous examples, the best strategy for dealing with this is to carefully review the code to find and isolate such instances, and rather than changing from the original offset size to a new offset size, permanently fix the code to use a proper sizeof or equivalent syntax.

This may be a particular problem with applications which use significant amounts of assembly code, examples being operating system kernels and device drivers.

Interrupt Handling and Latencies

Real time applications, especially legacy applications, may have many instances in which the behaviour of the code is crafted around the idiosyncrasies of the original target platform's interrupt architecture, but also the latency behaviour of memory, I/O accesses and interrupt services.

Transitioning such code to platforms which have significantly faster busses, and possibly quite different interrupt behaviour, needs to be done with some care. While many newer applications, crafted in higher level languages, will behave well since the compilers and libraries will address most of the issues, this may not be true of older applications.

Very Large Instruction Word Architectures

VLIW architectures will add other changes, in addition to those arising from a transition to a 64-bit datapath. In theory, compiler technology will hide most of the changes arising from VLIW from the programmer.

Notable differences in VLIW vs CISC/RISC, at the assembly code level, will be the use of multiple operation multiple operand instructions, scheduled by a compiler to allow maximum parallelism. Speculative execution is performed using predication techniques. Significantly larger register sets will be used, against most established CISC/RISC architectures.

Legacy applications written in assembly code will present an interesting challenge in porting. The massive performance gains resulting from the transition to VLIW mean that in practical terms the justification for coding in assembler, valid when the application was crafted, becomes basically irrelevant in most instances. With the exceptions of startup code and some parts of an operating system kernel, there is little justification for retaining much if any assembler.

Two basic strategies exist for dealing with existing assembly code applications. The first is to code up in a high level language an emulation of the original CISC or RISC CPU and then execute the assembly code on a virtual machine. This approach is not practical for small applications, but may be the most economical choice for handling a very large assembly code application. In essence, this is the basic model implemented by the Transmeta architects in their dynamic execution model. Whether the assembly code module is translated at runtime or at compile time is basically an implementation issue.

The alternate approach is to reimplement the assembly code application in a higher level language such as C, and then compile it to VLIW instructions. Again, there are alternatives in how this could be implemented. The traditional approach is for a programmer to wade through the assembler and convert it line by line, into a lesser number of lines of C. Another strategy might be to write a conversion tool which does this in an automated fashion. Merging these two approaches into a first pass of automated conversion, followed by a manual cleanup and commenting pass, is also a viable technique.

What is clear is that most newer applications will port across from CISC and RISC 32-bit architectures to 64-bit/VLIW architectures without unreasonably large effort. Providing that care is taken with the porting process and rigorous testing is performed, the odds are that a relatively bug free port can be executed with reasonable effort.

Legacy applications, especially those which contain embedded architectural dependencies and copious amounts of assembly code will present a more demanding task. This may become a major issue especially in porting real time embedded applications, such as software for avionics and defence projects.

It is likely that by the middle of this decade most if not all vendors will transition from 32-bit CISC/RISC technology to 64-bit VLIW technology. If the industry is not to face another Y2K scramble, it is imperative that careful forethought be given to finding the best strategies for a painless transition.

Next month's feature will explore machine arithmetic and the resulting implications of 64-bit architectures.

$Revision: 1.1 $

Last Updated: Sun Apr 24 11:22:45 GMT 2005