IUB
Introduction
History
Architecture
Implications
32-bits
Software
Conclusion
Main
 
     

 

 

64-bit architecture

Introduction:

To get a first idea, how the 64-bit architecture works and also how it differs significantly from a 32-bit implementation it is useful to consider one definition first:

"A 64-bit processor is a microprocessor with a word size of 64 bits, a requirement for memory and data intensive applications such as computer-aided design (CAD) applications, database management systems, technical and scientific applications, and high-performance servers. 64-bit computer architecture provides higher performance than 32-bit architecture by handling twice as many bits of information in the same clock cycle." (search390.com)

The most important parts, which define a 64-bit architecture are boldfaced and give a rough idea that one can now process not only 2^32 = 4294967296 basic units of information, but 2^64 = 18446744073709551616 units. The numbers are quite impressive and show that the architecture level has to be updated accordingly.

There are several companies, which actually implemented 64-bit processors, but the two main companies are AMD and Intel. Other enterprises certainly have their place in the development of 64-bit processors, too, but the mainstream market is going to face those products by AMD and Intel. Therefore it is reasonable to explain, how those two companies designed the 64-bit processors and moreover there are only details to consider in translating the two special layouts and implementations to the general concept. There are quite some differences how the two companies chose to convert 32-bit programs to work with the 64-bit architecture and those differences will be outlined in the 32-bit part of this document, but in the following part the structure of a "pure" 64-bit architectural level will be outlined. As there is not much public information available about the physical structure of current 64-bit processors due to the fact that neither AMD nor Intel want to provide crucial information to the corresponding rival on the processor market it is useful to focus on the instruction set architecture (ISA) and the general differences between a 32-bit processor and the new 64-bit one.

AMD - layout & features:

The basic layout of AMD's Hammer processor can be seen in this picture. The processor core contains 3 arithmetic logic units (ALU), three address generation units (AGU) for load/store and three floating point units (FPU) for arithmetic with floating point numbers as the ALUs process only integers. Furthermore the controller to access the memory is build in to reduce the delay, when a request for memory access is sent (A1-Electronics).

So the question is what exactly happens inside the processor core and where does the 64-bit component come into play? First of all, there is a distinction to make between the size of the databus, which is already 64-bits large for many 32-bit processors and the architecture of the central processing unit (CPU). Here the difference e.g. between the AMD Athlon and the new AMD Hammer or Opteron is that the complete architecture is now based on 64-bits. For AMD it is called x86-64 and is activated by one special bit called Long Mode Active (LMA). If LMA is activated all 64-bit features are activated and the CPU leaves the compatibility mode for older 32-bit applications. The following main concepts become viable when LMA is activated:

  • Virtual 64-bit addressing
  • 64-bit instruction pointer
  • Flat-addressing mode

But those three conceptual changes for the "pure" 64-bit mode are build upon the main structures continuously present (whether used or not) in the 64-bit architecture. These most basic principles are:

  • Registers extend to 64-bits
  • Addition of 8 new registers (R8-R15 / general purpose)
  • 8 new registers for SIMD (single instruction, multiple data)

The scheme how the registers are used and instructions are transferred and executed is not a new invention, but rather a continuation of the old principles of the 8086 architecture. There is still the distinction between one part of higher bits (in this case 32) and one part of lower bits, which can split up into two smaller parts (16 bits and 2x 8 bits). This principle carries over to the ISA as well. (hardwaresecrets).

In the following part information will be taken from the source, which is seemingly consulted by anybody, who publishes something about the AMD 64-bit architecture. This is the AMD x86-64 Programmer's Manual . It comes in several volumes and describes the implications and applications of the instruction set architecture for AMD's 64-bit processing technology. As mentioned above, one main - if not the main - change from 32-bits to 64-bits is the change of the size for register files and in particular for the GPRs (general purpose registers). The following picture shows, how AMD partitioned their system of register files:

This shows again that the AMD's design of processor architecture can be viewed as an extension of the old 32-bit approach as all the general registers concepts displayed here existed already before in a 32-bit design, so GPRs, media registers, an instruction pointer and a register for flags are not new, but rather enhanced by the new approach. But what is even more important, becomes visible when one looks at the ISA closely.

AMD - Instruction Set Architecture:

The most basic units of organization for the instructions are specified the following way (see AMD manual again - page 38/39):

  1. General Purpose Instructions: The basic integer instructions, which are used nearly everywhere. Also often referred to as the x86 instruction set and easily illustrated by examples like addition of integers, moving, load, store, shifts and so on.
  2. 128-Bit Media Instructions: Named due to their primary application, these instructions operate on vectors of large data packages (e.g. video, scientific applications, games, etc.). Moreover, they operate in parallel. That means they are able to access multiple data sets at once. Obviously, these instructions are designed for speed in one special field of applications and therefore are not able to perform any task.
  3. 64-bit Media Instructions: Also SIMD instructions and not much different in use compared to the 128-bit instructions.
  4. x87 Floating Point Instructions: As GPIs only work for integers, these instructions are designed to have a suitable tool for floating point operations.

When the LMA is activated the maximum speed for instructions to be performed is enabled and this is usually done by the operating system. This is the stage we would like to call "pure" 64-bit mode and this mode can be recognized for both architectures, the one described here from AMD and the Intel IA64 described later on this page. For the following part of the analysis we assume that LMA is activated and the processor is in "pure" 64-bit mode, which is not to be confused with legacy mode or long mode compatibility mode; these are features to support the transition from 32-bit machines and software to the new architecture. Those should not be considered yet, but in the 32-bit section.

The default size for operands is 32-bits in contrast to the 16-bits of the 32-bit architecture. The REX registers, which is the common name for the 8 new GPRs R8-R15 - specify whether one would like to accept this default value or to extend to virtual 64-bits (basically a concatenation of two registers). This means that some of the instructions for the opcode had to be redefined to allow the virtual 64-bit addressing. Nevertheless, these are only minor changes and most parts of the opcode are carried over from a 32-bit processor. The memory is a single flat address space starting at the address 0 and is distributed linearly over 64-bits. The operating system can specify several levels of data access/protection for the address space.The segment registers to access memory locations are set to a canonical position - namely 0 - and it is not possible for the processor to access all segmented registers. This is essentially a real simplification compared to 32-bit processing and all the compatibility modes offered by AMD. It is just pure memory addressing from 0 to 2^64 -1 without any specialties (picture). This concept shows on the microlevel what the goal of the complete architecture is. The search for more simplicity, more raw computing power and preparation for large amounts of data. Another cornerstone of this path is the possibility to translate all the virtual 64-address space in physical memory in a one-to-one translation process. Paging can be performed on the virtual address directly. The bytes themselves are ordered according to little/low endian and so are all the data and instructions. The instructions do not really "change" in the sense that there a structural redesign has happened. The size of the operands is the crucial factor. Consider for example this instruction: 48 B8 1234567812345678. The 48 specifies the length of the operands: 64-bits! The opcode B8 is also used in the 32-bit architecture and the remaining part is just an 8-bit immediate value and we are computing with a 64-bit processor.

There exist five addressing modes:

  • Absolute Address: given as displacements from the base - for 64-bits just 0)
  • Instruction-Relative Address: referring to the IP (instruction pointer) and the PC (program counter)
  • Stack Address: using the stack pointer
  • String Addresses
  • Mod R/M Address

And again one realizes that there are no real differences in the structure compared to non-64-bit ISAs. The PC, the Stack and absolute addressing just carry over with more bits. The RIP (relative instruction pointer / program counter) keeps its function, but due to 64-bits provides a more efficient way to directly access segments of code with relative addressing. This is one reason , why there is a significant increase in speed for the AMD 64-bit architecture - direct access to program code.

For the Absolute Addressing it gets even easier due to the common standard base 0. The same holds for pointers in general. As one is no longer able to access the segmented registers the concept of far pointers, which store a segment address and the usual address, is no longer needed as the memory is just one linear chunk. Near pointers are enough and one can return for 64-bit applications for the AMD architecture to the general term pointer as it is obvious that it can only point into one data segment. The immediate and displacements remain of 32-bit size but can be extended to a virtual 64-bit mode if needed.

This finishes the broad outline of the instruction set architecture for AMD based on the document mentioned above and their philosophy to keep it simple and easy becomes apparent, but this is only true for AMD, not for 64-but processors in general. They might demand more sophisticated instruction sets and might not rather focus and build upon established concepts. One has to know more certain technical details, which should not be emphasized here as the new registers must be taken into account and therefore the possibility of combinations to address and declare correctly rises, but their complexity level does not rise significantly for AMD. Outlining the new instructions for every new register would be tedious and cumbersome work and is only valid for the ISA of AMD, anyway, so we go on the comparison with Intel's implementation.

Intel - layout & features:

Intel takes a quite different approach in its 64-bit architecture called Itanium. The main two catchwords for the ideas, which are used are IA-64 and even more important VLIW (very large instruction word). Intel aims at even more parallel computing power and a more involved approach in implementing the possibilities of 64-bits. In this context one might even say 128-bit as the instruction word for IA-64 is 128-bits, which gives an impressive amount of information 2^128=3,4028236692093846346337460743177*10^38. Also this might sound superior to the approach taken by AMD and other companies (which also have 128-bit registers available) at first it also entails a lot of problems especially in compatibility issues - discussed in the next part - and also in the complexity of the structure of registers and instructions. One instruction word encodes three basic instructions and contains a pointer:

As shown in the picture on the left hand side, each instruction is 41 bits long and for the pointer there are 5-bits. The pointer is used to indicate the type of instructions, which are in the instruction word. Theoretically one could specify 32 of such types, but only 24 are actually used. Each instruction makes use of one of the CPU elements: Integer Data, Floating-Point Operations, Memory Access or Branch Prediction. Instructions can be executed in parallel as for the AMD architecture, but Intel enables six instructions to be processed in parallel. This is also referred to as EPIC (explicit parallel computing) for Intel. For this process the compiler takes the huge responsibility to determine, which instruction to execute on the correct registers. To be able to interpret everything correctly for each instruction word, the compiler has to take advantage of three important concepts:

  • speculation
  • predication
  • EPIC

EPIC was shortly explained above and the concept of speculation is quite intuitive as the compiler tries to schedule data access and operations before the time they would normally be needed or executed. This should avoid that slow operations halt the whole process. Predication is more complex and means that a branch, which is conditional and might not even be needed at all for the whole operation is prepared to be executed beforehand to guarantee the maximum amount of speed. Predication is not to be confused with prediction. For predication a certain number of parallel operations are prepared by marking conditional branches, which might be taken for the next parallel instruction bundle (bundle will be explained later on). But these branches are actually only computed, when they are really needed and the conditional evaluates to true. The number of different types of registers in the Intel architecture IA-64 is greater than the ones for AMD and also the general concept is more involved as the most basic overview below shows (data & pictures used for this part: hardwaresecrets):

As the picture to the left already gives a glimpse on the complexity of the Intel approach it is actually even more difficult, considering the registers. Altogether 400 documented registers exist (c't 12/99 page28). Considering the rather straightforward approach taken by AMD. Intel seems to be willing to sacrifice possible simplifications just for rewards in power of its architecture. The following part of the discussion is going to outline some of the features and possibilities, but even the impression might occur considering various sources of information that not even Intel is sure, how all those special cases, conversions and huge possibilities are going to work out in practice. There seems to exist only few really concrete pieces of data, which are going to be covered in the following two parts.

 

 

Intel - New Concepts:

The following part of this analysis will just take up some important points as there is no possibility to describe Intel's approach in a complete broad context properly in reasonable amount of space. Therefore one primary resource will be a brief presentation given by Gautam Doshi already in 1999, which can be found here. Again we are going to focus here on the "pure" 64-bit mode of the processor.

First of all one statement from the first part of the IA-64 architecture carries directly over into this part: It is all about parallelism! This entails addressing the questions, what is the structure of the registers and the instructions to enable and use this feature extensively. Obviously some operations are not dependent or only partly dependent and the structured program pattern holds all these operations back. One easy striking example is the following:

  • add r1 = r2,r3 ;;
  • sub r4 = r2,r5 ;;
  • add r5 = r4,r1 ;;

Certainly the second instruction can be performed without any knowledge of the first one and actually this would help to prepare for the third operation, so Intel makes an interesting point here that parallelism can lead to superior performance. Therefore to increase the size of the instruction word is a natural step. The other step is to remove the stops ";;" to implement the following code:

  • add r1 = r2,r3
  • sub r4 = r2,r5
  • add r5 = r4,r1

The goal is to remove branching as far as possible and transform the more or less linear pattern of execution to a structure, which looks in some sense more compact. This is predication and looks actually like (taken from the presentation) the picture to the right. One can now break barriers in the execution cycle to address memory latency, when it is needed regardless of the step of execution, which is performed. This leads to a completely different architecture than the approach to extend the registers, give more space to the flow of data and instructions and keep the old design constant. Linear design is replaced by parallel design.

But how to manage all this just with registers and simple instructions? One concept, which is of importance for Intel's architecture is the RSE (register stack engine). It automatically saves and restores the stack; a feature, which is crucially needed to perform background operations or to cope with outputs of speculation properly and to reduce over-/underflow. There are several other concepts like the advanced load address table, the translation lookaside buffer or the hardware page walker, which help to achieve the primary goal of parallelism (reference so far).

Intel - Architecture:

Now we have seen the underlying principles to be considered, when talking about the ISA. The resource is the Manual for Software Development for the Itanium processor series and the IA-64 architecture respectively. All the volumes and papers can be downloaded from the Intel website for developers. The software developer's manual and the hardware developer's manual provide the best overview about the IA-64 and therefore the Itanium architecture with respect to the ISA and to a general overview of the physical level of implementation.

The basic structural unit of the Itanium looks like the picture to the right.The databus can cope according to Intel with a data rate of 2.1GB/sec. The Itanium processor contains 4 integer ALUs, 4 multimedia ALUs, 2 AGUs, 3 branching units and 4 FPUs for arithmetic with floating point numbers. The processor is capable of theoretically performing 20 operations in one clock cycle by loading 16 operands and evaluating 4 ALU operations. This possibility should not be confused with the number of instructions possible within one clock cycle - namely six. The instructions are retrieved from memory and are bundled by a process called bundle rotation; this prepares the execution of parallel instructions on the hardware level. The instructions are fetched from the cache speculatively. All this is implemented with the help of 128 floating point registers, 128 integer registers and 8 branching registers, which all support explicitly 64-bits (reference).

But actually the different types of registers even extend this basic structure and there is maybe no other point to see the direct distinction between the approach of AMD and Intel more directly. Common components in both architectures are the GPRs (64-bit), registers for floating-point, instruction pointer and branch registers. Nevertheless, there is a reason why sometimes the Itanium is called "The King of Registers". Besides the sheer number also additional types are specified:

  • Current Frame Marker (state of rotation)
  • Predicate Registers
  • Application Registers (special purpose)
  • Performance and Monitoring Registers
  • Kernel Registers (link OS and applications)
  • Register Stack Configuration Register (control of RSE)
  • ...

If one views this scheme in contrast to the register distribution of the AMD x86-64 above a contrast becomes apparent and this continues virtually into every part of the ISA. Memory addressing, for example, is accessed with 64-bit pointers and uses sizes of 1,2,4,8,10 and 16 byte. Data/Instructions might be stored in memory either in little endian of high endian and this is controlled by another special purpose register file (reference).

Some features of these constructs are certainly quietly implemented in the AMD architecture, too, but the real difference is then made by instruction bundling for the IA-64. Bundling refers to the 3 instructions of 41-bits mentioned at the beginning of this paragraph and one instruction word is also called a bundle in this case and needs special management, but this management of the bundle has been only touched on the surface in the first part of the IA-64 architecture and would need at least the space used so far to give a brief overview of the IA-64 and X86-64. Just as a short remark to remind the reader: These are only two companies, two approaches and two quite different goals. Nevertheless AMD and Intel are not alone and other enterprises are working on 64-bit processors, too.

Perspectives:

At this point one has only an overview to judge the two approaches presented. But one thing should be emphasized clearly that it should be apparent, how fundamental differences in the view of processing technology become visible in these two cases: The struggle for power against stability/compatibility. Certainly Intel's emphasis on parallel computing offers more power and extends computing to a new level, but whether this is the way to go can definitely not be decided yet because we have to ask, what about our old software, everything developed so far in 32-bits should work in the new age, too?! The next section is going to address this topic in more depth.

 

And last but not least an explanation, how the marketing tries to explain this complex architecture and instruction set. Some specialists of the AMD marketing department have done quite a nice job to explain the whole construction within one minute and introduced some simplifications for the explanation to the public. It is just interesting to see how two series of manuals - containing several thousand pages each - boil down to this view.


References for this part are basically placed in the appropriate positions - this list gives an overview:

- Search390.com: http://search390.techtarget.com/sDefinition/0,,sid10_gci498697,00.html

- Hammer Review A1- Electronics: http://www.a1-electronics.co.uk/AMD_Section/CPUs/Hammer_Review_pg2.shtml

- Article X86-64 Hardwaresite: http://www.hardwaresite.net/x86-64.html

- AMD Developer's Manual X86-64: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24592.pdf

- Article IA-64 Hardwaresite: http://www.hardwaresite.net/ia64.html

- Presentation IA-64: http://www.eg.bucknell.edu/~bsprunt/comp_arch/intel/ia64_tutorial.pdf

- Softwware Developer's Manual Itanium: http://developer.intel.com/design/itanium/manuals/245317.pdf

- Hardware Developer's Manual Itanium: http://developer.intel.com/design/itanium/downloads/248701.htm

- AMD Opteron video: http://www.amd.com/us-en/assets/content_type/DigitalMedia/AMD_Opteron.wmv

- Article 64-bit computing: c't 12/99 page 28

- basic notations, definitons and concepts are taken from "Computer Organization and Design", Hennessey and Patterson

 
previous page
 
next page