Parent: Processor Architectures

Intel and AMD Historical, Pentium 4, AMD Opteron, Dual Core, Pentium M to Core 2, Nehalem, Sandy Bridge to Haswell, Hyper-Threading, SIMD Extensions

Opteron

Where as Athlon established AMD as a top tier microprocessor in personaly computers, Opteron established AMD in the server market. The main innovative features of the Opteron processor architecture are:

  1. 1) Hyper-Transport replacing shared front-side bus (FSB)
  2. 2) Integrated memory controller
  3. 3) 64-bit Instruction set extensions
  4. 4) Extending general purpose & FP registers from 8 to 16
  5. 5) fully pipelined floating point

The Hyper-Transport point-to-point protocol enables higher total bandwidth per pin and lower latency than shared bus. The integrated memory controller reduces memory latency and improves performance probably in the range of 15-20%.
By itself, extending the instruction set architecture from 32-bit to 64-bit is not difficult, but Intel refused to consider this, reserving 64-bit ISA for Itanium. Only when it gradually became clear that Itanium was not going to arrive soon in a meaningful manner was the magnitude of this blunder realized. However, expanding the number of general purpose (integer) and floating point registers from 8 to 16 each in 64-bit mode was a very innovative. Intel architects swore that this could not be done. It was said that Sun Microsystems provided assistance on the instruction set architecture.

At the time, the plurality of Intel's the high-end CPU came from business desktops, for which the premium performance metric was integer performance. The workstation division/group at Intel was newly formed at the time, and did not know how to present a coherent set of achievable requirements that would be taken seriously. The gaming market had not yet achieved the market force that it is today. So Intel was not willing to set a requirement for floating point performance to match or exceed that of the RISC processors, then prominent in the professional workstation space. AMD however was willing to invest silicon in a fully pipelined floating point unit. This earned AMD and Opteron the respect and loyalty of the gaming market in the years that followed.

Like the Pentium Pro, the Opteron core implemented out-of-order 3-wide superscalar execution units. Beyond that, Opteron has several other micro-architectural innovations to be competitive in the Pentium 4 time frame.

Opteron DC 90nm

Dual-Core Opteron, 90nm, 199mm2, 2x1M L2, 233M transistors (2005/6)

AMD Opteron quad-core processors and later

Below are the recent Opterons:
quad-core Barcelona 65nm, 285mm2 4x512K L2, 2M L3, 463M transistors (2007),
quad-core Shanghai 45nm, 258mm2, 6M L3, 758M transistors (2008),
six-core Istanbul 45nm, 346mm2, 6M L3, 904M transistors (2009).

Barcelona Shanghai Istanbul Athlon_II_X4

Quad-Core Barcelona 65nm, QC Shanghai 45nm, six-core Istanbul and Athlon II X4?

The most recent AMD Opteron, Magny-Cours, has two Istanbul die in a single package for 12-cores. Two Istanbul die representing Magny-Cours is shown next to Nehalem-EX approximately to scale. Per above, each Istanbul die is 346mm2 and Nehalem-EX is 684mm2.

Istanbul Istanbul Nehalem-EX
Istanbul and Nehalem-EX. The scaling is obviously incorrect. The two Istanbul die combined should be larger than Nehalem-EX. I will recalibrate my images when soon?

Below is the die layout of Magny-Cours from AMD's Hotchips 2009 presentation.

Magny Cours Topology

AMD Bulldozer (2010 Aug)

346mm per die, versus 263 for Shanghai

AMD released information on their upcoming Bulldozer architecture in conjunction with Hotchips 22, of which a less technical version is on the AMD website Bulldozer press release. The slidedeck is at slideshare. The principle new feature is the AMD solution for addressing multi-threaded performance. AMD seems determined not implement the same multi-threading solution adopted by Intel (Hyper-Threading), IBM POWER, and Sun. The new AMD Bulldozer architecture, shown in the diagrams below, has a complete core that appears as two cores to the operating system. Each sub-core has dedicated Integer scheduler and execution units, and L1 cache. The Fetch/Decode, floating-point and L2 cache are shared by the complete core.

Bulldozer

Additional Bulldozer architecture diagrams:

Bulldozer

More detailed Bulldozer core architecture diagrams:

Bulldozer

Bulldozer

The Bulldozer chip-level view shown below has 4 complete double integer cores, with L3 cache and memory controllers shared by all cores.

Bulldozer

The AMD press release mentions that the processor socket would have 16 core visibly to the operating system, so the socket level would consist of two chips. Bulldozer then has 33% more cores than the current 12-core Opteron 6100 series (Magny-Cours). AMD stated that 50% better performance was expected, corresponding to 13% better performance per core. This could indicated that the Bulldozer core has micro-architectural improvements, or it could be that AMD is able get 15-20% better frequency on the 32nm process than Magny-Cours can on the 45nm process. Note that the single die six-core Istanbul could reach 2.8GHz on the 45nm process.

AMD claims the additional Integer unit requires 12% extra die space, for a potential 70%(?) performance gain on some applications. A 70% throughput gain for 12% die size is a spectacular achievement. However, there is a curious matter of Bulldozer die size. The quad-core Shanghai on 45nm die size is 258mm2 with 6M L3 cache. The six-core Istanbul also on 45nm is 346mm2 with 6M L3. A shrink/compaction of quad-core Shanghai to 32nm should have die size in the range of 130mm2. The 12% for the double integer execution would only increase die size to 145mm2, which is rather small for the high-end server processor, even if the processor socket will have 2 die.