Parent: Processor Architectures
Intel and AMD Historical, Intel Pentium 4, AMD Opteron, Dual Core, Pentium M to Core 2, Nehalem
Where as Athlon established AMD as a top tier microprocessor in personaly computers, Opteron established AMD in the server market. The main innovative features of the Opteron processor architecture are:
The Hyper-Transport point-to-point protocol enables higher total bandwidth per pin
and lower latency than shared bus.
The integrated memory controller reduces memory latency and
improves performance probably in the range of 15-20%.
By itself, extending the instruction set architecture from 32-bit to 64-bit is
not difficult, but Intel refused to consider this, reserving 64-bit ISA for Itanium.
Only when it gradually became clear that Itanium was not going to arrive soon
in a meaningful manner was the magnitude of this blunder realized.
However, expanding the number of general purpose (integer)
and floating point registers from 8 to 16 each in 64-bit mode
was a very innovative.
Intel architects swore that this could not be done.
It was said that Sun Microsystems provided assistance on the instruction set architecture.
At the time, the plurality of Intel's the high-end CPU came from business desktops, for which the premium performance metric was integer performance. The workstation division/group at Intel was newly formed at the time, and did not know how to present a coherent set of achievable requirements that would be taken seriously. The gaming market had not yet achieved the market force that it is today. So Intel was not willing to set a requirement for floating point performance to match or exceed that of the RISC processors, then prominent in the professional workstation space. AMD however was willing to invest silicon in a fully pipelined floating point unit. This earned AMD and Opteron the respect and loyalty of the gaming market in the years that followed.
Like the Pentium Pro, the Opteron core implemented out-of-order 3-wide superscalar execution units. Beyond that, Opteron has several other micro-architectural innovations to be competitive in the Pentium 4 time frame.

Below are the recent Opterons:
quad-core Barcelona 65nm, 285mm2 4x512K L2, 2M L3, 463M transistors (2007),
quad-core Shanghai 45nm, 258mm2, 6M L3, 758M transistors (2008),
six-core Istanbul 45nm, 346mm2, 6M L3, 904M transistors (2009).
The most recent AMD Opteron, Magny-Cours, has two Istanbul die in a single package for 12-cores. Two Istanbul die representing Magny-Cours is shown next to Nehalem-EX approximately to scale. Per above, each Istanbul die is 346mm2 and Nehalem-EX is 684mm2.
Istanbul and Nehalem-EX. The scaling is obviously incorrect.
The two Istanbul die combined should be larger than Nehalem-EX.
I will recalibrate my images when soon?
Below is the die layout of Magny-Cours from AMD's Hotchips 2009 presentation.

346mm per die, versus 263 for Shanghai
AMD released information on their upcoming Bulldozer architecture in conjunction with Hotchips 22, of which a less technical version is on the AMD website Bulldozer press release. The slidedeck is at slideshare. The principle new feature is the AMD solution for addressing multi-threaded performance. AMD seems determined not implement the same multi-threading solution adopted by Intel (Hyper-Threading), IBM POWER, and Sun. The new AMD Bulldozer architecture, shown in the diagrams below, has a complete core that appears as two cores to the operating system. Each sub-core has dedicated Integer scheduler and execution units, and L1 cache. The Fetch/Decode, floating-point and L2 cache are shared by the complete core.
Additional Bulldozer architecture diagrams:
More detailed Bulldozer core architecture diagrams:
The Bulldozer chip-level view shown below has 4 complete double integer cores, with L3 cache and memory controllers shared by all cores.
The AMD press release mentions that the processor socket would have 16 core visibly to the operating system, so the socket level would consist of two chips. Bulldozer then has 33% more cores than the current 12-core Opteron 6100 series (Magny-Cours). AMD stated that 50% better performance was expected, corresponding to 13% better performance per core. This could indicated that the Bulldozer core has micro-architectural improvements, or it could be that AMD is able get 15-20% better frequency on the 32nm process than Magny-Cours can on the 45nm process. Note that the single die six-core Istanbul could reach 2.8GHz on the 45nm process.
AMD claims the additional Integer unit requires 12% extra die space, for a potential 70%(?) performance gain on some applications. A 70% throughput gain for 12% die size is a spectacular achievement. However, there is a curious matter of Bulldozer die size. The quad-core Shanghai on 45nm die size is 258mm2 with 6M L3 cache. The six-core Istanbul also on 45nm is 346mm2 with 6M L3. A shrink/compaction of quad-core Shanghai to 32nm should have die size in the range of 130mm2. The 12% for the double integer execution would only increase die size to 145mm2, which is rather small for the high-end server processor, even if the processor socket will have 2 die.