Home

Performance Benchmarks
Server Systems
Systems Architecture
Processors (Architecture)

Processor Architectures and History

To better understand server systems, it is necessary to understand the microprocessors on which server systems are based. In turn, it is helpful to understand the recent history of microprocessors. The Intel Pentium Pro was a landmark that brought Intel into prominence in server systems. It was the first X86 (or IA-32) processor that shook the presumption that RISC was superior and would eventually take over the market.

The RISC argument had become so broadly accepted that Intel evens planned on replacing their X86/IA-32 line. Instead of fielding yet another RISC processor, Intel felt compelled to come up with an even better architecture that became EPIC, the foundation of the Itanium processors. There is a valid technical foundation to the argument that RISC concepts were becoming outdated. RISC was conceived with the anticipation of transistor budgets in the tens of thousands. In the EPIC/Itanium time frame, the transistor budget was in the tens of millions. Broad consensus does not seem to be a reliable predictor of the future.

Intel History

The Pentium Pro also supported glueless symmetric multi-processing (SMP) for systems with up to 4 processors, enabling ordinary PC companies to enter the server market. The Pentium Pro was followed by incremental improvements of the Pentium Pro architecture in Pentium II and Pentium III. The next major new architecture was the Pentium 4.

At the time, Intel had two major concurrent design teams for mainstream personal computers, one focus on the high-end, the other on the lower end. This was in addition to the design teams working on Itanium.

Eventually Intel settled into their Tick-Tock Model. Intel has two major processor architecture teams, and one (or more?) smaller design team. The large teams handle the new processor micro-architecture (the "tock" in the Intel tick-tock). A smaller team under Intel's process technology division is responsible for taking an existing processor architecture to the new manufacturing process (the "tick"). There may be other small teams that handle the variants of each basic architecture.

Each architecture team is expected to produce a new processor architecture every four years. Two architecture teams are employed so that a new architecture is ready every other year. The process team is primarily responsible for advancing the manufacturing process of the latest tock processor, but can make design improvements, which some times can have significant impact (such as the on-die L2 cache in Pentium III codename Coppermine).

The Pentium 4 (NetBurst), and Nehalem architectures are from one major team, hence both feature Hyper-Threading. Pentium M, Core2 (Conroe) and the upcoming Sandy Bridge are from the other team. So even though Core2 was a later processor microarchitecture than NetBurst, the different team elected not to employ hyper-threading in Core2. However, Sandy Bridge is expected to have hyper-threading. It is unclear whether Sandy Bridge would implement two or four threads per core. (see Wikipedia List_of_Intel_microprocessors)

AMD History

In this time period, the AMD K5 project (1995?) did not achieve the performance goals of its ambitious architecture, resulting in the K5 being set aside. AMD had purchased NexGen and employed their design as the K6, setting aside their own K6. The K6 architecture (1996) was relatively sucessful, and was regarded as being more advanced than the Intel Pentium, but below the performance of its contemporary, the Pentium II. Being bracketed by Pentium II which could occupy the higher price points, and by Pentium (MMX) which has relatively low cost, AMD had difficulty in achieving financial success in this time period.

The AMD K7 (Athlon, 1999) architecture was more advanced than the Intel Pentium III. While less "advanced" than the Pentium 4, because the Pentium 4 had uneven performance characteristics, AMD was finally able to establish a position in the more profitable desktop processor market segments. The K7 was soon followed by the K8 (Opteron, 2003), which incorporated several bold initiatives. The Opteron was consider equal or better than the contemporary Intel processor, and also launched AMD into the SMP server market. Afterwards, AMD retained the basic Opteron architecture, making incremental improvements, while focusing on increasing the number of cores.

Intel Pentium Pro to Pentium III

The Pentium Pro came to market in late 1995, on a 0.5 micron process, with 5.5M transistors, and in 150-166MHz. The initial version was with the smaller 256K L2 cache that was more suitable to 2-way servers. A later version with 512K L2 cache had decent 4-way system scaling.

The Pentium Pro was followed by the Pentium II (Klamath 0.35 micron, 7.5M transistors, up to 300MHz) in 1997 incorporating the MMX (vector integer) instruction set extensions first developed for the Pentium MMX, along with improvements in misaligned cache accesses for 16-bit applications. Deschutes was codename for the 0.25 micron version of Pentium II, launched in 1998, available in frequency up to 450MHz. The server version was Pentium II Xeon, with larger and higher bandwidth L2 cache.

Pentium III followed in 1999 with new SSE (vector single precision FP) instruction set extensions. The initial version, codename Katmai, also on the same 0.25 micron (or 250nm) process as Deschutes. Transistor count was now 9.5M. The maximum desktop frequency reached 600MHz, but the Pentium III Xeon server version stopped at 550MHz.

The 180nm version of Pentium III codenamed Coppermine launched in early 2000. The L2 cache was brought on-die at 256K, with 26M transistor total. The main motivation was cost reduction, both for the processor and at the system level. Below the radar, a new very low latency 256-bit wide bus was designed to connect the L2 cache. (For Pentium II, there was derivative with on-die L2, retaining the original 64-bit back-side bus.)

The server version underwent late revisions due to poor and incompetent planning (it was realized that the initial plan, a quick shrink of Katmai, with the long latency off-die cache on a narrow 64-bit bus, would look pathetic compared to Coppermine), finally with the codename Cascades. Cascades after revision, also adopted an on-die L2 cache, with 1M & 2M versions to enable adequate 4-way scaling. The initial Pentium III Xeon was 700MHz, available about one year after Coppermine (1GHz) with an even later version at 900MHz.

Pentium III was further continued to 130nm (Tualatin) and a larger 512K on-die cache for the mobile market. The performance characteristics were very positive as 2-way servers also adopted this processor.

All Pentium Pro to Pentium III processors employed a common bus. The initial Pentium Pro bus ran at 60 and 66MHz. The Pentium III processors extended bus frequency to 133MHz. The Pentium III Xeon with four processors on a shared bus was limited to 100MHz, mostly due to the slot mounting mechanism.

First Generation Pentium 4

Willamette was the codename for the first brand new micro-architecture after Pentium Pro, a 180nm process with 42M transistors, coming out in late 2000 as the Pentium 4, with a key architectural feature called NetBurst (see The Microarchitecture of the Pentium® 4 Processor).

The Pentium 4 architecture was meant run at very high clock rates. The very high clock rate on the Pentium 4 is achieved with a very deep pipeline, that is, each operation is divided into many very small steps. There are 20 pipeline stages in the first generation Pentium (Willamette and Northwood) and ~31 for the second generation (Prescott and Cedar Mill), compared to 10-12 on Pentium Pro to Pentium III line, 12 stages for integer and 17 for floating point on the Opteron and 14 for the Core 2. The Itanium 2 has 8 pipeline stages, but can execute 6 instructions simultaneously.

The Pentium 4 bus was an improved version of the Pentium Pro bus. The clock rate was 100MHz, but addresses count be sent at 200MHz and data was quad-pumped for 400MHz. The APIC controller were brought from a separate bus on to the main bus. During the early Pentium 4 bus architecture time, it was too early to consider point-to-point signaling that was gaining acceptance as the next generation technology. Once Intel did commit to the Pentium 4 bus, they decided to leverage the infrastructure build around it for a along time. All NetBurst architecture processors: Willamette, Northwood, Prescot and Cedar Mill, all the Pentium M architecture processors: Banias, Dothan and Yonah and the Core 2 architectures processors employed the Pentium 4 bus architecture, with the data transfer rate pushed to 1600MHz at the terminal end.

To realize the full benefit of the very deep Pentium 4 pipeline, the code sequence must be very predictable. Otherwise the pipeline gets flushed too often, and the Pentium 4 is no better than other processors running at lower clock rate. The Pentium 4 tended to have uneven characteristics relative to other architectures across a range of applications.

The 4-way server variant codename Foster became Xeon MP in Mar 2002. Apparently Intel could not get a worldwide registered trademark for Xeon, hence the earlier processor were Pentium II or III Xeon. So apparently, Intel was eventually able to secure the trademark worldwide.

Northwood was the codename for the 130nm Pentium 4, released in Q1 2002, L2 cache increased to 512K, 55M transistors. The MP version was Gallatin, initially with 1 & 2M L3 cache in Nov 2002, and a later version with 4M L3 cache.

Opteron

Where as Athlon established AMD as a top tier microprocessor in personaly computers, Opteron established AMD in the server market. The main innovative features of the Opteron processor architecture are:

  1. 1) Hyper-Transport replacing shared front-side bus (FSB)
  2. 2) Integrated memory controller
  3. 3) 64-bit Instruction set extensions
  4. 4) Extending general purpose & FP registers from 8 to 16
  5. 5) fully pipelined floating point

The Hyper-Transport point-to-point protocol enables higher total bandwidth per pin and lower latency than shared bus. The integrated memory controller reduces memory latency and improves performance probably in the range of 15-20%.
By itself, extending the instruction set architecture from 32-bit to 64-bit is not difficult, but Intel refused to consider this, reserving 64-bit ISA for Itanium. Only when it gradually became clear that Itanium was not going to arrive soon in a meaningful manner was the magnitude of this blunder realized. However, expanding the number of general purpose (integer) and floating point registers from 8 to 16 each in 64-bit mode was a very innovative. Intel architects swore that this could not be done. It was said that Sun Microsystems provided assistance on the instruction set architecture.

At the time, the plurality of Intel's the high-end CPU came from business desktops, for which the premium performance metric was integer performance. The workstation division/group at Intel was newly formed at the time, and did not know how to present a coherent set of achievable requirements that would be taken seriously. The gaming market had not yet achieved the market force that it is today. So Intel was not willing to set a requirement for floating point performance to match or exceed that of the RISC processors, then prominent in the professional workstation space. AMD however was willing to invest silicon in a fully pipelined floating point unit. This earned AMD and Opteron the respect and loyalty of the gaming market in the years that followed.

Like the Pentium Pro, the Opteron core implemented out-of-order 3-wide superscalar execution units. Beyond that, Opteron has several other micro-architectural innovations to be competitive in the Pentium 4 time frame.

Second Generation Pentium 4 Architecture

The first generation Pentium 4 was reasonably competitive with the contemporary Opterons. On the 130nm process, the Pentium 4 at 3.0GHz had broadly comparable performance to Opteron at 2.0GHz. Depending on the nature of the application, Pentium 4 had an advantage on some and Opteron on others. The Intel strategy for the second generation Pentium 4 architecture (see The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology ) codename Prescott on 90nm in March 2004 was to achieve even more extraordinarily high clock rates, hence the deeper pipelining to 31 stages.

Unfortunately for Intel, the leakage current for off-state transistor in the 90nm process was much higher the anticipated, so the Prescott core could not clock any where near as high as it was designed, within the 130W envelope that capped the amount heat that could be removed from a medium sized die (112 square mm) with air-cooling. At 90nm, Pentium 4 reached 3.8GHz (3.66GHz in MP server variants, 3.33GHz with 8M L3 cache) compared with 2.8-3.0GHz for Opteron. There are indications that the second generation Pentium 4 architecture was targeted to reach 5-6GHz at 90nm and 9GHz on the 65nm process.

The Prescott Pentium 4 were 64-bit microprocessors. 2004 Q1, 125M transistors, 112mm2. Cedar Mill 65nm, Q1 2006 Potomac was the 90nm server server with 8M L3.

A brief note on comparing Intel and AMD processor frequencies: Intel tends to introduce a processor line with a range of frequencies, and then sometimes provide one or two higher frequencies later, but the main focus is to launch the next generation. AMD may incremental frequencies throughout the product life.

Dual Core

Around this time, it was clear that single core performance could not be pushed at the traditional 40% per year pace of Moore’s Law. There was plenty of transistor budget, as each manufacturing process step doubles the number of transistors at a given die size. For server applications, the best strategy for performance in terms of through-put was multiple cores, either on one die or multiple die in one package (socket).

Perhaps the one and only benefit (at this point in time) of a processor using the shared bus protocol is that it is relatively simple to put two single core die into one package that fits in a single processor socket. AMD had to do substantial design work to build their single-die dual-core processor. Paxville MP Dec 2005

Pentium 4 90nm 2M Pentium 4 90nm 2M   Opteron DC 90nm

Xeon 7000 series, two Pentium 4 90nm, 135mm2, 2M L2, 169M transistors per die (2005)
compared with Dual-Core Opteron, 90nm, 199mm2, 2x1M L2, 233M transistors (2005/6)

The down side for Intel on the dual-core Pentium 4 architecture was that frequency had to be scaled back significantly to fit within the maximum supportable thermal envelope, 165W for a dual-core 3GHz 90nm Xeon 7000 series. By contrast, the Opteron at 90nm single core model 856 (DDR) was 3GHz in a 93W thermal envelope. The dual-core model 890 (DDR) 2.8GHz fit in a 95W envelope, the dual-core model 8220SE (DDR2) 2.8GHz had a 120W envelope. So AMD was able to accommodate dual core Opteron on 90nm without giving up much of the design frequency.

Intel had to give up substantial frequency on the Pentium 4 architecture to stay within the desired thermal envelope. At 90nm, Intel was limited to 3.0GHz for the dual-core. AMD started at 2.2GHz in 2005, and slowly incremented frequency to 2.8GHz by 2006. Smithfield 2005 Q2, Pentium D Pressler 2006 Q1,

Intel had a single die, dual core Pentium 4 with 16M shared L3 cache on the 65nm process in Aug 2006 (Tulsa). This was able to clock at 3.4-3.5GHz. Because of the large cache, and hyper-threading, it was able to produce a very impressive TPC-C result, but the dual-core Opteron at 2.8GHz was able to achieve better performance in other aspects.

Pentium 4 65nm 2M Pentium 4 65nm 2M

Pentium 4, 65nm (Cedar Mill), 81mm2, 2M L2, 188M transistors per die (2006)

Tulsa

Xeon 7100, 65nm (Tulsa), 435mm2, 2x1M L2, 16M L3, 1.328B transistors (2006)

Afterwards, these became the 7000 series. The first, numbered 70xx, were the standard 65nm NetBurst with two single core die in a single package, each due with 2M L2 cache used in desktop and 2-way systems. In 2006, this was followed by the Xeon 7100 series, which was 2 NetBurst cores, 1M dedicated L2 and 16M L3 shared all on a single die. The very large L3 cache significantly improves high-call volume transaction-type applications. In 2007, the 7300 series comprised two dual-core 65nm Core 2 die in a single package, with 2 x 4M L2 cache. In 2008, the 7400 series was a 45nm process, six cores comprised of 3 dual-core pairs each with 3M L2 shared by the pair, and 16M L3 shared cache, all on a single die. As with the 7100 series, the very large cache significantly improved high-call volume applications.

Pentium M and Core Interlude

The Intel Pentium 4 architecture was designed to be a mainstream desktop processor with server and mobile versions. The goal was best performance within the desktop platform cost structure, including a reasonable power and thermal envelop for desktop systems. It was realized early in the design phase that the performance-power efficiency of the Pentium 4 was too high to be suitable for the mobile platform.

Chopaka and Timna

Initially, another architecture group was tasked to design a microprocessor described as an improved P6 microarchitecture with the codename Chopaka. This project was cancelled mid-stream with the new task to build as quickly as possible an integrated processor with the codename Timna. Time to market was deemed so critical that this was to be the existing Pentium III, north bridge (memory controller) and graphics all bolted on to a single-die, even retaining a front-side bus between the processor and memory controller. Had the normal four year development cycle been available, the memory controller would have been integrated into the microprocessor with reduced latency and better performance.

Unfortunately for Intel, the memory controller hub available was for Rambus memory. A Rambus to SDRAM Memory Translator Hub was also built as this was intended to target the low cost segment. After the Timna design was complete, and pre-production samples were sent out, most system vendors came to the conclusion that the Timna product line did not match up with any market segements. (cancelled Sep 2000) In other words, enough cost had already been extracted out the mainstream Pentium III system, that a crippled version was not necessary. A corollary might have been the IBM PC junior.

Pentium M

The design group then reverted to a microprocessor optimized for performance and power efficiency suitable for the mobile market. The first product had the codename Banias on 130nm in Mar 2003 (see The Intel® Pentium® M Processor: Microarchitecture and Performance). This became Pentium M and was targeted only at mobile platforms. One might think that this would the previously cancelled Chopaka, but the Intel Technology Journal described Banias as a clean sheet design, instead of a significantly improved Pentium III architecture. The paternity/maternity of Pentium M will not be discussed here. The Pentium M performance characteristics were appropriate to be considered one full generation improvement over Pentium III also at 130nm (Tualatin). The 90nm version of Pentium M in Jun 2004 Q2 had the codename Dothan (see Performance and Power Consumption for Mobile Platform Components Under Common Usage Models).

Yonah, first called Core Duo, then Pentium

The next design from this group was codename Yonah (see Introduction to Intel® Core™ Duo processor architecture, pdf) on 90nm in 2006 Q1, a dual-core improved Pentium M design. The brand name for Yonah became Core instead of Pentium M. The Intel Core Duo was a dual-core processor, with a single-core variant called Core Solo. The product name later reverted to Pentium. Note that the Pentium M and Core were 32-bit microprocessors as were the Pentium III.

Even Pentium M and Core (later Pentium) were all designed with objectives balanced between performance and the more restrictive mobile platform power and thermal envelop, the performance was still respectable compared with contemporary desktop processors with less restrictive power constraints.

Core 2 (65nm)

Having established credibility in designing microprocessors with both performance and power efficiency, the next product from this group was a new 64-bit dual-core architecture on the 65nm process with the codename Conroe for desktop and Merom for mobile segments. The initial public name was Core 2 in the desktop and mobile markets. Core 2, which is now rebranded Core, with the original Core being rebranded Pentium. (Presumably extra marketing staff was added to handle all the powerpoint slides that had to be changed for this.)

The rapid turn from Banias (Mar 2003), to Yonah (Jan 2006) and Conroe (Jul 2006) was truely impressive considering that each represented a significant improvement at the core level over its predecessor.

Nehalem

See the wikipedia page on Core microarchitecture.

Core 2

With Core 2, Intel discarded the previous practice of having one large design group for desktop processors plus variant, and another large group for mobile processors. The new organization established the tick-tock method with each team producing and new microarchitecture every four years, allowing a new architecture to be introduced every two years, and having a process shrink with minor improvements for the in-between years. Intel roadmaps show the tock processors in the even years, and tick processors in the odd years. However, the actual target launch dates tend to be towards the end of the year, and if an extra stepping is required, may drift to the next year.

The Core 2 on 65nm, has two cores, with 4M L2 shared cache, and run at up to 3.0GHz.

In the server market, the dual-core Core 2 became the Xeon 5100 series for 2-way systems (Jun 2006). There was also a dual-core Core 2 for single socket servers as the Xeon 3000 series. Later, a dual-die quad-core version became the Xeon 5300 series (Jan 2007). The dual-die quad-core codename Tigerton became the Xeon 7300 series in Sep 2007.

Conroe Conroe

Xeon 7300 (2 Conroe die), 65nm, 143mm2, dual core per die, 65nm, 4M L2, 291M (2007)

The Core 2 architecture at 3GHz was far more powerful that the Pentium 4 architecture at 3.4GHz or Opteron at 2.8GHz at the processor core level. This is evidenced by the SPEC CPU Integer base scores of 21 for Core 2, 12 for the Pentium 4, and 13 for Opteron. Single query tests in SQL also demonstrate the outstanding single-core performance of the Core 2 architecture.

Core 2 (45nm)

The 45nm Core 2, codename Penryn was also a dual-core with the shared L2 cache increased from 4M in Conroe to 6M (107mm2 die size). (see Improvements in the Intel® Core™2 Penryn Processor Family Architecture and Microarchitecture).

For 2-way servers, the single-die dual-core 45nm Core 2 product was the Xeon 5200 series (Feb 2008) and the two-die quad-core was the Xeon 5400 series (2007 or Mar 2008?).

Note: In the beginning, the Xeon brand meant 4-way+ systems for server platforms, even though there were 2-way workstations. The 2-way servers used standard desktop Pentium II or III processors. With Pentium 4, desktop processors were now only for single-socket platforms. Xeon became the brand for 2-way servers. Xeon MP was the brand for 4-way and higher, even though the Xeon processor of this time were really not meant to scale well. Then in 2006, Xeon 7000 was used for 4-way and higher, Xeon 5000 for 2-way, and Xeon 3000 for single socket servers.

During this time, a procesor architecture had different codenames for each market segment. Eventually, people got tired of this sillyness, and there was a single codename, with the suffix EP denoting Enhanced Performance for 2-way servers, and the EX denoting Expandable Perfomance for 4-way and higher.

By keeping the marketing people busy with unending powerpoint alterations, they do not have time to make other contributions to the design effort?

For 4-way+, there was a special variant with six cores integrated into a single die with 16M L3 cache and 1.9 billion transistors, codename Dunnington, officially the Xeon 7400 series (Sep 2008). Each of the three dual-core sets had its own 3M L2 cache shared between the two cores. Below is the Dunnington die, shown next to Penryn. The 2 sections labeled cache is the L3 cache. The other part that looks like cache but not labeled is the L2 cache.

Dunnington Penryn Dunnington

Xeon 7400, 45nm, 503mm2, 3 dual core pairs, 3 x 3M L2, 16M L3

Nehalem, Core i3-7, Xeon 5500 etc

The second tock processor, from the Pentium Pro and NetBurst team, has codename Nehalem. The desktop version came out in late 2008 as Core i7, with later Core i5 and i3 versions. In 2-way servers, Nehalem-EP is the Xeon 5500 series (May 2009).

The super-deep pipeline (20-30 stages) and high core speed of Pentium 4 had to be abandoned. The power consumption to performance ratio was too high, and there was too much legacy code that did not benefit proportionately or at all from the deep pipeline. However, other concepts from Pentium 4 were kept, including the trace cache

Nehalem

See the wikipedia page on Nehalem microarchitecture.

Nehalem

Also see Dave Kanter's Real World Tech article Inside Nehalem: Intel's Future Processor and System

Nehalem

Below are
Conroe 65nm 4M L2 143mm2 291m transistors,
Penryn 45nm 6M L2 107mm2 420m transistors,
and Nehalem 45nm die 256K L2 per core, 8M L3, 263mm2 731M transistors, shown approximately to scale.

Conroe Penryn Nehalem

Conroe 65nm, Penryn 45nm and Nehalem 45nm

AMD Opteron quad-core processors and later

Below are the recent Opterons:
quad-core Barcelona 65nm, 285mm2 4x512K L2, 2M L3, 463M transistors (2007),
quad-core Shanghai 45nm, 258mm2, 6M L3 758M transistors (2008),
six-core Istanbul 45nm, 346mm2 6M L2, 904M transistors (2009)

Barcelona Shanghai Istanbul Athlon_II_X4

Quad-Core Barcelona 65nm, QC Shanghai 45nm, six-core Istanbul and Athlon II X4?

More Nehalem and related

Below, the Nehaelm die is shown with outline of functionality.

Nehalem Nehalem

Lynnfield (Jasper Forest?) is a Nehalem microarchitecture with integrated PCI-E.

Lynnfield

A comparison of quad-core Nehalem 45nm with dual-core Westere 32nm and the six-core Westmere (uncertain if this is scaled correctly).

Nehalem Westmere Westmere Westmere
Nehalem,                               Westmere 2P Westmere 2P   Westmere 6P(scale?)

For some reason, I thought I saw Nehalem-EX die size listed at 540mm2. The Intel press release says the actual size is 684mm2, for which the scaling below is more appropriate. (previously, I had scaling based on 540mm2)

Nehalem Nehalem-EX

Nehalem                     Nehalem-EX (based on 684mm2 die size)

The most recent AMD Opteron, Magny-Cours, has two Istanbul die in a single package for 12-cores. Two Istanbul die representing Magny-Cours is shown next to Nehalem-EX approximately to scale. Per above, each Istanbul die is 346mm2 and Nehalem-EX is 684mm2.

Istanbul Istanbul Nehalem-EX
Istanbul and Nehalem-EX