To better understand server systems, it is necessary to understand the microprocessors on which server systems are based. In turn, it is helpful to understand the recent history of microprocessors. The Intel Pentium Pro was a landmark that brought Intel into prominence in server systems. It was the first X86 (or IA-32) processor that shook the presumption that RISC was superior and would eventually take over the market.
The RISC argument had become so broadly accepted that Intel evens planned on replacing their X86/IA-32 line. Instead of fielding yet another RISC processor, Intel felt compelled to come up with an even better architecture that became EPIC, the foundation of the Itanium processors. There is a valid technical foundation to the argument that RISC concepts were becoming outdated. RISC was conceived with the anticipation of transistor budgets in the tens of thousands. In the EPIC/Itanium time frame, the transistor budget was in the tens of millions. Broad consensus does not seem to be a reliable predictor of the future.
Below is an architectural diagram of Haswell from IDF 2012 Fall.
Below is an architectural diagram of Sandy Bridge from IDF 2010 Fall.
Below is an architectural diagram of Nehalem from IDF 2009 Spring.
Below is an architectural diagram of Core from EMEA Academic Forum 2007 Spring. Core is described as 4-wide execution even it appears to have 5 ports off the Schedulers.
The Xeon E-5 based on the EP and EN versions of Sandy Bridge came out in 2012, about the same time as Ivy Bridge desktop and mobile processors. The E-5 EP supports both 2-way and 4-way systems. Supposedly there will be a EX processor coming out this year based on Ivy Bridge? to suceed Westmere-EX.
I will straighten things out later. But for now:
Westmere 6-core 248mm2 and Sandy Bridge 4-core 216mm2, both 32nm
Below are Westmere-EX and Sandy Bridge EP/EN.
Westmere-EX 10-cores 513mm2 , and Sandy Bridge EP & EN, 8 cores, 435mm2
Intel Hyper-Threading Recap: Simultaneous and Temporal
Several Intel processor architecture have Hyper-Threading, including the Pentium 4 (NetBurst), the more recent Itanium (Monteceito forward) and even Atom. The original Pentium 4 implementation was simultaneous multi-threading. The Itanium and Nehalem (Atom?) implementions are time-slice or temporal multi-threading. (Wikipedia say Nehalem is SMT, not temporal?)
The general idea in microprocessor architecture has always been to make best use of the available transistors. Long ago, the objective was single threaded performance at the processor socket level (and multi-threaded performance at system level). For the last several years, the objective shifted to multi-threaded performance within a reasonable power envelop, with consideration for the fact that single-threaded performance is still important.
The ideal microprocessor design has all units uniformly running at maximum load continuously. Of course, the ideal cannot be achieved across a broad range of applications except perhaps brief intervals.
In a single-threaded architecture, the pattern of Moore's law has been each doubling of the transistor budget could increase performance by about 40%. In a multi-core design, the aggregate throughput could be linear with the number of cores, or transistor budget. One criteria power budget of the combined cores. In many multi-core designs, the clock frequency is restricted to below the top frequency capability of the single core. So multi-core processors can achieve scaling than Moore's Law if an application is effectively multi-threaded.
Early in the original Pentium 4 (Willamette) design phase, it was thought that simultaneous multi-threading (SMT) as it was called before Hyper-Threading, could contribute 30% through-put performance gain in multi-threaded server applications at a cost of approximately 10% in transistor budget. As it turned out, unanticipated complications resulted in less gain in the key TPC-C benchmark, probably on the order of 10%.
It is possible or suspected that all of the Pentium 4 HT performance gain can be attributed to the network round-trip portion, and no gain in the core SQL Server engine. In addition, there was erratic behavior in other applications probably due to neither the operating system nor the application being properly designed for a HT processor architecture. This is especially evident in parallel execution plans. The Prescott based Pentium 4 processor did show 40-50% performance gain on Quest LightSpeed database backup compression, probably the highest reported HT performance gain. Hyper-Threading does have potential, but there are issues that need to be worked out.
For the Nehalem generation, an informal statement puts the HT performance gain in the range of 30% for high call volume workloads, and in the range of 10% for DW, but additional details are necessary before making conclusive statements. If the original Willamette die size impact were still the case, then the 30% gain in transaction for 10% die size is a good trade. However the 10% gain in DW is only marginal, about the same as having more non-HT cores.
In a test with custom non-transactional (no locking code) index search engine, an astounding nearly 100% performance gain was observed with HT in a 2-way quad-core Nehalem system. The code was almost entirely a repetitive sequence of string comparison followed memory fetch. If the comparison operation is shorter than a local memory fetch (50-60ns, or 150-180 CPU-cycles), then 100% gain would seem to be possible. Perhaps there are still unresolved HT contention points in the SQL Server engine.
With the Bulldozer architecture, AMD is asserting that the Fetch/Decode and FP units are under-utilized in previous Opteron architectures. Of course, Bulldozer is targeted at server applications, which tend to emphasize integer performance. Hence the Bulldozer design has two integer cores sharing the under-utilized units.
Any assesment at this point is speculative without actual measurement details. The various press releases mention that all three expected 2011 processors (AMD Bulldozer, Intel Westmere-EX on the high-end and Intel Sandy-Bridge up to the mid-point) have reached first silicon. It is unknown is meaningful performance numbers have been generated with the pre-production samples, as some references are to estimated performance and some are simulations. However, based on the pre-production statements that have been made, which have been reasonably reliable in the past, we can draw some expectations.
At this point in mid-2010, the three main processors are:
In 2-way (socket) systems, Xeon 5680 3.33GHz has 25% better TPC-E (OLTP) performance as Opteron 6176 2.3GHz and the two are roughly comparable on TPC-H (DW). On 4-way systems, the Xeon 7560 2.26GHz has 51% better TPC-C and 38% better TPC-E than Opteron 6176. If we attribute 30% of the difference between the Xeon 7500 and Opteron 6100 to Hyper-Threading, then the 4 x 12-core 2.3GHz Opteron would be close to the 4 x 8-core 2.26GHz Xeon.
The 4-way Opteron does has a much lower system cost than the Xeon 7500, for roughly comparable price performance. 4-way Opteron price-point advantage would be negated by the SQL Server Enterprise Edition per-processor license cost, but is a better fit for Standard Edition and CAL licensing. The other factor to consider is that Xeon 5680 has nearly 2X the single core performance as Opteron and the Xeon 7560 is 35% better.
In the mid-2011 time frame, the expected processors would be:
The expectation is that Bulldozer with 50% throughput performance gain over Magny-Cours will be better than the current six-core Westmere, but could be comparable if Sandy-Bridge has 8-cores of the same single-core performance as each Westmere core. Westmere and Sandy Bridge would contine to have nearly 2X the single core performance of Bulldozer.
In 4-way systems, Westmere-EX will have 25% more cores, and should be able to improve over 2.26GHz frequency of Nehalem-EX. (There was a 50% increase in the number of cores from Westmere to Nehalem both at 3.33GHz) This should allow Westmere-EX to maintain an advantage over Bulldozer. AMD will maintain the midway price point between the 2-way and 4-way Intel systems.
I am sure there will be heated (vehement?) arguments over my expectations. But please, I would like to hear quantitative analysis rather than emotional outbursts.
Consider some of the implications of Moore's Law. Suppose we start with a baseline processor with core size 50mm2. If the goal is a high-end workstation product, we might target a total silicon budget of 400mm2, in which case we could have 8 cores total. The aggregate throughput performance of the processor socket for a trivially multi-threaded application, that is one in which there is zero overhead for coordinating threads and no resource contention, is simply the number of cores times the baseline single core performance, 8 in this case.
|Core Size||# of cores||Core Perfomance||Aggregate Perf|
Now suppose the we desired greater single core performance. The pattern established by Moore's law at given manufacturing process is that doubling the transistor budget (die area to be more precise, as logic and cache memory transistor have very different densities) should yield a 40% performance gain for a single thread design. The second processor, at 100mm2 should have 40% better performance relative to the baseline. However, we can only fit 4 cores on a 400mm2, for an aggregate thoughput performance of 5.6. The third example at 200mm2 should have twice the baseline single performance, but only 2 will fit on the die, for an aggregate throughput performance of 4.
It is evident that throughput oriented applications with very little multi-threading overhead/contention favors many basic cores, while applications that are not easily multi-threadable favor more powerful cores at the expense of throughput.
Now to actual processor characteristics. Using the 45nm Opteron core as the reference, the 45nm Nehalem core has nearly twice the integer performance (floating point ratio?). The six-core Istanbul die size is 346mm2, for 58mm2 per core including 1.5M cache (512K L2 and 1M L3). The 45nm quad-core Nehalem is 268mm2, or 67mm2 per core including 256K L2 and 2M L3. This is a serious problem for the Opteron core, with only a slighly smaller die size per core, the single core (integer) performance is too far below Nehalem (Core 2 as well).