Parent: Processor Architectures
I will straighten things out later. But for now:
Westmere 6-core 248mm2 and Sandy Bridge 4-core 216mm2, both 32nm
Below are Westmere-EX and Sandy Bridge EP/EN.
Westmere-EX 10-cores 513mm2 , and Sandy Bridge EP & EN, 8 cores, 435mm2
The Xeon E-5 based on the EP and EN versions of Sandy Bridge came out in 2012, about the same time as Ivy Bridge desktop and mobile processors. The E-5 EP supports both 2-way and 4-way systems. Supposedly there will be a EX processor coming out this year based on Ivy Bridge? to suceed Westmere-EX.
Below is an architectural diagram of Haswell from IDF 2012 Fall.
Below is an architectural diagram of Sandy Bridge from IDF 2010 Fall.
Below is an architectural diagram of Nehalem from IDF 2009 Spring.
Below is an architectural diagram of Core from EMEA Academic Forum 2007 Spring. Core is described as 4-wide execution even it appears to have 5 ports off the Schedulers.
Any assesment at this point is speculative without actual measurement details. The various press releases mention that all three expected 2011 processors (AMD Bulldozer, Intel Westmere-EX on the high-end and Intel Sandy-Bridge up to the mid-point) have reached first silicon. It is unknown is meaningful performance numbers have been generated with the pre-production samples, as some references are to estimated performance and some are simulations. However, based on the pre-production statements that have been made, which have been reasonably reliable in the past, we can draw some expectations.
At this point in mid-2010, the three main processors are:
In 2-way (socket) systems, Xeon 5680 3.33GHz has 25% better TPC-E (OLTP) performance as Opteron 6176 2.3GHz and the two are roughly comparable on TPC-H (DW). On 4-way systems, the Xeon 7560 2.26GHz has 51% better TPC-C and 38% better TPC-E than Opteron 6176. If we attribute 30% of the difference between the Xeon 7500 and Opteron 6100 to Hyper-Threading, then the 4 x 12-core 2.3GHz Opteron would be close to the 4 x 8-core 2.26GHz Xeon.
The 4-way Opteron does has a much lower system cost than the Xeon 7500, for roughly comparable price performance. 4-way Opteron price-point advantage would be negated by the SQL Server Enterprise Edition per-processor license cost, but is a better fit for Standard Edition and CAL licensing. The other factor to consider is that Xeon 5680 has nearly 2X the single core performance as Opteron and the Xeon 7560 is 35% better.
In the mid-2011 time frame, the expected processors would be:
The expectation is that Bulldozer with 50% throughput performance gain over Magny-Cours will be better than the current six-core Westmere, but could be comparable if Sandy-Bridge has 8-cores of the same single-core performance as each Westmere core. Westmere and Sandy Bridge would contine to have nearly 2X the single core performance of Bulldozer.
In 4-way systems, Westmere-EX will have 25% more cores, and should be able to improve over 2.26GHz frequency of Nehalem-EX. (There was a 50% increase in the number of cores from Westmere to Nehalem both at 3.33GHz) This should allow Westmere-EX to maintain an advantage over Bulldozer. AMD will maintain the midway price point between the 2-way and 4-way Intel systems.
I am sure there will be heated (vehement?) arguments over my expectations. But please, I would like to hear quantitative analysis rather than emotional outbursts.
Consider some of the implications of Moore's Law. Suppose we start with a baseline processor with core size 50mm2. If the goal is a high-end workstation product, we might target a total silicon budget of 400mm2, in which case we could have 8 cores total. The aggregate throughput performance of the processor socket for a trivially multi-threaded application, that is one in which there is zero overhead for coordinating threads and no resource contention, is simply the number of cores times the baseline single core performance, 8 in this case.
|Core Size||# of cores||Core Perfomance||Aggregate Perf|
Now suppose the we desired greater single core performance. The pattern established by Moore's law at given manufacturing process is that doubling the transistor budget (die area to be more precise, as logic and cache memory transistor have very different densities) should yield a 40% performance gain for a single thread design. The second processor, at 100mm2 should have 40% better performance relative to the baseline. However, we can only fit 4 cores on a 400mm2, for an aggregate thoughput performance of 5.6. The third example at 200mm2 should have twice the baseline single performance, but only 2 will fit on the die, for an aggregate throughput performance of 4.
It is evident that throughput oriented applications with very little multi-threading overhead/contention favors many basic cores, while applications that are not easily multi-threadable favor more powerful cores at the expense of throughput.
Now to actual processor characteristics. Using the 45nm Opteron core as the reference, the 45nm Nehalem core has nearly twice the integer performance (floating point ratio?). The six-core Istanbul die size is 346mm2, for 58mm2 per core including 1.5M cache (512K L2 and 1M L3). The 45nm quad-core Nehalem is 268mm2, or 67mm2 per core including 256K L2 and 2M L3. This is a serious problem for the Opteron core, with only a slighly smaller die size per core, the single core (integer) performance is too far below Nehalem (Core 2 as well).