In many previous articles, I have discussed processor and system architecture, server systems, and performance benchmarks, often mixing more than one topic. Going forward, I will consolidate each topic into separate collections:
Performance Benchmarks
Server Systems
Systems Architecture
Processors (Architecture)
Benchmarks is now split into subsections:
SQL Server Benchmark Results TPC-C
SQL Server Benchmark Results TPC-E
SQL Server Benchmark Results TPC-H
Benchmark Results SPEC CPU 2006 Integer
SQL Server 2008 R2 Data Warehouse Performance Evaluation (2010-07)
Solid State Drive versus Memory, TPC-H Nehalem (Updated 2009-11)
Benchmark Omissions (2010-04)
SQL Server Benchmark Results November 2009
Below is a summary of the (not always up to date) best available TPC benchmark result for Intel Xeon and AMD Opteron server systems.
| Processor Architecture Process | TPC | 2-way | 4-way | 8-way | 16-way |
|---|---|---|---|---|---|
| Core2 65nm Xeon 5300 QC 7300 QC |
TPC-C TPC-E TPC-H |
251,300 5160 only 17,686@100 |
407,079 479.51 34,990@100 |
841,809 804.0 46,034@300 |
- 1,250.0 - |
| Barcelona 65nm QC |
TPC-C TPC-E TPC-H |
- - - |
471,883 - - |
- - 52,860@300 |
- - - |
| Core2 45nm Xeon 5400 QC 7400 6C |
TPC-C TPC-E TPC-H |
275,149 317.45 - |
634,825 729.65 - |
Linux DB2 1,165.56 - |
- 2,012.8 (R2) 102,778@3T |
| Shanghai 45nm QC |
TPC-C TPC-E TPC-H |
- - - |
579,814 635.4 - |
- - 57,685@300G |
- - - |
| Istanbul 45nm 6C |
TPC-C TPC-E TPC-H |
- - - |
- - - |
- - 91,558@300G* |
- - - |
| Nehalem 45nm Xeon 5500 QC 7500 8C |
TPC-C TPC-E TPC-H |
661,475† 850.0 51,086@100G |
1,807,347 2,022.64 - |
- 3,141.76 162,601@3TB |
- - - |
| Westmere 32nm Xeon 5600 6C 7600 10C‡ |
TPC-C TPC-E TPC-H |
803,068 1,110 73,974.6@100G |
future future future |
future future future |
future future future |
| Magny-Cours 45nm 12C |
TPC-C TPC-E TPC-H |
705,652 887.4 71,438.3@100G |
1,193,472 1,464 107,561@300G |
n/a n/a n/a |
n/a n/a n/a |
* and 81,514QphH@1000GB
† Xeon W5580 3.2GHz, instead of X5570 2.93GHz (Linux/Oracle too)
‡ the Hot Chips 22
conference lists a paper: Westmere-EX: A 20 Thread Server CPU.
IBM Power 780 with 2 quad-core POWER7 4.14GHz TPC-C: 1,200,011 (4 logical proc/core)
HP Integrity Superdome 16 quad-core Itanium 9350 TPC-H 140,181.1 QphH@1000GB
Notice that at 2-way, Xeon 5680 higher than Opteron 6176 by 14% on TPC-E and 25% on TPC-E, but only by 3.5% on TPC-H. This is expected because Hyper-Threading provides a large boost on high call-volume (low average cost per call), and a smaller boost in large (DW) queries.
HP has just announced the ProLiant DL580 G7 and DL980 G7 servers based on the Xeon 7500 series processors,
and the DL585 G7 4-way server with the 12-core AMD Opteron 6100 series (Magny-Cours).
Apparently the reason for the delay is that the 8-way DL980 G7 employs custom silicon node controllers (XNC),
and possibly, so HP could make a splash in announcing all three system at their big annual conference:
HP Technology Forum. The DL580 and 585 G7 are available now(?),
and the 980 G7 should be available later in Q3.
While the Intel Xeon 7500 processor allows a glue-less 8-way system, HP felt that the design could be improved with node controllers. The node controllers reduce snoop traffic for a majority of memory accesses, and can achieve a 30% reduction in memory latency in some circumstances. It should be considered that HP needed to build custom silicon crossbar (& node controllers) for their SuperDome2 system and the Itanium 9300 series processors, which use the same QuickPath Interconnect (QPI) as the Nehalem processors. There are differences in the way the Itanium and Xeon processors use QPI. There are also differences between the node controllers for the Itanium and Xeon systems. (The Itanium node controller implements directory based cache coherency and the Xeon node controller is snoop filter).
HP may have built a glueless 8-way Xeon 7500 system if they had not already invested the effort to built the XNC for their Itanium systems. This also means that HP should have the components to built a 16-way Xeon 7500 system, meaning that if there were market demand, such a system could be brought to market. Intel did say that there were 16-way Xeon 7500 system designs, but none have surfaced yet.
Dell has also released a 2-way TPC-E result for the Xeon 5600, and Fujitsu released a 4-way TPC-E result for the Xeon 7500
System PricingBelow are the bare system pricing based on HP's published TPC reports in 2010. Base system pricing only includes the system chassis and processors. There might be differences in memory pricing because of differences in memory type.
|   | Intel | AMD |
|---|---|---|
| 4-way |   |   |
| System | DL580G7 | DL585G7 |
| Processor | 4xXeon 7560 | 4x6176 |
| Price | $24,597 | $11,235 |
|   | Intel | AMD |
| 2-way |   |   |
| System | DL380G7 | DL385G7 |
| Processor | 2xXeon 5680 | 2x6176 |
| Price | $6,189 | $5,109 |
I do not consider the TPC overall pricing and pricing-performance metrics to be particularly helpful in comparative assessment. The software licensing may be much higher than the server costs when per-processor licensing apply, but is inconsequential for CAL mode in DW environments. Also, storage can be a very large part of the overall system cost. Some vendors can use SSD or direct-attach for favorable pricing, other vendors employ more expensive SAN storage. In any case, the customer always pick their own storage, so the storage pricing in the TPC reports is of no consequence to the actual customer. That being said, there is no meaningful difference in the TPC-E Price/Performance between the AMD and Intel based platforms at either 2-way or 4-way.
The AMD objective for the near-term (until Bulldozer?) is probably not competing with Intel on a socket basis but overall system price. The price difference between the ML/DL 380 and 385 is nominal at best, and so I would definitely prefer the 380 with the Xeon 5680, because any time I have a big single threaded operation, I want the fastest core, not aggregate core-GHz.
At the 4-way level, between the DL 580 and 585, there is a better story. If we add 256GB memory at $16K (the HP report says $32K for the DL585G7, but I think that is too high), then we are looking at $40.5 vs 27K, which 1.5 to 1. This roughly corresponds to the difference in performance. So it all comes down to the cost of the other components: storage and software licensing.
I am inclined to recommend the 2-way Xeon 5600 series for medium tasks, the 4-way Xeon 7500 series for big tasks, and the 8-way Xeon 7500 ($79K base) for really big tasks. Yes, the 4-way 12-core Opteron does fit in the gap between 2 x 5600 and 4 x 7500. But I am not sure this is a viable gap, as I would recommend the 2 x 5600 when the choice is between a middle and big system, in part because processors today are so powerful and because the Xeon 5600 is the fastest for single-threaded tasks.
ps - High-Availability requirements might steer towards the Xeon 7500 because of the MCA features etc.
Intel launched the Xeon 5600 series (Westmere-EP, 32nm) six-core processors on 16 March 2010 without any TPC benchmark results. In the performance world, no results almost always mean bad or not good results. Yet there is every reason to believe that the Xeon 5600 series with six-cores (X models only) will performance exactly as expected for a 50% increase in the number of cores at the same frequency (as the 5500) with no system level bottlenecks. The expectation is that a six-core Xeon 5600 should provide 30%+ improvement over the comparable quad-core Xeon 5500 in throughput oriented tests, particularly OLTP type workloads. Single stream parallel execution plans will probably show less gain, as scaling via parallelism is not a simple matter.
Then two weeks later on 30 March 2010, Intel launched the Xeon 7500 series 8-core processors for 4-way+ systems (and the Xeon 6500 for high-end 2-way systems) with TPC-E results on 4-way and 8-way systems but no TPC-H results. The TPC-E results were exactly what Intel said it was going to be last September at IDF, 2.5X over the previous generation Xeon 7400 series and 2.5X over the contemporary 2-way Xeon 5500 series.
My guess is that Intel wanted it to be clear that the 4-way Xeon 7500 achieved the stated performance objectives of 2.5X over the 2-way Xeon 5500, just in case some slide-decks did not mention which 2-way system the 2.5X claim referred to. Of course, the Intel statement of 2.5X for Xeon 7500 was most probably based on performance measurements already run on proto-type systems. It was probably also felt that the Xeon 5600 series is such a natural choice to supersede the 5500 series that TPC benchmarks were not essential, as there were sufficient other benchmarks to support the claims.
In brief, the Intel Core 2 architecture processors were avoiding comparisons against AMD Opteron in TPC-H, except for the 16-way Unisys system, for which there is no comparable Opteron system.
Opteron on the other hand, avoided comparison with Core2 architecture in 2-way systems and TPC-C/E OLTP benchmarks across the board. In the 2-way systems, the Intel old-FSB technology was still adequate, and the powerful Core2 architecture core was enough to beat a 2-way Opteron. There were respectable 4-way TPC-C and TPC-E results for Shanghai. When AMD announced the HT-Assist feature in Istanbul, one might have thought AMD was finally going to be able compete in 4-way OLTP. But there have been zero benchmarks published as of current.
When the 2-way Intel Xeon 5500 processor, based on the Nehalem architecture, came out in early 2009, outstanding results were published for both the OLTP oreiented TPC-E and DW/DSS oriented TPC-H. In February 2010, a TPC-C was published as well, even though Microsoft had previously said all new OLTP benchmarks were going to be TPC-E. This result was with SQL Server 2005 instead of 2008.
There was every expectation with the Xeon 7500 Nehalem-EX, that there would be both OLTP and DW/DSS benchmark results, as Xeon 7500 should produce world-class (and world-record) results in both. It is possible that performance problems were encountered in trying to achieve good scaling over 32-cores and 64-threads in a 4-way Xeon 7500 system. If this is identified as something that can be fixed in the Windows operating system or SQL Server engine, then a change request would be made. I seriously doubt that another processor stepping would be done for this, as Xeon 7500 is already D-step at release.
It is also quite possible Intel will have to face the fact that 2.5X the 2-way Xeon 5500 TPC-H SF100 result of 51,000 QphH is not going to be achieved no matter how good Xeon 7500 is at DW. This is because the TPC-H scores is a geometric mean of the 22 queries. There are several small queries in TPC-H, two of which already run in under 1 seconds on the 2-way 8-core Xeon 5570 for SF100, and several that run near or under 2 seconds. There is limited opportunity to continue to improve the performance of small queries with increasing degree of parallelism, as the overhead to setup each thread becomes larger compared to the actual work done be each thread, especially if one also has the give up frequency, dropping from 2.93 to 2.26GHz. It would be helpful to know what the actual frequency is during a performance run with the turbo-boost feature.
It is possible that some marketing putz does not understand this and denied permission to publish perfectly good Xeon 7500 TPC-H results because it did not meet the 2.5X goal. (Along with making a negative ranking and review entry for the person responsible for TPC-H benchmarking due to failing to achieve the 2.5X goal. But lets not grind axes on here. Besides, who said life was fair? It takes exceptional talent to accomplish the impossible. A clever person anticipates impossible problems, and transfers to another group to avoid a sticky wicket).
Achieving 2.5X in the big queries is a more meaningful goal. Achieving 50% better than the 8-way Opteron 6-core TPC-H SF300 or SF1TB would also be a worthwhile accomplishment, if Xeon 7500 were upto the task.
Finally, a quick comment on Xeon 7500 scaling from 4-way (32-cores, 64-threads) to 8-way (64-cores, 128-threads). In the past, achieving 1.5 scaling with this number of cores would have been a triumph. Given the announcement Microsoft made on Windows Server 2008 R2, on removing the thread scheduler and other impediments to high-end scaling, we were expecting 1.7X scaling. It could be that scaling beyond 64-threads in tricky, because of the 64-thread limit per group(insert correct terminology). Hopefully the 4-way to 8-way to 16-way scaling will improve over time as problems are solved one at a time, while the task master whips his/her draft horses (again, I digress).
Earlier, I had commented about benchmark omissions from the quad-core generation on. Below is a summary of processors and systems for which TPC results are published. (The Intel Xeon 7500 Processor Product Brief shows 3.03X relative to 7400 for OLTP Brokerage Database, which is TPC-E, but 2022 over 729 is 2.77X.)
This section is from 2010-04
To date, no 4-way or 8-way TPC-H data warehouse benchmark result has been published for the six-core Xeon X7460 and no TPC-C or TPC-E OLTP benchmark result has been published for six-core Opteron. Usually, the absence of published results means the results are not competitive, in one manner or another.
TPC-C, E and H results were published for the previous generation quad-core Intel Xeon X7350. TPC-C and TPC-E results were published for the follow-on six-core Xeon X7460, but there are no 4-way or 8-way TPC-H results. Unisys did publish a 10TB TPC-H result their 16-way ES7000 with the Xeon 7460, but there is no simple way to compare this with 4-way or 8-way results at 100 or 300GB scale factors.
There are very impressive 4-way quad-core Opteron 8384 2.7GHz TPC-C and TPC-E results of 579,814 tpm-C and 635.43 tpsE respectively, but not for the six-core Opteron. For TPC-H, there is a series of 8-way results at scale factor 300GB for the quad-core and six-core Opteron processors, though curiously no 4-way results for the Opteron after dual-core 8220.
One suspected reason for the lack of 4-way or 8-way TPC-H results on the Intel Xeon X7460 is that it cannot achieve meaningful performance gains over the quad-core Xeon X7350. The large 16M L3 cache on the X7460 helps on high-call volume (>10,000 RPC/sec) benchmarks like TPC-C and TPC-E, but not in the high-row count TPC-H queries with parallel execution plans, where the lower 2.66GHz frequency is also a liability.
In the dual-core era, 4-way Intel Xeon (with the Pentium 4 based NetBurst core) and AMD Opteron systems were very close on TPC-H. It might be that the 4-way quad-core Opteron processors were not competitive with the Xeon 7300 series (Core2 architecture cores) so no TPC-H results were published. The quad-core Opteron was very competitive, significantly better even, than the Xeon 7350 without a large shared cache, in TPC-C and TPC-E. The Opteron architecture does not need as large a cache as the Xeon, but does benefit from the large 6M L3 cache in Shanghai compared with the 2M L3 cache in Barcelona.
TPC-H results were published for quad-core and six-core Opteron in the HP ProLiant DL785 8-way systems at scale factor 300GB. Previously, IBM had published an 8-way SF 300 TPC-H result for the Xeon X7350. The first 8-way Opteron quad-core had a better result, so it is possible that the Xeon bus architecture could not scale to 8-way for DW type workloads.
Significantly, the 8-way six-core Opteron 8389 2.8GHz shows a very significant TPC-H performance gain over the quad-core Opteron 8384 2.7GHz (91,558.2 QphH@300GB versus 57,684.7), more than would be suggested by the 50% increase in the number of cores and nominal frequency increase, as scaling is less than linear. Presumably, this should be attributed to micro-architecture improvements between Shanghai and Istanbul.
The most significant improvement cited by AMD is HT-assist, which is essentially a snoop filter for maintaining cache coherency in the Hyper-Transport architecture. Now ever since the Opteron with integrated memory controller and HT introduction, AMD crowed about how memory and inter-processor bandwidth scaled with the number of processors, unlike the Intel architecture, where memory and inter-processor bandwidth was bottlenecked by the front-side bus.
Well AMD neglected to mention that their scalable bandwidth was also offset by increased inter-processor communication to maintain cache-coherency (see the article by Johan de Dela on Anandtech http://it.anandtech.com/IT/showdoc.aspx?i=3571). So now that AMD has the snoop filter capability and a very good TPC-H result for the Opteron with HT-assist, why is there not any published TPC-C or TPC-E OLTP benchmark results?
Note that Intel had difficultly with Snoop Filter in the 5000P/X chipset. The snoop filter improved some benchmarks, and cause degradation in others. So it would be no surprise if it takes or two generations to work out the issues. The expectation is that AMD will need to work out these issues if Magny-Cours is expected to compete with Nehalem-EX systems.
For Intel, the lack of competitive Xeon 7400 series DW benchmark results will be a moot point once the next-generation Nehalem-EX systems becomes available.
Anyways, these are my suspicions. System vendors are welcome to refute any of my opinions by publishing results. I was supriseb by the 2-way Xeon 5500 Nehalem TPC-H results.
Intel launched the Xeon 5600 series (Westmere-EP, 32nm) six-core processors on 16 March 2010 without any TPC benchmark results. In the performance world, no results almost always mean bad or not good results. Yet there is every reason to believe that the Xeon 5600 series with six-cores (X models only) will performance exactly as expected for a 50% increase in the number of cores at the same frequency (as the 5500) with no system level bottlenecks. The expectation is that a six-core Xeon 5600 should provide 30%+ improvement over the comparable quad-core Xeon 5500 in throughput oriented tests, particularly OLTP type workloads. Single stream parallel execution plans will probably show less gain, as scaling via parallelism is not a simple matter.
Then two weeks later on 30 March 2010, Intel launched the Xeon 7500 series 8-core processors for 4-way+ systems (and the Xeon 6500 for high-end 2-way systems) with TPC-E results on 4-way and 8-way systems but no TPC-H results. The TPC-E results were exactly what Intel said it was going to be last September at IDF, 2.5X over the previous generation Xeon 7400 series and 2.5X over the contemporary 2-way Xeon 5500 series.
My guess is that Intel wanted it to be clear that the 4-way Xeon 7500 achieved the stated performance objectives of 2.5X over the 2-way Xeon 5500, just in case some slide-decks did not mention which 2-way system the 2.5X claim referred to. Of course, the Intel statement of 2.5X for Xeon 7500 was most probably based on performance measurements already run on proto-type systems. It was probably also felt that the Xeon 5600 series is such a natural choice to supersede the 5500 series that TPC benchmarks were not essential, as there were sufficient other benchmarks to support the claims.
Earlier, I had commented about benchmark omissions from the quad-core generation on. Below is a summary of processors and systems for which TPC results are published. In brief, the Intel Core 2 architecture processors were avoiding comparisons against AMD Opteron in TPC-H, except for the 16-way Unisys system, for which there is no comparable Opteron system.
Opteron on the other hand, avoided comparison with Core2 architecture in 2-way systems and TPC-C/E OLTP benchmarks across the board. In the 2-way systems, the Intel old-FSB technology was still adequate, and the powerful Core2 architecture core was enough to beat a 2-way Opteron. There were respectable 4-way TPC-C and TPC-E results for Shanghai. When AMD announced the HT-Assist feature in Istanbul, one might have thought AMD was finally going to be able compete in 4-way OLTP. But there have been zero benchmarks published as of current.
When the 2-way Intel Xeon 5500 processor, based on the Nehalem architecture, came out in early 2009, outstanding results were published for both the OLTP oreiented TPC-E and DW/DSS oriented TPC-H. In February 2010, a TPC-C was published as well, even though Microsoft had previously said all new OLTP benchmarks were going to be TPC-E.
There was every expectation with the Xeon 7500 Nehalem-EX, that there would be both OLTP and DW/DSS benchmark results, is Xeon 7500 should produce world-class (and world-record) results in both. It is possible that performance problems were encountered in trying to achieve good scaling over 32 cores and 64 threads in a 4-way Xeon 7500 system. If this is identified as something that can be fixed in the Windows operating system or SQL Server engine, then a change request would be made. I seriously doubt that another processor stepping would be done for this, as Xeon 7500 is already D-step at release.
It is also quite possible Intel will have face the fact that 2.5X the 2-way Xeon 5500 TPC-H SF100 result of 51,000 QphH is not going to be achieved no matter how good Xeon 7500 is at DW. This is because TPC-H uses a geometric mean of the 22 queries, and the small queries in TPC-H are unlikely get any faster with more cores, especially at lower frequency. Achieving 2.5X in the big queries is a more meaningful goal.
This section is older
From information available, Nehalem-EX will become the Xeon 7500 series superseding the Xeon 7400 series. Previously, Intel did not target 7000 series processors in 2-way systems. Now that AMD has the six-core Istanbul Opteron processor in 2-way system, it is expected that system vendors will have 2-way systems based on Nehalem-EX. This might become a Xeon 6500 series. The Xeon 5500 series processors have 3 memory channels and 2 QPI links in a 1366-pin package. Nehalem-EX has 4 memory channels and 4 QPI links, so Xeon 5500 and 6500 would not be interchangeable.
older material
As processor performance cannot be characterized by a single metric, it is necessary to look at several benchmarks. The most useful for single core performance is SPEC CPU 2006 integer base. SPEC CPU results are base on 12 individual applications. Another important aspect is that the very latest compilers used for SPEC CPU designed to generate the results. The compilers used for common applications like SQL Server do not incorporate the very latest enhancements. The more recent versions of SPEC CPU are now actually multi-threaded (?).
An alternative for single core performance metrics is to simply run certain test queries in SQL Server with all data in memory. Examples tests are: 1) rows per second for a non-clustered index seek with bookmark lookup, or loop join, 2) table scan pages per sec, and 3) network round trips per second.
At the system level, the most widely recognized benchmarks TPC-C, E, and H. TPC-C is a transaction processing benchmark that has long history. Results are available from 1995 to present. TPC-C has the most extensive set of results allowing reasonable ability to compare platforms across multiple generations. TPC-E is a relatively new transaction processing benchmark that is supposed to replace TPC-C. Results are available from 2007 to present. Only SQL Server results have been published. TPC-H is a data warehouse/DSS benchmark with published result from 2004 to present. While TPC-H performance characteristics are important, unfortunately there is not anywhere near enough published results a comprehensive comparison of system and processor architectures. Intel, and the top system vendors most probably have a complete set of TPC-H performance results, electing only to publish when it suits their purpose.
While TPC-C is described as a transaction processing benchmark, it has no relation to any actual transaction processing application. The more important aspect of the TPC-C benchmark is that it is moderately network round-trip intensive and very disk IO intensive. Consider the recent TPC-C for the 4-way Xeon 7460 (24 cores) of 634,825 transactions per minute (tpm-C). This is 10,580 transactions per sec. On average, each TPC-C transaction requires 2.25 calls from the application server to the database server. So this system generates 23,806 network round-trips per second, or 991 per core per sec. This in turn implies that the average cost per call is just over 1 CPU-milli-sec.
The other important metric to draw is the number of disk drives used for the data files. TPC-C has a performance and price-performance metrics. The system should be optimized for full utilization. Then a reasonable assumption is that the disk load is approximately 175-200 IOPS per disk (queue depth 1 operation). Certain TPC-C publications may target best price-performance, in which case the disks will be driven to much higher queue depth, and higher IOPS per disk. The result cited above was configured with 1000 disk drives, meaning the disk load is very likely to be in 175-200K IOPS range.
The reason the network round-trips is a very important is that scaling SQL Server performance has a very different set of issues for driving high network round-trip volume than in executing SQL statements. The TPC-C workload is below but near the boundary where the network call volume becomes more important than SQL execution. SAP has characteristics that are almost purely a matter of driving network round-trips. The other aspect of TPC-C is that disk IO is far higher than most any actual transaction processing application, which might contribute as much as 20-30% of the overall load.
Some other aspects of TPC-C are as follows. TPC-C benefits from Hyper-Threading, a feature of Pentium 4, Nehalem and Itanium 2 processors (starting from the dual core 9100 series). In the Pentium 4 architecture, HT (not AMD Hyper-Transport) improves network call volume throughput, but had no consistent affect in SQL operations. HT may cause erratic SQL performance.
A large processor cache improves the fixed startup cost of SQL operations, but does not change the incremental cost per row. Large cache benefits TPC-C and TPC-E, but does not benefit TPC-H.
The TPC-C only uses 5 stored procedures, meaning there is no cost expended for SQL compiles, which can be a sizeable portion of a less than perfectly designed transaction database.
TPC-E results are reported in transactions per second (tpsE). There are 10 transaction types that make up the main part of TPC-E. Each type may be made up of more than one frame (or stored procedure). The transaction that is scored is the Trade-Result, which makes up 10% of the transactions. There are about 25 frames total in the 10 transaction types. On average, there are 22.3 stored procedure calls for each scored tpsE. Consider the 4-way X7460 result of 721.4 tpsE. This implies that there are approximately 16,087 RPC call per second, or 670 per second for each of the 24 cores. So TPC-E performance characteristics are very similar to that of TPC-C.
The main stages of the competition between Intel Xeon and AMD Opteron are: 130nm single core, 90nm single core, 90nm dual-core, Intel 65nm dual core versus AMD 90nm dual core, Intel 65nm quad-core versus Opteron 90nm dual core and Intel 45nm six core versus Opteron 45nm quad-core.
At the time of the Opteron 4-way system introduction in 2003, it was competitive with Xeon in TPC-C. If we account for Xeon benefiting from Hyper-Threading in TPC-C by about 7%, when most environments have HT disabled for stability purposes, the Opteron had a moderate advantage. As Opteron frequencies started to climb up, AMD gained a more substantial advantage. The 90nm 3GHz Xeon MP with 8M L3 was not particularly impressive as it suffered from a poorly designed bus to the on-die L3 cache. It was found that the desktop Pentium 4 core at 3.66GHz with just 1M L2 cache in a Xeon package performed much better than the custom design server core with the large L3 cache.
The dual-core Opteron appeared in 2005 initially at 2.2GHz, against which the dual core Xeon 7041 at 3.0 GHz was competitive. Over time, Opteron dual core processor frequency increased incrementally, but Intel Xeon did not. The dual core Opteron result for 2.8GHz was actually a new improved design with DDR2 memory at 667MHz, compared DDR at up to 333MHz before. The disproportionate performance increase from 214K to 263K (23%) for a 7.7% frequency might suggest that there were additional improvements beyond the increased memory bandwidth.
The Intel dual core Xeon 7140 appeared in October 2006 featuring 2 Pentium 4 architecture cores plus a large 16M L3 cache on one die at 3.4GHz with an excellent TPC-C result of 318K versus 263K for the contemporary dual core Opteron at 2.8GHz. As mentioned before, TPC-C benefits from large cache and Hyper-Threading. So while the Xeon TPC-C was better, the Opteron did very well on many ad-hoc tests. The dual-core Opteron 8220 scored better at TPC-H than the Xeon 7140.
The Intel Core 2 architecture appeared in mid-2006 for two-way systems (Xeon 5100 series) and below. The 65nm Core 2 consists of two cores with a shared 4M L2 cache. This was soon followed by the quad core Xeon 5300 series consisting of two dual core die in a single package. The two-way Xeon 5355 was almost as powerful at the system level as 4-way dual core systems. The Core 2 was significantly faster at the single core level than Opteron or Pentium 4 architecture processors.
The Core 2 architecture for 4-way systems, Xeon 7300 series (quad-core) appeared in late 2007. Only with this did Intel finally regain clear advantage over Opteron in 4-way servers in both high call volume (TPC-C and TPC-E) and large queries (TPC-H) applications. AMD did not have a quad-core until 2008.
At the single core level, the Intel Core 2 architecture is considered to be more powerful than a contemporary generation AMD Opteron. The Xeon X5470 at 3.33GHz SPEC CPU 2006 Integer base result is 26.3, compared to 16.9 for the Opteron 8384 released in late 2008 at 2.5GHz. In the six-core Xeon 7460, top frequency was limited to 2.66GHz on account of the power consumption of the two extra cores and L3 cache, came in at 21.7. A recently announced Opteron 8393 at 3.1GHz is reported at 19.7, which is not far off the Xeon X7460.
AMD was late with quad-core, codename Barcelona, on 65nm process, with a 2.3GHz benchmark result published in March 2008 of 402K, and a 2.5GHz result in July 2008 of 471K. For less than 9% delta in frequency, there was a 17% increase, which is one indication of the bugs reported in early version of Barcelona. The 65nm Opteron 8360 quad core did scored higher than the 65nm Xeon 7350 at 4-way. At 45nm, the Intel Core 2 architecture (X7460) scored higher than the Opteron (8384) on the strength of both the 16M L3 cache and the 2 extra cores per socket, even at a lower frequency than the 65nm (2.66GHz versus 2.93GHz).
TPC-H is a data warehouse benchmark that stresses the ability to run very large queries. Unfortunately there is not a sufficiently large set of the comparable TPC-H results. The following is known. TPC-H does not benefit from large processor cache at all.
While the Core 2 architecture has significantly better performance than Opteron at the single core level, the expectation is that Opteron scales better to a large number of sockets. This means the performance with 2, 4, 8 or more sockets relative to the performance at a single socket.