Parent

Intel Xeon 5600 (Westmere-EP) and 7500 (Nehalem-EX) Performance

Intel launched the Xeon 5600 series (Westmere-EP, 32nm) six-core processors on 16 March 2010 without any TPC benchmark results. In the performance world, no results almost always mean bad or not good results. Yet there is every reason to believe that the Xeon 5600 series with six-cores (X models only) will performance exactly as expected for a 50% increase in the number of cores at the same frequency (as the 5500) with no system level bottlenecks. The expectation is that a six-core Xeon 5600 should provide 30%+ improvement over the comparable quad-core Xeon 5500 in throughput oriented tests, particularly OLTP type workloads. Single stream parallel execution plans will probably show less gain, as scaling via parallelism is not a simple matter.

Then two weeks later on 30 March 2010, Intel launched the Xeon 7500 series 8-core processors for 4-way+ systems (and the Xeon 6500 for high-end 2-way systems) with TPC-E results on 4-way and 8-way systems but no TPC-H results. The TPC-E results were exactly what Intel said it was going to be last September at IDF, 2.5X over the previous generation Xeon 7400 series and 2.5X over the contemporary 2-way Xeon 5500 series.

My guess is that Intel wanted it to be clear that the 4-way Xeon 7500 achieved the stated performance objectives of 2.5X over the 2-way Xeon 5500, just in case some slide-decks did not mention which 2-way system the 2.5X claim referred to. Of course, the Intel statement of 2.5X for Xeon 7500 was most probably based on performance measurements already run on proto-type systems. It was probably also felt that the Xeon 5600 series is such a natural choice to supersede the 5500 series that TPC benchmarks were not essential, as there were sufficient other benchmarks to support the claims.

Benchmark Omissions

Earlier, I had commented about benchmark omissions from the quad-core generation on. Below is a summary of processors and systems for which TPC results are published. (The Intel Xeon 7500 Processor Product Brief shows 3.03X relative to 7400 for OLTP Brokerage Database, which is TPC-E, but 2022 over 729 is 2.77X.
see Performance Benchmarks for updates.)

Processor
Architecture
Process
TPC2-way4-way8-way16-way
Core2 65nm
Xeon 5300 QC
7300 QC
TPC-C
TPC-E
TPC-H
251,300
5160 only
17,686@100
407,079
479.51
34,990@100
841,809
804.0
46,034@300
-
1,250.0
-
Barcelona
65nm
QC
TPC-C
TPC-E
TPC-H
-
-
-
471,883
-
-
-
-
52,860@300
-
-
-
Core2 45nm
Xeon 5400 QC
7400 6C
TPC-C
TPC-E
TPC-H
275,149
317.45
-
634,825
729.65
-
Linux DB2
1,165.56
-
-
2,012.8 (R2)
102,778@3T
Shanghai 45nm
QC
TPC-C
TPC-E
TPC-H
-
-
-
579,814
635.4
-
-
-
57,685@300G
-
-
-
Istanbul 45nm
6C
TPC-C
TPC-E
TPC-H
-
-
-
-
-
-
-
-
91,558@300G*
-
-
-
Nehalem 45nm
Xeon 5500 QC
7500 8C
TPC-C
TPC-E
TPC-H
661,475†
850.0
51,086@100G
-
2,022.64
-
-
3,141.76
162,601@3TB
-
-
-
Westmere 32nm
Xeon 5600 6C
7600 ?C
TPC-C
TPC-E
TPC-H
803,068
1,110
-
future
future
future
future
future
future
future
future
future
Magny-Cours
45nm
12C
TPC-C
TPC-E
TPC-H
705,652
887.4
-
1,193,472
1,464
107,561@300G
n/a
n/a
n/a
n/a
n/a
n/a

* and SF 1TB report
† Xeon W5580 3.2GHz, versus X5570 2.93GHz
Magny-Cours will not support >4 socket systems

In brief, the Intel Core 2 architecture processors were avoiding comparisons against AMD Opteron in TPC-H, except for the 16-way Unisys system, for which there is no comparable Opteron system.

Opteron on the other hand, avoided comparison with Core2 architecture in 2-way systems and TPC-C/E OLTP benchmarks across the board. In the 2-way systems, the Intel old-FSB technology was still adequate, and the powerful Core2 architecture core was enough to beat a 2-way Opteron. There were respectable 4-way TPC-C and TPC-E results for Shanghai. When AMD announced the HT-Assist feature in Istanbul, one might have thought AMD was finally going to be able compete in 4-way OLTP. But there have been zero benchmarks published as of current.

When the 2-way Intel Xeon 5500 processor, based on the Nehalem architecture, came out in early 2009, outstanding results were published for both the OLTP oreiented TPC-E and DW/DSS oriented TPC-H. In February 2010, a TPC-C was published as well, even though Microsoft had previously said all new OLTP benchmarks were going to be TPC-E. This result was with SQL Server 2005 instead of 2008.

There was every expectation with the Xeon 7500 Nehalem-EX, that there would be both OLTP and DW/DSS benchmark results, as Xeon 7500 should produce world-class (and world-record) results in both. It is possible that performance problems were encountered in trying to achieve good scaling over 32-cores and 64-threads in a 4-way Xeon 7500 system. If this is identified as something that can be fixed in the Windows operating system or SQL Server engine, then a change request would be made. I seriously doubt that another processor stepping would be done for this, as Xeon 7500 is already D-step at release.

TPC-H Scaling

It is also quite possible Intel will have to face the fact that 2.5X the 2-way Xeon 5500 TPC-H SF100 result of 51,000 QphH is not going to be achieved no matter how good Xeon 7500 is at DW. This is because the TPC-H scores is a geometric mean of the 22 queries. There are several small queries in TPC-H, two of which already run in under 1 seconds on the 2-way 8-core Xeon 5570 for SF100, and several that run near or under 2 seconds. There is limited opportunity to continue to improve the performance of small queries with increasing degree of parallelism, as the overhead to setup each thread becomes larger compared to the actual work done be each thread, especially if one also has the give up frequency, dropping from 2.93 to 2.26GHz. It would be helpful to know what the actual frequency is during a performance run with the turbo-boost feature.

It is possible that some marketing putz does not understand this and denied permission to publish perfectly good Xeon 7500 TPC-H results because it did not meet the 2.5X goal. (Along with making a negative ranking and review entry for the person responsible for TPC-H benchmarking due to failing to achieve the 2.5X goal. But lets not grind axes on here. Besides, who said life was fair? It takes exceptional talent to accomplish the impossible. A clever person anticipates impossible problems, and transfers to another group to avoid a sticky wicket).

Achieving 2.5X in the big queries is a more meaningful goal. Achieving 50% better than the 8-way Opteron 6-core TPC-H SF300 or SF1TB would also be a worthwhile accomplishment, if Xeon 7500 were upto the task.

TPC-E Scaling

Finally, a quick comment on Xeon 7500 scaling from 4-way (32-cores, 64-threads) to 8=way (64-cores, 128-threads). In the past, achieving 1.5 scaling with this number of cores would have been a triumph. Given the announcement Microsoft made on Windows Server 2008 R2, on removing the thread scheduler and other impediments to high-end scaling, we were expecting 1.7X scaling. It could be that scaling beyond 64-threads in tricky, because of the 64-thread limit per group(insert correct terminology). Hopefully the 4-way to 8-way to 16-way scaling will improve over time as problems are solved one at a time, while the task master whips his/her draft horses (again, I digress).

Intel Xeon 5600 (Westmere-EP) and 7500 (Nehalem-EX) SKUs

Lets take a look at the Xeon 5600, 7500 and 6500 SKUs. The low-voltage, low power SKUs are omitted. These are fine products for high-density environments, web servers, and utility database. The Line-of-business and DW databases should be on the X models.

Xeon 5600 SKUs
ModelCoresThreadsGHzTurboL3QPI GT/sMemoryPowerPrice*
X56806123.333.612M6.41333130$1,663
X56706122.933.3312M6.4133395$1,440
X56606122.803.212M6.4133395$1,219
X56506122.663.0612M6.4133395 $996
E5640482.662.9312M5.86106680 $774
E5630482.532.812M5.86106680 $551
E5620482.402.6612M5.86106680 $387
X5677483.463.7312M6.41333130$1,693
X5667483.063.4612M6.4133395$1,440

* Intel 1k pricing

Xeon 7500 SKUs
ModelCoresThreadsGHzTurboL3QPI GT/sMemoryPowerPrice*
X75608162.262.6624M6.41066?130$3,692
X75508162.002.418M6.4?130$2,729
E75406122.002.2618M6.4?105$1,980
E75306121.862.1318M5.86?105$1,391
E7520481.861.8618M4.8?95 $856
X7542662.662.818M5.86?130$1,980

Xeon 6500 SKUs
ModelCoresThreadsGHzTurboL3QPI GT/sMemoryPowerPrice*
X65508162.002.418M6.4?130$2,461
E65406122.002.2618M6.4?105$1,712
E6510481.731.7312M4.8?105 $744

Before commenting, recall the main differences between the Xeon 5600 and Xeon 7500/6500 series. The Xeon 5600 series (32nm process) has 2 QPI links and 3 memory channels. The Xeon 7500 series (45nm process) has 4 QPI links, 4 memory channel, larger cache per core (for the 24M version, 3M vs 2M) plus extensive reliability features. The 2 QPI links on the 5600 series allows a 2-way (socket) system. The 4 QPI links on the 7500 series allows glueless 4-way and 8-way. My understanding is the 6500 series is the 7500 with only 2 QPI links enable for 2-way systems with 16-cores and 8 memory channels total, at lower frequency than the 5600 with 12-cores and 6 memory channels total, plus the 7500 RAS features.

Intel Xeon 5600 (Westmere-EP) and 7500 (Nehalem-EX) Systems

Now lets looks at system pricing for the 2-way Dell PowerEdge T710 (Xeon 5600), R810 (either 7500 or 6500) and the 4-way R910 (7500). All systems with redundant power supplies, 2x73GB 15K 2.5in drives, 6Gb/s SAS. 4 power supplies in the 4-way

Dell PowerEdge T710 Systems with 2 Xeon 5600 processors
SystemProcessorGHzCoresL3QPI-MemoryPrice
T710X56803.33612M6.4133372GB 18x4G$9,974
T710X56602.80612M6.4133372GB 18x4G$8,634
T710X56502.66612M6.4133372GB 18x4G$8,154
T710E56402.66412M5.86106672GB 18x4G$7,474
T710E56302.53412M5.86106672GB 18x4G$6,934

For some reason, Dell does not offer the T710 with the second from top X5670 2.93GHz.

Dell PowerEdge R810 Systems with 2 Xeon 7500 or 6500 processors
SystemProcessorGHzCoresL3QPI-MemoryPrice
R810X75602.26824M6.4106664GB 16x4G$17,866
R810X75422.66612M5.86?64GB 16x4G$13,366
R810X65502.00818M6.4106664GB 16x4G$13,066
R810E75402.00618M6.4106664GB 16x4G$12,166
R810E65402.00618M6.4106664GB 16x4G$11,496

Dell PowerEdge R910 Systems with 2 out of 4 sockets populated, Xeon 7500
SystemProcessorGHzCoresL3QPI-MemoryPrice
R910X75602.26824M6.4106664GB 16x4G$19,246
R910X75502.00818M6.4106664GB 16x4G$16,446
R910E75402.00618M6.4106664GB 16x4G$13,546
R910E75301.86618M5.8698064GB 16x4G$12,446

Dell PowerEdge R910 Systems with 4 Xeon 7500 processors
SystemProcessorGHzCoresL3QPI-MemoryPrice
R910X75602.26824M6.41066128GB 32x4G$34,040
R910X75502.00818M6.41066128GB 32x4G$28,440
R910E75402.00618M6.41066128GB 32x4G$22,640
R910E75301.86618M5.86980128GB 32x4G$20,440

Previously, I had argued that processors and systems today were so powerful that the standard practice of buying 4-way systems for critical database server by default be changed to 2-way. What I mean by default is in lieu of proper system sizing analysis.

It may seem strange that I suggest not doing a proper sizing analysis (one of my services as a consultant). But from the sizing analysis I have seen done by other people, the quality of the work was poor and the effort cost more than a pair 4-way systems.

What this means is that the practical solution used to be to buy a 4-way system. Try it out. If it not sufficient, then hire someone (there are many people who can do this) to make it work on a 4-way. If that does not work, consider pruning features until it does work.

So why not just move up to an 8-way or larger system? Because 8-way and larger are mostly NUMA systems. Technically, all Opteron 2-way and up are NUMA. But by NUMA, I really mean systems where there is a large discrepancy between local and remote node memory access. There are very very few people who can do performance analysis on a NUMA system (not those who claim to be able to). Do a search on SQL NUMA to see who has published meaningful material on this matter.

Default System Choice: Intel Xeon 5600

Anyways, the default choice today should be a 2-way system. However, since this is critical system, perhaps there are features from the high-end that we want. I believe this is the rational for the Xeon 6500 from Intel, and the PowerEdge R810 from Dell.

In looking over the T710, R810 and R910, I am inclined to say the effort was not entirely successful, as with many first iterations. The effort definitely deserves merit, and is the proper direction for the future. But it just needs further refinement. Of course, the true measure whether people actually buy the R810 in volume, not just one persons opinion.

The R810 with either X7560 or X6550 just gives up too much frequency for the extra 2 cores per socket, and fourth memory channel. Some environments might want the X7500/6500 RAS features despite this. And there is only a $1400 price difference between the R810 and R910 with 2 sockets populated.

The amount of $1,400 is very small for having two extra sockets available, even though most people never populate sockets after system purchase. It would be nice if could buy the R910 with 4-sockets populated, but not have to pay the per-socket software licensing until they are turned-on, like in RISC world.

True, the R810 is a 2U form factor compared with 4U for the R910, allowing much higher density. But the assumption was this is a critical database server, for which an extra 2U is not a show stopper. (There are people who get hung up on the latest industry jargon/fads, and forget the job one is making sure your business in running.)

Late Addition: AMD Opteron 6100, Magny-Cours

AMD Opteron 6176 (Magny-Cours) 2-way 12-core results have been just published, with the HP ProLiant DL385G7. I will add more detail later. The 2-way TPC-E result is 887.38 and the TPC-C result is 705,652. Interestingly, both the HP ProLiant DL370G6 with the Xeon W5580 and the DL385G7 Opteron TPC-C results are on SQL Server 2005. Perhaps the Microsoft mandate to use TPC-E is for SQL Server 2008, hence the C on 2005 was allowed? Also of interest is that the Opteron 6176 TPC-C result uses 125 SSDs instead of hard disks (1300 HDs in the Xeon W5580 result).

Before comparing the Opteron 12-core with Xeon 5500, let us first compare against the previous generation Xeon 5400 quad-core. The 2-way 12-core Opteron 6176 achieved OLTP results higher than the Xeon 5460 by 2.5X on TPC-C and 2.8X on TPC-E. These are very good results for a 3X increase in the number of cores. Now in comparing against the quad-core Xeon 5500 series, the 12-core Opteron is just marginally higher. I am inclined to think much of this is due to the Hyper-Threading capability in the Xeon 5500 series. HT was much maligned in the NetBurst architecture generation. Some people today still blindly regurgitate the advice to disable HT, not realizing this advice was applicable to the old NetBurst and not the new Nehalem architecture processors. At some point AMD may have to admit that implementing HT will be a necessity.

The price for the DL385G7 with 2x6176 processors from the TPC-H report is $1,511 for the system chassis, $1,799 for each processor, $990 for each 8GB kit, and perhaps another $1K for comparable configuration as above. This is very reasonable, except for the memory which seems high. Each 8GB kit should be around $500.

Magny-Cours is comprised of two six-core Istanbul die(?) each with 4x0.5 L2 cache and 6M L3. The Istanbul die size is 346mm2, versus 540 mm2 for Nehalem-EX with 8-cores and 24M L3. The images below were adjusted to match the die size closely, but there is no assurance that the aspect ratios are correct.

Istanbul Istanbul Nehalem-EX

For some reason I thought Nehalem EX was 540 mm2 when in fact the Intel website says it is 684 mm2. The figure below shows the corrected scaling.

Istanbul Istanbul Nehalem-EX

2010 June 21, HP ProLiant servers, and other results

HP has just announced the ProLiant DL580 G7 and DL980 G7 servers based on the Xeon 7500 series processors, and the DL585 G7 4-way server with the 12-core AMD Opteron 6100 series (Magny-Cours).
Apparently the reason for the delay is that the 8-way DL980 G7 employs custom silicon node controllers (XNC), and possibly, so HP could make a splash in announcing all three system at their big annual conference: HP Technology Forum. The DL580 and 585 G7 are available now(?), and the 980 G7 should be available later in Q3.

While the Intel Xeon 7500 processor allows a glue-less 8-way system, HP felt that the design could be improved with node controllers. The node controllers reduce snoop traffic for a majority of memory accesses, and can achieve a 30% reduction in memory latency in some circumstances. It should be considered that HP needed to build custom silicon crossbar (& node controllers) for their SuperDome2 system and the Itanium 9300 series processors, which use the same QuickPath Interconnect (QPI) as the Nehalem processors. There are differences in the way the Itanium and Xeon processors use QPI. There are also differences between the node controllers for the Itanium and Xeon systems. (The Itanium node controller implements directory based cache coherency and the Xeon node controller is snoop filter).

HP may have built a glueless 8-way Xeon 7500 system if they had not already invested the effort to built the XNC for their Itanium systems. This also means that HP should have the components to built a 16-way Xeon 7500 system, meaning that if there were market demand, such a system could be brought to market. Intel did say that there were 16-way Xeon 7500 system designs, but none have surfaced yet.

Dell has also released a 2-way TPC-E result for the Xeon 5600, and Fujitsu released a 4-way TPC-E result for the Xeon 7500

TPC-H 300GB: 4-way DL585 G7 vs 8-way ProLiant DL785 G6

A comparison of the TPC-H 300GB results for the 8-way ProLiant DL785 G6 and the 4-way DL585 G7 is interesting, with the 4-way DL585G7 having 18% better performance on the Power metric.

SystemTPC-H PowerTPC-H ThroughputTPC-H Composite QphH
DL785G6109,067.176,860.091,558.2
DL585G7129,198.389,547.7107,561.2

The significant differences between the two systems are below. Both system have the same number of total cores, the 8-way with 6-core processors and the 4-way with 12-core processors. The DL785G6 cores are 2.8GHz versus the DL585G7 at 2.3GHz, about a 20% difference. The DL585G7 has twice the memory, 512GB versus the 256GB. For TPC-H at SF300, and using SQL Server 2008 page compression, 256GB is not quite sufficient to encompass the entire database tables and indexes. With 512GB, there is more than sufficient memory for data, indexes and probably most hash join intermediate results (for minimal tempdb activity)

SystemDL785G6DL585G7
ProcessorOpteron 8439Opteron 6167
Sockets-Cores8 x 6 = 484 x 12 = 48
Frequency2.8GHz2.3GHz
Memory256GB512GB
Storage194 HDD4 SSD
Windows Server2008 EE SP12008 R2 EE
SQL Server2008 EE SP12008 R2 EE

That the DL585G7 employs SSD storage is not expected to impact performance, and was probably used for lower cost. The 194 15K HDDs and 12 storage enclosures in the DL785 cost $110K, while the 4 320GB Fusio-IO drives in the DL585 cost $55K. If the DL585 had 256 or less memory, then the SSD storage would have moderately better performance than with HDD storage. Another significant difference are the improvements in Windows Server 2008 R2, several of which have major impact scaling to a high number of processor cores.

The chart below shows the TPC-H power query run times for the DL585G7 relative to the DL785G6.

tpch300 DL785 vs DL585
TPC-H Power query run times, DL585G7 relative to DL785G6

Overall, the DL585G7 with 4 Opteron 6167 is about 20% higher than the DL785G6 with 8 Opteron 8439 processors. For the individual queries, several are moderately faster, 3 are much faster, 5 are about the same, and 3 are actually significantly slower. The DL785 has faster processors, which should make all queries run faster. It is difficult to account for differences in the system architecture, as there may be difference in how the individual dies are connected. The greater memory on the DL585 is expected to make certain queries run faster. The scaling improvements in R2 (OS and SQL) might contribute significant gains in some queries, but may also negative effects in others.

It would be very helpful to have access to the actual execution plans, along with execution statistics to determine if the differences can be attributed plans differences or differences in disk IO.

TPC-H 3000GB: 8-way Xeon 7560 vs 16-way Xeon 7460

Below are the TPC-H 3000GB results for the 8-way ProLiant DL980 G7 with the Xeon 7560 processor and the 16-way ES7000 with the Xeon 7460. The 32-way dual-core IBM 5GHz Power6 result is also shown.

SystemTPC-H PowerTPC-H ThroughputTPC-H Composite QphH
16 x Xeon 7460120,254.887,841.4102,778.2
8 x Xeon 7560185,297.7142,685.6162,601.7
32 x Power6142,790.7171,607.4156,537.3

Additional details are below:

SystemES7000DL980G7Power 595
ProcessorXeon 7460Xeon 7560Power6
Sockets-Cores16 x 6 = 968 x 8 = 6432 x 2 = 64
Hyper-Threadingnoyes4/core?
Frequency2.66GHz2.26GHz5.0GHz
Memory1024GB512GB512GB
Storage914 HDD660 HDD288 HDD
OS2008 R2 DC2008 R2 EEAIX 6.1
Database2008 R2 DC2008 R2 EESybase 15.1

The Unisys system may have been over-configured in disks and memory. Many of the TPC-H queries involve large table (or range) scans). If the entire entire database cannot be brought into memory, then there may not be much difference in the disk IO generated with either 512G or 1TB memory. More importantly, the Windows operating system and SQL Server versions match, so there is high confidence we are seeing mostly the difference between the two processor (and system) architectures.

The IBM system may appear to be under-configured in terms of the number of disk drives. But it does seem that other database engine are better in switching from pseudo-random to sequential scan operations, and can work fine with fewer disks.

While the Xeon 7400 series processor core was top of the line in its time, even the 4-way Xeon 7400 system had limited memory bandwidth (and channels). Scaling beyond 4-way was not a simple matter. Of course, the Xeon 7400 systems were still competitive with systems based on processors with better scalability, but weaker single core performance.

Based on the 16-way Xeon 7460 result, the expectation is that an 8-way Xeon 7460 would be in the range of 75,000, i.e., doubling the number of processors should increase performance by 1.6X. In turn, there is sufficient reason to estimate that the Xeon 7560 is about 2.5X more powerful than the Xeon 7460 for data warehouse usage. This is less than the 2.77X observed in OLTP, which is inline with expectations because OLTP derives substantial benefits from Hyper-Threading (30%?) and data warehousing derives only a modest benefit from HT (10%?).

The chart below shows the TPC-H power query run times for the 8-way Xeon 7560 relative to the 16-way Xeon 7460.

tpch300 DL785 vs DL585
TPC-H Power query run times, 8-way Xeon 7560 relative to 16-way 7460

As with the earlier comparison, there is also wide variation in the individual queries. Many queries are 40% faster, two are about the same, two are actually slower, and one is more than 5X faster.

TPC-H 1000GB: 8-way 6-core Opteron 785G6 vs 16-way quad-core Itanium

Below are the TPC-H 1000GB results for the 8-way ProLiant DL785 G6 with the Opteron 8439 processor and the 16-way Integrity Superdome 2 with the Itanium 9350.

SystemTPC-H PowerTPC-H ThroughputTPC-H Composite QphH
8 x Opteron 843995,789.169,367.681,514.8
16 x Itanium 9350139,181.0141,188.3140,181.1

Additional details are below:

SystemDL785 G6Superdome 2
ProcessorOpteron 8439Itanium 2 9350
Sockets-Cores8 x 6 = 4816 x 4 = 64
Hyper-Threadingnoyes
Frequency2.8GHz1.73GHz
Memory512GB512GB
Storage240 HDD576 HDD
OS2008 R2 EEHP-UX
Database2008 EEOracle 11g R2

The operating system and database engine are both completely different, so caution is warranted in comparing the results. Also very important is that the execution plans could also be very different in certain queries.

As the expectation is that doubling the number of processors should lead to approximately 1.6X performance gain, we can see that six-core Opteron 8439 is the same neigbhorhood as the quad-core Itanium 2 9350. The individual Opteron processor is probably a little better than the Itanium at the socket level in the TPC-H Power test, but the Itanium has the advantage in through-put oriented usage.

The chart below shows the TPC-H power query run times for the 16-way Itanium relative to the 8-way Opteron.

tpch1000 DL785 vs Itanium
TPC-H Power query run times, 16-way quad-core Itanium relative to 8-way 6-core Opteron

As expected, there is wide variation in the individual queries. The are differences in almost every important area: the processor and system architecture, the operating system and the database engine. It is not just the difference in the database engine, but also the execution plans.