Home, Parent Benchmarks

2010 July 06

Nehalem, Westmere and Magny-Cours, TPC-H 100GB

HP is most prolific contributor to TPC benchmarks, allowing us to compare systems from generation to another, and between processor architecture, and the scale-up characteristics. Below is comparison of TPC-H 100GB results for the ProLiant DL380 G6 with 2 quad-core Xeon 5570, the DL380 G7 with 2 six-core Xeon 5680 and the DL385 G7 with 2 12-core Opteron 6176 processors. Overall, the Westmere system is 42% better in the Power test than Nehalem, and slightly better than the 12-core Magny-Cours.

SystemTPC-H PowerTPC-H ThroughputTPC-H Composite QphH
DL380 G670,048.537,749.151,422.4
DL380 G799,426.355,038.273,974.6
DL385 G794,761.553,855.671,438.3

Below are the system details. Ideally, we would like to compare for differences in just one substantial component at a time. But we must work with the results that are available. The difference in processors is both the 50% increase in the number of cores and the 13.6% increase in frequency. There are also differences in memory, from 144GB to 192GB, and in the Windows operating system and SQL Server version from 2008 sp1 to R2.

SystemDL380 G6DL380 G7DL385 G7
ProcessorXeon 5570Xeon 5680Opteron 6176
Sockets-Cores2 x 4 = 82 x 6 = 122 x 12 = 24
Frequency2.93GHz3.33GHz2.3GHz
Memory144GB192GB192GB
Storage12 SSD4 SSD4 SSD
Windows Server2008 EE SP22008 R2 EE2008 R2 EE
SQL Server2008 EE SP12008 R2 EE2008 R2 EE

The max server memory settings were 138,000MB and 190,000MB respectively. The TPC-H SF100GB database (with SQL Server 2008 date data type) data and indexes is just over 140,000MB. It appears that neither report enabled page compression. The expectation is that there is more disk activity in the Nehalem system with 144GB memory. The affect of the difference in memory is expected to be highly uneven. Some queries will not be affected by the difference in memory, others will be affected even with SSD storage.

Windows Server 2008 R2 made significant improvements in high-end scaling, typically the realm of 32_ cores, through the removal of certain critical locks. This still might contribute in a small way at 8-12 cores. SQL Server 2008 R2 may also contribute some differences. If an execution plan were different for any query, the results may be uneven. Unfortunately, the TPC-H full disclosure reports do not include the execution plans.

The TPC-H Power individual query runtime ratios are shown below for Westmere relative to Nehalem. Most of the queries ranges from 20-50% faster for Westmere over Nehalem, plus memory and OS/SQL version, in-line with the overall average of 30% (1 over 1.42).

tpch100
TPC-H Power query run times, DL380G7-Xeon 5680 relative to DL380G6-Xeon 5570

Query 9 is 3 times faster, and Query 20 is 2.3 times slower. It would be interesting to see if there are differences in the execution plans.

The TPC-H Power individual query runtime ratios are shown below for Magny-Cours relative to Westmere. Most of the queries are about the same. Magny-Cours is slower in query 11 and 19, and faster in 12 and 20.

tpch300 DL785 vs DL585
TPC-H Power query run times, DL385G7-Opteron 6176 relative to DL380G7-Xeon 5680

TPC-H Summary 2010 Jun

TPC-H 300GB: 4-way DL585 G7 vs 8-way ProLiant DL785 G6

A comparison of the TPC-H 300GB results for the 8-way ProLiant DL785 G6 and the 4-way DL585 G7 is interesting, with the 4-way DL585G7 having 18% better performance on the Power metric.

SystemTPC-H PowerTPC-H ThroughputTPC-H Composite QphH
DL785G6109,067.176,860.091,558.2
DL585G7129,198.389,547.7107,561.2

The significant differences between the two systems are below. Both system have the same number of total cores, the 8-way with 6-core processors and the 4-way with 12-core processors. The DL785G6 cores are 2.8GHz versus the DL585G7 at 2.3GHz, about a 20% difference. The DL585G7 has twice the memory, 512GB versus the 256GB. For TPC-H at SF300, and using SQL Server 2008 page compression, 256GB is not quite sufficient to encompass the entire database tables and indexes. With 512GB, there is more than sufficient memory for data, indexes and probably most hash join intermediate results (for minimal tempdb activity)

SystemDL785G6DL585G7
ProcessorOpteron 8439Opteron 6167
Sockets-Cores8 x 6 = 484 x 12 = 48
Frequency2.8GHz2.3GHz
Memory256GB512GB
Storage194 HDD4 SSD
Windows Server2008 EE SP12008 R2 EE
SQL Server2008 EE SP12008 R2 EE

That the DL585G7 employs SSD storage is not expected to impact performance, and was probably used for lower cost. The 194 15K HDDs and 12 storage enclosures in the DL785 cost $110K, while the 4 320GB Fusio-IO drives in the DL585 cost $55K. If the DL585 had 256 or less memory, then the SSD storage would have moderately better performance than with HDD storage. Another significant difference are the improvements in Windows Server 2008 R2, several of which have major impact scaling to a high number of processor cores.

The chart below shows the TPC-H power query run times for the DL585G7 relative to the DL785G6.

tpch300 DL785 vs DL585
TPC-H Power query run times, DL585G7 relative to DL785G6

Overall, the DL585G7 with 4 Opteron 6167 is about 20% higher than the DL785G6 with 8 Opteron 8439 processors. For the individual queries, several are moderately faster, 3 are much faster, 5 are about the same, and 3 are actually significantly slower. The DL785 has faster processors, which should make all queries run faster. It is difficult to account for differences in the system architecture, as there may be difference in how the individual dies are connected. The greater memory on the DL585 is expected to make certain queries run faster. The scaling improvements in R2 (OS and SQL) might contribute significant gains in some queries, but may also negative effects in others.

It would be very helpful to have access to the actual execution plans, along with execution statistics to determine if the differences can be attributed plans differences or differences in disk IO.

TPC-H 3000GB: 8-way Xeon 7560 vs 16-way Xeon 7460

Below are the TPC-H 3000GB results for the 8-way ProLiant DL980 G7 with the Xeon 7560 processor and the 16-way ES7000 with the Xeon 7460. The 32-way dual-core IBM 5GHz Power6 result is also shown.

SystemTPC-H PowerTPC-H ThroughputTPC-H Composite QphH
16 x Xeon 7460120,254.887,841.4102,778.2
8 x Xeon 7560185,297.7142,685.6162,601.7
32 x Power6142,790.7171,607.4156,537.3

Additional details are below:

SystemES7000DL980G7Power 595
ProcessorXeon 7460Xeon 7560Power6
Sockets-Cores16 x 6 = 968 x 8 = 6432 x 2 = 64
Hyper-Threadingnoyes4/core?
Frequency2.66GHz2.26GHz5.0GHz
Memory1024GB512GB512GB
Storage914 HDD660 HDD288 HDD
OS2008 R2 DC2008 R2 EEAIX 6.1
Database2008 R2 DC2008 R2 EESybase 15.1

The Unisys system may have been over-configured in disks and memory. Many of the TPC-H queries involve large table (or range) scans). If the entire entire database cannot be brought into memory, then there may not be much difference in the disk IO generated with either 512G or 1TB memory. More importantly, the Windows operating system and SQL Server versions match, so there is high confidence we are seeing mostly the difference between the two processor (and system) architectures.

The IBM system may appear to be under-configured in terms of the number of disk drives. But it does seem that other database engines are better in switching from pseudo-random to sequential scan IO operations than SQL Server, and can work fine with fewer disks.

While the Xeon 7400 series processor core was top of the line in its time, even the 4-way Xeon 7400 system had limited memory bandwidth (and channels). Scaling beyond 4-way was not a simple matter. Of course, the Xeon 7400 systems were still competitive with systems based on processors with better scalability, but weaker single core performance.

Based on the 16-way Xeon 7460 result, the expectation is that an 8-way Xeon 7460 would be in the range of 75,000, i.e., doubling the number of processors should increase performance by 1.6X. In turn, there is sufficient reason to estimate that the Xeon 7560 is about 2.5X more powerful than the Xeon 7460 for data warehouse usage. This is less than the 2.77X observed in OLTP, which is inline with expectations because OLTP derives substantial benefits from Hyper-Threading (30%?) and data warehousing derives only a modest benefit from HT (10%?).

The chart below shows the TPC-H power query run times for the 8-way Xeon 7560 relative to the 16-way Xeon 7460.

tpch300 DL785 vs DL585
TPC-H Power query run times, 8-way Xeon 7560 relative to 16-way 7460

As with the earlier comparison, there is also wide variation in the individual queries. Many queries are 40% faster, two are about the same, two are actually slower, and one is more than 5X faster.

The chart below shows the TPC-H power query run times for the 32-way IBM p595 with 64-cores relative to the 8-way Xeon 7560 also with 64 cores.

tpch300 DL785 vs DL585
TPC-H Power query run times, 64-core POWER6 relative to 64-core Xeon

The 64 core Xeon 7560 has 30% better TPC-H Power than the 64 core POWER6. The POWER6 in turn has 20% better TPC-H Throughput than the Xeon. Again, there is also wide variation in the individual queries. In query 18 and 19, where the Sybase is faster, the SQL Server execution plan shows key lookups at SF100. It would be helpful if HP could provide execution plans at SF3000. We should not draw too many conclusions when comparing completely different system architectures and completely different database engines. But, I think this is good hint for Microsoft to re-evaluate the execution plan cost formulas.

TPC-H 1000GB: 8-way 6-core Opteron 785G6 vs 16-way quad-core Itanium

Below are the TPC-H 1000GB results for the 8-way ProLiant DL785 G6 with the Opteron 8439 processor and the 16-way Integrity Superdome 2 with the Itanium 9350.

SystemTPC-H PowerTPC-H ThroughputTPC-H Composite QphH
8 x Opteron 843995,789.169,367.681,514.8
16 x Itanium 9350139,181.0141,188.3140,181.1

Additional details are below:

SystemDL785 G6Superdome 2
ProcessorOpteron 8439Itanium 2 9350
Sockets-Cores8 x 6 = 4816 x 4 = 64
Hyper-Threadingnoyes
Frequency2.8GHz1.73GHz
Memory512GB512GB
Storage240 HDD576 HDD
OS2008 R2 EEHP-UX
Database2008 EEOracle 11g R2

The operating system and database engine are both completely different, so caution is warranted in comparing the results. Also very important is that the execution plans could also be very different in certain queries.

As the expectation is that doubling the number of processors should lead to approximately 1.6X performance gain, we can see that six-core Opteron 8439 is the same neigbhorhood as the quad-core Itanium 2 9350. The individual Opteron processor is probably a little better than the Itanium at the socket level in the TPC-H Power test, but the Itanium has the advantage in through-put oriented usage.

The chart below shows the TPC-H power query run times for the 16-way Itanium relative to the 8-way Opteron.

tpch1000 DL785 vs Itanium
TPC-H Power query run times, 16-way quad-core Itanium relative to 8-way 6-core Opteron

As expected, there is wide variation in the individual queries. The are differences in almost every important area: the processor and system architecture, the operating system and the database engine. It is not just the difference in the database engine, but also the execution plans.

TPC-H 2009

Below are recent TPC-H scale factor 100 results. Be aware that the full SF100 TPC-H database is approximately 170GB in SQL Server 2005 with the 8-byte datetime type. So the 4-way system with 128GB memory (and somewhat less SQL Server data buffer cache) cannot hold the entire database in memory.

TPC-H Scale Factor 100GB

DateSystemcoresMHzCacheMemoryProcessQphH
04/20084 x Xeon X7350Quad2.93G2x4M128G65nm34,989.9
n/a 4 x Xeon X7460Six 2.66G16M n/a 45nmno result
06/20092 x Xeon X5570Quad2.93G8M 48G 45nm28,772.9
08/20092 x Xeon X5570Quad2.93G8M 144G45nm51,180.5

* TPC-H QphH results are scaled by size, but results at different scale factors should not be compared directly.

The two Xeon 5570 systems use SSD disk, versus HDD on the 4-way, so IO impact is uncertain. The 2-way X5570 system with 144GB memory on SQL Server 2008 using the 4-byte date type is approximately 140GB, which may almost fit in the buffer cache.

There is some uncertainty in interpreting the results, but it is within reason to expect that the 2-way Xeon 5570 is in the same class as a 4-way Xeon 7350. No 4-way results are available for the Xeon 7460 six-core.

Direct testing between the Dell T710 and R900 in matched like for like configuration is desired for final assessment. But there is enough public data to support consideration for the 2-way Xeon 5570 over the 4-way Xeon 7300 or 7400 series.

CPUmemoryPowerThroughputQphH
32 Itanium2 DC256GB90,90953,89969,999
128 Core2 QC2080GB782,6091,740,1221,166,977