Home,   Parent,   Memory Latency,   TPC-E Benchmark,   Single Processor,   DRAM,   SRAM

TPC-E Benchmark Update (2018-Sep)

Updated tables to support recent related analysis.

Below are 2-way TPC-E reports, tpsE/thread shown in blue bars, scale on left axis. Initial database size relative to system memory, scale on right axis.

tpce_2Sc

Reports up to the Xeon X5680 (sixth from left) should be HDD storage, and reports afterwards should be SSD storage (I will verify later).

Below are 4-way TPC-E reports.

tpce_4Sc

The transition point from HDD to SSD storage is probably one of the Xeon X7560 systems?

Below is the table for the 2-way systems, now denoting calculated values for tpsE/thread (also memory and initial database size from report).

System VendorProcessor ModelFreq
(GHz)
Cores
per S
Mem
(GB)
DB
(GB)
tps-E tps-E
per thr.
SQL
ver.
IBM X5160 3.0 2 32654 169.59* 42.40 2005 SP2
Dell X5460 3.16 4 481,666 268.00* 33.50 2005 SP2
Dell X5460 3.16 4 481,666 295.27* 36.91 2008
FujitsuX5460 3.16 4 641,233 317.45* 39.68 2008
IBM X5570 2.93 4 963,163 817.15* 51.072008 SP1
HP X5680 3.33 6 964,761 1,110.11*46.252008R2
HP X5690 3.46 61925,622 1,284.1453.512008R2
IBM E7-2870 2.4 105126,4081,560.7 39.02 2008R2
IBM E5-2690 2.9 85127,7821,863.23 58.23 2012 SP1
HP E5-2690 2.9 8256†7,7921,881.7658.812012
IBM E5-2697 v2 2.7 1251210,7392,590.9353.982012
FujitsuE5-2699 v3 2.3 1851215,6043,772.0852.392014
Lenovo E5-2699 v4 2.2 2251220,5184,938.1456.122016
FujitsuE5-2699 v4 2.2 22102419,5524,734.8753.812016
Lenovo SP 8180 2.5 28153629,1456,598.3658.912017

* HDD storage
† Edit 2018-08-27, previously, I had cited this as 512GB. The IBM report was 512GB.

System VendorProcessor ModelFreq
(GHz)
Cores
per S
Mem
(GB)
DB
(GB)
tps-E tps-E
per thr.
SQL
ver.
IBM X73502.93GHz4 1281,774 419.80*26.242005 SP2
IBM X73502.93GHz4 1281,913 479.51*29.972008
IBM X74602.66GHz6 1282,821 729.65*30.402008
Dell X75602.27GHz8 5128,309 1,933.96*30.222008R2
IBM X75602.27GHz8 10248,512 2,022.64*31.602008R2
Fujitsu X75602.27GHz8 5128,512 2,046.9631.982008R2
HP E7-4870 2.4GHz 10102411,0002,454.51*30.682008R2
IBM E7-4870 2.4GHz 10102411,6312,862.6135.782008R2
FujitsuE5-4650 2.7GHz 8 51210,904 2,651.2741.432012
IBM E7-4870 2.4GHz 10204813,3283,218.4640.232012
IBM E7-4890 v22.8GHz 15204823,2935,576.2746.472014
Lenovo E7-8890 v32.5GHz 18409628,7346,964.7548.372014
FujitsuE7-8890 v32.5GHz 18204828,7346,904.5347.952014
FujitsuE7-8890 v42.2GHz 24204836,9518,796.4745.812016
Lenovo E7-8890 v42.2GHz 24409637,3629,068.0047.232016
Lenovo Plat 8180 2.5GHz 28307249,27611,357.2850.702017

* HDD storage

 

TPC-E Benchmark Review (2018-04)

2-way TPC-E results

We can look at TPC-E benchmark results to see if any significant changes in memory latency are discernable. Database transaction processing is largely an exercise in serialized pointer-chasing. Fetch one memory location to determine the next operation and subsequent memory address to fetch, and then repeat.

The chart below shows TPC-E results for 2-way (socket) systems from the Core 2 (Conroe) Xeon 5160 to Skylake SP normalized on a per-core basis. SQL Server version changes are denoted in parenthesis.

tpce_2S

The table below have numerical values for the previous chart. (Technically I should provide specific report information whenever discussing TPC benchmark results, so I will provide this upon request.)

System VendorProcessor ModelFreq
(GHz)
Cores
per S
Mem
(GB)
tps-Etps-E
per core
SQL
ver.
Inter
Conn
IBM X5160 3.0 232 169.59 42.40 2005 SP2 
Dell X5460 3.16448 268.00 33.50 2005 SP2 
FujitsuX5460 3.16464 317.45 39.68 2008 
IBM X5570 2.93496 817.15 102.142008 SP1 
HP X5690 3.466192 1,284.14107.012008R2 
IBM E7-2870* 2.4 105121,560.7 78.04 2008R21 ring
HP E5-2690 2.9 8 256*1,881.76117.6120121 ring
IBM E5-2697 v2 2.7 125122,590.93107.0120122 rings
FujitsuE5-2699 v3 2.3 185123,772.08104.7820142 rings
Lenovo E5-2699 v4 2.2 225124,938.14112.2320162 rings
Lenovo SP 8180 2.5 2815366,598.36117.832017mesh

* Edit 2018-08-27, previously, I had cited this as 512GB. The IBM report was 512GB.

Core 2 (Penryn) to Nehalem

The sharp increase from Xeon X5460 (Penryn) to X5570 (Nehalem) is impacted by the return of Hyper-Threading. To adjust for this, compare the performance per core on Penryn and earlier systems without HT to performance per thread on Nehalem and later systems which have HT and HT is enabled.

In essence, scaling logical processors is excellent at the 1-to-2 level and would probably still be very good at the 2-to-4 level. Why Intel does not support 4 logical processors (threads) per core in the Xeon products is a mystery. Nehalem at 51.07 per thread versus Penryn at 39.68 per core is a very impressive 29% gain.

The Virtualization Performance Insights from TPC-VMS says TPC-E has 80% local memory access in a 2-way system. Local node memory latency in a 2-way Nehalem is much lower than latency in the Core-2 generation system, and much of the 29% gain can be attributed to this.

2S Westmere EP (Xeon 5600) versus Westmere EX (Xeon E7)

Our main interest is in the results from Nehalem forward. Although not the main topic here, there is an odd ball in the Westmere-EX Xeon E7-2870 10-core 2.4GHz in a 2-way system with 512GB memory.

Compare that with the Westmere-EP X5690 6-core 3.46GHz and 192GB memory. It might be tempting to attribute the 37% difference between Westmere EX and EP to the 44% difference in frequency from 2.4 to 3.46GHz. (Note that we do not know the actual operating frequency if turbo-boost was in effect).

However, we now know from other articles on this site that frequency only has a very weak effect, as most CPU cycles are spent waiting for memory. An inadvertent test on a 2-way Xeon E5-2670 system. After a BIOS/UEFI update, the system was put into a power-save mode of 135MHz. Performance was reduced by 3X versus the rated frequency of 2.7GHz. This would be the expected result assuming an average memory latency of 110ns (50/50 local-remote) and 2.7% of operations requiring a serialized memory access.

Using this model, then a frequency increase from 2.4GHz to 3.46GHz should yield a 3.5% performance gain. To explain the remaining 33.5% would require a difference of 35ns in average memory latency (100.5 and 135.5ns) between the two systems. Note the performance increased from X5690 at 3.46GHz to E5-2690 at 2.9GHz.

I would have guessed that the presence of the SMB chip in the memory path of the Xeon E7 processors contributes 15-20ns, but I could be wrong on this? Note: in 4-way systems, Westmere-EX results were published for both SQL Server 2008R2 and 2012, which may shed some light on this.

Westmere_EX
Westmere EX with SMB

Nehalem-EP (Xeon 5500) and Westmere-EP (5600)

Between Nehalem (X5570) 2.93GHz at 96GB memory on (384 SAS 15K) HDD storage, and Westmere-EP (X5690) 3.46GHz at 192GB memory on flash storage (PCI direct to Violin Memory Systems), performance per core (and thread) increased by 4.8%. We can probably attribute 1.5% to the increase in frequency. The remainder is probably the change in storage.

Average response times for key operations Trade-Lookup and Trade-Update decreases from 0.60 and 0.71 sec to 0.09 and 0.11 sec respectively. Fewer concurrrent threads are necessary to drive the system to 100% CPU utilization, and hence less overhead for context-switching occurs.

I am disinclined to attribute to much to the change in SQL Server version from 2008 SP1 to 2008R2, considering that the benchmark consists of basic SQL operations and that a substantial change in system architecture has not occurred. The SQL Server engine team members are invited to provide input on this.

Sandy Bridge (Xeon E5)

In the next step to Sandy Bridge Xeon E5-2690, 2.9GHz, 256GB memory from Westmere (X5690) at 192GB memory, both flash storage and SQL Server version 2012, performance per core increased by 9.9%. If there was an increase in memory latency between Westmere and Sandy Bridge, then it is not evident.

Some of the improved performance can probably be attributed to the increased memory in all of absolute terms, per core/thread and per unit performance. Possibly a large part of the increase is from IPC improvements in Sandy Bridge. Sandy Bridge-EP also increased L3 cache size from 2M to 2.5MB per core/slice. It is also possible that there were improvements in the Sandy Bridge architecture.

Another guess is that in Nehalem and Westmere-EP, the two processors are connected with one QPI, and the other QPI on each processors connects to an IOH. It could be the same IOH or there could be two IOP, also connected by QPI?

Note: Trade-Lookup and Trade-Update average response time is now 0.07 and 0.08 sec respectively.

Ivy Bridge (Xeon E5 v2)

The first decrease in performance per core, aside from the E7 Westmere-EX, occurs from Sandy Bridge E5-2690 8-core 2.9GHz to Ivy Bridge E5-2697 v2 12-core at 2.7GHz. Both are IBM systems.

We could attribute some of this to absolute performance increasing from 1863.23 tps-E to 2590.93 tps-E, both on the same 512GB memory. Note the 32GB DDR3-1866 LRDIMM was $4,599 in the first report (Mar 2012) and $1,439 in the second report (Sep 2013).

I am inclined to attribute some of this to higher latency due to the more complicated core interconnect rings versus in the big die Ivy Bridge versus the simpler single interconnect ring of Sandy Bridge.

IvyBridge-e7v2 IvyBridge15c

Note: there are some Intel layout diagrams showing the large die (HCC) Ivy Bridge with cores in 4 rows and 3 columns for 12 cores. In fact the die is 5 rows of 3 columns for 15 cores. The Xeon E5 v2 series were only offered with up to 12 core, while the E7 v2 were offered with up to the full set of 15 cores. I seem to recall that certain special customers were given access to the full house Ivy Bridge E5 v2.

Note: Trade-Lookup and Trade-Update average response time is now 0.05 and 0.06 sec respectively.

Haswell (Xeon E5 v3)

Performance decreased another 2.2% per core going from Ivy Bridge to Haswell Xeon E5-2699 v3 18-core 2.3GHz. About 2% could be attributed to the 17% difference in frequency. Regardless of IPC differences between Ivy Bridge and Haswell, trading frequency for thermal envelop and more cores is worthwhile.

System memory is still 512GB, though now DDR4-2133 LR ($1,357 each) and absolute performance is now 3,772.08 tps-E, double that of the Sandy Bridge system.

Broadwell (Xeon E5 v4)

Interestingly, the Broadwell Xeon E5 v4 recovers 7.1% in performance per core/thread even as the number of cores per socket has increased to 22, frequency drops slightly to 2.2GHz. Memory remains at 512GB while absolute performance is now 4,938.14 tps-E, 2.62 times higher than Sandy Bridge at the same memory.

Broadwell (Xeon E5 v4)

Note: Trade-Lookup and Trade-Update average response time is now 0.07 and 0.08 sec respectively.

Skylake Xeon SP

Skylake improves performance per core/thread by 5.0% over Broadwell, bring performance per core back to the Sandy Bridge level. Frequency is 2.5GHz. Memory finally increased and to a whopping 1536GB at that (12x64GB DDR4-2666)

In Skylake, the interconnect rings have been replaced by a mesh as the better path forward as the number of cores continues to increase. More significant is the change in the L2/L3 cache. L2 has increased from 256K to 1M, nowing requiring 14 CPU cycles latency versus 12 cycles for the smaller L2 in earlier generations.

L3 is nolonger inclusive of the contents of L2. See the Intel Xeon Scalable Architecture Deep Dive (slide deck) and Intel Xeon Processor Scalable Family Technical Overview for more information.

4-way TPC-E results

The Intel EX processor variants under the Xeon E7 product lines are almost entirely reported for 4-way and 8-way systems. While not the primary topic of this article, it is still useful to look at the TPC-E results for 4-way systems.

The chart below shows tps-E performance normalized per core. As before, the processors before Nehalem (X7560) did not have Hyper-Threading except for the X7140 which did have HT but was not enabled. Newer processors should use performance per thread compared to older products performance per core.

tpce_4S

The table below has more details for Nehalem and later processors. (I will fill in more later).

System VendorProcessor ModelFreq
(GHz)
Cores
per S
Mem
(GB)
tps-Etps-E
per core
SQL
ver.
Inter
Conn
FujitsuX7560 2.27GHz8 512 2,046.9663.972008R21 ring
IBM E7-4870 2.4GHz 1010242,862.6171.572008R21 ring
IBM E7-4870 2.4GHz 1020483,218.4680.4620121 ring
FujitsuE5-4650 2.7GHz 8 512 2,651.2782.8520121 ring
IBM E7-4890 v22.8GHz 1520485,576.2792.9420143 ring
Lenovo E7-8890 v32.5GHz 1840966,964.7596.7320142 ring
Lenovo E7-8890 v42.2GHz 2440969,068.0094.4620162 ring
Lenovo Plat 8180 2.5GHz 28307211,357.28101.402017mesh
Westmere-EX, SQL Server 2008R2 versus 2012

Some points to note are as follows. The first result for the 4-way Westmere EX Xeon E7-4870 shown above was in Jun 2011 with 1024GB memory (16GB quad-rank DDR3-1066) on SQL Server 2008R2. A second result was reported in Nov 2012 with 2TB memory (32GB DDR3-1066) on SQL Server 2012. Both were IBM systems and had SAS SSDs attached to RAID controllers. There were 90 SSDs x 200GB in the first, and 126 in the second.

The performance gain from the first to the second was 12.4%. As before, I am disinclined to attribute much of this gain the larger memory (2X). Maybe 5%, but that is just a guess.

It could be possible that the driver was improved between the two reports? Sometime in this period, the RAID controller vendors offered a special firmware that improved performance when used with SSDs. The full disclosure shows the ServerRAID M5000 Series Performance Accelerator key in use for both systems. I believe that the secret sauce for improving RAID controller performance on SSDs was to turn off the cache.

I would guess that a substantial part of the gain was due to improvements in SQL Server version 2012, and in part related to the system architecture. The Nehalem-EX was a substantially different system architecture from the previous generation Intel 4-way system. AMD did have processors with integrated memory controllers (IMC) since 2004. But the combination of IMC, 8-core processors and HT doubling the number of logical processors (threads) was new.

Only after Microsoft had time to work on the new systems to learn its characteristics could changes then be rolled into the next version of the database engine, and that was 2012. Another item in the 4-way Westmere-EX was that the total number of logical processors now exceeded 64 which required the relatively new mechanism of processor groups (2008R). It also two or even three iterations to really work out a good strategy for leveraging very complex system architectures.

In any case, we should be careful in making attributions from the TPC-E reports. There are very many changes between reports on successive generation products. While the full disclosure does require all details, it does not require an explanation for the impact of each of the changes between generations.

Sandy Bridge Xeon E5

Another item in the 4-way TPC-E is the Sandy Bridge Xeon E5-4650 system. The Xeon E7 processors have 3 QPI links to connect to remote processors. (Nehalem and Westmere-EX have a fourth for connection to the IOH). In a 4-way system, each of the 3 remote processor nodes are connected directly. The Xeon E5 processors only have 2 QPI links (PCI-E is integrated, discrete IOH not needed). In the 4-way, two remote processors are directly connected (1-hop) and communiction to the 3rd remote processor must go through one of the other two (2-hops away).

The Xeon E5 does not use the scalable memory buffer (SMB) which doubles memory capacity, but also has additional latency. Whether capacity or latency is more important is an important question that no one wants to answer directly. In Skylake, the two separate E5 and E7 product lines have been consolidated into a single Xeon SP product line. There are still M models of Xeon SP which can use the SMB, but the TPC-E systems do not use it. My interpretation of this is that memory latency is now more important than capacity, in light of the already very large memory capacity without the SMB and the maturation of SSD/flash storage.

4-way Performance per Core/Thread Trends

Notice that the trend in 4-way systems is a steady improvement in the performance per core (and thread) from Nehalem (and before) all the way to Skylake. Only Broadwell Xeon E7 v4 has a small dip.

The chart below consolidates the 2-way and 4-way TPC-E reports from Nehalem to Skylake. The Westmere-EP/EX compares the 2-way EP to the 4-way EX. The EX/EX represents EX processors in both 2-way and 4-way. Both 2-way and 4-way Sandy Bridge processors are Xeon E5. In most other cases, the 2-way are EP or Xeon E5 and the 4-way are EX or Xeon E7.

tpce_4S

If there has been changes in latency from Nehalem to Haswell and later processors, then it is not entirely evident from the TPC-E results. It is possible that the memory latency values reported by the test tools (7-cpu, Intel MLC, etc.) may not reflect the characteristics of memory latency experienced in database engine code? There is the matter of Westmere-EP based X5690 versus Westmere-EX based Xeon E7-2870 2-way systems. There is reason to believe that the performance per core is more dependent on memory latency than frequency (for values > 1.5GHz).

Summary

Interpreting TPC-E benchmark results, even with the full disclosure reports is tricky. With few exceptions, there is insufficient data to attribute changes between processor generations between the hardware itself or significant changes in the software version.

It is apparent that TPC-E performance per core and thread has been steady in 2-way systems from Nehalem (2009) to Skylake (2017) with system performance increasing by a factor of 8. I am guessing that there must be some imrovement at the core level in order sustain performance per core at 28-core per processor. Scaling up the number of cores in a processor is very good, but it is not perfect.

In 4-way systems, there has been steady improvement in performance per core from Nehalem-EX to Skylake, with only one minor decline at Broadwell. Furthermore, the performance per core in 4-way has steadilied narrowed the gap relative to performance per core in 2-way systems. Much of this occured from SQL Server 2012 onward. Could this be a SQL Server engine improvement for large systems?

In the Nehalem/Westmere generation, there was a large difference in frequency, 23 and 30% respectively, between the 4-way EX and the 2-way EP. But I am inclines to say that frequency at the 2GHz-plus level only has a weak effect on performance, due to the

 

TPC website notes

TPC-E vs. TPC-C: Characterizing the New TPC-E Benchmark via an I/O Comparison Study

The TPC-E trace was obtained on a 4 quad-core 1.8 GHz Opterons system with 128 GB memory and 336 15Krpm SCSI drives organized into 12 RAID-0 disk arrays of 28 disks each.
similar Dell R905 result: 635.43 tpsE, 2009-2-19 4xQC 2.7GHz, 64GB, 288 15K HDD.

The TPC-E run was con?guredwith 200,000 customers. The TPC-E trace is about 10 minutes long.

We estimate the database size by counting the numberof 1MB units that see at least one I/O request: The TPC-C database is 1.5 TB large, and the TPC-E database is 1.7 TB large.

TPC-E sees an average 2740 I/Os per second
90.69% read, 9.31% write, 98.97% 8KB,

 

Addendum

See Hot Chips 18, 2006,
Blackford: A Dual Processor Chipset for Servers and Workstations,
Kai Cheng et. al.
Fully Buffered DIMM (FBD)
page 12
Cites Latencies and Bandwidth
between 2004 DP (LH) and 2006 DP (BF)
Memory idle latency 85ns and 87-102ns respectively
Average loaded latency for TPC-C mix as 180-200ns and 115-125ns respectively.