Home, Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies (Parent)

Hyper-Threading Performance 2013-03

(Also see Benchmarks 2014)

The Hyper-Threading technology returned to the Intel Xeon (and Core i) with processor codename Nehalem in 2009. It was first introduced with Willamette in late 2000?, and the first Xeon in 2001. But the Core 2 architecture, 2006, in the Xeon 5300 and 5400 series did not have this feature. As with many (potentially) performance enhancing technologies and features, there is a glaring absence of meaningful detailed performance data that might be of help to developers and information technology professionals. And it is not as if Intel and Microsoft do not have the budget for this.

The complex interaction between software and advanced hardware inherently implies that there is no magic that always improve performance without risk of negative consequences. I am of the impression that marketing people squash anything that could be interpreted as negative because they want victims to believe that everything will be wonderful.

On our side we should expect that it takes talent and years of experience to build up the skills necessary to command professional salaries. So we need to see both good and bad in order to determine the best strategy to realize the benefits of the positive while containing the impact of any deficiencies. If it were not, then why staff our positions with someone more expensive than necessary?

Hyper-Threading from Nehalem to Sandy Bridge

Nehalem and later Xeon Core-i based processor implement temporal multi-threading. The older Pentium 4 based processors had simultaneous multi-threading. The details of both type are covered in the references at the end and will not be covered here. I have a brief discussion on HT in Processor Architectures and History.

When the Nehalem-EX (Xeon 7500 series) was near introduction, the HP benchmark team suggested that the impact of HT was in the range of 30% improvement for the OLTP oriented benchmarks and 10% for the DW/DSS benchmarks. I presumed that this was on the 8-socket ProLiant DL980 as benchmarks for the ProLiant 300 and 500 systems were handled by a different group. My expectations are that HT should have greater impact on the 1 to 4 socket systems because it is much more difficult to achieve performance gains on the more complex systems.

A few years ago, I did an assessment of a proprietary database engine that only did parallel b-tree searches without locking. Scaling was nearly linear with the number threads over both physical cores and logical processors. This was not unexpected because the pointer chasing code means the core executes serialized memory accesses, so there should be plenty of no-op cycles for two logical processor to share. As a matter of fact, 4 threads (logical processors) per core might be even better.

That SQL Server does not realize such benefit has to do with the locking implementation. Is there is a single resource that locks must go through? Dave Campbell's post Breakthrough performance with in-memory technologies on Hekaton cites Per-Ake Larson et al article, and mentions lock free data structures. (I found Sasha Goldshtein material Practical Concurrency Patterns: Lock-Free Operations and Practical Concurrency Patterns: Spinlock. If any knows good sources, please advise.) This might indicate that the next version of SQL Server could show more benefit from HT and perhaps Intel should consider more threads per core in the next generation?

HT in Small and Large Queries

Ideally we should investigate HT for both high-volume small queries (non-parallel execution plans) and low-volume large queries with parallel execution plans. Unfortunately my multi-threaded test harness is old and I do not have time to bring it up to date.

Recently I did work on a clustered system with two nodes. For whatever reason, one had HT enabled and another had HT disabled. Periodically, the active node would be switched to allow for patching. This provide invaluable data on HT performance.

Test System

The average SQL statement on this system is 1-2 CPU-ms, comparable to TPC-E. HT was off until mid-September, on until October, off until early November, on until mid-January and off until beginning of March. Certain SQL statements are not impacted by HT, and some SQL could be more than 50% faster with HT on, even queries without a parallel execution plan. The overall steady CPU load on the system is typically 25% during the day.

For large queries, I used the TPC-H data generator at scale factor 100 and the set of 22 TPC-H queries. The LineItem table is 80GB because of the use of the 4-byte date data type instead of the 8-byte datetime. The total database is 140GB.

Test System

The test server is a Dell PowerEdge T110 II with Intel Xeon E3-1240 processor, quad-core 3.3GHz, 32GB memory, and LSI MegaRAID SAS-9260 controller connected to 8 SATA 6Gpbs SSDs. There are 2 Crucial m4, 2 OCZ Vertex 3, 3 OCZ Vertex 3 Max IOPS and 1 OCZ Vector. The T110 only has PCI-E gen 2 slots, so I will need to get a new system for gen 3 testing. Below is a representation of the test system.

Test System

The software stack is Windows Server 2012 and SQL Server 2012 Enterprise Edition. Most testing was done with sp1 + cu3, with a few tests at RTM, and sp1 base. The storage configurations evaluated were 4 x 2 disk RAID 0, 8 disk RAID 4 and 1 x 8 disk RAID 0. All results here are 1 x 8 disk RAID 0, but write activity was minimal so RAID 5 results should be only moderately less.

Hyper-Threading can be enabled or disabled in the UEFI (formerly BIOS, now Universal Extensible Firmware?). Dell also allows setting the number of cores to enable. I think is most useful in light of SQL Server per core licensing. We can buy a single model for many functions, then enable the correct number of cores to balance performance and licensing cost.

The SQL Server DATE_CORRELATION_OPTIMIZATION setting was turned on. It so happens that correlated dates within the TPC-H tables are within 120-days of each other. Ship date is within 120 days of order date and so on. The TPC-H queries are written citing a greater than condition on one date column and a less than condition on a correlated column, so this setting helps. If there is a not tight correlation, then the setting may not help. In practice, it might be better to just explicitly specify both upper and lower bounds on the cluster key column whenever possible.

Several settings were tried on the LSI SAS-9260. No read-ahead is better than read-ahead and 128K stripe size was better than 64K, but results below are with no read-ahead and 64K stripe size.

While the TPC-H scale factor 100 data set was used, the indexes were built different than TPC-H requirements as I am not presenting official TPC-H results. The significant difference is that the clustered index key on LineItem is Shipdate+OrderKey+LineNumber, specified as unique. This is per my practice of building unique clustered indexes, and not to eliminate key lookups in the execution plan.

Hyper-Threading On versus Off

In all tests, only the execution time and CPU are reported. Statistics generation and compile times were excluded. Each test of the 22 TPC-H queries are from clean buffers and clean procedure cache. From a freshly populated database, the statistics generation time is about 40 sec. The compile time with statistics previously generated is about 4-5 sec for all 22 queries.

CPU and elapsed time were collected with SET STATISTICS TIME ON. The DMV dm_exec_query_stats reports exactly the same elapsed time, but the worker time is higher, perhaps including the compile time?

Below are elapsed time in milli-sec to execute each of the 22 TPC-C queries with Hyper-Threading off and on. There was no restriction on Max degree of parallelism so HT off is DOP 4 and HT on is DOP 8 (4 physical cores).

HT off on dur

Below are CPU (worker) time in ms for the same with HT off and on.

HT off on CPU

The sum total execution elapsed time for 22 queries is 692 sec with HT off and 559 sec with HT on. Overall, the 22 queries run 24% faster with HT. On the individual queries, all but one range from 0 to 65% faster. The exception, query 15, runs 4 times faster with HT on instead of off. Below is the elapsed time for HT off relative to HT on.

HT elapsed

The sum total execution CPU time is 2253 sec with HT off and 3374 sec with HT on. Overall, CPU is 50% higher with HT on. Of course, there are twice as much available "CPU-sec" with HT on than off, so this is not a negative. With a perfect parallel execution plan and infinitely fast IO, all (logical) cores should be at 100% during the query execution.

On the individual queries, CPU on all but Q 2 and Q15 range from 22 to 80% higher with HT on. On the Q2, CPU is 137% higher. On Q15, CPU was 45% lower with HT on than off, reflecting the 4X speedup with double the logical processors. Below is the CPU time for HT on relative to HT off.

HT elapsed

There is just the slight difference in the execution plans for Q15, shown below. The first is with HT off and the second with HT on.

Q15 HT off

Q15 HT on

I do expect HT should improve the performance of this query but it is probably reasonable to believe that most of the 4X gain in elapsed time is attributed to the difference in execution plans. But I have no idea why the Query Optimizer would not use the same plan in both cases. Perhap someone from the Microsoft QO team can help?

For anyone interested, I have provided the actual execution plans for the above HT off and on sqlplan. Too bad the web browser does not render sqlplan files!

Hyper-Threading Off and On at DOP 1

Below is elapsed time with HT off and on at degree of parallelism 1. Amazingly, performance is better with HT on at DOP 1, 20% faster!

HT off on dur

How can this be? When this is the only activity, only one of two logical cores on one of the four cores is in use. The performance should be very similar to single-threaded HT off? The range is 0 to 63% faster except Query 17 which 14% slower with HT on.

Below is CPU time with HT off and on at DOP 1.

HT off on CPU

Below is the relative elapsed time with HT off to HT on.

HT elapsed

Query 1 is a simple table access aggregating 591M rows over nearly 80GB, so there can be no difference in execution plan, and yet it is 48% faster with HT on? If someone knows how this can be, please advise?

I could speculate that with HT on, there are twice as many logical processors, so it is less likely an OS task would disrupt the SQL thread handling the running query. Perhaps is so happened that thread handling the SQL query and the thread handling IO were on the same physical processor, so data was already in L2? These are just wild guesses.

In the comments below, Chris A mentions that SQL Server 2012 tries to fit batches into L2 cache. What if it estimate the amount of available L2 too high? Then with HT on, it would presume only half that to be available? But with nothing running on the other LP, now single-thread HT on runs better than HT off?

Hyper-Threading Off and On at DOP 2

Below is elapsed time for HT off and on at DOP 2. Most of the individual queries are with 12% of each other, faster or slower, except for Query 1, 28% faster with HT on. Overall HT on is 2% faster, basically noise.

HT off on dur

Below is CPU time for HT off and on at DOP 2

HT off on CPU

Hyper-Threading Off and On at DOP 4

Below is elapsed time for HT off and on at DOP 4. The range with HT on is 2.5X slower (Query 6) to 90% faster (Query 15). Overall, 13% slower with HT on. It is possible that both logical processors are used on certain cores instead of a single logical processor of each physical core. The same may have happened at DOP 2.

HT off on dur

Below is CPU time for HT off and on at DOP 4

HT off on CPU

Hyper-Threading Off, DOP 1, 2 and 4

Below is elapsed time with HT off at DOP 1, 2 and 4. Notice that scaling is sometimes better than 2X from DOP 1 to 2. I discussed this super-scaling effect in Tpch Studies at SF10

HT off on dur

Below is CPU time for HT off at DOP 1, 2 and 4

HT off on CPU

Below is the ratio of elapsed time at DOP 2 over 1, and DOP 4 over 2. Correction: this is actually speedup from DOP 1 to 2, and 2 to 4, it is the ratio of elapsed time at DOP 1 over DOP 2, and 2 over 4. The conventional expectation is the scaling should be close to linear from DOP 1 to 2, for a value of just under 2 on the graph. From 2 to 4, scaling should be still be good but less than 1 to 2. If the test system had more cores, there should be reduced gains in scaling to the higher DOP. Query 10 is an example.

HT elapsed

As it turns out, the behavior is all over the map of possibilities. One effect is super-scaling where the parallel execution plan is actually more efficient than the non-parallel plan. Four queries (2,11,13 and 18) actually exhibit super-scaling at both DOP 1-2 and DOP 2-4.

Other effects are that the parallel plan is much less efficient than the non-parallel plan, but otherwise scales with parallelism. In this case, we would expect scaling from 1 to 2 to be poor, but good at 2 to 4. Examples are queries 6 and 15.

In addition, because of the SQL Server cost formulas, we could be expect completely different execution plans and characteristics at different degrees of parallelism.

Below is the ratio of worker time at DOP 2 over 1, and DOP 4 over 2. The conventional expectation is progressively higher CPU with increasing DOP to account for the overhead of coordinating the threads.

HT elapsed

But as with elapsed time, there are radical variations in behavior. Per super-scaling effect, CPU at DOP 2 is actually lower than DOP 1 in many cases.

Hyper-Threading On, DOP 1, 2, 4 and 8

Below is elapsed time with HT on at DOP 1, 2, 4 and 8. Notice the more predictable behavior of queries running faster at progressively higher DOP, even when the number of threads exceeds the number of physical cores. The exception is query 15.

HT off on dur

Below is CPU time with HT on at DOP 1, 2, 4 and 8. Again the expected behavior of progressively higher CPU with increasing DOP except in the DOP 1 to 2 where the parallel execution plan is more efficient than the non-parallel plan.

HT off on CPU

Below is the ratio of elapsed time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4. Correction: DOP 1 over 2 DOP 2 over 4 and DOP 4 over 8

HT elapsed

Below is the ratio of CPU time at DOP 2 over 1, DOP 4 over 2 and DOP 8 over 4.

HT elapsed

HT Summary

Overall Hyper-threading is a significant positive for performance. There are unexpected characteristics so it is important to collect sufficient performance data for HT off and on, so that anomalies can be investigated. It is also important to have data on query performance at range of DOP values. In a production system, one should also consider the strategy for max degree of parallelism (and cost threshold for parallelism) factoring in the number of concurrently running queries (with and without parallel execution plans.

The premise is that HT performance gains on SQL Server is heavily impacted by the locking mechanism. Then the next version of SQL Server, which is expected to make extensive use of lock-free data structures should see more substantial performance gains with HT. This should also prompt Intel to support more logical processor on each physical core, similar to the 4 threads on IBM POWER.

See also Queries barely over the Cost Threshold for Parallelism and Parallelism Strategy and Comments. In the first, I mentioned that a parallel execution plan with an intermediate DOP may allocate both logical processors on one core, instead of distributing threads one per physical core. Even if using both logical processors on one core produces a 30% throughput gain, that implies each processor running at 65% of one LP used and the other idle. Then the threads on the doubled up core would take 50% longer than the cores with only one active LP.

Another question is how should the threads be distributed? Consider one thread per core, all on a single socket first, before going to the next socket, so threads share a common L3 cache. Potentially operations such as the Parallelism Repartition Streams might run better. The other option is one thread on each socket first, before coming back to the first socket. If SQL Server could then distribute the work so that threads are assigned to access pages stored on local memory as much as possible, this would improve memory performance. Of course, this would be impossible to accomplish perfectly, but perhaps good enough is possible?

Wikipedia Referenes

Wikipedia has pages on both Multi-Threading and processors with MT in one form or another. MT topics include the broader topic Multi-Threading, and the specific types of MT: Simultaneos Multi-Threading, Temporal Multi-Threading, along with the Intel brand name Intel Hyper-Threading on this topic.

Processors with MT include the IBM POWER5 (2004), POWER6 (2007) and POWER7 (2010), all 4 threads per core. POWER8 in 2013?

The Sun (now Oracle) UltraSPARC T1 (2005) is 4 threads. UltraSPARC T2 (2007) and later are 8 threads. Then SPARC T3 (2010) T4 (2011). The newest Oracle SPARC T5 (2013).

Intel Itanium acquired MT with Montecito (2006), and continued with Tukwila 2010 and Poulson (2012).