Previously I had discussed SQL Server parallelism, with my thoughts on the best settings for: Cost Threshold for Parallelism (CTP) and Max Degrees of Parallelism (MAXDOP) in Parallelism Strategy and Comments. At the time, I had intended to follow up with detailed measurements. So now a mere 2 years later, here it is. The general thought was that CTP should be raised from the default value of 5, and MAXDOP should be changed from unrestricted, on modern systems with very many cores, and most especially on systems with Hyper-Threading. However reasonable each persons ideas/suggestions are, nothing is better than hard data.
To investigate this matter, we need to identify the queries with the lowest plan cost
(and lowest actual cost, as this is different)
that can have a parallel execution plan
with the default Cost Threshold of Parallelism setting.
(I will provide details on parallelism for the index seek later).
The primary test system is now a 2-socket Intel Xeon 5670 six-core 2.93GHz (32nm Westmere-EP) with Hyper-Threading enabled (a total of 24 logical processors). Some references are made to test results on an earlier system with 2 x Xeon E5430 quad-core 2.66GHz (45nm Core 2) without Hyper-Threading.
Measurements were also made on a second test system with 2 Xeon E7-46xx 10-core 2.4GHz (Westmere-EX) also with Hyper-Threading enabled. This was a production system with some live activity. The expectation is that results are low DOP are unaffected by other activity, but query execution at higher DOP may not occured at the expected DOP.
The test data is built using the TPC-H data generator, initially employing derivatives of the Part table. At scale factor 10, the Part table is 2M rows, about 285MB, 55 rows per page or 149 bytes per row. The derived tables have 2 additional 4-byte integer columns, size 301MB, 52 rows per page or 157 bytes per row. Note: the IO cost of the first page is 0.003125. Every 1350 pages additional pages contributes IO cost 1.
Below is the non-parallel execution plan for an index seek aggregating 350K rows.
At 157 bytes per row or 52 rows per page, about 6719 leaf level pages must be touched, resulting in IO cost of (6719-1)/1350 + 0.003125, roughly in-line with the IO cost of 4.979 shown below.
Below is the parallel index seek plan at DOP 2.
The details below show that IO cost is the same, and CPU cost is reduced by the degree of parallelism.
The Parallelism operator has a base cost of 0.285 (depends on the number of rows/streams?).
The smallest hash join on the modified Part table that results in a parallel execution plan occurs at around 100,000 rows. Lets first examine the non-parallel plan.
The three major components of this plan are the outer and inner source index seeks, and the hash match. The Stream Aggregate is small to the major components.
The IO cost of a 1.42 corresponds roughly to a range scan/seek of 15MB (IO cost 1 is 10.5MB). The actual logical IO is 1931, very close to the values of (1.42446 - 0.003125) * 1350, and includes a few upper level index LIOs.
Below are the Hash Match and Stream Aggregate details. Note that the Hash Match has the largest cost of all operations in this execution plan, and the entire cost is in the CPU element. The 2 index seeks operations have their cost mostly in the IO element, with only 7.1% in the CPU.
Below is the parallel Hash Join execution plan at DOP 2.
The Outer and Inner Source index seeks have identical costs so only the first is shown. As covered in Cost Based Optimizer Parallelism I the CPU cost is reduced by the degree of parallelism, but the IO cost does not change, which I call a saturated IO model.
The Hash Match details above and the Stream Aggregate details below at DOP 2. Both operations have cost reduced by the degree of parallelism as both elements have their costs entirely in the CPU portion.
The Parallelism operation, as its name implies, only occurs in parallel execution plans adding cost 0.0285. The cost structure of this operation is proportional to the number of rows, which implies that high row count queries are less likely to have a parallel execution plan as the cost of reconstituting threads may outweigh the formula benefit of parallel operations.
The second test query is a loop join. A parallel execution plan occurs at around 8000 rows for high degrees of parallelism. At DOP 2, this occurs around 10000 rows, which will be used for test purposes. The non-parallel loop join plan is shown below.
The outer and inner source operation details are shown below. Note the inner source subtree cost of 29.0264, based on number of executions 10000. The CPU component should be 10000 * 0.0001581 = 1.581. Then the IO component is 29.0264 - 1.581 = 27.4454, which is approximately equal to 8782 * 0.003125. This is an estimate of the number of physical IO for 10000 executes. The assumption is that some of the pages would have been previously loaded into memory during the execution of this query, but not yet evicted.
There are 38465 leaf level pages in both tables. The alternate plan for this query is a hash join with a scan on the inner source table. In this plan, the table scan would have IO component 28.5, CPU 2.2 and the hash match would contribute another 9. So it would take another 35% more rows before the non-parallel loop join plan to naturally shift to a hash join. In parallel execution, the hash join cost is reduced proportionate to DOP.
The Nested Loops and Stream Aggregate details are shown below. The Nested Loops is only 0.14% of the overall plan cost, and the Stream Aggregate even less.
Below is the parallel loop join query plan at DOP 2.
The parallel outer and inner details are below. Note the very minor reduction in outer source CPU from 0.155756 to 0.150178. There is no change in the inner source operation, not even the element attributed to CPU.
The Nested Loops and Parallelism details are below. In the parallel plan, there is a CPU reduction of 0.005 from the outer source, 0.0209 from the Nest Loops, and 0.003 from the Stream Aggregate, to just ever so slightly overcome the Parallelism cost of 0.0285.
The total plan cost for the non-parallel loop join is 29.23 and the parallel plan cost is 29.229. If this model was even remotely accurate, we should ask why bother to switch to a parallel plan for such a minute gain. It turns out that the SQL Server parallel execution plan cost estimate is nothing close to the true cost structure, which raises a completely different set of questions, starting with: how can SQL Server be expected to generate good parallel (or not) execution plans if the cost model is completely different than the true cost structure?
Below is the hash join performance in rows per sec (left vertical scale), and (worker or CPU) cost in micro-sec per row (right vertical scale), both versus degree of parallelism. The join involves 100K rows and the test tables are 2M row each.
Hash Join 100K rows versus Degree of Parallelism on 2M row table - Westmere EP
At DOP 1, the performance is just over 1M rows/sec, scaling well to DOP 4, levels off for DOP 6-12 at just over 4M rows/s, then jumps again at DOP 14. The peak gain from parallelism is 6.646 speed up at DOP 20 over DOP 1. The cost is just under 1 us per row at DOP 1, rising as high as 2.6 us per row at high DOP, more so if HT is involved? At DOP 1, the query execution time for 100K rows is 95ms.
It appears that at DOP 2 and 4, the individual threads are running on logical processors in different physical cores, hence the excellent scaling. At DOP 6-12, some of the logical cores are on the same physical cores. Hyper-threading does improve performance to a moderate degree in SQL Server depending on the operation.
Below is the performance graphs for the same 100K row hash join, only on 20M row tables. The performance is virtually identical. The hash join actual cost at a particular size does not depend on the table size.
Hash Join 100K rows versus Degree of Parallelism on 20M row table - Westmere EP
Below are hash join performance graphs for Westmere-EX
Hash Join 100K rows versus Degree of Parallelism on 2M row table - Westmere EX
Hash Join 100K rows versus Degree of Parallelism on 20M row table - Westmere EX
(I worked on a custom b-tree search engine with no locking protection. The scaling was essentially linear regardless of physical or logical cores from 1 to 24 threads. This may seem fantastic, but it does make sense because a b-tree search is just serialized memory accesses. Fetch a memory location, which determine the next memory location to fetch. Note the second memory access cannot begin until the first is complete. The core clock rate for 3.3GHz is 0.3ns. Memory access latency is around 50ns for local node, and 100ns for 1 hop remote node, corresponding to 150 or 300 CPU-cycles respectively.)
If this assessment is correct, then it would suggest the proper strategy for the SQL Server engine in parallel execution plans is to allocate 1 worker thread from a logical processor on each physical core, perhaps allocating from the cores on the same socket before proceeding to the next socket. Only if the desired degree of parallelism is greater than number of cores should two logical processors be allocated from any single physical core, and this should only occur in unusual circumstances.
Note that a degree of parallelism greater than the number of physical cores but less than total logical core must run unbalanced, some cores supporting one thread, other two. If each thread has equal work, then this situation will probably be worse than 1 thread per core.
The next level of sophistication would be to match thread/core assignment to memory nodes. For example, an evenly partitioned table (as in hash key partitioning) would have any individual partition entirely on a single memory node. Threads would be assigned to cores local to the memory node. and assign threads to cores such that most memory access are to the local node. This effort would be hugely complicated, and only applicable to special circumstances, but a true fanatic would not be deterred.
Below is the performance and cost structure for a hash join on the full table of 300MB and 2M rows. The cost per row is higher 1.495 us per row at DOP 1, rising to 3us at high DOP. However scaling is better with a peak speedup over DOP 1 of 11.35 and the peak rows per sec is also higher than the previous query.
Hash Join - 2M rows versus Degree of Parallelism - Westmere EP
The same full table scan query was tested on a 3GB table with 20M rows, as shown below. The DOP 1 cost and performance was about the same, slightly lower actually at 1.376 us per row. The scaling was better with peak speedup at 14.95 and peak performance at nearly 11M rows/sec.
Hash Join - 20M rows versus Degree of Parallelism - Westmere EP
I am not sure why the cost per row in this query is higher for the full table over the limited range of 100K rows. There was no tempdb activity in either case. One speculation is that the hash join intermediate results remain L3 cache (15M), and results in lower access time than off-die memory accesses.
A test with small and large merge joins, also exhibits the same behavior. Is it possible this is due to local and remote node memory locations? Life is not simple on NUMA systems with HT.
OK, in retrospect, the cost of a hash join should depend on the size of the intermediate result, as this requires more work? Or is it because the 100K row hash join specified GroupID = @P1 on both tables, hence did not have to be carried into the intermediate table. The full table hash join on the other hand joins on 2 columns both of which have to be built into the intermediate table? This would explain why the hash joins at table size 300MB and 3GB has approximately the same cost per row, both having the same number of join columns.
Hash join performance on Westmere-EX
Hash Join - 2M rows versus Degree of Parallelism - Westmere EX
Hash Join - 20M rows versus Degree of Parallelism - Westmere EX
Both full table hash joins (300MB and 3GB) demonstrate that parallel execution plan scaling is better for larger queries.
Below is the Loop Join performance in rows per sec and cost in usec per row for a query of 10000 rows on the 300MB tables. Notice that performance peaks at DOP 6 and then degrades at higher DOP. The peak speedup over DOP 1 was 4.4.
Loop Join 10K rows, 2M row table versus Degree of Parallelism - Westmere EP
The cost at DOP 1 is 1.827 usec per row. For 10000 rows, the execution time is a mere 18 ms. (I initially mis-stated this as 1.8ms) That this query benefit from parallel execution at all is amazing. Of course it also raises the serious question as why SQL Server would go to the effort of a parallel execution plan for a query that runs in 18ms.
This is due to the antiquated IO based cost model and the default CTP setting of 5.
Based on the 100K row hash join hash with plan cost just over 5 and actual query time of 95ms,
we might consider Cost Threshold for Parallelism around 25,
making the theshold at around 0.5 sec.
However, based on the 10K rows loop join with plan cost 29 and actual query time of 1.8ms,
CTP of 25 would point to a 50K rows loop join with actual execution tim 10ms.
However, the 10K row loop join has plan cost 29, and would be eligible for a parallel
plan even though actual execution time is only 18ms.
It would be necessary to increase CTP to perhaps 3000 to suppress parallel loop join
with actual cost less than 100ms?, which is far to late for the hash join plan.
The disparity between Loop and Hash join model and actual cost does allow a good strategy for the Cost Threshold for Parallelism setting without a complete overhaul of the SQL Server cost model. And this is not something the SQL Server team seems to be will to tackle. Another possiblilty is for the Cost Threshold for Parallelism setting to only consider the CPU element of the plan cost. Too bad we do not have access to source code to investigate this.
The outer and inner source tables were populated and indexed in a manner such that each row from the outer source joins to a row in a different page of the inner source table. When the rows join to consecutive rows in the inner source table, the cost per row is even lower at 1.33 usec per row.
The Index (b-tree) Depth of this table (clustered index) is 3. The index upper levels should have more than 300 entries. The leaf level density is 52 rows per page. So 3 levels should accommodate approximately 5M rows?
Below is the Loop Join performance and cost for 100K rows on the 3GB table wit 20M rows.
The index depth is now 4 versus 3 on the 2M row table.
Apparently the index depth is still 3?
The cost is now higher at 2.25 usec per row.
On the positive side, performance continues to increase for higher DOP.
The peak speedup over DOP 1 was 8.26.
Loop Join 100K rows, 20M row Table versus Degree of Parallelism - Westmere EP
In double and tripling checking the interpretation, below is the Loop Join performance for 10K rows on the 20M row index depth 4 table.
Loop Join 10K rows, 20M row Table versus Degree of Parallelism - Westmere EP
Below are loop join performance graphs for Westmere-EX
Loop Join 10K rows, 2M row Table versus Degree of Parallelism - Westmere EX
Loop Join 10K rows, 20M row Table versus Degree of Parallelism - Westmere EX
Loop Join 100K rows, 20M row Table versus Degree of Parallelism - Westmere EX
In examining the CPU usage during the tests, it was evident that in some cases, only one logical processor of a physical core was used. In other cases, both logical processors on some cores were used. This leads to the uneven scaling versus DOP.
Another point interest is throughput with parallel execution plans is supporting concurrent sessions. The figure below shows performance in rows per sec at various DOP settings versus number of concurrent sessions. The results based on the number of queries completed in a 10-sec windows. Since the single session DOP 1 query run-time is 100ms, this should be reasonably accurate.
Hash Join Performance (rows/s) by DOP versus # of Concurrent Sessions
SQL Server parallel execution plans are great for a single session, even for queries that run in so short a time that parallelism should have never been considered. However there are definitely issues with parallel execution in throughput tests.
Loop Join Performance (rows/s) by DOP versus # of Concurrent Sessions
It is stressed that the queries used in the tests above run in 95ms and 1.8ms respectively at DOP 1. So it is not surprising that there are significant throughput issues in attempting concurrent parallel execution with such minuscule queries. I present these as examples because they are among the smallest queries for which SQL Server engages parallel execution with the default settings.
This is because the default setting are seriously obsolete. The original setting that targeted a 5 sec threshold for parallel execution on a Pentium Pro 166MHz (I think this a 0.8um or 800nm design) does not work well on modern microprocessor at 3GHz and 32nm design.
That said, SQL Server does have issues with concurrent parallel execution throughput on properly sized queries. So some effort here would be helpful. But the priority is below index on hash key, and parallel write operations.
The initial release neglected the simple index seek test
Below is the performance and cost characteristics for a simple index seek aggregating 1 float type column for 350K rows. The cost per row is 0.1734 usec at DOP 1. Considering that the hash join test above does 2 index seeks and summed 2 float columns, we could speculated that the cost of the bare hash join is 0.952 - 2x0.1734 = 0.605 usec per row.
Index Seek Sum - Westmere EP
Below is the characteristics for a simple index with only COUNT(*). The cost is now only 0.0783 usec per row. This implies that the float type column sum costs 0.095 usec per row.
Index Seek Count - Westmere EP
The bare index seek cost of 0.0783usec per row also amortizes the cost of the page access. At 52 rows per page, the full page cost including rows is 4.07usec. Previously, I had assessed that the bare page access cost was 1us on the Core 2 platform. The Westmere processor core is more powerful and mostly importantly, has significantly lower memory round-trip time. So I am expecting the bare page cost on Westmere to be less than 1usec.
Below are index seek performance graphs for Westmere-EX
Index Seek Count - Westmere EX
Index Seek Sum - Westmere EX
It might seem that we a quibbling over fractions of 1usec. But we need to consider that 1us on a modern microprocessor in 2-3,000 CPU-cycles, and this on a processor core that can execute multiple intructions per cycle. Those of you with very good friends on the Microsoft SQL Server engine team might persuade them to show you the source code and compiled binary/assembly. Even without looking source code, it is clear that there are not nearly that many instructions to touch a page, row and column.
The CPU-cycles are spent in the lock mechanism and the memory-round sequences. This is why there significant differences between a traditional database engine with all data in-memory and the "in-memory" database engine that have completely different data organization. It might better to refer to traditional database engines as page-row structures and the new stuff column-oriented structures.
It is definitely clear that the default settings for Cost Threshold for Parallelism (5) and Max Degree of Parallelism (0 - unrestricted) are seriously obsolete and should be changed as standard practice on SQL Server installation.
It is not clear there is a single universal best value for Cost Threshold for Parallelism. I think the more important criteria is that there should not be a high volume of concurrently running queries with a parallel execution plans. Whatever the plan cost of the high volume queries are, the CTP should be set above it. Of course, the other important action is to reduce the plan cost of the high volume queries, assuming this correlates to improved actual query execution time.
The strategy for Max Degree of Parallelism is also unclear. Before the advent of multi-core processors, a good strategy was to disable parallelism on transaction processing systems. It was also critical in the old days to ruthlessly enforce a restriction to only pure transactions too. Today, we have 32-40 cores on transaction systems, and almost everyone runs complex queries on it. So parallelism is good.
Ideally I would like to restrict MaxDOP to the number of physical cores on a Data Warehouse system, and less than that on a transaction system. I would also like to get great scaling with parallelism. But because SQL Server does not balance the threads to one logical processor per phyical core, this strategy does not work. So this is something the SQL Server team needs to address.