TPC-H Studies at SF10

Test System

8-way Opteron 8384 2.7GHz quad-core, 128GB
Windows Server 2008 R2
SQL Server 2008 sp1, build 2789?

The Wikipedia List of Opteron processors shows the Opteron 8384 to be a Shanghai 45nm with 1GHz Hyper-Transport links. The data transfer rate is actually 2GT/s (giga-transfers per sec) with 2 bytes (16-bits) per transfer. Later Shanghai models 8389, 8393 SE, 8379 HE and 8381 HE increased the HT to 2GHz. The Istanbul Opterons 8400-series operates HT at 2.4GHz. Magny-Cours Opterons 6100-series further increased HT to 3.2GHz.

Scaling with Degree of Parallelism (DOP)

The Big Queries

Below are the plan cost for degree of parallelism (DOP) 1, 2, 4, 8 and 16 on a 32-core system with 128GB memory.

a
 Plan Costs

Only Query 18 shows substantial reduction in plan cost with increasing DOP. And this is only from DOP 1 to 4, with very little plan cost reduction at higher DOP. This is because a high portion of the DOP 1 plan cost is in a Hash Join, which gets disproptionate plan cost reductions with parallelism. The other queries have a large portion of IO cost from tables and indexes, which do not change with parallelism.

All queries above do show continued decrease in plan cost with increasing DOP, however minimal, from the reduction in the CPU portion of many operations. There does appear to be mechanism in the SQL Server plan cost model that would cause plan cuase to stop decreasing or increase with DOP (the exception being in the step from DOP 1 to 2).

The figure below shows CPU time for the same queries.

a
 CPU time

The expectation is that CPU time should increase by some amount with DOP due to the extra operations (Distribute, Repartition, and Gather Streams) that need to occur in the parallel execution plan. In actuality, there is a wide range of behavior. CPU may rise gradually, or sharply. CPU may be flat or even drop with increasing DOP.

Below is the parallel execution speedup relative to DOP 1, calculated as the elapsed time at DOP divided by elapsed time at DOP n.

a
 Speed up relative to DOP 1

Overall scaling is strong. For most of the above queries (on this test system), scaling levels off beyond DOP 24. Query 1 actually experiences negative scaling from DOP 30 to 32. I thought this was contention in coordinating thread, but T Kejser suggested another explanation is bottlenecks in the Hyper-Transport interconnect. The test system, an 8-way quad-core Opteron, was the last to be on 1GHz (2GT/s) HT. The newer models have HT operating at 2.0-3.2GHz.

Super Scaling Queries

Certain queries exhibit superscaling. This should only occur in one step, usually from DOP 1 to 2. In Q22 this is occuring from DOP 2 to 4. There after the scaling should not be better than the DOP. Notice that four of the five queries below also exhibit negative scaling at high parallelism.

a
 Speed up relative to DOP 1

The reason for superscaling is usually the Bitmap operation that can preceed a parallel Hash Join. There is no technical reason Bitmap cannot be employed in a DOP 1 hash join. I am inclined to the rational is that the Bitmap helps large hash joins and a large hash join should have a parallel plan, so it should only be necessary to consider this for parallel hash joins. I propose enabling the Bitmap operation for DOP 1 hash join if the plan cost exceeds some value, possibly the cost threshold for parallelism. There are situations in OLTP databases that might have reason to set max degree of parallelism to 1, even though I now prefer 2-4 on high core count systems. Bitmap is beneficial in moderately large hash joins and there are valid reasons keep certain queries at DOP 1 even though a parallel plan might have been considered by default.

Below is the CPU time (absolute scale). Query 21 is off the scale, but was shown above in the Big Queries section.

a
 CPU time

Superscaling is accompanied by a drop in CPU in the step exhibiting this behavior, indicating a significant change in the execution plan. At high DOP, all of these queries sharp CPU time increases for no gain. This is an example of the situation where DOP should have been restricted if the optimizer could have modeled this phenomenon.

Below is the elapsed time by DOP.

a
 Elapse time

Below is the CPU relative to DOP 1.

a
 CPU relative to DOP 1

Small Queries

Small queries are interesting in our parallelism study. Below is the plan cost versus parallelism. We should also study parallel execution for plan costs in the 5-50 range where the SQL Server default setting enables parallelism. Even though plan costs show very little change with DOP, it is monotonically decreasing, meaning the optimizer will fire all threads if available.

a
 Plan Costs

Below is speedup relative to DOP 1. Some show no speedup beyond DOP 8. It is rather astonishing that some relative low plan cost queries actually show scaling to DOP 30. I had earlier proposed that we replace the single cost threshold of parallelism setting with a multi-value setting. Something like enable DOP 2 for plan cost 5-30, DOP 4 for 30-200, DOP 8 for 200-1000, etc. It is clear that intermediate plan cost alone is not a sufficient justification for limiting the DOP.

a
 Speed up relative to DOP 1

Below is the CPU time for the small queries. Notice the erratic behavior in Q20. Query 20 has two search arguments, each on different tables, leading to very different execution plans. There are also difficulties in row count estimate errors with propagation through the joined tables.

a
 CPU time

Below is the elapsed time.

a
 Elapse time

Negative Scaling Queries

As mentioned earlier, queries can exhibit negative scaling, even if the plan costs contine to decrease with increasing DOP. Negative scaling is usually accompanied by a sharp increase in CPU.

a
 CPU time

a
 Elapse time

a
 Speed up relative to DOP 1

Other Queries

a
 CPU time

a
 Elapse time

a
 Speed up relative to DOP 1

All Queries

Below is the normalized CPU time for each query relative to the DOP 1 query time. Certain queries, including 2, 3, 17, and 20, show sharp escalation in CPU at high parallelism.

a
 CPU time

Below is the speedup by DOP each query relative to the DOP 1 query time. Query 13 has perfect scaling to DOP 32, many have good scaling. Query 2, 17 and 20 have horrible scaling characteristics.

a
 Speed up relative to DOP 1

Compression

Below is the sum of the 22 query CPU times for the database with page compression by DOP relative to the standard organization without compression. Note that official TPC-H results are based on the geometric mean, not straight sum.

a
 CPU time

Below is the sum of the 22 query elapsed times for the database with page compression by DOP relative to the standard organization without compression.

a
 Elapsed time

Below is the relative CPU time for each query to compressed tables relative to the query to uncompressed.

a
 CPU time

a
 Elapsed time

Partitioning

Overall, the Partitioned table organization actually reduces CPU at DOP 1 and 2. At higher DOP, there is a 5-12% CPU penalty.

a
 CPU time

There is a larger penalty in elapsed time in this series of tests. This was traced to tempdb activity in the partitioned table queries, while there was no tempdb activity in the standard and compressed tables. The test system had weak storage system, causing serious delays in the partitioned tables result. So the question is why does the partitioned tables generate tempdb activity while the nonpartitioned tables do not.

a
 Elapsed time

a
 CPU time

a
 Elapsed time

Summary

There are two conclusions that should be draw from the above. One is that there is still areas of parallelism scaling that could be improved, even though SQL Server 2005 and later versions have made dramatics improvements. Second, there is a great deal of complexity in parallelism scaling. A good strategey for handling parallelism cannot be handles with just the tuning parameters cost threshold of parallelism and max degree of parallelism. Alternatively, we could investigate each query and determine the appropriate degree of parallelism for each. This is not compatible with the SQL Server self-tuning objective.

What is really needed is a more sophisticated parallelism strategy. This starts with a more realistic model of query cost. It should be able to predict with some degree of success when there are diminishing returns to continued parallelism, and even negative scaling.