Cost Based Optimizer,
CBO: AdventureWorks, Index Seek, Key Lookups, Table Scan, Joins, Insert, Update, Delete,
Parallelism I, Parallelism II, Parallelism III
On strictly intuitive grounds, what should scale? Our expectation is that a trivially parallelizable operation, one which can be divided into multiple sections, that can be processed independently with nearly no coordination between individual threads until the end at which point it is necessary to collate the output of the individual threads, should have exceptional scaling.
On the other hand, a query that is processed in partitioned units of work, for which a great deal of coordination between threads is necessary, especially if the output of one thread must be sent to another thread for further processing, the expectation is that scaling might be good initially, but will become more limited at higher degrees of parallelism.
Another interesting aspect is partitioned tables. Paritioning is a great management tool. We can load data into an empty table, then join an existing partitioned table. As data becomes older, we can switch the old partition out into a separate table without incurring the substantial cost of a high row count DELETE operation.
However partitioning is not normally a performance feature. Any performance gain from possibly having shallow B-tree depth (which is logaritmic) is going be offset from the need to perform operations on multiple trees.
That being said, consider the Hash Join shown above. In the paralllel execution plan shown, it is necessary to repartition the output of both the Orders and LineItem tables before performing the hash join. Now suppose the Orders and LineItem tables are both partitioned on OrderKey, which is the primary key of Orders, and the foreign key in LineItem back to Orders. The Orders table is still clustered on OrderDate and the LineItem table is still clustered on ShipDate.
Notice that there is no Parallelism Repartition Streams operations. The parallel execution strategy now assigns a thread to each partition. Rows in Orders within a partitioned Orderkey range will only join to LineItem rows within the corresponding partition range. Hence there is no need to repartition streams.
Lets start with the non-parallel version of this query.
The Constant Scan has an Estimated Number Rows 30 coresponding to the number of partitions. The Clustered Index Scan on Orders and LineItem and the Hash Match show Estimated Number of Executions 30, also coresponding to the number of partitions. Note, the LineItem IO cost 14.4461 x 30 corresponds to a compressed table of 4.680GB or 584,969 pages.
The Nested Loops bring the results together, and are aggregated in Stream Aggregate.
Theory maybe sound, but does it holdup? Below is the CPU time in seconds for a table scan of the SF10 LineItem with 59.986M rows, 8.75GB in the standard table structure (149 bytes per row average) without compression.
The CPU time is relatively flat to DOP 16, with only very slight increases below that. There after, CPU time increases more substantially with DOP.
Below is the speedup relative to DOP, calculated as the elapsed time at DOP divided by elapsed time at DOP n.
The straight table scan with simple Stream Aggregate has the lowest CPU cost at DOP, but relatively poor scaling to high parallelism. The aggregate with a Group By employing the more expensive Hash Match operation has better scaling, but still tops out below the maximum compute capability. It is the more expensive Hash Join with aggregation that actually has very good scaling all the way to 32-cores.
This is contrary to the theory proposed above that the amount of inter-thread coordination might be a limiter to high parallelism scaling. The simple table scan with aggregation is easily partitionable requiring no interaction between threads, and yet quickly runs into a scaling bottleneck. At first I thought this was contention main coordinator thread. An author on high-scale OLTP and data loading suggested this might also be a Hyper-Transport bottelneck. The 8-way Opteron quad-core system was the last model on 1GHz (2GT/s) HT. HT at 1GHz was fine for the original Opteron and even the dual-core Opteron processors. Perhaps 8-way quad-core is just too much.
Consider that the DOP 1 8.75GB table scan ran in 16.34 seconds for 536MB/sec. This is 67K pages per sec, or 1 page every 15 micro-seconds. At DOP 32, the query of 1.478 sec corresponds to 5,920MB/sec (185MB/sec per core) or 740K pages/sec for 1 page every 1.35 micro-sec. This brings up the question of how much work does one thread do before requesting the next block of work. The Windows operating system time increment is 15.625 milli-sec (the time-slice is a multiple of this) (10ms on the non-mp kernel, does this still exist?). Does this mean our strategy should be to divide work into units that take approximately one slice or step. The 15.625 ms step would imply 1046 pages for a simple table scan or 8MB per 15.625ms step. Now consider that the Core 2 and Nehalem cores are even more powerful than the Opteron core.
Below is the CPU-time for the same 59.986M row table with simple Stream Aggregate on 1 and 2 (integer, money or even float data type) columns. The compressed table is 4.82GB, 82 bytes per row average.
There is a 32% CPU penalty for compression at DOP 1. However, the CPU is relatively flat for increasing DOP. At maximum parallelism, the compressed table scan is actually fast than without compression! This is for data in memory. There is no disk IO in these tests.
Below is speedup for the compressed table scan relative to DOP 1. Scaling on the simple Stream Aggregate is now outstanding, even perfect for summing two columns, also for summing one column.
It is interesting that scaling is perfect for DOP 1-4 and 32, but less than perfect for DOP between 8-24. This might be because simply running a query with MAXDOP n does not guarantee the most optimal selection of processor cores. I am not sure why the Group By (Hash Match) aggregate has lesser scaling. The Hash Join still have very good scaling, but less than the simple aggregate as expected.
At maximum DOP, the compressed table with simple aggregates take 0.736 sec, faster than the uncompressed table scan! The data consumption rate is 6550MB/sec for 205MB/sec per core, which are also higher than for the uncompressed table.
Below is the CPU time for a LineItem scan with 1 column summed for the standard table, a compressed table and a partitioned table (not compreseed).
The partitioned table is slightly smaller than the standard table at 8.51GB. All three database specified fill factor 95%. Creating a unique clustered index by appending the OrderKey and LineNumber columns results in an even smaller table, due to the internal uniqueness requirement of the cluster key. There is further reduction in the size of the nonclustered index on OrderKey.
Below is the Speedup for a LineItem scan with 1 column summed for the standard table, a compressed table and a partitioned table (not compreseed).
The partitioned table scan with simple aggregate is slightly slower at DOP 1, 18.17 sec versus 16.34 for standard, but slightly faster at DOP 32, 1.245 sec versus 1.478 sec. The data consumption rate is 6,835MB/sec, higher than for both the standard and compressed tables.
Below is the CPU time for the single table LineItem scan aggregation with a Group By, employing the Hash Match Operation.
Below is the speedup for the single table LineItem scan aggregation with a Group By, employing the Hash Match Operation.
This is a more CPU intensive operation, generating 272MB/sec with a single core without compression and runs 1.944 sec at DOP 32 for 4500MB/sec total, 140MB/s per core. The compressed table DOP 32 run time is 2.037, and DCR is 2366MB/sec. The partitioned table DOP 32 run time is 1.458, and DCR is 5837MB/sec.
Below is the CPU time for a Hash Join on tables LineItem and Orders (15M rows) aggregating two columns, no grouping. Notice CPU is somewhat flat versus DOP with the partitioned tables having an anomalous bump at DOP 4.
Below is the speedup for a Hash Join on tables LineItem and Orders (15M rows) aggregating two columns, no grouping. All have very good scaling. The standard organization and compressed tables both have high row count repartition streams. The partition tables plans does not have the repartition streams operation.
Below is CPU time for a forced nonclustered index seek plus Key Lookup (1.4M rows) compared to a table scan (with SARG) for standard and compressed table. The key lookup CPU is fairly flat for both.
Below is the speedup for the same. Even though key lookup has fairly flat CPU, the speedup is very limited.
Thomas Kejser presented a session at SQLBits on data loading, explaining the locking details. I am wondering if each threading is following the protocol of first acquiring a shared lock on the root level of clustered index (which is navigated for each lookup), followed by the next level and so on to the leaf level. If so, could there be contention in acquiring the root level lock? Why would not a single lock at the root level for all threads be valid?
The DOP 1 Key Look time for 1.4M rows is 9.329 sec on the standard table (11.653 sec on compressed table), corresponding to 150K rows/sec and 120K respectively. The maximum parallel Key Lookup rows per sec is 1.168M and 663K rows/sec respectively.
The Key Lookup to a heap organzied table is moderately lower, and perhaps may have better scaling? I will try to test this with equivalent tables (I should have older results somewhere).