SQL Server Cost Based Optimizer
AdventureWorks, Index Seek, Key Lookups, Table Scan, Joins, Insert, Update, Delete,
Parallelism I, Parallelism II, Parallelism III,
First we can take a quick tour of Parallel Execution Plans, including the graphical plan, the details, the information available with F4 properties, and the XML plan itself.
Below are parts of parallel execution plans. In a parallel plan, there are the common operations in a non-parallel plan such as Index Seek, Index/Table Scan, Key Lookup, Join, Aggregate and so on. The is a small gold circle with two arrows at the bottom right of the symbol to indicate the operation is executed in parallel. In addition there are three Parallelism operations that support parallelism, naturally. The first shows the Parallelism (Gather Streams) operation.
Next is the Repartition Streams operation. Also in the plan is the Bitmap operation, which accompanies parallel Hash Joins and will be discussed later.
And finally, the Distribute Streams operation.
The three parallelism operations are perform the following functions: Distribute Streams send output of scalar (non-parallel) operations to a parallel operation. Repartition Streams sends output from one parallel operation to another. Gather Streams collects data from parallel operation to combine into a single stream for subsequent scalar operations, usually near the end of a parallel execution plan.
The Parallelism detail are shown below.
At the top of the XML plan is information on parallelism under the tag QueryPlan with the attribute DegreeOfParallelism.
For a specific operation, there is run time information including row count by thread.
Some of this information is accessible from the F4 key.
Consider a simple table scan, in this case, a clustered index scan. (Sorry I do not have a large heap table handy, but its cost structure is exactly the same.)
Below are the Clustered Index Scan details for Degree of Parallelism (DOP) 1, 2, 4, 8 and 16.
DOP 1 and 2
DOP 4 and 8
The IO cost does not depend on the DOP. The CPU cost is reduced by the DOP. This is also true for DOP levels not shown up to 16: 3, 5, 7, etc. This demonstration was conducted on a 32-core system. The plan cost for higher DOP than 16 is the same as for DOP 16. So apparently the last 2X in cores does not affect the plan cost.
The non-parallel plan cost was discussed in previous chapters. For this example, it is as follows. The IO cost is the in-row-data-page count of the LineItem clustered index 1,093,729 divided by 1350 (pages/sec) plus the base cost 0.003125 (810.17275 truncated). The CPU cost is the estimated number of rows 59,986,100 times 0.0000011 per row + base cost 0.000157 = 65.9848.
Curiously the Estimated Number of Rows is 59,986,100 when all the system DMVs show row count as 59,986,052. In fact, the Actual Execution Plan does show 59,986,052 rows. This is not a matter of row count estimatation based on statistics, but the total number of rows in a table should be know considering that statistics have been updated since the last insert.
The pattern is exactly the same as for an Index Scan (NonClustered).
The Index Scan details for DOP 1, 2, 4, 8 and 16:
DOP 1 and 2
DOP 4 and 8
The Index Scan IO cost 103.831 corresponds to the in-row-data-page count of the LineItem nonclustered index L_ORDERKEY_IDX as 140168 / 1350 + 0.003125 (103.831398 truncated).
The Stream Aggregate cost is the same for both Clustered Index Scan (Subtree Cost 912.148) and Index Scan (Subtree Cost 205.807), depending only on the number of rows and not the number of columns aggregated (0.000 000 6 per row).
The Stream Aggregate cost is reduced corresponding to the DOP, as shown below at DOP 2, 4, 8, and 16.
DOP 2 and 4
DOP 8 and 16
At some point the execution plan for a SELECT query with search argument changes from an index seek with key lookup to a table scan. Note that both the Key Lookup and Loop Join use the same Nested Loops operation. This was previously discussed but here it is again.
Below are the details for the index seek + key lookup plan, starting with the index seek and key lookup.
The Index Seek IO cost is determined first by calculating the number of index pages as:
(323,655 rows in seek / 59,986,052 rows total) * 140,168 pages total = 756.277 pages in seek,
with a cost of 0.003125 + 756 /1350 = 0.563125.
The Index Seek CPU Cost corresponds to 0.000157 + 323,655 * 0.0000011 = 0.356177.
The Key Lookup CPU cost is simply the number of rows 323,655 * 0.0001581 = 51.169856.
The Key Lookup operator cost is assessed by first subtracting the CPU cost to obtain the IO cost:
926.67 (operator cost) - 323,655 * 0.0001581 (CPU cost) = 875.50 IO cost.
Divide the IO cost by 0.003125 to determine the number of rows for which IO cost is assessed,
875.50 / 0.003125 = 280160, meaning 86.56% of 323,655 are assessed the IO cost. There is actually a formula to determine, but most database admininstrator are probably not that interested in this level of detail.
The last two operations in the index seek + key lookup plan are the Nested Loops and Stream Aggregate operations, details shown below.
The Nested Loops cost is simply 0.000 004 18 * 323655 = 1.353698. The Stream Aggregate cost is 0.000 000 6 per row.
Below are the details for the Clustered Index Scan plan.
The scan operation cost structure was already discussed. Both Stream Aggregates show a CPU cost of 0.194193. However the scan Stream Aggregate shows an Operator Cost of 52.982, without explanation. It would make sense that there is overhead for the excluded rows, but it would seem implied that this occurs in the scan operation. Still the cost assign to the scan Stream Aggregate is suspiciously close to the Key Lookup CPU cost of 0.0001581 per row.
This means cross-over occurs when Key Lookup IO cost is approximately equal to the table/clustered index scan IO + CPU cost. The random to sequential IO cost ratio discussed earlier is 1350/320 = 4.21875 (random IO cost 1/320, sequential IO cost 1/1350). The Key Lookup cost is reduced to 86.56% reflecting the percentage of rows assumed to require IO (there is probably a complicated formula for this depending on table size and number of rows), and the Scan cost is increase by 65.9848 or 8.14%. This factor depends on the rows per page as each page contributes 0.00074074 and each row contributes 0.0000011, and in the case corresponds to 55 rows per page.
The net effect is that at 55 rows per page, the index seek + key lookup cross-over to table scan occurs at approximately 80% of the random to sequential IO ratio 1350/320 or 3.377. The interpretation is that the cross-over occurs when the number of rows exceeds the number of leaf level pages divided by 3.377 with adjustments as decsribed.
Below are the Select details for Index Seek + Key Lookup (Subtree Cost 929.137) and Scan (929.139) execution plans.
Now let us look at how parallelism affects the index seek + key lookup cross-over point to table scan. The same pair of queries as discussed above but now with DOP 2, 4, 8 and 16.
Notice that the plan cost for the scan becomes lower relative to the seek and lookup at progressively higher DOP. We have already discussed how the clustered index scan IO cost is fixed at 810, but the CPU cost decreases from 66 at DOP 1 to 4.1 at DOP 16 and the Stream Aggregate CPU decreases from 36 at DOP 1 to 2.5 at DOP 16.
The main cost component of the Index Seek and Key Lookup plan is the Key Lookup operation, for which the cost structure does not change with DOP.
DOP 1 & 2
The CPU cost of the Index Seek and Nested Loops does decrease with DOP
DOP 1 & 2
DOP 1 & 2
The cross-over at DOP 16 now occurs as shown below, at 280,361, down from 323,725.
The pages to rows ratio at DOP 16 cross-over is approximately 3.90, and getting closer to the pure random to sequential IO cost ratio.
When an aggregate query has a group by, the aggregation has a Hash Match operation instead of Stream Aggregate (unless there is a covered index in the group by order).
The Hash Match in this example is much expensive than the Stream Aggregate (270 versus 36). The Hash Match aggregation as used here only decreases by a factor of 2 in parallel execution plans even at higher DOP, unlike other components that decrease up to 50% of the number or processors. Of course, this might be because there are only 2 distinct values of the group by expression. The Hash Match aggregation cost does not depend on the number of columns (or bytes per row).
Hash Match Aggregate details at DOP 1 and 2 with output 2 rows
The estimated rows for DOP 1 is 2 as there are 2 distinct values of L_LINESTATUS. The estimated rows at DOP 2 is 4 because there are 2 distinct values from each of 2 threads.
The Hash Match Aggregate cost depends in both the number of incoming rows, and the number of aggregated rows.
Hash Match Aggregate details at DOP 1 for 2M (PartKey) and 15M (OrderKey) rows out
The parallel execution plan for aggregate with a high number of rows output shows some variation. At DOP 2, there is a partial aggregate before going into a repartition streams for the final aggregate. At higher DOP, the output from the Line Item table is immediately repartitioned followed by the aggregation. The Hash Match aggregate costs here do continue to decrease at high DOP.