Home, SQL Server Cost Based Optimizer
AdventureWorks, Index Seek, Key Lookups, Table Scan, Joins, Insert, Update, Delete,
Parallelism I, Parallelism II, Parallelism III,

Parallelism II Joins

Hash Joins

The Hash Join is more complicated, depending on the number of rows from each source, the number of bytes per row from each and system memory. The plan below is for a query with the narrowest possible bytes per row.

Hash Match DOP 1

Notice the Hash Join plan cost has decreased substantially even at DOP 2

Hash Match DOP 1

Below is the Hash Match operation for the Inner Join at DOP 1 and 2 with operator cost 925 and 256 respectively. Notice that the DOP 1 Hash Match has both I/O and CPU cost components, while for DOP 2 the I/O component is zero. Note the Estimated Row Size at 9 bytes.

Hash Match DOP 1 Hash Match DOP 1
DOP 1, 2

Hash Match DOP 1 Hash Match DOP 1 Hash Match DOP 1
DOP 4, 8, 16

The Parallelism Repartition streams component is the lower of the two for the LINEITEM table at DOP 2, 4, 8 and 16 respectively. Note Parallelism costs decrease corresponding the DOP.

Hash Match DOP 1 Hash Match DOP 1
DOP 2 and 4

Hash Match DOP 1 Hash Match DOP 1
DOP 8 and 16

The Parallelism Gather Streams is relatively low cost as there are few rows after the aggregate.

Hash Match DOP 1

Below are the Hash Match join for 27 bytes per row, most from the outer source. Both the IO and CPU costs are higher than for 9 bytes per. (I will try to put together a more complete formula for the hash join by system memory, byte per row etc.)

Hash Match DOP 1 Hash Match DOP 1 Hash Match DOP 1
DOP 1, 2, 4

The AUTO_CORRELATE_DATE creates a view, that apparently can be used by certain count query without conditions.

Clustered Index Scan

Merge Joins

Below is a simple count query joining ORDERS and LINEITEM. The index hint on ORDERS is to force an actual join instead of the reference to the MS_Stats view. Because no columns are specified, the query can use any index, preferably small. In fact, when the indexes are in order, the execution plan can even be a Merge Join.

Clustered Index Scan

Below, MAXDOP set to 2, the Hash Join is now parallel but the query without join hint remains a non-parallel Merge Join.

Clustered Index Scan
Clustered Index Scan

At DOP 1, the Merge Join is far more efficient. (The Merge Join may have performance quirks in smaller queries). As indicated by both the Worktable and sys.dm_io_virtual_file_stats , there is no activity to tempdb as suggested by the Hash Match IO cost assessed.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'LINEITEM'. Scan count 1, logical reads 140603
Table 'ORDERS'. Scan count 1, logical reads 35128
SQL Server Execution Times:
  CPU time = 33805 ms, elapsed time = 34503 ms.

Table 'LINEITEM'. Scan count 1, logical reads 140603
Table 'ORDERS'. Scan count 1, logical reads 35128
SQL Server Execution Times:
  CPU time = 12823 ms, elapsed time = 12824 ms.

Below are the DOP 1 Hash and Merge Join details. The difference between the two plans is solely in this operation. Both Index Seeks and the Stream Aggregate have absolute identical costs. (The cost relative to the query is different because each query has different total cost.)

Clustered Index Scan Clustered Index Scan

From DOP 1 to 2, the Index Seek CPU plan costs are reduced by a factor of 2, for a plan cost savings of 8.25 on Orders and 33 on Line Item.

Clustered Index Scan Clustered Index Scan

Clustered Index Scan Clustered Index Scan

From DOP 1 to 2, there is a very large cost reduction on the Hash Match operation from 925 to 256, mostly because no IO in the plan. The CPU cost is reduced by somewhat less than a factor of 2, from 466 to 256.

Clustered Index Scan Clustered Index Scan

On more plan cost reduction of 18 (from 36 to 18) is assessed on the Stream Aggregate.

Clustered Index Scan Clustered Index Scan

Two new significant and one minor addition operations occur in the parallel plan, namely the two Parallelism Repartition Streams,

Clustered Index Scan Clustered Index Scan

and the Parallelism Gather Streams.

Clustered Index Scan

Note that there are 4 times more rows in Line Item than in Orders, and the difference in Repartition Streams plan cost is also about 4X, 71.9 and 18 respectively.

The new component costs incurred in the Parallelism operations is just under 90. The total plan cost for the hash join at DOP 1 is 1173.63 and at DOP 2 is 535.281. If we could force a parallel Merge Join, the cost reduction on the Index Scans, and Stream Aggregate would amount to 8 + 33 + 18 = 59. It would appear the Merge Join CPU cost reduction should be on the order of 80, for a net plan cost reduction, and yet the Merge Join remains non-parallel.

The actual DOP 2 hash join consumes about 10% more CPU, with duration just over 50% of the DOP 1 elapsed time. The super-scaling suggested by the plan cost does not occur.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'ORDERS'. Scan count 3, logical reads 140907
Table 'LINEITEM'. Scan count 3, logical reads 35198
SQL Server Execution Times:
  CPU time = 38546 ms, elapsed time = 19654 ms.

At MAXDOP 4, the merge join (forced) remains non-parallel (plan cost 411.756) and the hash join plan is less at 355.569

Clustered Index Scan

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'LINEITEM'. Scan count 5, logical reads 141147
Table 'ORDERS'. Scan count 5, logical reads 35238
SQL Server Execution Times:
  CPU time = 39343 ms, elapsed time = 10073 ms.

Only until DOP 16 is there a parallel Merge Join, which actually has a higher plan cost than the hash join.

Clustered Index Scan
Clustered Index Scan

Both plans have the same costs for the index scan operations.

Clustered Index Scan Clustered Index Scan

The difference in DOP 16 hash and merge operations are shown below. The DOP 16 Hash Match is just more 1/8th the DOP 2 cost of 256. The Merge Join CPU cost is about 16X lower than the DOP 1 Merge.

Clustered Index Scan Clustered Index Scan

However, the Parallelism Repartition Streams operations have different plan cost structure between the hash and merge joins. The two Repartition Streams operations for the hash join are shown below.

Clustered Index Scan Clustered Index Scan

The Repartition Streams operations for the merge join are shown below. The Repartition Streams operations appears to be more than 17 times higher for merge joins than for hash joins. Extrapolating back to DOP 2 (multiply by 8), this cost would have been far more than the reduction in the other components.

This explains why parallel merge joins are very rare. It might occur at high degree of parallelism, if the row count were low (impacting the repartition stream row count, and hence cost) and if the query included many columns, such that the hash join cost were very high. So the question is why is the repartition streams plan cost so high for merge joins?

Clustered Index Scan Clustered Index Scan

The table below shows hash and merge join query costs in CPU time and elapsed time, both in milli-sec. The tests are based on a single run, and will show up to 2-3% variation from run to run.

 Plan CostCPU timeelapsed time
DOPHashMergeHashMergeHashMerge
1 1173.61411.75633,80512,82335,12812,824
2 535.281 38,546  19,654 
4 355.569 39,343  10,073 
8 265.713 38,328  5,315 
16220.785346.20839,20545,5742,75512,250
32  44,03752,7881,6888,278

The hash join CPU rises from DOP 1 to 2, (as there is no bitmap operation which can reduce query cost). From DOP 2-16, the hash join cost is steady, and then rises 11% from DOP 16 to 32. The Merge Join CPU cost is 12,823 at DOP 1, jumping to 45,574 at DOP 16, yet with no meaningful reduction in elapsed time (12,824 to 12,250). Only at DOP 32 does the merge join show reduced elapsed time, but not as fast as the hash join join.

More Parallelism

Remaining topics to be filled in. Loop Join is the same as key lookup, but some investigate may be helpful. Sort,