The SQL Server Cost Based Optimizer

Joe Chang jchang6@yahoo.com

Last year, I started what was to be a series of articles about the SQL Server execution plan cost model, i.e., the Cost Based Optimizer (CBO). At the time, I could not upload images and an explanation without pictures is very tedious to follow. This is the same topic, except that I can now upload images, so important points are now illustrated. The coverage here includes Index Seek, Key Lookup (formerly Bookmark Lookup),Table Scans, Nested Loops, Hash and Merge joins. In SQL Server 2005 and later, Key Lookups to a clustered index and the Nested Loops (or loop join) are now the same operation.

Long ago I wrote in detail explaining how the formulas used by SQL Server 2000 CBO can be derived using a set of tables populated to a range of specific row and page values. The cost model changed from SQL Server 2000 to 2005, but the full detail derivation of the exact SQL Server 2005 and 2008 cost model is not provided here. Rather, the Adventure Works database will be used to demonstrate basic points of the CBO model.

Since Adventure Works database is populated from the same CSV files, the row count should be the same from instance to instance. However, it is possible that there may be minor variations in size. Also, there is no assurance that the statistics generated to represent the data distribution is exactly the same. The execution plan can only use the data distribution statistics for row estimates, so it is always necessary to check the data distribution statistics to generate essentially the same results as described here.

For the Adventure Works database used in examples here, the Sales SalesOrderHeader table has 31,465 rows, and 5600KB data or 700 pages of in-row-data, as shown below.

exec sp_spaceused 'Sales.SalesOrderHeader'

SELECT in_row_data_page_count, in_row_used_page_count, in_row_reserved_page_count

FROM sys.dm_db_partition_stats p

WHERE p .object_id = OBJECT_ID ('Sales.SalesOrderHeader' ) AND index_id < 2

Below is the data distribution statistics. The default sample for low row counts should be a full sample. Not every statistics sampling will be represented by exactly the same sample set, particularly in the Range High Key values, for which the number of Equal Rows is an integer. The key values in between the Range High Key values (the Range Rows) are characterized with an Average Range Row distribution.

DBCC SHOW_STATISTICS( 'Sales.SalesOrderHeader' , IX_SalesOrderHeader_CustomerID )

Finally check the state of parameterization to ensure that the execution plan estimate rows represent the exact query, instead of using parameterized values.

SELECT is_parameterization_forced FROM sys.databases WHERE name = 'AdventureWorks'

Index Seek and Key Lookup

The query below has an execution plan with a single row (nonclustered) index seek followed by a single row key lookup. For this execution plan with exactly the same row estimates, use any CustomerID for which the distribution statistics Range Hi-Key column has one Equal Rows.

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11935

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 13497

Below are the Index Seek and Key Lookup details for the query plan shown above, using SSMS 2005.

Below are the same details shown from SSMS 2008. Notice the extra line for Estimated Number of Executions.

Observe that for both the index seek and the key lookup, the Estimated IO Cost is 0.003125 which is exactly 1/320. Per SQL Server Books On-Line documentation, stating that plan cost represents the estimated elapsed time in seconds on a specific hardware configuration, the interpretation of this is that the cost model is based on reference system performance of 320 IOPS for 8KB random IO. The Estimated CPU Cost is 0.0001581 for both operations. In both the Index Seek and Key Lookup details, the Estimated Number of Rows is 1. For the index seek, this means one row is expected from the index seek. For the Key Lookup, this means one row is expected from each row of the index seek, not the total rows from the key lookup operation. Below are two lines from the XML plan pertinent to the Key Lookup. The Key Lookup is actually a Clustered Index Seek when the table has a clustered index, as opposed to being heap organized.

<RelOp AvgRowSize="31" EstimateCPU="0.0001581" EstimateIO="0.003125" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="1" LogicalOp="Clustered Index Seek" NodeId="4" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="0.0032831">

<IndexScan Lookup="true" Ordered="true" ScanDirection="FORWARD" ForcedIndex="false" NoExpandHint="false">

<ObjectDatabase="[AdventureWorks]" Schema="[Sales]" Table="[SalesOrderHeader]" Index="[PK_SalesOrderHeader_SalesOrderID]" TableReferenceId="-1" />

The Nested Loops operation brings together the data retrieved from the Index Seek and the Key Lookup. The Estimate I/O Cost is zero, and the CPU cost shown for a single row is 0.0000042 (Examining higher row count operations show that this is actually 0.00000418 rounded to 7 places beyond the decimal point). The Estimated Subtree Cost represent the full cost of the current operation and all operations the feed into it.

The query below uses a CustomerID for which the data distribution Range High Key has Equal Rows 2.

SELECT SalesOrderID, TotalDue FROM Sales.SalesOrderHeader WHERE CustomerID = 11361

Details from SSMS 2005.

Details from SSMS 2008.

Observe that the Index Seek has an estimated number of rows 2, while the key lookup shows estimated number of rows 1. The estimated IO cost for the Index Seek is still 0.003125, but the estimated CPU cost is 0.0001592, which is 0.0001581 for the first row and 0.0000011 for each additional row. The Key lookup shows the same estimated IO cost 0.003125 and CPU cost 0.0001581 as for single row query, but the Estimated Operator Cost is now 0.005002 instead of 0.0032831.

The XML plan details pertinent to the Key Lookup for 2 rows from the Index Seek are shown below. Note the Estimate Rebinds value, which is not shown in the graphical detail.

<RelOp AvgRowSize="31" EstimateCPU="0.0001581" EstimateIO="0.003125" EstimateRebinds="1" EstimateRewinds="0" EstimateRows="1" LogicalOp="Clustered Index Seek" NodeId="4" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="0.00500202">

The details for 3 and 4 rows from the Index Seek are shown below.

<RelOp AvgRowSize="31" EstimateCPU="0.0001581" EstimateIO="0.003125" EstimateRebinds="2" EstimateRewinds="0" EstimateRows="1" LogicalOp="Clustered Index Seek" NodeId="4" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="0.00706628">

<RelOp AvgRowSize="31" EstimateCPU="0.0001581" EstimateIO="0.003125" EstimateRebinds="3" EstimateRewinds="0" EstimateRows="1" LogicalOp="Clustered Index Seek" NodeId="4" Parallel="false" PhysicalOp="Clustered Index Seek" EstimatedTotalSubtreeCost="0.00999399">

The interpretation is that the cost assessed for Key Lookups involves the number of Rebinds, which appears to be the number of additional rows that require a key lookup after the first row. The Estimated Number of Rows represents the number of rows per key lookup (there can be only one row per key lookup, but there can be multiple rows from a loop join the inner source index seek).

The queries below are selected based on having being Range High Key values for various Equal Row values ranging from 1 to 28.

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 12375 -- 1

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11361 -- 2

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11965 -- 3

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 12655 -- 4

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11784 -- 5

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11439 -- 6

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 233 -- 8

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 191 -- 9

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 126 -- 11

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 54 -- 12

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11519 -- 16

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11019 -- 17

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11566 -- 25

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11185 -- 27

SELECT SubTotal FROM Sales.SalesOrderHeader WHERE CustomerID = 11091 -- 28

The table below shows the Estimate Executions from SHOWPLAN_ALL in the first column, the Total Subtree Cost in the second column, the net cost after subtracting 0.0001581 per row from the Total Subtree Cost, and finally the net cost divided by the IO cost for a single operation: 0.003125.

Estimate ExecutionsTotal SubtreeCostTotSubtreeCost -rows*CPUNet/0.003125
10.00328310.0031250001.0000
20.0050020260.0046858261.4995
30.0070662870.0065919872.1094
40.0099940130.0093616132.9957
50.013263740.0124732403.9914
60.016529020.0155804204.9857
80.023046270.0217814706.9701
90.026298260.0248753607.9601
110.032788970.0310498709.9360
120.036027710.03413051010.9218
160.048938740.04640914014.8509
170.052155540.04946784015.8297
250.077733170.07378067023.6098
270.084084250.07981555025.5410
280.087253330.08282653026.5045

It is apparent that the CPU cost of 0.0001581 is assessed for each execution of the key lookup (or for each row of the index seek), but only a percentage of the IO cost of 0.003125 is assessed. The figure below shows the percentage of index rows for which the IO cost is assessed.

IO percentage versus rows

The assumption is that for each subsequent key lookup, there is a chance that the required page is already in-memory, so no additional IO cost need be assessed. For a small table, the number IO assessed has an upper bound limited by the number of data pages used. For a larger table, a previously loaded page may have been evicted, so the upper bound is not limited by the number of data pages.

Table Scan

The query below has an execution plan with a table scan.

SELECT SalesOrderID, TotalDue FROM Sales.SalesOrderHeader

The Clustered Index Scan IO cost is 0.520903. This is almost exactly 0.003125 plus 0.00074071 (1/1350) per additional data page. The CPU cost is 0.0347685, which is approximately 0.000001 per row plus the base cost.

Per the definition of the SQL Server plan cost, this implies that the reference system can perform a table scan at 1350 pages per sec, or 10.54MB/s (1M = 1024KB). The most recent generation 15K disk has a sustained pure sequential transfer rate of over 150MB/sec on the out tracks. For disks in an enclosure, typically 15 disks sharing one x4 SAS port (1,100MB/sec) or one 4Gbit/s FC port (390MB/s) the sustained sequential transfer per disk is less. Of course, at the time when these formulas were set, the sequential transfer rate of a high-performance was much less, possibly in the 4-10MB/sec range.

It should also be considered that a table scan is not necessary or even frequently a pure sequential operation at the disk level. The actual blocks allocated to a table could be highly non-contiguous. So another interpretation of the 1350 pages per second might the number of 64KB extents per second (168.75x8 = 1350). This cost structure might also reflect the IO rate in the presence of a mix of 8K and 64K IO.

The bigger issue is that the SQL Server Cost Based Optimizer uses the fixed model for random and pseudo-sequential (or large block) IO performance. There is no consideration for whether data is in memory, or the actual storage system performance characteristics. The IO cost structure is based on random 8K IO at 320 IOPS and sequential at 1350 IOPS, meaning each page of a key lookup or loop join inner source index seek (with adjustment for the percentage of pages that require IO) has approximately the same IO cost of 4.2 pages in a scan operation.

Execution Plan Cross-Over

For any database engine, at some point, the execution plan for a table scan has a lower cost than an index seek followed by a key lookup. For SQL Server, the cross-over point occurs when the number of key lookups is approximately equal to the number data pages divided by 4.2 (or 4.43 based on a full key lookup of 0.003125 + 0.0001581). Some people have guessed that the cross-over point occurs when the number of rows requiring a key lookup exceeds a certain percentage of the rows in the table. In fact, a close examination of the cost structure indicates that it is actually based on the ratio of key lookup rows to data pages, not total rows.

A curious coincidence occurs when the clustered index has depth is 4. Each key lookup requires 4 logical reads. So the point at which the execution plan cross-over occurs is also approximately the point at which the number of logical reads crosses over. This is coincidence when the index depth is 4, and does not occur when the index depth is not 4.

The queries below illustrate the Index Seek + Key Lookup to Table Scan cross-over.

SELECT SalesOrderID, TotalDue FROM Sales.SalesOrderHeader

WHERE CustomerID BETWEEN 11091 AND 11151

SELECT SalesOrderID, TotalDue FROM Sales.SalesOrderHeader

WHERE CustomerID BETWEEN 11091 AND 11152

The execution plans and details are shown below.

The estimated rows for the first query parameters are 199.8 and 202.3 rows for the second query parameters. The actual rows are 199 and 200. Notice that the plan cost for the first query is actually slightly higher than for the second query (0.577118 for the Index Seek + Key and 0.555671 for the Clustered Index Scan). Notice that the plan cost for the second query is exactly the same as the unconstrained table scan for all 31,465 rows. The full Key Lookup cost is 0.572781, which works out 0.5380676 after subtracting out 0.003125 and 199.8 times the CPU cost of 0.0001581, which is 172.2 times the base IO cost of 0.003125, meaning the IO cost is assessed for 86% of rows.

Now let’s examine the actual cost versus logical reads. All tests here are done with all data in memory. The simplest way to get cost is from the DMV dm_exec_query_stats. It is should be pointed out that this measurement method exhibits significant variations from run to run. A full load test with a multi-threaded load generator driving the database server to near 100% negates the need to measure individual SQL statement level CPU and yields much more consistent results from run to run.

The examples here use the following statement to display CPU and logical reads.

SELECT SUBSTRING(t.text, statement_start_offset/2 + 1, (CASE WHEN statement_end_offset =-1 THEN LEN(CONVERT (nvarchar( max),text)) * 2 ELSE statement_end_offset END - statement_start_offset)/2) AS [Stmt]

, q.execution_count AS ExecCnt, q.total_worker_time AS WorkerTime, q.total_logical_reads AS LogicalReads

FROM sys.dm_exec_query_stats q

CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS t

The query execution statistics for the query with slightly different search arguments demonstrating execution plan cross-over are shown below along with an unconstrained table scan. The plan costs for all three are about same or exactly the same. Notice that the Logical Reads is also comparable between the two plans (63656 and 70400).

The Worker Time (CPU) in microseconds is much higher for the scan than for the index seek with key lookup. Of course, there is no physical disk IO in this test run, while the plan cost is based on IO and plan cost represents estimated elapsed time, not CPU (worker) time. Also note that the table scan returning all rows is substantially more expensive than the query returning 200 rows (while using the table scan in the execution plan).

The execution stats below are for queries with the same search parameters, except that the result set is aggregated. The plan costs are slightly higher with a Compute Scalar operation for the aggregation, but the actual worker time is now lower. When a very low volume of results are sent to the client connection, and all data is in memory, and the execution plan is not a parallel plan, and there is no contention for resource, then the worker time is essentially the same as the elapsed time.

The execution stats below are the query set except that NOLOCK or TABLOCK is specified.

Below are the execution stats for another sequence of tests. This time, the test sequence leads with a dummy query to make sure that the query order is not important. A query has been added with same search parameters as the table scan plan, with a hint to force the index seek. The query forcing the index seek with BETWEEN 11091 AND 11152 is slightly less expensive than the unforced index seek plan with BETWEEN 11091 AND 11151. One should probably not interpret too much from this due to the run to run variations of this methodology.

The main conclusion is that the SQL Server plan cost model makes fixed assumption on disk IO time, and is not a good representation of actual worker time or elapsed time when all data is in memory. Presumably, if the disk system has exactly the 4.2:1 sequential to random 8K IO ratio, and sufficient buffer cache for the index root and intermediate levels (but not the leaf level), then the plan cost model would be accurate.

Nested Loops, Hash and Merge Joins

Since SQL Server is a relational database, we are most interested in joins. The three main types of joins are Nested Loops, Hash and Merge, the details of which are described in BOL. For some reason, Hash and Merge joins are poorly understand and frequently associated with poor performing queries.

The query below joins two tables with a search argument on one table yielding exactly one row.

SELECT SalesOrderID FROM Sales.Customer c

INNER JOIN Sales.SalesOrderHeader o ON o.CustomerID = c.CustomerID WHERE c.CustomerID = 13497

The execution plan with details is shown below. The upper symbol (the Clustered Index Seek in this case) to the Nested Loops Inner Join is the outer source, and the lower symbol is the inner source.

Warning

The operations here are exactly the same as in the key lookup, except that now the inner source can have multiple rows.

In an actual production database, there are frequent joins between the Customers and the Orders tables and between Orders and Order Details. Each of these joins are one-to-many. For illustrative purposes here, it is desirable to demonstrate one-to-one joins. For this reason, a duplicate copy of the SalesOrderHeader table is created as SalesOrderHdr1 with the same primary key clustered on SalesOrderID.

The queries below involve a one-to-one join for 213 rows for the first SARG parameters and 214 rows for the second. The normal (not hinted) execution plan for the first query, with the first SARG parameters is a loop join. The next three queries all use the second SARG parameters. The second query forces a loop join with a hint, the third query does not hint, and the fourth query hints a hash join.

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER JOIN Sales.SalesOrderHdr1 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11161

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER LOOP JOIN Sales.SalesOrderHdr1 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11162

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER JOIN Sales.SalesOrderHdr1 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11162

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER HASH JOIN Sales .SalesOrderHdr1 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11162

The execution plans for the four queries are shown below. The first three queries all have approximately the same plan cost. Only the forced hash join has a moderately higher cost.

The reason the execution plan changes (without hints) from the index seek with nested loops join to a table scan with a hash or merge join is that the cost of the loop join inner source is somewhat linear with the number of rows from the outer source, with each outer source row contributing the standard random IO cost of 0.003125, while the alternative plans has the high fixed cost of the table scan operation, and a relatively low cost per row. So at some point the scan becomes faster than the repeated index seeks to the inner source.

For some reason, people see the third or fourth execution plans, with the hash or merge joins, and draw the conclusion that hash and merge joins are associated with poor performance, overlooking the fact that the main contribution in the plan cost is the scan operation.

The details for the first query, an unforced nested loops join, outer (Index Seek) and inner (Clustered Index Seek) sources are shown below.

Below are the outer and inner source details for the second query, the forced Loop Join. The first query has estimated rows 225 and the second to fourth queries have estimated rows 227.7. As stated above, the actual rows are 213 and 214 respectively.

The Nested Loops operations for the first and second queries are shown above. The total plan cost for the first query is 0.639851 and 0.646008 for the second.

The different detail operations of the third query are shown below. The index seek operation detail is the same as above.

The total plan cost of the third query is 0.645527, just slightly higher than the loop join plan for the same SARG parameters.

Below is the cost detail for the Hash Match operation of the fourth query. The index seek and clustered index scan operations are the same is in the third query.

Below are the actual execution statistics for the above four queries. As before, the first query below is a dummy. The second to fifth lines are for the four queries of interest. The second row is the query without hints, using the first SARG parameters with 225 rows estimated and 213 actual. The third row is the query with the Loop Join hint using the second SARG parameters with 227 rows estimated and 214 actual. As in the key lookup example, the true cost for one additional row is less. Again, due to the run-to-run variations, one should not drawn conclusions from this.

The execution stats below return a single row aggregate instead of multiple rows.

While the plan costs and logical reads for the four queries of interest above are similar, the actual worker time is much higher for the queries involving the clustered index scan. As before, the above tests were conducted with all data in memory. There was no physical disk IO, as assumed by the plan cost model.

More on Hash and Merge Joins

To better demonstrate the hash and merge join, another duplicate of the SalesOrderHeader table is created as SalesOrderHdr2, with the difference being the clustered index is now [CustomerID, SalesOrderID]. In an actual production database, the tables SalesOrderHeader and SalesOrderDetail would both have a cluster key leading with CustomerID.

Below are the execution plans for a single row one-to-one join with the normal Nested Loops join, along with forced Merge and Hash joins.

As before the Nested Loops joins has a cost structure of 0.00000418 per row with no base cost.

The Merge Join base cost is 0.0056022 for the first row (base cost) plus approximately 0.00000640 for each additional row. The Hash Match cost structure is a base of 0.01776509, plus 0.00001509 per additional row or 0.000015266

The four queries used to demonstrate to change in execution plan are shown below. Notice that the SARG parameters are now much contracted, between down to 11094 and 11095 from 11161 and 11162.

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER JOIN Sales.SalesOrderHdr2 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11094 AND d.CustomerID = h.CustomerID

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER LOOP JOIN Sales.SalesOrderHdr2 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11095 AND d.CustomerID = h.CustomerID

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER JOIN Sales.SalesOrderHdr2 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11095 AND d.CustomerID = h.CustomerID

SELECT d.SubTotal FROM Sales.SalesOrderHeader h

INNER HASH JOIN Sales.SalesOrderHdr2 d ON d.SalesOrderID = h.SalesOrderID

WHERE h.CustomerID BETWEEN 11091 AND 11095 AND d.CustomerID = h.CustomerID

The execution plans are shown below. The plan costs for the first three queries are very close. Only the plan with the hash join is substantially more expensive.

The outer (Index Seek) and inner (Clustered Index Seek) source details for the first query are shown below.

Below are the outer and inner source details for the second query. Note the estimated row from the Index Seek for the first query is 35.76 rows and 38.35 rows for the second query. The actual row counts are 37 and 40 respectively.

Above are the Nested Loops details for the first and second query, showing estimated row counts at 34.8 and 37.4 rows and the total plan costs as 0.0122817 and 0.0127068 respectively.

Below are the Clustered Index Seek and Index Seek details for the third and fourth queries. Each operation is the same between the two queries.

Below are the Merge Join and Hash Match details

If we were to examine a forced Merge and Hash joins for 2 rows, we would see that the fixed base cost for a Merge Join operation is approximately 0.0056 and for a Hash Match operation 0.01778. Since the inner source is executed in a single operation, the Hash and Merge operations do not incurred the 0.003125 IO cost per assessed row. If the SARG is specific on only of the two tables of a join, then a hash or merge join would require a scan on the other source. This is why joins involving few rows and one SARG are Loop Joins.

Only when the IO cost for the inner source becomes large enough to offset the full table scan cost on the inner source and the merge or hash base cost does the plan change become possible. Since we have already described how the SQL Server CBO underestimates table scan cost when all data is in memory, the true cost of a join can jump suddenly on the shift from a loop join to a scan with hash or merge, even though the plan costs are comparable. The hash and merge joins can be more efficient than a loop, especially when a selective SARG is specified on both sources and the right indexes are available. The SQL Server optimizer does not deliberately use a less efficient plan.

The result set below shows the execution stats for the dummy query and the four queries described above. The unforced and forced loop joins have approximately the same cost. The unforced Merge Join is slightly lower cost than the loop joins, and the forced hash join is substantially more expensive than loop or merge.

Let us now look at the three execution plan types with the parameters from the original join demonstrations (between 11091 and 11162) for 199 and 200 rows. A join hint is used to force the Loop and Hash joins. The query plan without hints has a merge join.

Below are the execution stats after the dummy query. The merge join has the lowest cost, the loop join the second lowest. The hash join still has the higher cost, but the difference is closing. At higher row counts, the hash join can be lower cost the than loop join. Note that the Merge Join requires that rows be in sorted order, otherwise a sort operation, which has additional cost, is required. A regular merge join also requires the join to be one-to-many. Otherwise a separate many-to-many merge join is required, which is more expensive than a one-to-many merge.


Appendix

The SQL Server Books On-Line (BOL) provides a definition for the execution plan cost numbers. Enter the term query governor cost limit option in the Index tab, or follow from the Contents page:

SQL Server Database Engine >Administering the Database Engine > Managing Servers > Setting Server Configuration Options > query governor cost limit Option to find the following:

Use the query governor cost limit option to specify an upper limit on the time period in which a query can run. Query cost refers to the estimated elapsed time, in seconds, required to complete a query on a specific hardware configuration.

BOL also provides the following information on Nested Loops, Merge and Hash Joins

Query Performance > Query Tuning > Advanced Query Tuning Concepts >

Understanding Nested Loops Joins

The nested loops join, also called nested iteration, uses one join input as the outer input table (shown as the top input in the graphical execution plan) and one as the inner (bottom) input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.

In the simplest case, the search scans an entire table or index; this is called a naive nested loops join. If the search exploits an index, it is called an index nested loops join. If the index is built as part of the query plan (and destroyed upon completion of the query), it is called a temporary index nested loops join. All these variants are considered by the query optimizer.

A nested loops join is particularly effective if the outer input is small and the inner input is preindexed and large. In many small transactions, such as those affecting only a small set of rows, index nested loops joins are superior to both merge joins and hash joins. In large queries, however, nested loops joins are often not the optimal choice.

Understanding Merge Joins

The merge join requires both inputs to be sorted on the merge columns, which are defined by the equality (ON) clauses of the join predicate. The query optimizer typically scans an index, if one exists on the proper set of columns, or it places a sort operator below the merge join. In rare cases, there may be multiple equality clauses, but the merge columns are taken from only some of the available equality clauses.

Because each input is sorted, the Merge Join operator gets a row from each input and compares them. For example, for inner join operations, the rows are returned if they are equal. If they are not equal, the lower-value row is discarded and another row is obtained from that input. This process repeats until all rows have been processed.

The merge join operation may be either a regular or a many-to-many operation. A many-to-many merge join uses a temporary table to store rows. If there are duplicate values from each input, one of the inputs will have to rewind to the start of the duplicates as each duplicate from the other input is processed.

If a residual predicate is present, all rows that satisfy the merge predicate evaluate the residual predicate, and only those rows that satisfy it are returned.

Merge join itself is very fast, but it can be an expensive choice if sort operations are required. However, if the data volume is large and the desired data can be obtained presorted from existing B-tree indexes, merge join is often the fastest available join algorithm.

Understanding Hash Joins

The hash join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input.

Hash joins are used for many types of set-matching operations: inner join; left, right, and full outer join; left and right semi-join; intersection; union; and difference. Moreover, a variant of the hash join can do duplicate removal and grouping, such as SUM(salary) GROUP BY department. These modifications use only one input for both the build and probe roles.

The following sections describe different types of hash joins: in-memory hash join, grace hash join, and recursive hash join.

In-Memory Hash Join

The hash join first scans or computes the entire build input and then builds a hash table in memory. Each row is inserted into a hash bucket depending on the hash value computed for the hash key. If the entire build input is smaller than the available memory, all rows can be inserted into the hash table. This build phase is followed by the probe phase. The entire probe input is scanned or computed one row at a time, and for each probe row, the hash key's value is computed, the corresponding hash bucket is scanned, and the matches are produced.

Grace Hash Join

If the build input does not fit in memory, a hash join proceeds in several steps. This is known as a grace hash join. Each step has a build phase and probe phase. Initially, the entire build and probe inputs are consumed and partitioned (using a hash function on the hash keys) into multiple files. Using the hash function on the hash keys guarantees that any two joining records must be in the same pair of files. Therefore, the task of joining two large inputs has been reduced to multiple, but smaller, instances of the same tasks. The hash join is then applied to each pair of partitioned files.

Recursive Hash Join

If the build input is so large that inputs for a standard external merge would require multiple merge levels, multiple partitioning steps and multiple partitioning levels are required. If only some of the partitions are large, additional partitioning steps are used for only those specific partitions. In order to make all partitioning steps as fast as possible, large, asynchronous I/O operations are used so that a single thread can keep multiple disk drives busy.

Note: If the build input is only slightly larger than the available memory, elements of in-memory hash join and grace hash join are combined in a single step, producing a hybrid hash join.

It is not always possible during optimization to determine which hash join is used. Therefore, SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.

If the optimizer anticipates wrongly which of the two inputs is smaller and, therefore, should have been the build input, the build and probe roles are reversed dynamically. The hash join makes sure that it uses the smaller overflow file as build input. This technique is called role reversal. Role reversal occurs inside the hash join after at least one spill to the disk.

Note: Role reversal occurs independent of any query hints or structure. Role reversal does not display in your query plan; when it occurs, it is transparent to the user.

Hash Bailout

The term hash bailout is sometimes used to describe grace hash joins or recursive hash joins.

Note: Recursive hash joins or hash bailouts cause reduced performance in your server. If you see many Hash Warning events in a trace, update statistics on the columns that are being joined.