The SQL Server query optimizer has become very sophisticated. Some of us may remember SQL Server 2000 having SQL parse and compile times on the order of seconds if not minutes. The number of possible execution plans for a query of any complexity is astronomical, with variations on join order, access methods, index choices etc. Now SQL Server 2008 R2 routinely compiles complex queries in far less time than version 2000. I am guessing that the 2008 R2 query optimizer is very good at discarding trees that do not need investigation. If a selective index is available on a large table with certain join order, then all variations involving a scan on the large table or join orders incompatible with index can be discarded. The 2008 R2 compiler is also much better at finding a good execution plan in tricky expressions.
Still, there certain SQL expression structures for which an efficient execution plan is possible, but the SQL Server 2008 R2 query optimizer does find it. The topic today is OR and NOT combinations.
Consider the query below. Note that there are 2 EXISTS, 1 NOT EXISTS and 1 OR conditions.
The query plan is below. The index seek at the top produces 10,245 rows estimated (exact due to the values being in the histogram steps). The plan cost 74.3. At first glance, it might seem that this is a good execution plan. There is one fat arrow indicating many rows, every other table access operation uses a highly selective index, and 3 of 5 are unique.
The reason we want a better execution plan is that in practice, there will be parameter values for which the source may involve hundreds of thousands and even millions of rows. Even though there is only one fat arrow, the thin arrows underneath the top Nested Loops (or just loop join) involve more loop joins. In all there are 4 loop join operations. For each row from the top source, the inner source is executed once. Each inner source execution involves 3 loop joins.
To determine what the better execution should be, we note that this query has 2 independently executable sub-expressions, as shown below.
The execution plans for the two sub-expressions are below:
Sub-query 1 plan cost 0.56
Sub-query 2 plan cost 1.23
Note that both sub-expressions have execution plans that use hash joins. And note that each sub-query has a much lower plans cost than the original query.
The original query can be rewritten as below, with a UNION operation replace the central OR clause, and by readjusting the two separate IN clauses to support the new but equivalent logic.
The execution plan for the alternate query is below.
The plan cost for the new query is 2.02, in-line with the combined sub-query plan costs.
The more important element of the new plan is the use of hash joins, not necessarily the lower plan cost. Loop joins with highly selective indexes are good for transaction processing environments, specifically when few rows are involved. The general structure of an execution plan operation is the base cost to setup the operation (for the first row) and an incremental cost for each additional row. The loop join has very low (actual) setup cost.
For processing many rows, the hash join can be more efficient in specific circumstances. The hash join has higher setup cost, so it is not best for a processing small number of rows when indexes to support the loop join are available. However the incremental cost per row is lower than the loop, by as much as 50% in a non-parallel plan depending on the size of the row to be hashed, and the quality of the hash but not accounting for differences in the amount of data processed by the different join methods. The hash join processes each source independently, where the loop join can use an index on the join condition to access the inner source.
Finally, the hash join has excellent scaling characteristics with parallelism, while the loop join has very limited parallel execution scaling. On a modern server system, there could be 12 to 40 or even 80 physical cores. For intensive data processing, we would like to effectively use the all cores that were paid for, not to mention the SQL Server licensing costs, and ever more so for SQL Server 2012 per-core licensing. The performance difference in large queries at high parallelism between hash and loop joins can be dramatic.
Below is the execution plan for original query with different parameter values that involve far more rows, 11.4M rows to be exact. The plan still uses several loop joins, but also has the Index Spool operators. The plan cost is now 6715.
Below are the execution plans for the 2 sub-queries, with plan costs 24.9 and 25.3, respectively.
Below is the execution plan for the alternate query with the 11.4M row parameter values.
Now that we have established that simply rewriting a high row count query can shift the execution plan from mainly loop join to hash joins, with dramatic performance differences at high degrees of parallelism, we might ask why? If we are working with a third party application, it might be difficult to get then vendor to change the SQL. Would a hash join hint work?
If we were to apply the hint OPTION (hash join) on this query, we will get the following error message:
It turns out that one of the requirements for a hash join is the presence of at least one equality join condition. (see Microsoft Understanding Hash Joins). Somehow the combined set of OR and NOT EXISTS clauses prevented the SQL Server query optimized from detecting an equality join condition between required tables.
To test this hypothesis, let try a similar query, but changing the OR to AND. Below is the execution plan. Note there are hash joins in the high row count operations. When the estimated number of rows is low, the join operation is a loop join.
We can now apply the hash join hint to get the following query plans
Lets also try keeping the OR, but changing the NOT EXISTS to Exists
Addition investigation is necessary to determine exactly when the SQL Server query optimizer cannot determine that an equality join condition is present. At the moment, it appears that there can be any number of AND and EXISTS conditions. There can also be a single OR, or a single NOT EXISTS clause, but there cannot be both in the form (A and B) OR ( C and D but not E).