Home, Parent

Date Correlation Optimzation in SQL Server 2005 & 2008

How many people are familiar with the Date Correlation Optimzation feature introduced in SQL Server 2005? How many people have actually used it? and found it useful?

One of the objectives Microsoft had for SQL Server 2005 was to a world class data warehouse platform. This meant SQL Server 2005 had to have substantially better than SQL Server 2000, and it competitive with the other RDBMS products. Strong performance in actual production environments is not enough, it is also necessary to have referenceable benchmark results, for which TPC-H is the only

This was clearly achieved as detailed in Large Query Performance and other sources, and by published TPC-H benchmarks.

There is the command ALTER DATABASE tpch SET DATE_CORRELATION_OPTIMIZATION ON and

The description below is from SQL Server 2005 Books Online (November 2008)
Optimizing Queries That Access Correlated datetime Columns

SQL Server 2005 introduces the DATE_CORRELATION_OPTIMIZATION database SET option. This option improves the performance of queries that perform an equi-join between two tables whose datetime columns are correlated, and which specify a date restriction in the query predicate.

Tables whose datetime column values are correlated, and which can benefit from enabling DATE_CORRELATION_OPTIMIZATION, are typically part of a one-to-many relationship and are used primarily for decision support, reporting, or data warehousing purposes.

When the DATE_CORRELATION_OPTIMIZATION database option is set to ON, SQL Server maintains correlation statistics between any two tables in the database that have datetime columns and are linked by a one-column foreign key constraint. By default, this option is set to OFF.

SQL Server uses these correlation statistics together with the date restriction specified in the query predicate to infer that additional restrictions can be added to the query without changing the result set. The query optimizer uses these inferred conditions when it chooses a query plan. A faster query plan may result, because the added restrictions let SQL Server read less data when it is processing the query. Performance is also improved when both tables have clustered indexes defined on them, and their datetime columns for which correlation statistics are maintained are the first or only key of the clustered index.

All the following conditions must be met for two tables to benefit from enabling the DATE_CORRELATION_OPTIMIZATION database option:

Consider the following when you set the DATE_CORRELATION_OPTIMIZATION database option to ON:

Working with Correlation Statistics

For all eligible pairs of matching tables, correlation statistics are automatically created in the form of indexed views when you set the DATE_CORRELATION_OPTIMIZATION database option to ON. When the SQL Server query optimizer is able to take advantage of correlation between pairs of datetime columns, it uses these correlation statistics in its query plan. Correlation statistics are also included in the logic of INSERT, UPDATE, and DELETE statements where they are affected. The names of correlation statistics take the following form:

_MPStats_Sys_<constraint_object_id>_<GUID>_<FK_constraint_name>

SQL Server 2008 Books Online Optimizing Queries That Access Correlated datetime Columns

The DATE_CORRELATION_OPTIMIZATION database SET option improves the performance of queries that perform an equi-join between two tables whose date or datetime columns are correlated, and which specify a date restriction in the query predicate.

When the DATE_CORRELATION_OPTIMIZATION database option is set to ON, SQL Server maintains correlation statistics between any two tables in the database that have date or datetime columns and are linked by a one-column foreign key constraint. By default, this option is set to OFF.

SQL Server uses these correlation statistics together with the date restriction specified in the query predicate to infer that additional restrictions can be added to the query without changing the result set. The query optimizer uses these inferred conditions when it chooses a query plan. A faster query plan may result, because the added restrictions let SQL Server read less data when it is processing the query. Performance is also improved when both tables have clustered indexes defined on them, and their date or datetime columns for which correlation statistics are maintained are the first or only key of the clustered index.

 


/* TPC_H Query 3 - Shipping Priority */

SELECT TOP 10 L_ORDERKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE, O_ORDERDATE, O_SHIPPRIORITY
FROM CUSTOMER, ORDERS, LINEITEM
WHERE C_MKTSEGMENT = 'BUILDING' AND C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND
O_ORDERDATE < '1995-03-15' AND L_SHIPDATE > '1995-03-15'
GROUP BY L_ORDERKEY, O_ORDERDATE, O_SHIPPRIORITY
ORDER BY REVENUE DESC, O_ORDERDATE

In the TPC-H benchmark, the maximum difference between O_ORDERDATE and SHIPDATE is 121 days, so technically for query 3 we could have appended the additional SARG below.

AND L_SHIPDATE <= dateadd(dd, 122, cast('1995-03-15' as date)

Below is TPC-H Query 3, the top plan points to the database without date correlation.

TPC-H Query 3 Plan

The plan with correlation enabled.

TPC-H Query 3 Plan

Notice the for the ORDERS and LINEITEM table, only the explicity stated SARG is applied.

TPC-H Query 3 Plan TPC-H Query 3 Plan

Below are the details for the correlation enabled database. Notice that O_ORDERDATE < '1995-03-15' implies L_SHIPDATE < '1995-09-15' and L_SHIPDATE > '1995-03-15' implies O_ORDERDATE > '1994-10-15'. In fact, the implied restriction should be 121 days or 4 months, but the correlation allowed for a 5-s6 month range.

TPC-H Query 3 Plan TPC-H Query 3 Plan

 

/* TPC_H Query 4 - Order Priority Checking */

SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT FROM ORDERS
WHERE O_ORDERDATE >= '1993-07-01' AND O_ORDERDATE < dateadd(mm,3, cast('1993-07-01' as date))
AND EXISTS (SELECT * FROM LINEITEM WHERE L_ORDERKEY = O_ORDERKEY AND L_COMMITDATE < L_RECEIPTDATE)
GROUP BY O_ORDERPRIORITY
ORDER BY O_ORDERPRIORITY

Similarly, for query 4, we could have appended the additional SARGs:

AND L_SHIPDATE >= '1993-07-01' AND L_SHIPDATE < dateadd(mm,7, cast('1993-07-01' as date))

 

Below is TPC-H Query 4, the top plan points to the database without date correlation. The second plan points to a database with correlation enabled. Notice the execution plan changes from a scan on LINEITEM to a seek

TPC-H Query 4 Plan

TPC-H Query 4 Plan

Notice the for the ORDERS the explicity stated SARGs are applied, no SARG is specified for LINEITEM, hence the table scan.

TPC-H Query 4 Plan TPC-H Query 4 Plan

Below are the details for the correlation enabled database. Notice that O_ORDERDATE < '1995-03-15' implies L_SHIPDATE < '1995-09-15' and L_SHIPDATE > '1995-03-15' implies O_ORDERDATE > '1994-10-15'. In fact, the implied restriction should be 121 days or 4 months, but the correlation allowed for a 5-s6 month range. The date range on LINEITEM below is implied from the date range in ORDERS.

TPC-H Query 4 Plan TPC-H Query 4 Plan

 

/* TPC_H Query 5 - Local Supplier Volume */

SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION
WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND L_SUPPKEY = S_SUPPKEY
AND C_NATIONKEY = S_NATIONKEY AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY
AND R_NAME = 'ASIA' AND O_ORDERDATE >= '1994-01-01' AND O_ORDERDATE < DATEADD(YY, 1, cast('1994-01-01' as date))
GROUP BY N_NAME
ORDER BY REVENUE DESC

 

Below is TPC-H Query 5, the top plan points to the database without date correlation. The second plan points to a database with correlation enabled. Notice the execution plan changes from a scan on LINEITEM to a seek

TPC-H Query 5 Plan

TPC-H Query 5 Plan

Notice the for the ORDERS the explicity stated SARGs are applied, no SARG is specified for LINEITEM, hence the table scan.

TPC-H Query 5 Plan TPC-H Query 5 Plan

Below are the details for the correlation enabled database. Notice that O_ORDERDATE < '1995-03-15' implies L_SHIPDATE < '1995-09-15' and L_SHIPDATE > '1995-03-15' implies O_ORDERDATE > '1994-10-15'. In fact, the implied restriction should be 121 days or 4 months, but the correlation allowed for a 5-s6 month range. The date range on LINEITEM below is implied from the date range in ORDERS.

TPC-H Query 5 Plan TPC-H Query 5 Plan

 

/* TPC_H Query 7 - Volume Shipping */

SELECT SUPP_NATION, CUST_NATION, L_YEAR, SUM(VOLUME) AS REVENUE
FROM ( SELECT N1.N_NAME AS SUPP_NATION, N2.N_NAME AS CUST_NATION, datepart(yy, L_SHIPDATE) AS L_YEAR,
 L_EXTENDEDPRICE*(1-L_DISCOUNT) AS VOLUME
 FROM SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION N1, NATION N2
 WHERE S_SUPPKEY = L_SUPPKEY AND O_ORDERKEY = L_ORDERKEY AND C_CUSTKEY = O_CUSTKEY
 AND S_NATIONKEY = N1.N_NATIONKEY AND C_NATIONKEY = N2.N_NATIONKEY AND
 ((N1.N_NAME = 'FRANCE' AND N2.N_NAME = 'GERMANY') OR
 (N1.N_NAME = 'GERMANY' AND N2.N_NAME = 'FRANCE')) AND
 L_SHIPDATE BETWEEN '1995-01-01' AND '1996-12-31' ) AS SHIPPING
GROUP BY SUPP_NATION, CUST_NATION, L_YEAR
ORDER BY SUPP_NATION, CUST_NATION, L_YEAR

TPC-H Query 7 Plan

TPC-H Query 7 Plan

TPC-H Query 7 Plan TPC-H Query 7 Plan

TPC-H Query 7 Plan TPC-H Query 7 Plan

 

/* TPC_H Query 8 - National Market Share */

SELECT O_YEAR, SUM(CASE WHEN NATION = 'BRAZIL' THEN VOLUME ELSE 0 END)/SUM(VOLUME) AS MKT_SHARE
FROM (SELECT datepart(yy,O_ORDERDATE) AS O_YEAR, L_EXTENDEDPRICE*(1-L_DISCOUNT) AS VOLUME, N2.N_NAME AS NATION
 FROM PART, SUPPLIER, LINEITEM, ORDERS, CUSTOMER, NATION N1, NATION N2, REGION
 WHERE P_PARTKEY = L_PARTKEY AND S_SUPPKEY = L_SUPPKEY AND L_ORDERKEY = O_ORDERKEY
 AND O_CUSTKEY = C_CUSTKEY AND C_NATIONKEY = N1.N_NATIONKEY AND
 N1.N_REGIONKEY = R_REGIONKEY AND R_NAME = 'AMERICA' AND S_NATIONKEY = N2.N_NATIONKEY
 AND O_ORDERDATE BETWEEN '1995-01-01' AND '1996-12-31' AND P_TYPE= 'ECONOMY ANODIZED STEEL') AS ALL_NATIONS
GROUP BY O_YEAR
ORDER BY O_YEAR

TPC-H Query 8 Plan

TPC-H Query 8 Plan

TPC-H Query 8 Plan TPC-H Query 8 Plan

TPC-H Query 8 Plan TPC-H Query 8 Plan

 

/* TPC_H Query 10 - Returned Item Reporting */

SELECT TOP 20 C_CUSTKEY, C_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE, C_ACCTBAL,
N_NAME, C_ADDRESS, C_PHONE, C_COMMENT
FROM CUSTOMER, ORDERS, LINEITEM, NATION
WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE>= '1993-10-01' AND
O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND
L_RETURNFLAG = 'R' AND C_NATIONKEY = N_NATIONKEY
GROUP BY C_CUSTKEY, C_NAME, C_ACCTBAL, C_PHONE, N_NAME, C_ADDRESS, C_COMMENT
ORDER BY REVENUE DESC

TPC-H Query 10 Plan

TPC-H Query 10 Plan

TPC-H Query 10 Plan TPC-H Query 10 Plan

TPC-H Query 10 Plan TPC-H Query 10 Plan

The table below shows the TPC-H SF100 non-parallel execution plan cost without and with Date Correlation.

Queryw/o Correlationwith Correlation
Q3 15,595.31,955.7
Q4 10,731.11,449.7
Q5 14,133.83,960.2
Q7 10,988.65,516.7
Q8 13,628.56,062.1
Q1011,098.63,234.6