Home, Optimizer, Benchmarks, Server Systems, Processors, Storage,

Statistics that need special attention ( 2014-11, Updated: 2018-05)

Update:2018-05
the SQL at the bottom has been converted into a stored procedure: sp_updatestats2

Statistics are automatically created and updated in SQL Server to support the objective that it should just work without needing special skills as in a DBA. This is an admirable goal that if it were in fact achieved, then special skills would not be needed from the DBA, and this would be reflected in the pay scales for our profession. So from a certain point of view, we are fortunate CIOs understand that advanced skills are needed in the DBA with the consequent implications. Today’s topic is statistics that require special attention. Serious problems can and do occur with the default SQL Server settings. This is not just in when statistics are updated, but also in the sampling percentage. Examples of specific situations that could incur adverse effects of the these deficiencies are discussed. Finally, a partial plan is proposed for preventing statistics related problems based on easy availability of necessary information. A full plan depends on the details of the specific environment and preferences.

Recompute Threshold

The original SQL Server strategy (7.0, 1998) was statistics are marked for recompute when 20% of the rows have been modified. Elsewhere it was stated that temp tables incur recompiles at the first 6 rows inserted/modified, again at 500 rows modified and every 20% thereafter. Of course recompiles would only change the execution plan if the statistics reflected the changed row estimates, so temp table recompile set points are in fact statistics recompute thresholds.

SQL Server 2008 R2 at some point added Trace flag 2371 to allow gradual reduction in the percentage threshold for statistics recompute, to avoid the ridiculous situation of waiting for the 20% threshold even in very large tables. See Changes to automatic update statistics in SQL Server – traceflag 2371 and also Controlling Autostat (AUTO_UPDATE_STATISTICS) behaviour in SQL Server

Sampling

But recompute threshold set points are not the only problem in SQL Server statistics. Another very serious problem is the sampling methodology. The corner stone foundation of statistics is a random sample. In a database, this means sampling random rows. But this is not how SQL Server generates statistics. What SQL Server does is randomly sample pages, using all the row values in those pages. I believe that the first and last pages are always included to try to determine boundary values? If there is no correlation between page location and value, then our sample should be governed by sampling theory. If there is correlation, then potentially our statistics doe more harm than good.

Below is the default statistics on the TPC-H scale factor (SF) 10 database, Lineitem table, clustered index leading with Shipdate when no other indexes are present. The sample of 394933 rows from random pages indicates that certain values are very common, as indicated by the EQ_ROWS, while other value are uncommon, being lumped into the RANGE_ROWS, with AVG_RANGE_ROWS be the proper value in comparing to EQ_ROWS.

The query optimizer would then formulate the execution plan based on the assumption that certain values have high distribution, while other values are rare. Consider the query below, with the select list having difference columns to avoid the effects of auto-parametrization.

SELECT L_PARTKEY FROM LINEITEM WHERE L_SHIPDATE = '1992-04-28'
SELECT L_SUPKEY FROM LINEITEM WHERE L_SHIPDATE = '1992-04-29'

The query plans are below. The first query is much more expensive than the second due to the difference in estimated rows, even though the execution plan structure is the same.

Below is the detail with estimated number of rows for the 4-28 query

and the detail for the 4-29 query.

The estimated number of rows matches exactly the EQ_ROWS for 28 Apr in the first query and the average range rows for dates inside of the range 28 Apr and 10 May (excluding boundary values) in the second query.

Even if our query were inside a stored procedures, with the search specifying a parameter, the value at the time compile determines the execution plan, which is the used for subsequent executes until the next compile. If the search specified a variable, then the generic distribution is used.

Below is the true statistics assessed from a full scan. Except for the lower boundary values, the actual distribution is highly uniform with respect to date.

Of course, the SQL Server engine team has experts on statistical sampling theory know of this problem. To get around this while still leverage the lower cost of random page sampling, SQL Server tries to use a nonclustered index that has the desired column not as the lead key, on the hope that is not far from a true random row sample. Below are the default statistics when there was a nonclustered index (on L_PARTKEY) present. This sample is indeed much closer to the fullscan distribution.

Note that the objective is not 1% or even 10% accuracy, it is to be within a order of magnitude. That is, in the correct ball park, not in the wrong city. The point is to achieve the correct execution plan, with the major decision being whether to use a nonclustered index and key lookup versus a scan. The secondary decision is join order, which do require more accuracy in the statistics, but has lower impact.

Examples of Problems Caused by Default Statistics

We have established two main sources of concern, 1) the statistics recompute threshold, and 2) the sampling percentage. Now we need to consider situations where either source of error could have consequential results.

Compile Parameter Out of Bounds

One situation is in sequentially increasing (or decreasing) values, typically the identity column, and certain date time columns. Suppose statistics are recomputed at a specific time. Later additional rows are added with higher identity or datetime values. Should a query (stored procedure) compile with such parameter values, the query determines that the parameter values are outside the upper or lower bounds of the statistics, so there are zero rows.

Now the execution plan never shows an estimate of zero rows. Instead it will show the estimate number of rows to be 1. The only indication that the true estimate was zero, other than knowing the compile parameter was out of bounds, is if there are subsequent joins that is known via statistics to be 1-to-many. These will now all show 1 row in lieu of zero.

One would expect that whether the estimate is zero or 1 should not change the correct execution plan on a table when it is the identity column. The adverse effect could occur when there is a subsequent join that should have estimated many rows. One possibility is that the correct estimate would have resulted in a table scan on a later operation.

Leading to the Disk IO at Queue Depth 1 Problem

A less obvious adverse effect of the statistics out of bounds is the estimated number of rows 1 on all subsequent operations. If IO is required for a loop join or key lookup, it is synchronous IO. Even if the data were distributed over 1000 HDDs, the synchronous IO rate is 150-200 IOPS on 10/15K HDDs, utilizing only a single HDD out of the entire pool at any given point in time. The threshold for asynchronous IO is 25 rows?

The parameter out of bounds has been a problem on identity columns indexes since SQL Server version 7 in 1998. This problem is now fixed in SQL Server 2014, a mere 25 years later.

Columns with Limited Discrete Values

A similar situation occurs if there are only a limited number discrete values. The binary structure SQL Server uses to stored statistics allows for up to 200 histogram steps. When there are 200 or fewer values, the statistics structure could known exactly which values exist, without having any range rows with values in between the equal values.

Now suppose the we modify some rows with a value that did not exist at the time statistics were computed. The value could be in between the upper and lower bounds, but because there are no range rows, SQL server still believes that the value does not exist.

Temporary Values

This situation typically occurs in an ETL-like process. The first step inserts or updates a set of rows, marking a flag column with value different from the normal values. The next steps acts on these rows using the flag column special value as a search criteria. The final step clears the flag column back to the normal value.

Statistics computed before the first step or after the final step, indicate that only the normal value(s) exists, not the special value.

In this situation, when the process tags a large number of rows, such that the default threshold is exceeded, then statistics recomputed, the plan is recompile with good statistics, and the SQL completes in a reasonable time span.

If the process involves fewer rows than the threshold, statistics are not recomputed. When statistics were previously computed, the special value did not exist in this column. And so the query optimizer assumes zero (1) rows meet the search condition. Depending on the actual circumstances, the execution plan based on this assumption could be a complete disaster.

The developer or analyst perceives the paradox that a large job is not slow, but a small job is horribly slow. He/she concludes that SQL server works in mysterious magical ways and promulgates rules based on an erroneous concept for the next generation of developers to follow, none of whom bother to challenge the source.

Row Estimates High

As described in the first section, the random page sample – all rows in each page, could lead serious over estimates. This in turn could lead result in the execution plan being a scan when an index seek was the correct choice. This might occur on the source table, or in a join to a subsequent table.

High Compile Time

This is a compound problem. The random page sample causes estimates to be high. A series of joins involving affected tables results in the query to have extraordinarily high estimated number of rows, which in-turn means the plan cost is also very high. The high plan cost results in the query optimizer settings as very high time-out for the StatementOptmEarlyAbort.

The query optimizer now keeps trying to find a better plan even when a good plan has already been found and there is no better plan. (Perhaps there needs to a be new reason for optimizer early abort - when variations show no significant change.)

The plan in XML below is for a query with several self-joins (which could be replaced with a pivot) to a table/index with the default random page sample. The statistics is off by 3 full orders of magnitude (1000X) at the source, and then is magnified via multiple joins to an estimage of 5 billion rows with a plan cost of 44,000 and a compile CPU of nearly 5 minutes.

With fullscan statistics, now the estimate is a more accurate 24 rows, a plan cost of 0.68 and a more reason compile time of 51ms. The actual query execution is on the order of several hundred milli-seconds.

Summary of Statistics Requirements

Continously increasing (or decreasing) columns could pose a problem if the compile occurs with a parameter value out-of-bounds on the statistics and if there is a subsequent 1-to-(very)many join.

One solution is to update these statistics frequently, but obviously it is not practical to update statistics after every insert. Other methods of handling this situation is to use the OPTIMIZE FOR query hint, either specifying a literal constant known to be in bounds or UNKNOWN.

Unique columns do not need FULLSCAN statistics sample.

More than likely it is the indexes that could need FULLSCAN statistics sampling. If the lead column is in one of the other nonclustered indexes but not as the lead column, it could be possible that the default statistics sample could be adequate. Of course, when a problem occurs, your excuse for not having taken preventative action based on a "possible - could be adequate" argument is likely to be seen as incompetence.

We need to be especially alert for ETL-like procedures, when a column is marked with a new value, and then this new value is used in the next steps as a SARG. An independently scheduled statistics maintenance job is unlikely to help. The best solution is to employ an update statistics directive at the appropriate place directly in the logic stream. I also like to employ a filtered index. When there are zero rows with the filter definition value, then there zero rows in the index. Just 6 rows modified should trigger the first statistics recompute threshold? and if not, the 20% threshold show also be easily triggered.

Proposal for Avoiding Statistics Related Problems in SQL Server

Now that we know some of the serious problems that can occur with the SQL Server built-in statistics handling, the next step is to determine good strategies for preventing such problems. We would like to be 100% successful, and there to be peace on earth. So we should accept a good enough strategy.

Consider that just knowing statistics and row estimation are the source of a problem is valuable. Otherwise we would be engaged in protracted troubleshooting and wondering why good SQL and indexes alone are not sufficient. We might stumble on to a solution by forcing the query plan, but not have a rational reason SQL Server query optimizer did not come to the plan without hints.

Sources of information

The primary source of information on statistics comes from sys.dm_db_stats_properties. Some information is in sys.indexes, dm_db_partition_stats, dm_db_index_usage_stats. The key info in dm_db_stats_properties are rows, rows_sampled, modification_counter and last_update, steps. Note that dm_db_stats_properties is only available in SQL Server 2008 R2 SP2, and 2012 SP1 or later versions. This information is so important that any one on prior versions and SP should upgrade to one of the version with this function.

The query below is a starting point in identifying indexes that might benefit from an update, along with additional information on whether to use a default or full scan. A separate query can target column statistics.

;WITH b AS (
 SELECT d.object_id, d.index_id, row_count = SUM(d.row_count)
 FROM sys.dm_db_partition_stats d WITH(NOLOCK)
 GROUP BY d.object_id, d.index_id
), k AS (
 SELECT object_id, index_id, COUNT(*) kct
 FROM sys.index_columns WITH(NOLOCK)
 WHERE key_ordinal > 0
 GROUP BY object_id, index_id
)
SELECT ROW_NUMBER() OVER( ORDER BY s.name, o.name, i.index_id) rn
, s.name + '.' + o.name [object] , i.name [index] , i.has_filter filt
, i.is_unique uni , k.kct
, c.name leadcol , c.system_type_id ctype , c.is_identity isid , c.is_rowguidcol rgui
, rw_delta= b.row_count - t.rows , t.rows_sampled , t.modification_counter , t.steps
, CONVERT(varchar, t.last_updated,120) updated, o.type otype
FROM sys.objects o WITH(NOLOCK)
JOIN sys.schemas s WITH(NOLOCK) ON s.schema_id = o.schema_id
JOIN sys.indexes i WITH(NOLOCK) ON i.object_id = o.object_id
JOIN sys.index_columns j WITH(NOLOCK) ON j.object_id = i.object_id AND j.index_id = i.index_id
AND j.key_ordinal = 1
JOIN sys.columns c WITH(NOLOCK) ON c.object_id = j.object_id AND c.column_id = j.column_id
JOIN b ON b.object_id = i.object_id AND b.index_id = i.index_id
JOIN k ON k.object_id = i.object_id AND k.index_id = i.index_id
LEFT JOIN sys.dm_db_index_usage_stats y WITH(NOLOCK) ON y.object_id = i.object_id AND y.index_id = i.index_id
OUTER APPLY sys.dm_db_stats_properties ( i.object_id , i.index_id) t
WHERE o.type IN ('U', 'V')
AND i.index_id > 0 AND i.type <= 2 AND i.is_disabled = 0
AND b.row_count > 0
AND ( 10*t.modification_counter > b.row_count
OR t.modification_counter*t.modification_counter > 100 * b.row_count
OR 2*t.rows_sampled < b.row_count
OR ( i.is_unique = 1 AND kct = 1
  AND c.system_type_id IN (52,56,127) AND t.modification_counter > 0 )
)

The first CTE gets row count from sys.dm_db_partition_stats which should be relatively current. This can be compared with the rows and modification_counter values from sys.dm_db_stats_properties, which might provide some indication as to whether the writes are inserts, updates or deletes. For the time being, no effort is made for partition level statistics. The second CTE gets the number of key columns for each index.

In the main query, tables sys.objects and sys.schemas is strictly for name. The table sys.indexes has columns for is_unique, which encompasses primary key, unique constraint and unique index. The column index_id value > 0 excludes the heap, type <= 2 further restricts this to conventional indexes. The column is_disabled and is_hypothetical restricts the list to actual indexes. Finally dm_db_partition_stats row_count > 0 excludes empty tables.

From sys.index_columns, there are columns for type and is_identity. For unique indexes with 1 key column we are interested with the index column is a monotonically increasing or decreasing value, typically one of the integer types, and possibly a datetime. This indicates that there might be a need to update the statistics on the index frequently, but does not require fullscan. It should not be necessary to frequently update unique single key column indexes that are uniqueidentifiers, including one that have the rowguidcol property.

The DMV sys.dm_db_index_usage_stats has a column for the number of user_updates. This counter value is from SQL Server instance start or index (re)build, we do not know how many updates have occured since statistics were last updated. If we had complex tracking system, this DMV could be useful. Otherwise we will mostly use dm_db_stats_properties modification_counter.

Of the OR clause conditions, 10*modification_counter > row_count means 10% or more of the rows have been modified (with a multiply instead of divide). The condition modification_counter2 > 100 * row_count is another way of saying the mods > the 10 * square-root (rows). For a table of 10,000 rows, 1000 rows modified corresponds to 10 * sqrt(10000), or 10%. For a table of 1M rows, 10,000 rows modified corresponds to 10 * sqrt(1M) or 1%. For a table of 100M rows, this is 100,000 rows or 0.1%.

Since this list is for indexes, we want the update statistics to be full scan (possibly with an exception for single key unique integer indexes).

The Statistics Maintenance Plan

We could use this query to populate a table variable, or a permanent table if history is desired. Then loop through the table with a cursor, updating statistics with the appropriate qualifier, fullscan, default or resample.

Note that the rule of favoring set based operations over single cursor does not apply here because there the is no set operation for updating statistics a set of indexes.

The above strategy can be employed in a daily job, and then adjusted in frequency as appropriate. It is possible that time to update all marked statistics can be longer than desirable. We should consider that update statistics is single threaded.

On a modern system with very many processor cores and adequate disk IO capability, (i.e., do the storage configuration without assistance from the SAN vendor/admin) several update statistics commands can be run in parallel. So a method of dividing the work into multiple threads is needed. I suggest ranking the statistics to be updated by row count, then taking a modulus, so that the large statistics are distributed over different threads as opposed to the NTILE

Summary

There are two factors that we should consider: 1) the statistics recompute threshold and 2) sampling methodology (percentage). Prior to SQL Server 2014, an identity column could be subject to statistics related problems if a compile or recompile occurs with an out-of-bounds value for which there is no viable recompute threshold strategy. This must be handled in the procedure for which a simple OPTIMIZE FOR hint is sufficient. The adjusted threshold setting via Trace flag 2371 might be good, but there is an extremely serious problem with the SQL Server default sampling strategy. The resolution for this requires us to build our own statistics maintenance plan so that we can specify the sample size, unless we want to rely on the RESAMPLE option. Finally, there are special situations that cannot be resolved via independent statistics jobs, for which the resolution is a statistics update statement directly in our SQL sequence.

Appendix

The primary reference on SQL Server Statistics from Microsoft is: Statistics Used by the Query Optimizer in Microsoft SQL Server 2008 Writer: Eric N. Hanson and Yavor Angelov Contributor: Lubor Kollar

Below is a query for just column statistics. Adjust as suitable.

SELECT ROW_NUMBER() OVER( ORDER BY s.name, o.name, i.stats_id) rn
, s.name + '.' + o.name [object] , i.name [stats]
, c.name leadcol , c.system_type_id ctype , c.is_identity isid , c.is_rowguidcol rgui
, rw= t.rows , t.rows_sampled , t.modification_counter , t.steps
, CONVERT(varchar, t.last_updated,120) updated, o.type otype
FROM sys.objects o WITH(NOLOCK)
JOIN sys.schemas s WITH(NOLOCK) ON s.schema_id = o.schema_id
JOIN sys.stats i WITH(NOLOCK) ON i.object_id = o.object_id
JOIN sys.stats_columns j WITH(NOLOCK) ON j.object_id = i.object_id AND j.stats_id = i.stats_id
AND j.stats_column_id = 1
JOIN sys.columns c WITH(NOLOCK) ON c.object_id = j.object_id AND c.column_id = j.column_id
OUTER APPLY sys.dm_db_stats_properties ( i.object_id , i.stats_id) t
LEFT JOIN sys.indexes x WITH(NOLOCK) ON x.object_id = i.object_id AND x.index_id = i.stats_id
WHERE o.type IN ('U', 'V') AND x.index_id IS NULL AND t.rows > 0
AND ( 10*t.modification_counter > t.rows
OR t.modification_counter*t.modification_counter > 10*t.rows
)