Parent: Optimizer

Data Distribution Statistics - Decoding Stats_Stream Update 2018-12-24

Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines including Microsoft SQL Server. From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field accessible with the STATS_STREAM clause.

Back in SQL Server 2000, it was possible to modify system tables directly, including the sysindexes stat_blob field. At the time, I described a decode of the stat_blob field with the purpose of influencing the execution plan, presumably on a development system and not a production system.

Starting with SQL Server 2005, it was nolonger possible to directly modify system tables. An API was provided to access data distribution statistics to allow cloning the statistics from one database to another. The presumed usage is to clone statistics from a large production database to a small development database.

In other database engines, I had heard of the idea of updating statistics on a backup system to be applied to the production system. While it was still possible to decode most of the 2005 stats_stream binary, it appears that a checksum was added so it was not possible to apply an externally generated statistics binary unless the "checksum" value could be correctly calculated.
Update 2018-12-24
It appears the field at byte 17 (counting from 1) changes periodically even when statistics have not been updated. Perhaps there is an underlying values, but what is reported with each call to DBCC SHOW_STATISTICS(Table,Stat) WITH STATS_STREAM varies?

Around 2005-08, I was working on other SQL Server tools, most prominently SQL System for performance monitoring, Exec Stats for execution plan analysis and TraceAnalysis for trace processing. Work on the SQL Server data distribution cloning tool was discontinued, and I could not continue further research into the decoding of SQL Server data distribution statistics.

Since several people have asked about the data distribution statistics decode, I am making what I know about stat_stream available. Note that organization of stats_stream changed from SQL Server version 2000 to 2005 and again to 2008? It is quite possible there are also changes in version 2012? Most of what I discuss here applies to version 2008 R2.
(Updates of 2018-12-24 are based on SQL Server 2017. There do not appear to be any changes?)

Decoding Stats Stream for SQL Server 2008R2 (also valid for SQL Server 2017?)

Here I am using 1 based reference. Byte index 1 is the first byte.
C# and most other programming languages use zero based index.

PositionLengthValue/TypePurpose
14?1unkown
54? number of vectors
940zero
1340zero
174 checksum? variable?
2140zero
254 stats stream length
2940zero
334 stats stream length - minus vector variable length
The difference [25]-[33] is 64 for 1 vector (defined as off1).
Each additional vector adds 24 bytes starting at byte position 41
3740zero
 
Start of vector information, some of the unknown fields chould be for nullable, collation, etc
40 + 11 system type id
40 + 21 unkown
40 + 32 unkown
40 + 54 user type id
40 + 92 length
40 + 111 Prec
40 + 121 Scale
40 + 134 unknown
40 + 174 unknown
40 + 212 unknown
40 + 232 unknown
Addition vectors if present, 24 bytes each
 
Offset 1 begins at 40 + 24× Vectors + 1
off1 + 1*4 Unknown, 0x07000000?
off1 + 58 Updated, 8 byte datetime
off1 + 138 Rows
off1 + 218 Rows sampled
off1 + 2944 byte realDensity, Header
off1 + 334×33=1324 byte realDensity, vector, upto 33 values
off1 + 37...44 byte real2nd density vector, etc.
 
off1 + 16544 byte intSteps (first copy), excludes step for RHK NULL
off1 + 16944 byte intSteps (second copy)
off1 + 17344 byte intnumber of vectors
off1 + 17744 byte intHistogram Step size, example: 20 bytes for 4 byte int.
off1 + 18144 byte realAverage Key length - header
off1 + 18544 byte realUnfiltered rows
off1 + 18944 byte intunknown
off1 + 1934x33=1324 byte realAverage key length, vector, upto 33 values
off1 + 197...44B real2nd Avg key len, etc.
 
off1 + 3251byte?unkown, values 0x11, 0x13 and 0x19 observed,
may determine the # of post histogram 8 bytes values starting at off1+341?
off1 + 3261byte?unkown, values 0x00, 0x04 observed,
off1 + 3276?unkown
off1 + 33388-byte 0?unknown
off1 + 34184/8 byte int?length of histogram?
off1 + 3498dittoabove + 8?
(off1 + 357)8dittomay not exist or above + 8?
 
off1 + 357
or 365?
8 per step
up to 200
8-byte intoffsets for histogram steps relative to off1 + 357 or 365,
the first offset should be 8*steps, read 8-byte value from off1+357 until this is so?
more offsets if value of [off1+325(?)] is 25 (0x19) or more?
off1 + 365...88B int2nd histogram step offset
 
Eventually, this sequence appears: 0x10001100 followed by three 4-byte real,
a value in native type of the stat, and then ending with 0x040000
off220x10 - 16length of core columns, or offset to Range Hi Key
off2+2217 or highersize of step excluding 3 byte trailer, or end of RHK
off2+444 byte realEq Rows
off2+844 byte realRange Rows
off2+1244 byte realAvg Range Rows
off2+16native lengthnative typeRange Hi Key
off2+16+x3 byte0x040000step terminator?, x is the size of the type
off3??additional info
*off1 = value of 4(8) byte int at position [25] - value of [33]
**off2 = off1 + 341 + 16 if value of [off1+325] is 0x11, or 24 if 0x13 or 0x19
***off3 = off1 + 341 + value of 4(8) byte int at [off1+341]
Note: when DBCC SHOW_STATISTICS has a step for RANGE_HI_KEY value NULL, this is not stored in Stat_Stream?
Stat_Stream for a number of data types and variations in NULL, or filter have not been investigated.

So far, for SQL Sever 2008 R2, I have only looked at fixed length not nullable statistics. Variable length statistics has different organization, particularly in the histogram part. String statistics may have extended information after the histogram, per new feature of SQL Server 2008?

Variable length (varchar) (2012-06-16)
For varchar statistics, the histogram organization appears to be as follows.
The off2 value is 0x3000 (48), off2+2 value is 0x1000 (16). I am not sure what 48 means, as it is not the length of the step. The 16 values seems to mean that the first 16 bytes are the same as for fixed length, starting with the two 2 byte values just described, followed by 4 byte Equal rows, 4 byte Range Rows, and 4 byte Avg Range Rows.
Next is the 3 byte 0x040000, and then the Range Hi Key value. The first 4 bytes might be some kind of page code or collation indicator, possibly represented by two 2 byte values. After this is the actual string value in binary.

2018-12-24
Updated Supporting SQL functions and procedures: decoding stats stream4 (2018-12-26)

Supporting SQL functions and procedures: decoding stats stream3 (2012-06-16)

Update 2012-06-12 supports varchar stats.

Update 2012-05-12 has a stored procedure that accepts table and index (or column stat) as input parameters, in addition to the original procedure that has the stats stream binary.

(Old) Supporting SQL functions and procedures: decoding stats stream (old)

Umachandar provides a SQL function for converting 4-byte binary to real or 8-byte binary float, and vice versa.

ps
An interesting fact is that it is not necessary for statistics to be highly accurate to be effective. Normally we are interested in distribution differences to shift the execution plan from one to another. The boundaries for this can be very wide. False statistics in certain circumstances might guard against catastrophically bad execution plans, example in out of bounds situations. Another is in skewed distributions, but this should be handled by other means, to ensure high and low distributions get different execution plans.