Parent: Optimizer

Data Distribution Statistics - Decoding Stats_Stream

Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines including SQL Server. From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field accessible with the STATS_STREAM clause. Back in SQL Server 2000, it was possible to modify system tables directly, including the sysindexes stat_blob field. At the time, I described a decode of the stat_blob field with the purpose of influence the execution plan, presumably on a development system and not a production system.

Starting with SQL Server 2005, it was nolonger possible to directly modify system tables. An API was provided to access data distribution statistics to allow cloning the statistics from one database to another. The presumed usage is to clone statistics from a large production database to a small development database. In other database engines, I had heard of the idea of updating statistics on a backup system to be applied to the production system. While it was still possible to decode most of the 2005 stats_stream binary, it appears that a checksum was added so it was not possible to apply an externally generated statistics binary unless the "checksum" value could be correctly calculated.

Around this time, I was working on other SQL Server tools, most prominently SQL System for performance monitoring, Exec Stats for execution plan analysis and TraceAnalysis for trace processing. Work on the SQL Server data distribution cloning tool was discontinued, and I could not continue further research into the decoding of SQL Server data distribution statistics.

Since several people have asked about the data distribution statistics decode, I am making what I know about stat_stream available. Note that organization of stats_stream changed from SQL Server version 2000 to 2005 and again to 2008? It is quite possible there are also changes in version 2012? Most of what I discuss here applies to version 2008 R2.

Decoding Stats Stream for SQL Server 2008R2

Here I am using 1 based reference. Byte index 1 is the first byte.
C# and most other programming languages use zero based index.

PositionLengthValue/TypePurpose
14?1unkown
54? number of vectors
940zero
1340zero
174 checksum
2140zero
254 stats stream length
2940zero
334 stats stream length - minus vector variable length
The difference [25]-[33] is 64 for 1 vector (defined as off1).
Each additional vector adds 24 bytes starting at byte position 41
3740zero
Start of vector information, 24-bytes per vector
411 system type id
421 unkown
432 unkown
454 user type id
492 length
511 Prec
521 Scale
534 unknown
574 unknown
612 unknown
632 unknown
   Some of the unknown fields should be for nullable, collation, etc
Addition vectors if present
off1+1*9 Updated?, 9 byte datetime2?
off1+103 unknown
off1+138 Rows
off1+218 Rows sampled
off1+2944 byte realDensity - Header
off1+334x33=1324 byte realDensity - vector, upto 33 values
off1+16544 byte intSteps (first copy)
off1+16944 byte intSteps (second copy)
off1+17344 byte intnumber of vectors
off1+17744 byte intStep size (in bytes)
off1+18144 byte realAverage Key length - header
off1+18544 byte realUnfiltered rows
off1+18944 byte intunknown
off1+1934x33=1324 byte realAverage key length - vector, upto 33 values
   Some fields may represent string index (bool), or filter expression
off1+32588 byte intunkown, values 0x11, 0x13 and 0x19 observed,
may determine the # of post histogram 8 bytes values starting at off1+341?
off1+333808 byte 0?
off1+34180offset for value after histogram?
off1+34980another offset
off1+35780another offset if value of [off1+25] is 19 or more?
more offsets if value of [off1+25] is 25 or more?
Eventually, this sequence appears: 0x10001100 followed by three 4-byte real,
a value in native type of the stat, and then ending with 0x040000
off220x10 - 16length of core columns
off2+2217 or highersize of step, excluding 3 byte trailer
off2+444 byte realEq Rows
off2+844 byte realRange Rows
off2+1244 byte realAvg Range Rows
off2+16native lengthnative typeRange Hi Key
off2+16+x3 byte0x040000step terminator?, x is the size of the type
off3??additional info
*off1 = value of 4(8) byte int at position [25] - value of [33]
**off2 = off1 + 341 + 16 if value of [off1+325] is 0x11, or 24 if 0x13 or 0x19
***off3 = off1 + 341 + value of 4(8) byte int at [off1+341]

So far, for SQL Sever 2008 R2, I have only looked at fixed length not nullable statistics. Variable length statistics has different organization, particularly in the histogram part. String statistics may have extended information after the histogram, per new feature of SQL Server 2008?

Variable length (varchar) (2012-06-16)
For varchar statistics, the histogram organization appears to be as follows.
The off2 value is 0x3000 (48), off2+2 value is 0x1000 (16). I am not sure what 48 means, as it is not the length of the step. The 16 values seems to mean that the first 16 bytes are the same as for fixed length, starting with the two 2 byte values just described, followed by 4 byte Equal rows, 4 byte Range Rows, and 4 byte Avg Range Rows.
Next is the 3 byte 0x040000, and then the Range Hi Key value. The first 4 bytes might be some kind of page code or collation indicator, possibly represented by two 2 byte values. After this is the actual string value in binary.

Supporting SQL functions and procedures: decoding stats stream - Updated (2012-06-16)

Update 2012-06-12 supports varchar stats.

Update 2012-05-12 has a stored procedure that accepts table and index (or column stat) as input parameters, in addition to the original procedure that has the stats stream binary.

(Old) Supporting SQL functions and procedures: decoding stats stream

Umachandar provides a SQL function for converting 4-byte binary to real or 8-byte binary float, and vice versa.

ps
An interesting fact is that it is not necessary for statistics to be highly accurate to be effective. Normally we are interested in distribution differences to shift the execution plan from one to another. The boundaries for this can be very wide. False statistics in certain circumstances might guard against catastrophically bad execution plans, example in out of bounds situations. Another is in skewed distributions, but this should be handled by other means, to ensure high and low distributions get different execution plans.