Data distribution statistics is one of the foundations of the cost-based query optimizer in all modern database engines including SQL Server. From SQL Server 2005 on, most of the information displayed by DBCC SHOW_STATISTICS is kept in a binary field accessible with the STATS_STREAM clause. Back in SQL Server 2000, it was possible to modify system tables directly, including the sysindexes stat_blob field. At the time, I described a decode of the stat_blob field with the purpose of influence the execution plan, presumably on a development system and not a production system.
Starting with SQL Server 2005, it was nolonger possible to directly modify system tables. An API was provided to access data distribution statistics to allow cloning the statistics from one database to another. The presumed usage is to clone statistics from a large production database to a small development database. In other database engines, I had heard of the idea of updating statistics on a backup system to be applied to the production system. While it was still possible to decode most of the 2005 stats_stream binary, it appears that a checksum was added so it was not possible to apply an externally generated statistics binary unless the "checksum" value could be correctly calculated.
Around this time, I was working on other SQL Server tools, most prominently SQL System for performance monitoring, Exec Stats for execution plan analysis and TraceAnalysis for trace processing. Work on the SQL Server data distribution cloning tool was discontinued, and I could not continue further research into the decoding of SQL Server data distribution statistics.
Since several people have asked about the data distribution statistics decode, I am making what I know about stat_stream available. Note that organization of stats_stream changed from SQL Server version 2000 to 2005 and again to 2008? It is quite possible there are also changes in version 2012? Most of what I discuss here applies to version 2008 R2.
|5||4?||number of vectors|
|25||4||stats stream length|
|33||4||stats stream length - minus vector variable length
The difference - is 64 for 1 vector (defined as off1).
Each additional vector adds 24 bytes starting at byte position 41
|Start of vector information, 24-bytes per vector|
|41||1||system type id|
|45||4||user type id|
|Some of the unknown fields should be for nullable, collation, etc|
|Addition vectors if present|
|off1+1*||9||Updated?, 9 byte datetime2?|
|off1+29||4||4 byte real||Density - Header|
|off1+33||4x33=132||4 byte real||Density - vector, upto 33 values|
|off1+165||4||4 byte int||Steps (first copy)|
|off1+169||4||4 byte int||Steps (second copy)|
|off1+173||4||4 byte int||number of vectors|
|off1+177||4||4 byte int||Step size (in bytes)|
|off1+181||4||4 byte real||Average Key length - header|
|off1+185||4||4 byte real||Unfiltered rows|
|off1+189||4||4 byte int||unknown|
|off1+193||4x33=132||4 byte real||Average key length - vector, upto 33 values|
|Some fields may represent string index (bool), or filter expression|
|off1+325||8||8 byte int||unkown, values 0x11, 0x13 and 0x19 observed,
may determine the # of post histogram 8 bytes values starting at off1+341?
|off1+333||8||0||8 byte 0?|
|off1+341||8||0||offset for value after histogram?|
|off1+357||8||0||another offset if value of [off1+25] is 19 or more?|
|more offsets if value of [off1+25] is 25 or more?|
|Eventually, this sequence appears: 0x10001100 followed by three 4-byte real,
a value in native type of the stat, and then ending with 0x040000
|off2||2||0x10 - 16||length of core columns|
|off2+2||2||17 or higher||size of step, excluding 3 byte trailer|
|off2+4||4||4 byte real||Eq Rows|
|off2+8||4||4 byte real||Range Rows|
|off2+12||4||4 byte real||Avg Range Rows|
|off2+16||native length||native type||Range Hi Key|
|off2+16+x||3 byte||0x040000||step terminator?, x is the size of the type|
|*off1 = value of 4(8) byte int at position  - value of |
|**off2 = off1 + 341 + 16 if value of [off1+325] is 0x11, or 24 if 0x13 or 0x19|
|***off3 = off1 + 341 + value of 4(8) byte int at [off1+341]|
So far, for SQL Sever 2008 R2, I have only looked at fixed length not nullable statistics. Variable length statistics has different organization, particularly in the histogram part. String statistics may have extended information after the histogram, per new feature of SQL Server 2008?
Variable length (varchar) (2012-06-16)
For varchar statistics, the histogram organization appears to be as follows.
The off2 value is 0x3000 (48), off2+2 value is 0x1000 (16). I am not sure what 48 means, as it is not the length of the step. The 16 values seems to mean that the first 16 bytes are the same as for fixed length, starting with the two 2 byte values just described, followed by 4 byte Equal rows, 4 byte Range Rows, and 4 byte Avg Range Rows.
Next is the 3 byte 0x040000, and then the Range Hi Key value. The first 4 bytes might be some kind of page code or collation indicator, possibly represented by two 2 byte values. After this is the actual string value in binary.
Supporting SQL functions and procedures: decoding stats stream - Updated (2012-06-16)
Update 2012-06-12 supports varchar stats.
Update 2012-05-12 has a stored procedure that accepts table and index (or column stat) as input parameters, in addition to the original procedure that has the stats stream binary.
(Old) Supporting SQL functions and procedures: decoding stats stream
Umachandar provides a SQL function for converting 4-byte binary to real or 8-byte binary float, and vice versa.
An interesting fact is that it is not necessary for statistics to be highly accurate to be effective. Normally we are interested in distribution differences to shift the execution plan from one to another. The boundaries for this can be very wide. False statistics in certain circumstances might guard against catastrophically bad execution plans, example in out of bounds situations. Another is in skewed distributions, but this should be handled by other means, to ensure high and low distributions get different execution plans.