Parent: SQL Server Cost Based Optimizer, SQLBlog link: IO Cost Structure Anticipating SSD arrays

IO Cost Structure 2010-10?

An observant person has probably noticed that SQL queries requiring disk reads not only have longer duration but also higher CPU times. It is not hard then to deduce that disk access (for both HDD and SSD), which involves the OS performing an IO call, the SQL Server process finding a place in the buffer cache for the data pages, and possibly evicting current pages from the buffer cache, has significant CPU cost. What then is the cost structure of SQL Server disk IO? Exposing this is not entirely simple, as it is necessary to generate specific IO operations in a repeatable manner.

A while back, I ran a series of tests to expose the CPU cost of various SQL operations when all data was resident in memory. At the time, it was not a pressing matter to examine disk IO cost. If disk IO was necessary, the duration due to disk latency was going to be much higher the CPU time portion in most situations, the exception being the few environments with very powerful storage systems, typically with several hundred disk drives (properly configured and verified).

Now solid state disks (SSD) suitable for enterprise (or server storage) systems are on the verge of being practical for broad deployment. It is actually possible for a current generation 4-way server with 20-40 SSD array (any one thinking a single SSD should let others do the thinking) to reach the disk IO performance levels limited by CPU and not access latency. So it is now important to have an understanding disk IO cost structure, along with implications for the cost based optimizer (CBO) concerning IO performance characteristics.

Methodology

SQL Server provides several methods for collecting information on CPU and duration. This includes
1) Profiler and Trace,
2) the SET STATISTICS TIME option, and for SQL Server 2005 and later
3) the DMV (dm_exec_query_stats).

IO at the SQL Server level can be obtained from:
1) the DMV dm_io_virtual_file_stats (equivalent to fn_virtualfilestats)
and 2) the SET STATISTICS IO option.
It is always advisable to compare information from SQL Server with Operating System performance counters as a sanity check. The DMV method was found to be useful for averaging multiple calls, but would not properly report parallel execution plans, so all tests here are for non-parallel plans (MAXDOP 1).

In any case, the reported CPU and duration cost is for the entire query, not a specific portion. So to expose the disk component, it is necessary to compare the in-memory and the not n-memory (disk) results.

The first step is to build a set of test tables and SQL queries, exposing as many of the dependencies as possible, including total row length, data types, variable length columns, lock levels and so on. The next step is to ensure that measurements are repeatable.

Test System and Environment

The test system is a Dell PowerEdge 2900 with:
2 x QC E5430 2.66GHz processors,
24GB memory,
1 PCI-E PERC6/I RAID controller
and 8 15K SAS disk drives in two RAID 0 arrays.
The operating system is Windows Server 2008 64-bit, with SQL Server 2008 64-bit (RTM). The test database tables for table scans are 20GB each. The test table for random IO is 64GB (leaf level) data, plus indexes.

Table Scan

The table scan to a clustered index organized table, with no fragmentation, was observed to generate large block IO, from 192-384KB per read, depending on the lock level and the state of the buffer cache. Surprisingly, at default (Read Committed) with empty buffer cache, the non-parallel table scan generates a rather low 367MB/sec compared with 707MB/sec at nolock. The Read Committed read from disk rate was 680MB/sec with a warm cache.

The table below shows the cost per page for a table scan in CPU-microseconds for various row densities at NOLOCK. The test query for the first two result columns is a needle-in-haystack type, with a SARG that very few rows (10) meet, but with no suitable index, requires a table scan. For the first result column, that entire table was already in-memory. For the second result column, the buffer cache was empty. In both cases, no evictions were necessary. The third column is the in-memory result for a simple count and single column aggregate of all rows in the table. By running this test on tables with a range of row lengths (8000 bytes for 1 row/page, to 25 bytes for 323 rows per page) it is possible to determine the cost associated with a page access and the cost per row.

Rows/page Memory Disk Aggregate
1 0.94 5.28 1.18
2 1.03 5.27 1.44
5 1.36 5.65 2.38
10 1.72 5.98 3.78
20 2.59 6.87 6.14
40 3.40 7.82 8.87
80 5.08 9.68 15.01
158 8.84 13.39 28.48
323 16.91 21.27 74.00

From the above table, one can deduce that the cost per page to load data from disk to the buffer cache in large block IO (>192K/Read) is in the range of 4.3-4.6 CPU-microseconds. In another series of tests, SQL Server was restricted to 2GB memory, so in a 20GB table scan, 2GB was already in memory, 18GB must be read from disk, and 18GB must be evicted from the buffer cache. The cost was about 8 CPU-micro-sec per page at low row density. This implies the cost of evicting 1 (clean, not dirty) page is just over 3 CPU-micro-sec per page, based on the assumption of 18GB read and 18GB evicted.

Key Lookup

A nonclustered index seek, followed by a key lookup (bookmark lookup in SQL Server 2000) can generate a small block random IO. However, as we should know, the nonclustered key, on table with a clustered index, really includes the cluster key. To generate essentially random IO, it is necessary to include an extra column between the leading key and the cluster key with an essentially random distribution. One aspect of key lookup generated disk IO is that when the buffer cache is empty, SQL Server will issue a 64K read instead of an 8KB read. When the buffer cache is full (not sure where is the exact transition point occurs) SQL Server issues 8KB IO. Another aspect of small block random IO is the scatter-gather IO. This is a special feature in the Windows file IO API that allows issuing a single call for multiple blocks. To determine random IO cost structure, it necessary to test across a wide range of rows per query, observing the transition point from single block IO to scatter-gather IO. Tests should also start with a warm buffer cache unless the intent is to determine the cost of 64K IO, in which case it is necessary to ensure that for the duration of the test, all IO are 64K. It is also helpful to ensure that no two rows for a given key value reside in the same 8K page or even 64K extent for better consistency between measurements.

The table below shows the CPU-micro-sec per row for queries from 10 rows per Select to 10,000 rows per Select. The first result column is for data in memory. The higher cost per row for a 10 row query should probably be attributed to the fixed costs for issuing a Select query involving an index seek and bookmark lookup. So the actual in-memory cost per row is probably 4 CPU-micro-sec. The second result column is the average CPU cost per when disk access is required (64GB data, plus indexes) and for a warm cache (8K per read). There may be more than one disk IO per row, depending on how muck of the index and the cluster key upper levels are in-memory, so it is necessary to normalize to the actual number of reads. The third result column is the CPU cost per 8KB read. The fourth column is from a test starting with cold cache, for which the average size per read is 64K for the entire test.

Rows/Query In- mem Disk (8K) 8K Read 64K Read
10 5.20 35.50 26.6 46.3
20 4.15 31.00 25.4 47.7
25 4.00 32.84 27.6 46.1
26 4.31 26.54 21.8 44.8
30 4.17 21.80 17.5 43.4
50 4.04 21.32 17.3 44.7
100 3.90 20.82 17.1 46.9
300 3.84 21.11 17.6 49.2
1,000 4.06 21.28 17.7 49.7
3,000 4.89 21.17 16.8 46.1
10,000 4.18 19.86 15.9 41.6

In some of the above tests, the single IO to scatter-gather IO transition point was at 25 and 26 rows per query. In another set, the transition point was at 24 and 25 rows. The cost per 8KB read is about 26 CPU-micro-sec per 8K page for single block IO and at the onset of scatter-gather IO. However, the average costs drops to 16-17 CPU-micro-sec by 30 rows per Select query and higher. Notice that the cost per 64K read is only 44-50 CPU-micro-sec, while definitely higher than a single 8K read, but much lower than 8 separate 8K reads, even when issued with scatter-gather IO.

The query for 25 key lookups, most of which require (single IO) disk access, had an average duration of 110ms, with average disk read latency of 4.1ms. Note that even though data is distributed over 8 disk drives, if IO is issued one at a time synchronous, the individual query (or sequence of queries) does not benefit from the disk array. The overall system throughput, driven by many concurrent connections, does benefit from the multi-disk array. A query for 26 rows with key lookup had an average duration of 26ms, with disk latency 8.6ms. So clearly, multiple IOs are issued concurrently, allowing the query to complete much more quickly, even though disk latency is higher. The average CPU was slightly lower at 26 rows, 690 micro sec versus 820 at 25 rows.

This test was done with the explicit SQL query executed from SSMS, so that Profiler captures this as a TSQL event (also called a language event). Another surprise is that the multi-concurrent IO issued did not occur at the 26 or even 10,000 row point from queries executed with sp_executesql. Previously, I had noted the difference in hash operation behavior between TSQL (and stored procedures executed as TSQL) and RPC calls. The hash operation in language event call would not spool to temp even when the intermediate were moderately large. A query plan with the hash operation called by RPC would spool to temp at a much lower point. A hash operation spooling to temp is more expensive than an in-memory hash. The lower set point presumably is helpful in maintaining overall system throughput with many concurrent users. So apparently there are set point differences between queries executed as TSQL or RPC. From the client-side, declare a command object as type Text to execute a stored procedure as TSQL. Appending parameters to SQL will result in the use of sp_executesql. The preferred option for executing large queries is to execute as a language event if it must run fast and is allowed to disrupt transactions. Execute as RPC it is not as time class=GramE>critical, and it is more important to not disrupt transactions.

Below is some additional information demonstrating the single IO to scatter-gather IO transition occurring at either 25-26 rows or 24-25 rows. The first column is the rows per query. The second column is rows per sec when all data and indexes are in memory. The third column is rows per sec with warm cache (generating in 8KB per disk read) and the fourth column is the actual physical disk reads per sec. The fifth column is the rows per sec for cold cache (generating 64K per disk read) and the sixth column is the actual physical disk reads per sec. When the rows per query are below the transition point, it does not matter how many disks or LUNs there are, the disk IO is single and sequential, and the rate is mostly determined by disk latency. Once over the scatter-gather transition point, a single query can benefit from a high count disk array.

rows/q memory disk 8K 8K Rd/s disk 64K 64K Rd/s
10 168,228 216 246.3 54 207.4
20 212,690 230 242.9 55 208.3
25 222,735 226 236.0 365 960.5
26 211,238 1,005 1,026.8 355 937.5
30 217,445 987 996.9 337 907.7
50 232,512 1,457 1,458.3 535 1,259.1
100 248,160 1,798 1,776.5 717 1,570.5
300 257,793 2,270 2,224.7 1,093 2,133.5
1,000 244,911 2,492 2,418.0 1,321 2,461.3
3,000 203,900 3,912 3,784.4 1,725 3,146.4
10,000 239,000 4,375 4,305.0 2,292 3,847.8

Storage Performance Implications

A table scan could potentially generate from 360MB/sec to 1.5GB/sec per processor core (after including the cost to read a page in memory) depending on the number of rows per page. A random load could generate 30-45K IOPS per core. There are two significant adjustments to the raw data. One, in scaling from 1 to 8 cores, there is not going to be an 8X increase in throughput, 6-7X would be very good. Second, the above costs were derived based on special queries meant to expose cost structure. Specifically, the table scan is a need in the hay stack type query, looking a few rows with minimal logic. When a large number of rows are aggregated, the cost is higher. The cost is even higher if the logic is complicated. Consider the case of the TPC-H query 1, which aggregates 7 columns with additional arithmetic calculations and does group by (necessitating a hash aggregate instead of a stream aggregate). This query involves 9-10GB of data, and takes around 50 CPU-sec on a 2.66GHz Intel Core2 processor, for about 200MB/CPU-sec.

This is one of the major contention points on the SQL Server cost model. The purpose of the CBO is to determine the best table access strategy for a given query and data distribution statistics. In non-parallel queries, the exact split between CPU and IO cost does not really matter, so it does not matter that cost model does not attempt to account for the CPU cost of logic operations. For example, class=GramE>SUM(col1) has the same model cost as SUM(col1), SUM(col2), except in sort and hash operations, where the size of the intermediate does impact the cost model. In a parallel plan, CPU costs are reduced depending on the degree of parallelism, as a given amount of processing is divided over multiple processors, so the work is completed more quickly, while IO costs to permanent tables are not reduced. In a query with expensive logical operations, the actual CPU portion is far more than what is represented by the SQL Server cost model. Because this is not accounted for, it is possible that a specific query may not have a parallel execution plan because of the high IO cost portion, when in fact a parallel plan would be very effective.

So with the understanding that there is a wide range of variables, a reasonable target for an 8-core server system might be 2GB/sec sequential and 200K IOPS random. I will probably try for 4GB/sec sequential when I can build a SSD array.

Let me stress again, all of the above numbers are still very preliminary!

Also, the purpose of this for now is not to nail down to the exact cost structure of SQL Server disk IO, but rather to set IO performance specifications for a server configured with an SSD array, or a really massive HDD storage system.

Tuning is still Important!

It will be really great to get SSD, we can already hit very high sequential with HDD, with SSD we can also hit very high random IO. However, the CPU costs discussed above for HDD applies to SSD also, class=GramE>its the same code path. It is also clear the CPU associated with IO is very substantial. A well design database keeping hot code in memory is still very important. Do not let some nitwit tell you class=GramE>its OK to rely on the SSD!