Home, Cost-Based Optimizer, Benchmarks, Server Systems, Systems Architecture, Processors, Storage,
  Storage Overview, System View of Storage, SQL Server View of Storage, File Layout,
  PCI-E, SAS, FC, HDD, SSD, RAID Controllers, Direct-Attach,
  SAN, Dell MD3200, CLARiiON AX4, CX4, V-Max, HP P2000, EVA, P9000/VSP, Hitachi AMS

IO Queue Depth Strategy

I am finally getting around to setting up a SSD array, starting with a few devices, then working up to perhaps 20 units over 2 controllers and 4 x4 SAS ports. During initial testing I observed very high disk latency, in the range of 100ms+ for reads and upto 400ms+ for writes in certain operations. This occurs during periods of very high disk queue depth. So the questions are 1) does high queue-depth and latency improve performance 2) cause problems with other operations, including responsiveness, and 3) can this be avoided without giving up performance?

Lets start with single disk drive performance principles, then proceeding to disk arrays, caching RAID controllers, and finally to the complete storage system including SAN, with consideration for SSD characteristics. Storage performance advice commonly cited without reference to their origin and more importantly scope can be traced. With this, we can understand why the simple IO queue depth model used by SQL Server leaves much on the table, and could be improved with a more comprehensive model of disk, SSD, and storage system performance.

Hard Disk IOPS Theory

The standard theory of disk drive random IO rate is that average access time is the sum of the rotational latency, the average seek time, the transfer time, command overhead, and propagation delays. For small block IO, only the first two items make significant contribution. The average rotational latency for a 15K disk drive is 2ms, and the typical 15K 3.5in disk cites average seek time of 3.4ms. The two main contributors plus others add up to a total average access time of around 5.5ms. The two key qualifiers are:

  1. 1) for random accesses to data distributed over the entire disk
  2. 2) at queue depth one, one IO is issued, the next is issued after the completion of the first

A 15K disk is capable of 180 IOPS at queue depth 1 for random accesses distributed across the entire disk. Too many people neglect to mention the two key qualifiers.

Short Stroke Effect

When the data accesses are to limited range of disk cylinders, the average seek time is less, and queue depth one IOPS performance is higher.

Hard Disk Random Read IOPS versus Queue Depth

Now consider the implication of the second qualifier, random read IOPS versus queue depth. At higher queue depth, there are multiple IO to disk oustanding, either issued simulataneously, or subsequent IO issued before previously issued IO are completed. The controller on the disk drive itself can reorder the IO, with net effect of reducing time between each IO to increase the IOPS at the expense of higher latency for each individual IO.

There is a small gain at queue depth 2, perhaps to 200 IOPS, and larger gains to queue depth 4 at 240 IOPS, and about 40-50 IOPS with each doubling of queue depth to 32, with small gain to queue depth 64. (the disk drives of 2005 has a 64-deep task queue, it is now 128-deep.) With each doubling of the queue depth, the latency almost doubles.

The figure below shows both the short-effect and queue depth impact on IOPS. Increasing queue depth for IO distributed across the entire disk improves IOPS to 400 at queue depth 64. The short stroke effect improves performance at queue depth 1 to nearly 300 IOPS with 2.8% utilization. When the two effects are combined, it is possible drive over 600 IOPS per disk.

10K IOPS Q
IOPS versus queue depth for various disk space utilizations

The second figure shows the short-stroke effect and queue-depth versus access latency. Increasing queue depth for data accesses distributed across the entire disk has a high price in latency. Restricting data accesses to a narrow range significantly mitigates the high queue depth latency penalty.

10K lat Q
Latency versus queue depth for various disk space utilizations

Individually, both effects improve hard disk performance, but the two combined produce more dramatic benefits. In online transaction processing, response time and hence disk IO latency is as important as throughput performance. Hence the common rule adopted was to keep (time) averaged queue depth below 2 per disk (disregarding transient spikes) for online transaction processing. (This rule was popular in the days of 5400 and 7200RPM drives).

In batch processing, where there is not a live person waiting for each transaction to complete, the strategy is to drive queue depth high for the improved throughput performance. In DW/DSS, all such rules go out the window. If there was unused throughput not being utilized, then that was throughput squandered. Note the importance of the qualifiers, and yet people feel that it is acceptable to state the rule of queue depth per disk below 2 without qualification.

RAID Controllers and Disk Arrays

In the earlier days, we had just a bunch of disks (JBOD). The major RDBMS handled this situation by supporting multiple files, and multiple file groups as well for each database. And then there was light (sorry, wrong book) RAID and RAID controllers. And people saw that RAID was good, having fewer "disks" to manage at the operating system and database levels. A disk array appears to the operating system as a single disk and performance counters were usually read from the operating system, not the storage system.

The rule of queue depth 2 per disk (and accompanying qualifiers) does not translate directly to the queue depth determined from the operating system performance counter. So it became popular to cite a latency rule the data should not be higher than 10-20ms roughly, corresponding to the rule of queue depth 2 per disk (probably on 7200 and 10K RPM disks), having long since forgotten that there were additional qualifiers.

In general, data access latency below 10 ms is usually an indication that transaction response time should be very good. Latency in the range 10-20ms should support acceptable transaction response time. Latency over 20ms corresponds to a heavily loaded disk system. More importantly, any transient surge would push the disk IO into the very high queue depth range with sharp spikes in response time. So even if average transaction response time is deemed acceptable, there could be a noticeable distribution tail experiencing very poor responsiveness.

RAID Group Random Read IOPS versus Queue Depth

So what should be the IO characteristics of a set of disks in a RAID group? If a disk read IO were issued one at a time, i.e., queue depth 1, then one disk in the group would get the IO. The other disks would be idle, for 180 IOPS delivered from the RAID group. At queue depth 2, it is likely the two IO will go to different disks, so two disks will operate at queue depth 1, with the remaining disks idle. Only when the average queue depth is one per disk in the group will IOPS reach 180 per 15K disk, with some distribution probability adjustments.

For queue depth at some whole integer multiple of one per disk, the expected IOPS per disk is similar to the single disk IOPS versus queue depth. The important point is that at queue depth 1, a given thread will experience single disk IOPS, not disk group IOPS performance.

Log Write Latency

In the old direct-attach storage days, the advice was to provision one dedicated RAID 1 disk pair for each high transaction volume database log. It was rarely mentioned, but the pure sequential small block log write IO could achieve latency on the order of 0.3ms, and around 3000-5000 IOPS.

SAN vendors frequently suggest not bothering with dedicated physical disks for each high transaction volume log. Everything will be fine, trust them. When dedicated disks are provisioned, and perhaps even a dedicated service processor, the SAN still could not achieve a very low log write latency. As SAN storage systems were pervasive, Microsoft changed SQL Server to allow more log writes to be in flight. (SQL Server 2005 SP1, 8 for 32-bit, 32 for 64-bit SQL Server and 480KB. SQL Server 2008 allows 3840KB? SQL Server 2000 SP4 and 2005 RTM allowed 8 log IO outstanding per database).

RAID Small Block Random Write

RAID level write overhead is a subject covered elsewhere and is not discussed here. People like to cite RAID 5 and 10 rules without qualification, but the rule commonly cited concerning RAID 5 write performance only applies to small block random writes. On a non-caching controller, we would expect write IO to have similar IOPS characteristics as read IO, adjusted for RAID level overhead, both theoretical and controller specific.

Caching RAID Controllers and Read IO

In other discussions, I have explained why read cache is counter-productive. In essence, the database engine itself is a data cache that is much closer and less expensive to access than cache on the storage controller. Next, a properly configured system, the database engine should have much larger buffer cache than on the storage system. It is very unlikely that anything in the storage controller cache will be accessed again. Finally, the overhead of read-caching is significant in a storage configured for high IOPS performance. Read-caching on the storage controllers incurs overhead for blocks that will almost never be accessed.

Do you write weekly TPS reports at work that never get read?
10K IOPS Q
Put this in your report just to see who actually comments on it, with the correct TPS cover sheet of course!

Read caching is usually disabled in TPC benchmark systems for the reasons just cited. One reputable source stated that a small 2MB (not GB!) read cache per LUN to enable read-ahead is the preferred strategy. I recall someone asserting that a specific server system with 48GB memory showed IO performance improvement when SAN cache was increased from 80GB to 120GB. What this fact proves could be argued along more than one angle.

Caching RAID Controllers and Write IO

And now to the random write IO performance characteristics on a caching RAID controller. We have mostly skirted around write IO until now. There is a reason for this. Below is the small block random write IOPS pattern with a caching RAID controller.

10K IOPS Q

When SQL Server or the operating systems sends one or more write IO to the RAID controller, the IO is written to the controller cache, and a completion signal is sent back to the source. The next IO is then sent. There is nearly no variation in IOPS versus queue depth. Latency is very low until the write volume reaches the IOPS limit. Beyond this, the write cache fills up, and latency becomes high, until the source throttles back on write IO volume.

Large Storage Systems with Large RAID Groups

As systems became more powerful, compute performance growing 40% per year, and hard disk performance averaging less than 10% per year (7.2K to 10K to 15K, then nothing until SSD), it was necessary to build storage systems with very large number of disks. During this period, SAN systems became pervasive, especially for large storage systems.

It was soon noticed that the SAN could not deliver anywhere near the expected IOPS based on the number of disks. One cause was traced to the FC HBA default queue depth setting of 32 (per adapter, now per target?). The reasoning behind the default was the SAN vendor doctrine of shared storage. To prevent one host from generating too much load, the IO was throttled with the HBA queue depth setting so all hosts could get a share of IO volume.

If one were to measure IOPS versus the HBA Queue Depth setting on LUNs comprised of many disks, one should find that IOPS performance increases with the higher Queue Depth all the way up to the maximum. The behavior in fact was already described in the RAID Group IOPS versus queue depth section above.

FC HBA Queue Depth Setting

Note that in the early days, the HBA queue depth setting applied to either the HBA or each HBA FC port. On the more recent Emulex FC HBA, the default is now queue depth 32 per LUN, with the option of per LUN or for the entire target. (QLogic uses the term Execution Throttle?) I suppose that in one of the few TPC-C benchmark reports with SAN storage system, a reference was made to changing the HBA queue depth from 32 to 254 without addition explanation.

The TPC-C systems all have very large disk arrays. Of course it is proper to dial the HBA queue depth setting to maximum. Eventually, this was noted and the recommendation to change HBA queue depth from 32 to 255 made its way into various Microsoft documents. The ones I have seen gave no explanation to the underlying cause and effect, and supporting measurements.

So now what about a SAN with a small disk array? Especially if the queue depth setting is per LUN, and each LUN is comprised of 4 disks? Should the queue depth setting be increased to 254? I suggest following my guidelines for IOPS and latency versus queue depth with adjustments for the number of disks per LUN, all weighed against whether the objective is OLTP responsiveness or batch/DSS pure throughput.

Sequential IO

All of the IOPS versus queue depth discussed so far does not pertain to sequential disk IO. For large block sequential IO, a queue depth of 1 per LUN could be sufficient to generate maximum IO bandwidth, if the IO were large enough to span all disks in the LUN. I am inclined to think that the theory is the IO size x Queue depth should be larger than the number of disks in the array x the RAID stripe size. The reasoning is that each disk will have IO to process, but I have not verified this hypothesis.

Increasing queue depth beyond the minimum necessary to achieve near maximum bandwidth will only serve to increase latency. In a mixed small and large block IO workload, perhaps a higher queue depth on the large block might improve the large block portion of throughput, but this has not been studied. In a SAN, there are some suggestions that a higher queue depth may be necessary to reach maximum sequential bandwidth, along with multiple LUNs per RAID group. A satisfactory plausible explanation has not been provided.

SQL Server IO Characteristics

There are several Microsoft documents that describe SQL Server IO in detail. A selection include:
  the CSS SQL Server Engineers blog How It Works: Bob Dorr's SQL Server I/O Presentation, and
  Microsoft-SQL-Server-IO-Internals slidedeck,
  KB (917047) Microsoft SQL Server I/O subsystem requirements for the tempdb database.
  The SQL Server Technical Article by Emily Wilson, Mike Ruthruff, Thomas Kejser
  Analyzing I/O Characteristics and Sizing Storage Systems for SQL Server Database Applications.
  SQL Server 2008R2 Books Online has the following under Buffer Management,
  and Craig Freedman's Random Prefetching discussion on asynchronous IO.

In brief, in a table scan operation, SQL Server will issue IO to try to stay 1024 pages ahead of the scan with Enterprise Edition and 128 pages in Standard Edition.

SQL Server Synchronous and Asynchronous IO

In the random-like 8KB accesses for key lookup and loop join inner source rows, SQL Server switches from synchronous to asynchronous IO at estimate 25 rows (a source is needed for this).

Now consider the situation of a transaction processing system that also handles reports. The transactions consist of several serially issued IO at queue depth 1. The report is a single query that generates several hundred IO issued asynchronously at high queue depth. Suppose that with just transactions being processed, the average queue depth per disk is 1, and the average latency is 5ms. A transaction required 20 synchronous IO completes in 100ms, a reasonable response time. Now the report runs, generating asynchronous IO driving queue depth 8 per disk and latency to 30ms. The report runs fine because the storage system is delivering 350 IOPS per disk. But the transaction with 20 serially issued IO now takes 600ms to complete. In essence, the report has the effect of having higher priority than transaction processing!

SQL Server - Tempdb

SQL Server IO to tempdb frequently occurs at high queue depth. This is because the query has large hash or sort operations. If it were not large, then the hash or sort is done in memory. So tempdb activity is frequently from large queries that generate asynchronous IO at high queue depth.

If one only knew of the simple rule of IO latency below 20ms, we might draw the conclusion that the tempdb disks are overloaded, because the average latencies are very high. In fact, what is happening is that SQL Server is simply following the strategy for best performance with a throughput oriented metric. The proper metric is whether tempdb can deliver sufficient IO volume, not that tempdb IO needs to be low.

High Queue Depth SQL on SSD

In a table scan query without lock hints, a Read Queue depth of over 1300 was observed. The IO size was 8KB, read latency went above 200ms even on SSD storage. With table lock, the IO size was around 500K (probably mostly 512K plus a few small block IO), disk latency was less than 50ms and queue depth was around 40.

For key lookup 8KB IO, queue depth was around 160 with 7ms latency. With HDD storage and 20 or so disk, queue depth 160 works out to 8 per disk, a reasonable number for good IO but not excessive latency.

On SSD, any queue depth more than 1-2 per LUN should achieve maximum IOPS and latency is below 0.1ms. Marc Bevand on the Zorinaq's blog pointed out that IOPS at queue depth 1 is essentially measure of latency. Suppose a SSD is rated at 100μs latency, and 30K IOPS for 8KB IO (30K x 8KB = 240MB). Then the queue depth 1 IOPS should be 10K (1,000,000 μs/s / 100μs). So the theory is that queue depth 3 or higher may be required to reach 30K IOPS. Keeping queue depth at the bare minimum necessary for maximum IOPS does not degrade performance for the query generating the huge IO volume, while providing good responsiveness for other concurrent queries.

High Write Latency in Creating Clustered Indexes

The Create Clustered Index command was observed to generate very high write latency. Queue depth was 500, latency was 600ms+ and IO size averaging 100KB. Since this should not occur during the working day, it is a cause for concern. Still, there is no point issuing so many outstanding IO. With either a caching RAID controller or SSD, write IO bandwidth can be saturated even at low queue depth. Driving IO so high only makes the system highly unresponsive for any function requiring IO to the affected drives.

Wait Stat Tuning and Asynchronous Operations

Before concluding, I want to bring up this subject. Many people today are tuning solely on wait stats as the metric. Consider the following example. Our storage system is comprised of 100 disks. A query generates 1 million IO operations.

If the IO is issued synchronously at queue were 1 per disk or 100 to the entire storage system, then the IOPS is 200 per disk or 20,000 for storage system and disk latency is 5 ms. The query should take 50 seconds to complete. The total wait time is 5 ms per IO, or 5,000 seconds for 1M IO.

Now consider asynchronous IO, driving queue depth per disk to 16, for 400 IOPS per disk and 40ms latency. The query now completes 1M IO in 25 sec, but the total wait time is 40,000 seconds.

It is important to stay focused on the true metric, and always evaluate system performance counters, not just wait time statistics.

IO Queue Depth Summary

We have explored in brief the key components that are impacted by IO queue depth strategy. The following are the main points to consider.

1) Random read IOPS on hard disks can improve from operating at higher queue depth at the expense of latency.

2) Sequential IO does or should not need high queue depth operation beyond what is necessary to keep all disks busy. Staying 1024 pages ahead on a table scan seems reasonable for large block IO, but I would not flood the queue with 8KB IO. This strategy should be adjusted based on IO size, or perhaps we should ask why sometimes 8K IO is issued if the table is not fragmented.

3) Random write to RAID controller with write cache do not need deep queue depth for best performance.

4) SSD storage systems do not need very high queue depth for maximum performance.

SQL Server appears to follow a set piece strategy on IO queue depth, dependent only on Edition, Standard or Enterprise. The number of disks behind each LUN is not considered, the usage model (OLTP versus DW/DSS) is not considered.

The proposed strategy is as follows.
1) Sequential IO should not try for 1024 pages ahead if IO size is 8KB.
2) Writes IO to controllers with write cache should use lower queue depth.
3) It is important to adjust random read IO queue depth based on the type of storage, HDD or SSD.
4) It is helpful to adjust HDD random read IO by usage model, OLTP or DW/DSS.
5) It is helpful to adjust HDD random read IO based on disks per LUN.

Some of the above could be detected automatically. Others might require a parameter setting, sp_configure seems appropriate. As much as we would like a universal answer (42) independent of user action, having adjustments could greatly improve the usability of SQL Server. Today a number of operations can render the SQL Server system completely unresponsive for the duration due to disk queue flooding, even with SSD storage. Only very large perfectly configured storage systems would have immunity.