SQL Blog post Data, Temp and Log
First, especially for all the people with SAN storage, drive letters are of no consequence. What matters is the actual physical disk layout behind each RAID Group or LUN. Forget capacity, pay attention to the number of spindles supporting each RAID group. While there is nothing wrong with making 2 or more LUNs from the same set of disks, I prefer to make 1 big LUN from the RAID Group, then create multiple partitions from Windows, just to be absolutely clear what storage space shares spindles.
If the RAID group is shared with other applications, make sure that the SLA guarantees read and write latency. One very large company conducted a stress test in the QA environment. The SAN admin carved the LUNs from the same pool of disks as production, but thought he had a really powerful SAN that this was not important. It was and he did not.
The general theory is to aggregate random IO loads into a common pool of disks, and to provide dedicated physical drives for each significant sequential load. For some reason, many people incorrectly interpret this as placing the log files for all databases into a common drive.
Each database log for a well design application, in theory, generates pure sequential IO, and some serious technical heavy weights say that a single pair of physical disks in RAID 1 can handle almost any SQL load (barring very limited exceptions). Well this is not true if you do transaction log backups, have transaction roll-backs, replication and database mirroring. One might do additional calculations for this situation. If one places more than one log file on a common physical disk set, then even though the IO to an individual file might be sequential, the complete set of IO to the disk is not.
So does each log file get its own set of physical disks? If there are 50 databases in the instance, does the server need 50 pairs of disks in RAID 1 for logs?
OK, let go back to the original theory. A single 15K disk can do approximately 200 small block random IOPS for data spread across the entire disk. It can over 4000-5000 pure sequential IOPS in SQL Server generated log writes. For each pair of physical disks you take away from the common pool, then that is 200 random (write) IOPS that has been lost (assuming RAID 10). If a database does not generate more than 200 log writes/sec and does not require sub-ms log write latency, it is better to leave this log in the common pool.
But this is not how SQL Server behaves. Run a query that aggregates a very large number of rows, and involves a join. To be specific, the execution plan should show a hash operation, whether for the join or for the aggregate. Check the hash operation cost details. The IO cost component should be non-zero. Otherwise your test query is too small to actually use tempdb (check with the SQL engine team to be sure on this).
Now watch data and temp IO over time. (that is not based on the difference before and after). The disk IO temporal pattern is read from data, write to temp, read from data, write to temp etc, then may at the end, read from temp (but sometime not, think about what this means). So if you had split your disks 50-50 data-temp (excluding OS and log disks), then at any given time, 50% of your disks are busy, the other 50% are idle. Is this the best use of your (gawd awful expensive) resources (if you bought a SAN)?
There are special considerations if certain data files depend on low latency response, and others benefit from high-queue depth operation, but hey, if life were simple, would your company need skilled people? If the SAN vendor tells you its ok to have one common pool for everything, then that person is a liar or incompetent, and frequently both.
In the past, the competent technical analysis was all oriented around 1 heavy database per server. Hence many of the multi-db configuration strategies lacked good substantiating analysis.
In the past, most medium to large businesses were on 4-socket systems with small business on 2. Today with quad or six-core processors, the reality is that a single socket processor could handle the compute requirements for medium and possibly even large business. However, there is not much in the way of cost difference between a single socket system and a dual-socket system. A 2-socket system typically has much higher memory capacity when both sockets are populated. Also, the two-socket systems like the HP DL/ML370G6 and Dell T710 have massive internal storage. Have you noticed the per storage bay cost of these are less than for the external storage unit? So this means the low-end systems should target a 2-way with the cheaper processors, just to get the memory capacity.
I would not blindly recommend a tempdb strategy without first looking at the app. The two-ends of the tempdb spectrum are 1) every gawd dang stored proc seems to create a temp table, making for very high activity to tempdb system table, but there is actually very file io to temp. 2) big queries with hash or sort ops that actually write and read from temp. In between, there might be a situation where there are heavy write temp, but very little read.
In a heavy system, I would recommend 48-96 15K disk drives, which can handle high random IO, sequential IO and temp.
For many active db, instead of 1 pair of disks for the logs of each heavy db, i prefer SSD for the logs.
On lower cost system, my thinking is 64-bit SQL Server + larger memory configuration + high degree of parallelism to increase the set point where intermediate results have to be written to tempdb. On a 32-bit system, this occurs at a relative low point (I worked it out once back in SQL 2000 days). On 64-bit, it is proportional to memory (32-bit is proportional only to non-AWE memory). The memory set point is also one per thread, so high-degree of parallelism also increases the amount that can be kept.
Failing this, get an inexpensive SSD. I have no recent experience with RAM disks. For $200-400, I am not going to be bothered with issues that could be solved with the cheap SSD.
I am also very leery of the number tempdb files. The original complaint was for case 1) in SQL Server 2000 pre-sp3 days. The issue was supposedly fixed, and yet people keep talking about tempdb files per core. No one seems interested in citing actual performance tests on a specific version. This is generally by people who collect rules without caring why
Comments by Alejandro Mihanovich:
Very interesting article. I work at Microsoft. I'm now an MCS guy (used to be a SQL server PFE). I'm writing a design document for SQL 2008 so I need to have good foundation on disk IO concepts. The only concept I'm missing comes from the following paragraph from the SQL predeployment I/O best practices.
"Maintain a balanced storage design.
The architecture of specific HBAs (Host Bus Adapters) and their settings influence performance and throughput. But in general, more LUNs result in more independent I/O queues and potentially more concurrent outstanding I/Os. Assuming that the Storage Area Network (SAN) has no bottlenecks, one can expect better scalability with multiple LUNs and multiple paths between server HBAs and the SAN."
I don't see your mentioning of this in your arguments, but it looks that having many different LUNS creates more queues and will allow for more outstanding IOs (and thus more IOPS assuming no SAN bottlenecks). What are your comments on this ?
sorry about the lack of detail, i recall describing more detail in other places. I would suggest caution in the above quote. Balance between the number of IO channels (ie, HBA) and physical disks is essential. Sure a LUN could be construed as a physical disk, but I have seen total idiots think having multiple LUNs (technically and originally, this just meant Logical Unit Number, but in SAN, most use this term to refer to the logical disk created by the storage system and presented to the host OS as a physical disk) is all that is required, not thinking that 10 LUNs created on 10 physical disks is not 10 LUNs on 100 physical disks.
In brief, lets distinguish between two different disk queues. One, the HBA typically defaults to "Per Target" mode, where the default queue depth is created for each target (LUN). I think the other mode is per HBA or something (look in the QLogic manual for this), meaning all LUNs on that HBA channel share one queue.
The other place a disk queue is on the disk itself. I think most SAS/FC allows for 32 outstanding IO requests. That is it can only do elevator seeking for upto 32 IO requests.
Back to the HBA, I believe the default queue depth is 32, which means queue depth 32 per LUN. Some people, including MS people, blindly say increasing queue depth to 128 or 254 improves performance. Well sure if you have a storage system with lots of physical disks. My guess is 8 disks per LUNs or more. If there are only a small number of physical disks per LUN, then whats the point of increasing the queue depth? So I reiterate: it is grossly irresponsible to suggest increasing queue depth on the HBA without understanding the physical disk to LUN mapping. Also, if multiple LUNs are created from the same physical disks, you still end up with mismatched queue depth to physical disks.
So, balance the number of IO channels, physical disks, and set queue depth appropriately, and even better, get a competent professional, not some idiot with a list of rules.
There is a perplexing lack of material on overall storage performance strategy for database servers. Too much of the content is narrowly focused on tuning techniques to improve performance over an unknown base. First, storage system IO performance is well understood from the science and engineering perspective. Second there is sufficient information on SQL Server IO patterns. It follows that the overall storage performance objective is then to achieve the proper balance between capability and cost. Storage performance is achieved by distributing load across several disk controllers, and a (very) large number of disk drives. After which, the storage capacity is likely to be many times larger than the database. Any document that only discusses sizing storage to meet the database space requirement is written by someone who should not be giving advice on database performance. A key element in this storage performance strategy is keeping the amortized cost per disk reasonable. Testing and tuning is the final step to verifying that the storage system performance meets the design specification and requirements.
OLTP and data warehouse applications have inherently incompatible objectives. A transactional database server needs sufficient storage performance to support peak volume processing while maintaining good or acceptable responsiveness. In recent years, most database servers now have sufficient memory that this no longer an issue. The remaining major issue is handling transient IO surges that disrupt transaction processing. If it is not practical to achieve complete immunity, then tolerable limits to the magnitude and duration that transaction processing is affected should be established. A data warehouse should be able to power through large queries with as much as bandwidth and IOPS as warranted. Rules on disk performance counter limits can be disregarded. The storage performance was purchased, so one may as well use it.
There are several causes for transient IO surges. Examples include: checkpoints, large queries, and other internal housekeeping activities. A log backup or transaction rollback can also severely degrade transaction processing because this breaks the normal pure sequential IO pattern. One common recommendation is offload large queries to a separate report server. But it is not possible to completely suppress IO surges from other activities. Brute force capability is necessary in the storage system regardless. Once the storage system is configured with brute force capability, it may also be possible to run large queries without degrading transactions. (Large queries on transaction processing servers should be controlled by setting the max degree of parallelism to 1 or perhaps no more than 1 out of 4 processors.)
The key to preventing transaction processing disruptions in the IO subsystem is maintaining low latency reads from data files and very low latency writes to the log file, both of which are necessary to complete transactions. Even if the entire database resides in memory, a large query requiring temporary workspace could cause data to be flushed out of the buffer cache, requiring reads for subsequent queries. Writes to data are not required to complete a query as this is handled by the lazy writer. Reads from log should only occur during transaction log backups and possibly from a very large transaction rollback, which should not occur in properly designed transaction server during busy hours.
A very brief discussion of disk drive performance characteristics is warranted. The difference between random and sequential IO is well known if not fully appreciated. Recent generation 10,000rpm (10K) disk drives can sustain sequential transfer rates of 80MB/sec on the outer tracks and 40MB/sec on the inner tracks. For 15K drives, 125MB/s on the outer and 70MB/s on the inner tracks. Random IO performance is constrained by disk rotational speed and seek-time. The disk media on a 10K drive completes a full rotation in 6ms, for an average rotational latency contribution of 3ms, and 2ms for a 15K drive. Current generation 10K drives have an average seek time of 4.6ms for reads and 5.2ms for writes. A 15K drive might have average seek times of 3.5ms for reads and 4.2ms for writes. For a small 8KB access, the transfer time is approximately 0.1ms. Other contributions are negligible, so the total latency for random reads is about 8ms on 10K and 5.6ms on 15K drives. This leads to the frequently quoted small block random IO performance of 125 IOPS for 10K and 175 IOPS for 15K drives respectively.
Implied in the above calculations are that disk IO is issued serially one at a time and data is distributed across the entire disk. When data is restricted to a narrow section, then the average seek time is lower, the short stroke effect. When multiple disk IO requests are issued simultaneously, the disk re-orders the IO sequence, command queuing, for higher throughput at the expense of longer latency. If data is distributed over the entire disk, high-queue depth operation quickly leads to very high latency, exceeding 20ms at queue depth 4 and over 100ms at queue depth 32 per disk. This is the underlying cause and effect of the maximum queue depth 2 recommendation. A very important matter rarely discussed is the short-stroke effect combined with high queue depth operation. When only a tiny fraction of the disk is used for the active database, preferably 5% or less, latency increases much more slowly with queue depth, less than 20ms at queue depth 8 and less than 40ms at queue depth 16 and random IO performance can exceed 400 IOPS per disk. This characteristic is the key in maintaining strong transaction processing resiliency. This is also counter to the arguments made by SAN vendors in achieving high disk space utilization with shared storage.
The SQL Server engine has internal mechanisms governing disk IO. The details concerning what operations and the set points are not disclosed as the strategy is probably still evolving. In the past, there were differences in IO strategy between Standard and Enterprise editions. The assumption was that Enterprise Edition would be employed on systems with high performance storage systems suitable for aggressive high queue depth IO operation.
One objective of the SQL Server IO strategy is to prevent over flooding the disk queue during checkpoints, effectively shutting down transaction processing. Some builds of SQL Server 2000 were not effective at this, while early builds of SQL Server 2005 were successful. In large queries, queue depth is throttled, but not enough to prevent excessive impact in transaction processing on weak storage systems and perhaps too much for best data warehouse performance on strong storage systems. For execution plans with loop joins or bookmark lookups, SQL Server issues single IO in a serialized manner (one IO is issued, the next IO starts after the first is completed) for estimated row counts 20 and lower. At estimated row counts 30 and higher, SQL Server issues multiple concurrent IO, but does not drive the disks for maximum throughput. In any case, a query requiring disk IO involving 30 plus estimated rows can execute faster than the similar plan involving 20 estimated rows because the disk IO is issued concurrently instead of one at a time serially. In a simple table scan operation, SQL Server issues 256K-1MB IO at low queue depth. Scans followed by a hash operation generate 8K IO, which is not the best choice. See the paper: Microsoft SQL Server I/O Basic Chapter 2 and included references for more details.
Since SQL Server cannot know whether it is being used for transaction processing, or data warehousing, it cannot automatically determine the best disk IO settings. There really should be a setting for the database usage model. In addition, it is helpful to declare the number of physical disk drives in each array. This allows the SQL Server engine to know how many concurrent IO can be issued while maintaining low latency, without relying entirely on the disk access time feedback. The SQL Server engine internal IO governor is beneficial, but this alone cannot mask transient IO surges effectively on a weak disk system. If IO is not throttled enough, disk access latency can become excessive. If IO is throttled, then the full IO performance capability of the disk system is not realized.
The major server systems today based on the Intel 5100, 7300 or AMD/nVidia chipsets all have very powerful IO capability with 28 or more PCI-E lanes, allowing for 7 or more x4 PCI-E slots. Intel reported that the 5100 chipset can sustain 3GB/s disk IO. The other two chipsets should be able to equal or exceed this. Many systems configure fewer slots with a mix of x4 and x8 widths. The issue is that the first generation PCI-E RAID controllers could not fully utilize the x8 PCI-E bandwidth (1.5-1.6GB/sec). It is possible the second generation PCI-E controllers can but there is very little information on this. A single PCI-E RAID controller is capable of sustaining 800MB/sec in a x4 PCI-E slot.
To realize the full system IO capability, it is necessary to spread IO across as many controllers or HBAs as allowed by the number of independent PCI-E slots, and across as many disk drives as warranted. Technically, 8 drives could saturate a single PCI-E controller in pure sequential IO. In practice, few SQL operations will generate pure sequential IO. Common external storage enclosures hold up to 14-16 disks. The HP disk enclosures hold 10, 12, and 25 disks. Configuring one or two fully populated enclosures with a total of 12-30 disks for each PCI-E controller is a perfectly sound strategy.
Let us examine the disk configurations employed in the TPC benchmarks. In a recent TPC-C publication, an HP ProLiant ML370G5 with 2 x Xeon X5460 quad core processors is configured with 7 P800 RAID controllers, 600 36GB 15K disks drives for data and 28 72GB 15K drives for log. There are 100 disk drives connected to each of 6 controllers for the data files. The total disk space on the 600 data drives without RAID overhead is 20TB. The size of the data files is 1.8TB. The active portion of disk space used is about 10%. Many TPC-C publications are cost-performance optimized, meaning that it is reasonable to assume each of the 600 data disks are loaded to approximately 200 IOPS, for 120K IOPS total. The TPC-C benchmark generates exceptionally high random IO activity, infrequently seen in actual transaction processing databases. To support this IO rate, a number of special steps are involved. A custom driver to bypass the Windows mini-port architecture is one. Raw partitions are used instead of the NTFS file system. Disk performance counters are disabled in both the operating system and SQL Server. In typical systems running below 10K IOPS, these special precautions have no impact. Raw partitions might be useful for disk activity on the order of 20K IOPS or higher. Disabling performance counters is not advisable in almost all cases.
The TPC-H disk configurations are probably more meaningful to actual heavy duty critical database servers than the TPC-C configurations. The large DSS type queries in the TPC-H benchmark are effectively heavy transient IO surges. There is some variation in TPC-H disk configurations depending on the size of the database and amount of system memory configured. The HP report for at 1TB scale factor (1TB data, and another 300GB indexes) on a ProLiant DL585G2 with 32GB memory is a suitable representation of a heavy duty database server. The disk configuration is 8 RAID controllers (7 PCI-E and 1 PCI-X) connected to a total of 200 disks for data, 25 disks per controller. One controller also connects 2 disks for OS and 4 disks for logs. The TPC-H workload does not generate significant log load. A heavy duty transaction server should have the logs on a dedicated controller.
Not all applications require the full performance capability of the above TPC-H storage configuration. A basic level of capability can be achieved with 2 controllers and 2 external disk enclosures with 12-16 disks each. An intermediate capability level is achieved with 4 controllers and 4 disk enclosures with a total of 48-60 disk drives. Really serious capability requirements should employ 6-8 disk controllers and 120-200 disk drives. The cost for one controller and one enclosure populated with 73GB 15K drives (Dell MD1000) is about $7500. The basic capability cost using direct attach storage is then about $15K, $30K for intermediate capability and $60-100K for heavy duty capability. The Dell MD3000 that supports clustering is more expensive. SAN storage systems are much more expensive, with a amortized cost per disk exceeding $2000.
For any given storage configuration, based on the number of controllers, number of disks, and RAID level, the theoretical random and sequential IOP performance can be calculated. The actual realized capability may be far below what should have been possible. Always test the configuration. Storage vendors like to offer “advanced” capabilities without fully disclosing the performance overhead. Also be aware of disk partition alignment. It is important to test IO performance over the anticipated size of the database.
Modern server systems have considerable IO performance capability. This requires distributing load over multiple storage controllers and very many disk drives. If the resulting configuration has far more storage capability than needed by the primary database, this is a very important performance bonus, providing the capability of handling heavy IO surges while maintaining reasonably IO latency. The IO governor in the SQL Server engine helps, but cannot hide weak storage performance. Always test to verify the actual storage performance is in line with the expected performance level.
Disk partition alignment is a problem caused by the Windows operating system in RAID stripes. For whatever reason from long ago, the OS allocates 63 sectors (of 512 bytes, or 31.5KB in all) for its own use. The first partition then starts on the 64th sector. A common stripe size on RAID arrays is 64KB. The problem is that in random 8KB disk read or writes, accesses to either the first or last 8KB of a 64KB extents will reside on different stripes of the RAID array, i.e., different physical disks. So in theory, every 8 accesses for 8KB blocks will result in 10 disk accesses, for a degradation of 20% in random 8K IO performance. There are various reports of greater performance impact. In some storage systems, unaligned accesses bypass the cache, so actual performance impact could be greater. Another possibility is that a particular storage system has become seriously fragmented over time, and the simple act reformatting the file system has some benefit. Windows Server 2008 finally corrects this situation. See the presentation by Jimmy May on the SQL Server Disk Partition Alignment. There is more information on disk partition alignment targeted for Microsoft Exchange ( msexchange.org tutorials Disk-Geometry).
For some inexplicable reason, people seem to want to believe that a large cache on the SAN can compensate for a low disk count. A separate disk cache for a database engine is fundamentally a silly idea. The database engine is a disk cache specially optimized for databases. Suppose the database has an 8GB data buffer. What do you really expect to be in a 4GB SAN cache that is not already in the database engine’s internal cache? If the SAN cache is larger than the database cache, i.e., system memory, then one should probably find another person to do system configuration. Accessing cache in the database engine is much more efficient than going out to the storage controller cache. All this having been said, a large write cache on the storage controller is beneficial in handling checkpoints and other write bursts. RAID controllers should probably be offered with 2-4GB cache, not the 256-512M typically offered.
Designing high performance I/O for SQL Server
Authors: Thomas Kejser & Mike Ruthruff
SQL Server Customer Advisory Team
Tech Reviewer: Jimmy May
Jimmy presented at the Curaco PASS users group.
Below are some storage related material I found. When I get a chance, I will comment where appropriate.
How It Works: Bob Dorr's SQL Server I/O Presentation
Solid State Disks and SQL Server Wes Browns article on Fusion-IO
Storage Top 10 Best Practices
Predeployment I/O Best Practices SQL Server Best Practices Article
Physical Database Storage Design
Deploying SQL Server 2005 with SAN #1
System and Storage Configuration for SQL Server
I will fill this out later
Seagate 7200RPM 3.5in drive
Seagate 10K 3.5in drive
Seagate 15K 3.5in drive
Seagate 7200RPM 3.5in drive                         Seagate 15K 3.5in drive
Seagate Savvio 10K 2.5in drive         Seagate Cheetah 15K 2.5in drive