Storage performance is not inherently a complicated subject. The concepts are relatively simple. In fact, scaling storage performance is far easier compared to the difficulties encounters in scaling processor performance in NUMA systems. Storage performance is achieved by properly distributing IO over:
with consideration for random and sequential IO patterns, and in certain cases possibly also separation of low-queue and high-queue patterns, but this is not always possible. It helps to know how to estimate the theoretical performance in IOPS and bandwidth for a given number of disks and IO channel, and then test to see how your configuration compares to the expected characteristics.
It is also necessary to have a basic idea of the capabilities and limitations of each component or bus in this chain. Storage performance cannot be achieved with magic/secret registry settings or other incantations.
A dozen 1TB 7200RPM supporting data, temp and log files, however impressive the capacity seems to be, will have poor performance by database standards no matter what secret settings are applied.
Nor is performance achieved with a grossly overpriced SAN storage system, with relatively few big capacity disk drives, configured in complete disregard of the principals of disk system performance.
Without getting deep into concepts, I will provide a simple example of what I consider a balanced storage system configuration. The objective for this reference configuration is the ability to sustain transaction processing throughput with no more than minor degradation during a moderately heavy reporting query. The configuration is also suitable for data warehouse workloads.
A 4-way server, that is a system with four processor sockets, on which the current generation Intel Xeon 7400 (now 7500) series and Opteron 8400 series processors have six (or 8) cores per socket, the reference storage configuration is 4-5 controllers and 120 15K disk drives as detailed below. (Intel finally announced the Xeon 7500/6500 series in the middle of writing this, so I will have make adjustments later.)
| Processors | Intel Xeon X7560 or Intel Xeon X7460 or Opteron 8439 |
| Cores | 4 x 8 = 32 (X7560) 4 x 6 = 24 |
| Memory | 64 x 4GB = 256GB (X7500) 32 x 4GB = 128GB |
| HBA Controllers | 4-5 Dual-Port 4/8 Gbit/s FC
or 4-5 6Gb/s SAS with 2x4 ports |
| IO Channels | 8-10 |
| Disk Drives | 120 x 15K |
| Disks per channel | 12-15 |
This is only a reference configuration. With direct-attach storage, eminently suitable for data warehouse, should have an amortized cost per disk of $500-600, for a total of $60-70K. In a SAN, the cost per disk might range from $1500-3000 per disk, for a total cost of $180-360K. A SAN vendor will probably attempt to substitute 600GB 15K disk instead of the low capacity models. This will push cost per disk to over $6K, usually resulting in a storage system with far too few disks.
At this time, a SAN is required for clustering. In the past, Windows supporting clustering on SCSI, with two hosts on the same SCSI bus. But this capability was removed as customers seemed anxious to buy very expensive SAN storage. The SAS protocol also supports two hosts connected to the same SAS network, so it should also be possible to enable clustering, but Microsoft does not currently support this.
A really high-end storage system could have over 1000 disk drives. This does need not be a single storage system, it could be multiple systems. Of course, for exceptional random IO needs, a serious effort should be made to determine if solid-state storage can be implemented to keep the spindle count manageable.
If your storage vendor opens with a question as to your capacity requirements, don't waste anymore time. Just throw the rep out and proceed to the next vendor.
For calculation purposes, I am going to assume 100 of 120 disks are allocated for data and temp, and the remaining 20 for other purposes including logs. In actuality, if only 4 disks are required for logs, then 116 disks would be allocated to data and temp.
| Random IOPS | |
|   low queue, full-stroke | 185 per 15K 3.5in disk, 205 per 15K 2.5in disk, 17-22K tot |
|   high queue, full-stroke | 300+ |
|   low queue, short-stroke | 250+ per disk, 25K tot |
|   high queue, short-stroke | 400+ |
| Sequential | |
|   4Gb/s FC | 330-360MB/s per port, 720MB/s per dual-port HBA, 2.6-3.0GB/s+ total |
|   2x4 3Gb/s SAS RAID controller   in x8 PCI-E Gen 1 slot | 0.8GB/s per x4 port, 1.6GB/sec per adapter 6GB/s+ or system limit* |
|   2x4 6Gb/s SAS RAID controller   in x8 PCI-E Gen 2 slot | 1.6GB/s per x4 port, 2.8GB/sec per adapter 10GB/s+ or system limit* |
The previous generation of Intel systems (and Server Systems 2009 Q3) built around the 5000P/X and 7300 chipset may have been limited to 3GB/sec in realizable IO bandwidth, regardless of the apparent bandwidth of the IO slots. There is no clear source for the realizable IO bandwidth of a 4-way Opteron system. An authoritative source indicated that the 8-way Opteron platform could achieve 9GB/sec in IO bandwidth, with approximately 7GB/sec realized from a SQL Server query. This may have been the TPC-H Pricing Summary Report, which is moderately CPU-intensive for a single table scan query, so the 9GB/sec value might be achieved in other SQL queries. It is reasonable to suppose that a 2009-10 generation 4-way Opteron should be able to achieve 4.5GB/sec or higher, but actual documentation is still desired.
The Intel Nehalem generation servers (Xeon 5500 and 5600 series, and the 6500 and 7500 series should be able to sustain phenomenal IO bandwidth, but I have yet to get my hands on a system with truly massive IO brute force capability.
For server systems built around the Intel Xeon 5500 and 5600 processors, I recommend the HP ProLiant DL/ML 370G6 and Dell PowerEdge T710, as these should have 2 5520 IOHs for extra IO bandwidth. Other systems built with dual IOHs should be suitable as well. The systems built around the new Xeon 6500/7500 should also have massive IO bandwidth. In general, Opteron system have decent system bandwidth. The 2-way system are targeted more towards web/app servers, while the 4 & 8-way systems are targeted more towards database IO requirements.
The system memory bandwidth contribution is more complicated. Consider that a read from disk is also a write to memory, possibly followed by a read from memory. A system with SDRAM or DDR-x memory, the cite memory bandwidth is frequently the read bandwidth. The write rate to SDR/DDR memory is one-half the read rate. So the IO bandwidth might be limited to one-third the memory bandwidth, regardless of the bandwidth of the PCI busses. In the past there was a system with 2 DDR memory channels (64 bit or 8 byte wide) at 266MHz has a read bandwidth of 4,264MB/sec. The maximum disk IO bandwidth possible was around 1,350MB/sec, even though the system had two PCI-X 100/133MHz busses.
The more recent Intel chipsets, including the 5000 and 7300, have FB-DIMM which uses DDR, but have a separate device on the memory module. This allows simultaneous read and write traffic at full and half-speed. The 5000P chipset has 4 memory channels. With DDR2-667, the memory bandwidth is 5.3GB/s per channel or 21GB/sec system total for read, and 10.5GB/s for write. There are reports demonstrating 10GB/sec IO bandwidth, or even 7GB/s. The PCI-E bandwidth over 28 PCI-E lanes is 7GB/s unidirectional.
The table below shows PCI-E nominal and realizable bandwidths in GB/sec. PCI-E gen 1 (or just PCI-E) signals at 2.5Gbit/s. After 8B/10B (or is it 10B/8B?) overhead, the nominal bandwidth is 250MB/sec per lane per direction. Keep in mind PCI-E has simultaneous bi-directional capability. So a PCI-E x4 slot has a nominal bandwidth of 1GB/sec in each direction. Actual test transfers show that the maximum realizable bandwidth for a PCI-E x4 slot is approximately 800MB/sec. PCI-E gen 2 signals at 5.0Gbit/s or 500MB/sec per lane per direction, or double then gen 1 bandwidth for a given bus width.
| Slot width | PCI-E Gen 1 | PCI-E Gen 2 |   |
|---|---|---|---|
| x4 | 1.0 0.8 | 2.0 1.6 | nominal realizable |
| x8 | 2.0 1.6 | 4.0 3.2 | nominal realizable |
Systems of the Intel Core 2 processor architecture generation (Xeon 5100-5400 and Xeon 7300-7400 series) are almost exclusively PCI-E gen 1, as are the accompanying chipsets: the 5000P and 7300. The Intel 5400 MCH did support PCI-E gen 2, but no tier-1 system vendor produced a server with this chipset. (Supermicro, popular with white-box builders, did have 5400-based motherboards.) Systems of the Intel Nehalem generation and later have PCI-E gen 2. If someone could advise on when AMD Opteron transitioned from PCI-E gen 1 to gen 2, I would appreciate it.
SAS started out with 3.0Gbit/sec signaling. Unlike SATA, SAS appears to be used only with a x4 wide connection. Most SAS adapters have 2 x4 ports. The HP Smart Array P800 has 4 x4 ports. The nominal bandwidth of a x4 3Gb/s SAS connection is 12Gbit/sec. The realizable bandwidth appears to be 1.0-1.1GB/sec.
Unfortunately, this is not matched with the bandwidth of a PCI-E gen 1 slot. To realize more than 800MB/sec from a single x4 SAS channel requires a x8 PCI-E gen 1 slot, which in turn, results in under-utilizing the PCI-E slot or not achieving balance between the 2 x4 SAS ports. Since most adapters have 2 x4 ports, the maximum realizable bandwidth in a x8 PCI-E gen 1 slot is 1.6GB/sec. Some of the early PCI-E SAS adapters have an internal PCI-X bus that limits realizable bandwidth over both x4 SAS ports to 1GB/sec.
Server systems usually have some combination of x16, x8 and x4 slots. No server adapter relevent to databases can use more bandwidth than that provided by a x8 slot, so each x16 slot could have been 2 x8 slots, for a waste of an otherwise perfectly good x8 slot. The x4 slots are usually a good match for network adapters. A PCI-E gen 2 x4 slot is exactly matched to 2 x 10GbE ports.
Matching the available x16 and x8 slots to storage controllers is not always possible. Sometimes it may be necessary to place one or more SAS storage controllers in the x4 slots, in which case it is important to distribute the number disks behind controllers in x8 and x4 slots proportionately as appropriate.
In the last year, 6.0Gb/s SAS adapters and disk drives became available. The same bandwidth mismatch situation between 3Gb/s SAS and 2.5 Gb/s PCI-E gen 1 also occurs with 6Gb/s SAS and 5Gb/s PCI-E gen 2. In addition, LSI Logic states that their 6Gb/s SAS controller has a maximum combined bandwidth of 2.8GB/sec over both x4 SAS ports.
For direct-attach storage, the SAS adapter is frequently also a RAID controller.
Most SAS RAID controllers are built around LSI Logic silicon,
notably the LSI SAS 1078 for 3Gb/s SAS and the new SAS 2008 for 6Gb/s SAS and 5Gb/s PCI-E gen 2.
Intel used to make a PCI-E to SAS RAID controller built around the 80333 IO Processor,
but mysteriously dropped out of the market soon after releasing the new 81348 IOP in 2007.
There might be another vendor as I am not sure who makes the controller for the HP P800.


LSI has a 4 x4 PCI-E gen 2 6Gb/s SAS RAID Controller, listing a LSI SAS 2116. It is unclear if this is a variation of the 2008 or just two die in board.
It is Fibre channel to emphasize that the media is not necessarily fiber. Or it might be that some one thought fibre was more sophisticated. For a long time FC signaling stayed put at 4Gbit/sec, which I consider to be a serious mistake. The mistake might have also been in staying with a single lane, unlike SAS which employed 4 lanes as the standard connection.
Anyways, a dual-port 4Gb/s FC HBA is a good match for a PCI-E x4 slot. To make best use of system IO bandwidth, the x8 slot should be populated with a quad-port 4Gb/s FC HBA. Some Intel 4-way Xeon systems with the 7300MCH have one or two PCI-E bridge expanders, that allow two x8 slots share the upstream bandwidth of one x8 port. In this case, it is recommended that one slot be populated with storage controller and the other with a network controller, as simultaneous heavy traffic is predominately in opposite directions.
|   | PCI-E Gen 1 | PCI-E Gen 2 | ||
|---|---|---|---|---|
|   | x4 | x8 | x4 | x8 |
| 4Gb/s FC | dual-port | quad-port | quad-port | ? |
| 8Gb/s FC | single-port | dual-port | dual-port | quad-port? |
A dual-port 8Gb/s FC HBA should be placed in a x8 PCI-E slot or a x4 PCI-E gen slot. I am not aware that there are any quad-port 8Gb/s FC HBAs for gen 2 x8 slots, much less an 8-port for the gen 2 x16 slot.
Fibre Channel HBAsThere are currently two main vendors for FC controllers and HBAs, Emulex and QLogic. Keep in mind that the SAN controller itself is just a computer system and also has HBAs, for both front-end and back-end as applicable, from these same FC controller vendors. It might be a good idea to match the HBA, firmware and driver on both the host and SAN, but this is not a hard requirement.
On the Emulex HBAs, driver settings that used to be in the registry are now set from the HBAnyware utility. Of particular note are the pairs Queue Depth and Queue Target, and CoalesceMsCnt and CoalesceRspCnt. Various Microsoft documents say that increasing Queue Depth from the default of 32 to the maximum value 254 can improve performance without qualifications.
In life, there are always qualifications. At one time, this queue depth setting was for the entire HBA. Now on Emulex, the default is per LUN, with the option being per target. The general concept is that in a SAN storage system with hundreds of disk drives, limiting the queue depth generated by one server helps prevent overloading the SAN. Well, a line-of-business SQL Server database means it runs the business, and it is the most important host. So increasing the queue depth allowed helps.
Notice that I said a storage system with hundreds of disks. What if the storage system only has 30 disks? Does increasing queue depth on the HBA help? Now that Emulex defaults to per LUN, what if each LUN only comprises 15 disks? The Microsoft Fast Track Data Warehouse papers recommend LUNs comprised of 2 disks. What should the per LUN queue depth be?
My thinking is it should be any where from 2 to 32 per physical disk in the critical LUN. The disk itself has command queuing for up to 64 tasks (128 on current Seagate enterprise drives?). Piling on the queue increases throughput at the expense of latency. In theory, restricting the queue depth to a low value might prevent one source from overloading the LUN. An attempt to test this theory showed no difference in queue depth setting over a certain range.
Note: Queue depth has meaning at multiple locations: at the operating system, on the HBA, on the SAN storage controller, possibly both front and back-end HBAs, and on the disk drive itself.
As Linchi Shea pointed out, SAN stands for Storage Area Network. A storage system that connects to a SAN is a SAN based storage system. But it is common refer to the SAN based storage system as the SAN.
Many documents state that the bandwidth achievable in 2Gb/s FC is in range of 160-170MB/sec, and 320-360MB/sec for 4Gb/s FC. Nominally, 4G bits translates to 500M bytes decimal. Lets assume that there is a protocol overhead of 20% leaving 400M. Then translate this to MB binary where 1MB = 1,048,576 bytes. So 400MB decimal is really 380MB binary. So there is still a gap between observed and nominal bandwidth. Back in 2Gb/s FC days, I investigated this matter, and found that it was possible to achieve 190MB/sec from host to SAN cache, but only 165MB/sec from host to storage controller, then over the back-end FC loop to disk and back. The disks in the back-end are in a loop, with 15-120 disks in one loop path. It is possible that the number of disks in a loop influences that maximum achievable bandwidth.
In the 4Gb/s FC generation, EMC introduced the UltraPoint DAE with star-point topology to disks with an encolsure. This might be what allows EMC to achieve 360MB/s per 4Gb/s FC port.
Most SAN storage systems today are 4Gb/s on the back-end. The front-end might be able to support 8Gb/s FC. SAN vendors are usually not quick about moving to the latest technology. On the front-end, it only involves the HBA. The back-end is more complicated, also involving the disk drives and the enclosures, which might have custom FC components. Personally, I think storage vendors should just ditch FC on the back-end for mid-range systems and go to SAS like the Hitachi AMS. Otherwise customers should ditch the mid-range and go with multiple entry-level systems.
The SAN configuration employs four dual-port HBAs and four fiber channel loop pairs on the backend. Each FC loop pair consists of just that, two FC loops, each loop connected to a different storage/service processor (SP) depending on SAN vendor specific terminology.
Some details of the EMC CLARiiON CX4 line is show below. Each Clariion system is comprised of two Service Processors (SP). The SP is simply an Intel Core 2 architecture server system.
|   | CX4 120 | CX4 240 | CX4 480 | CX4 960 |
|---|---|---|---|---|
| Processor per SP | 1 dual-core 1.2GHz | 1 dual-core 1.6GHz | 1 dual-core 2.2GHz | 2 quad-core 2.33GHz |
| Memory per SP | 3GB | 4GB | 8GB | 16GB |
| Max Cache | 600MB | 1.264GB/s | 4.5GB | 10.764GB |
| Front-end FC ports (Base/Max) | 4/8 | 4/12 | 8/16 | 8/24 |
| Back-end FC ports (Base/Max) | 2/2 | 4/4 | 8/8 | 8/16 |
The Clariion CX4 line came out in 2008. I do have some criticism on the choice of processors for each model. First, the Intel Processor price list does not even show a 1.2GHz model in the Xeon 5100 or 3000 series. This means EMC asked Intel for a special crippled version of the Core 2 processor. The Intel Xeon processors start at 1.6GHz for a dual-core with a price of $167. The quad-core X3220 2.4GHz has price of only $198, so why in the world does EMC use the 1.2GHz dual-core at the low-end? Sure, basic storage server operations does not require a huge amount of compute cycles, but all the fancy features (that really should not be used in a critical SQL system) the SAN vendors advocate do use CPU-cycles. So when the features are used, performance tanks on the crippled CPU used in the expensive SAN storage system.
Now what we really want at the mid-range 480 level is having two processor sockets populated, as this will let the system use the full memory bandwidth of the Intel 5000 (or 5400) chipset, with 4 FB-DIMM memory channels. Yes, the 960 does have two quad-core processors, but I am inclined to think that the 960 (SP pair) with up to 16 back-end FC port might be over-reaching for the capability of the Intel 5000P chipset. If the CX4 960 in fact uses the 5400 chipset, then this might be a good configuration. But I have seen no documentation that the 960 can drive 5.6GB/sec. The quad-core E5405 2.00GHz processor is a mere $209 each, and the E5410 2.33GHz used in the high-end 960 model is $256 each. In late 2008, the dual-core E5205 1.86GHz was the same price as the quad-core E5405 2.0GHz. The Dell PowerEdge 2900 with 2 E5405 quad-core processors and 16GB was $2300.
This is less than the cost of each of the quad-port FC adapters, of which there are two in each SP of the 480. Consider also the cost of the 480 and 960 base systems, and that the 16GB memory in each 960 SP has a cost of around $800 each. Why not just fill the 16 DIMM sockets allowed by the 5000P chipset with 4GB DIMMs at about $3200 for 64GB per SP, unless it is because a large cache on a storage controller is really not that useful?
My final complaint in the EMC Clariion line is the use of a slice of the first 5 disk drives for the internal operating system (which is Windows XP or version of Windows). This results in the 5 disks having slightly less performance than the other disks, which can completely undermine the load balancing strategy. Given the price that EMC charges per disk, the storage system OS really should be moved to dedicated internal disks. If it seems that I am being highly critical of the EMC Clariion line, let me say now that the other mid-range SAN storage system use even more pathetic processors. So, the Clariion CX4 is probably the best of the mid-range systems.
First, the model name and numbering system for the HP entry storage line is utterly incomprehensible. Perhaps the product manager may have been on powerful medications at the time, or there were 2 PMs who did not talk to each other. The official name seems to be StorageWorks 2000 Modular Storage Array, but the common name seems to be MSA2000 G2 (for the second generation). This name might just apply to the parent chassis family, comprised of the 2012 12-bay enclosure for 3.5in (LFF) drives and the 2024 24-bay for 2.5in (SFF) drives. The controller itself appears to be the MSA2300 with suffix for the front-end interface. There are two models of interest for database systems, the 4Gb/s fiber channel fc model and the 3Gb/s SAS sa model. Do not even think of putting a critical database server on iSCSI. The choice is between fc and sa on the front-end interface. The configured unit might be the 2312 or 2324.
Apparently there is also the StorageWorks P2000 G3 MSA. This appears to consolidate the G2 fc and i (iSCSI) models, with 8Gb/s FC. Above this, HP has the P4000 series. I am not sure how this relates to the EVA 4400 series.
The back-end interface is SAS, and allows both SAS and SATA drives. The back-end can also connect to either additional 12-bay LFF enclosures (MSA2000) or 25-bay SFF enclosures (MSA70). There is the option of having a single controller or dual-controllers. The storage expansion enclosures can have single or dual IO interfaces. My opinion is that SAS for the back-end interface is the right choice. FC incurs a large cost premium and has no real advantages over SAS. A single 4Gb/s FC port has one-third the bandwidth of a 3Gb/s x4 SAS port, and the same ration for 8Gb/s FC to 6Gb/s x4 SAS.
There are 2 FC ports per controller on the fc model, and four SAS ports on the sa model. There is a single (3Gb/s x4?) SAS port on the backend. HP initially put out a performance report showing reasonable performance numbers for the MSA2000 G2 with 96 15K drives on the fc model of 22,800 random Read IOPS and 1,200MB/sec sequential in RAID 10, but ridiculously low numbers of 10,800 IOPS and 700MB/s for the sa model. Either this was a benchmarking mistake, which seems unlikely for HP's history in this area, or there were bugs in the sa software stack. This was later corrected to 21,800 IOPS and 1,000MB/s. This configuration is essentially the maximum for the MSA2000 with 2.5in. The random reads works out to just over 225 IOPS per disk, but the sequential is 12.5MB/sec per disk. I am presuming that 1GB/sec sequential could have been reach with about 40 disks. The Microsoft Fast Track Data Warehouse Reference Architecture 2.0 document seems to indicate that 100MB/sec per disk is possible for 2-disk RAID-1 groups.
See HP MSA2000 Technical Cook Book. for additional details. If the URL is not correct, search either the title or document number 4AA2-5505ENW.
I do not know much about the Hitachi AMS line, and have never worked on one (vendors should be alert to subtle, or not, hints). I point out this SAN storage system because Hitachi did submit a SPC benchmark report for it, with a price of about $1500 per 15K disk, amortizing the controller and supporting components. Most SAN storage systems usually work out from $2,500 to $3,500 per 15K 73 or 146GB disk, and up to $6K per 450 or 600GB disk, which seems to be what SAN vendors like to push, with horrible performance consequences. The Hitachi AMS has FC on the front-end and SAS on the back-end. The HP MSA 2000 and EMC Clariion AX also have SAS back-ends, but both are entry storage systems, in having limited backend ports. The Hitachi AMS is a mid-range comparable in bandwidth capability to the CX4 line. I reiterate that FC on the back-end is a major waste of money for less performance.
The big-iron storage systems are really beyond the scope of this document,
but a couple of comments are worth noting.
EMC top of the line used to be the DMX-4, which was a cross-bar architecture connecting front-end,
memory and back-end.
Last year (2009), the new V-Max line replaced the DMX-4.
The V-Max architecture is comprised of up to 8 engines.
Each engine is a pair of directors.
Each director is a 2-way quad-core Intel Xeon 5400 system with up to 64GB memory
(compared with 16GB for the CX4-960).
Each director also has 8 back-end 4Gb/s FC ports (comprised of quad-port HBAs?)
and various options for the front-end including 8 4Gb/s FC ports.
In the full configuration of 128 4Gb/s FC ports on the front and back ends,
the expectation is that this system could deliver 40GB/s if there a no bottlenecks in the system architecture.
Of course, there is no documentation on the actual sequential capability of the V-Max system.
EMC has not submitted SPC benchmark results for any of their product line.
EMC V-Max documentation does not say what the Virtual Matrix interface is,
but I presume it is Infini-Band, as I do not think 4 or even 8Gb/s FC is a good choice.
The main point here is that even EMC has decided it is a waste of time and money to build a custom architecture in silicon, and just using the best of Intel Xeon (or AMD Opteron) architecture components. It should be possible to build even more powerful storage systems around the Intel Nehalem architecture infrastructure. Unfortunately, storage systems evolve slowly, usually lagging 1-2 generations behind server systems.
The next step in the chain of devices from the system IO bus to the disk drive is the disk enclosure (EMC uses the term DAE, which will also be used here even for non-EMC enclosures). Some years ago, a 3U enclosure for 15 3.5in disk drives was more or the less the only standard configuration.
HP may have been the first major vendor to switch to a 2U 12-disk enclosure for 3.5in drives.

The standard configuration for 2.5in drives seems to be a 2U enclosure for 24 disks (below)

or 25 drives.

And the legacy 15 LFF (3.5in) disk enclosure.

My view is that the 12-disk enclosure is better matched to a single IO channel than the 15-disk enclosure. The 24 disk SFF enclosure should be split into two separate channels.
I am not aware of any SAN vendors offering the high-density enclosures for 2.5in drives, except for HP in the Storage Works 2000 MSA line. This may indicate a serious lack of appreciation (or even understanding) of the importance of performance over capacity.
The table below shows the specifications for the recent Seagate 3.5in (LFF) and 2.5in (SFF) 15K drives. The 2.5in Savvio drive has lower average seek time. The rotational latency for 15K drives is 2.0ms. The transfer time for an 8KB block ranges from 0.04ms at 204MB/s to 0.065ms at 122MB/s. The average access time for 8K IOP randomly distributed over the entire disk is then 5.45ms for the 3.5in disk and 4.95ms for the 2.5in disk. It should also be considered that the 3.5in Cheetah 15K.7 has media density of 150GB per platter versus 73GB for the 2.5in Savvio 15K.2. If the 3.5in disk were only populated to 50% capacity, the average seek latency would probably be comparable with the 2.5in disk.
|   | Cheetah 15K.6 | Cheetah 15K.7 | Savvio 15K.2 |
|---|---|---|---|
|   Avg. Read Seek | 3.4ms | 3.4ms | 2.9ms |
|   Avg. Write Seek | 3.9ms | 3.9ms | 3.3ms |
|   Sequential Max | 171MB/s | 204MB/s | 160MB/s |
|   Sequential Min | 112MB/s | 122MB/s | 122MB/s |
The sequential transfer rates assume no errors and no relocated logical blocks. On the enterprise class disk drives, this is effectively achieved. On the high-capacity 7200RPM drives, the ability to sustain the perfect transfer rates is highly problematic, and the data sheet may not specify the design transfer rate.
The chart below shows IOMeter results for a single 10K over a range of data space utilization
and queue depth demonstrating the short-stroke effect on IOPS (vertical axis).
The charts below show latency on the vertical scale in ms for a range of data utilizations
and queue depth.
There is no point to having the big capacity SATA disks in the main storage system. We said early that the short-stroke effect was key. This meant we will have much more space than needed on the set of 15K drives. The SATA drives are good for allowing dev and QA to work with the full database. There are too many developers who cannot understand why a query works fine on a tiny 10MB dev database, but not the 10TB production database.
The short-stroke effect is absolutely essential for transaction processing systems with tight mandatory limits on responsiveness. The short-stroke effect lowers latency and improves random IO performance. Most importantly, the short-stroke effect keeps latency low during heavy IO surges when active data is kept in a very narrow band of the disk. On a fully populated disk where full strokes are required, latency can jump to several hundred milli-sec during heavy IO surges.
One of the fundamental arguments made by SAN storage vendors is that by consolidating storage, it is possible to achieve high storage utilization, i.e., guaranteeing the full-stroke criteria. A heavy IO surge is very likely to cause transaction processing volume to collapse. To benefit from the short-stroke effect, it is necessary to restrict the active data to a very narrow range. The remaining disk capacity can still be used for data not in use during busy hours. This means having far more capacity than the active database, which in turn implies that it is essential to keep amortized cost per disk low, i.e., forgoing frills.
Most SSD fall into one of two categories: One is an SSD with one of the standard disk drive interfaces such as SATA, SAS, FC, or one of the legacy interfaces. The second type connects directly into the system IO port (PCI-E), for example the Fusion-IO SSDs. TMS has a complete solid state SAN system, which might even included DRAM for storage as well as non-volatile memory.
Most of SSD devices in the news have a SATA interface and are intended for use in desktop and mobile systems. There might be (or have been) technical issues with using the SATA SSD in an SAS storage system when there are multiple SAS-SAS bridges in the chain, even though SATA drives can be used in these systems.
STEC makes the SSD for the EMC DMX line, possibly other models as well, and for several other storage vendors. The specifications for the STEC SSD is 52K random read IOPS, 17K random write IOPS, 250MB/s sequential reads, and 200MB/s sequential write.
When using SSDs with disk interfaces in a mixed HD/SSD environment, it might be a good idea to place one SSD in each of the IO channels instead having multiple SSDs on one channel. Check with the storage vendor on technical issues.
The general idea behind the Fusion-IO architecture is that the storage interfaces were not really intended for the capabilities of an SSD. The storage interface, like SAS, was designed for many drives to be connected to a single system IO port. Since Fusion-IO could build a SSD unit to match the IO capability of a PCI-E slot, it is nature to interface directly to PCI-E.
What I would like from Fusion-IO are a range of cards that can match the IO bandwidth of PCI-E gen 2 x4, x8 and x16 slots, and deliver 2, 4 and 8GB/s respectively. Even better is the ability to simultaneously read 2GB/s and write 500MB/s or so from a x4 port, and so on for x8 and x16. I do not think it is really necessary for the write bandwidth to be more than 30-50% of the read bandwidth in proper database applications. One way to do this is to have a card with a x16 PCI-E interface, but the onboard SSD only connects to a x4 slice. The main card allows daughter cards each connecting to a x4 slice, or something to this effect.
One more thing I would like from Fusion-IO is using the PCI-E to PCI-E bridge chips. In my other blog on System Architecture, I mentioned that the 4-way systems such as the Dell PowerEdge R900 and HP ProLiant DL580G5 for Xeon 7400 series with the 7300MCH use bridge chips that let two PCI-E port share one upstream port. My thought is that the Fusion-IO resides in an external enclosure, attached to the bridge chip. The other two ports connect to the host system(s). One the host would be a simple pass through adapter that sends the signals from the host PCI-E port to the bridge chip in the Fusion-IO external enclosure. This means the SSD is connected to two hosts. So now we can have a cluster? Sure it would probably involve a lot of software to make this work, who said life was easy.
At this time, I am not entirely sure what the proper role is for SSD in database servers. A properly designed disk drive storage system can already achieve phenomenally highly sequential IO, just stay away from expensive SAN storage, and do not follow their standard advice on storage configuration.
Random IO is the natural fit for SSD. Let us suppose the amortized cost of a 146GB 15K disk is $500 in direct attach, $2K in a SAN, and a similar capacity SSD is $3000. Then table below shows cost per GB, and cost per IOP between HD and SSD, using fictitious but reasonable numbers.
|   | Direct Attach | SAN | SSD |
|---|---|---|---|
| Capacity | 146GB | 146GB | 146GB |
| Amortized Unit Cost | $500 | $2000 | $3000 |
| IOPS | 200 | 200 | 20,000 |
| $/GB | 3.4 | 13.7 | 20.5 |
| $/IOP | 2.5 | 10 | 0.15 |
A database storage system should not be entirely SSD. Rather a mix of 15K HD and SSD should be employed. The key is to put the data subject to high random IO into its own filegroup on the SSD.
Some people have suggested and temp as good candidates for SSD. For a single active database, a storage system that works correctly with the hard drive sequential characteristics is fine. It is only the situation of multiple high activity databases. Ideally, the storage controller cache interprets the pattern of activity from multiple log files on one LUN, so that a single RAID group is sufficient. If not, then this is a good fit for SSD.
I am not certain that SSDs are necessary for temp. For the data warehouse queries, the temp on a large HD array seems to work fine. In the TPC-H benchmark results, the queries that showed strong advantage for SDD involve random IO, not heavy tempdb activity.
Two criteria can sometimes terminate extended discussion. If the database happens to fit in memory, then there will not be heavy disk activity, except possible for log and temp. The log activity can be handled by disk drives. In this, it may not be worth the effort to setup a 48-disk HD array for temp, so a single SSD for temp is a good choice.
The second criteria is for slightly larger databases that exceed system memory, perhaps in the 200-400GB range, but are sufficiently small to fit on one or two SSDs. Again, it may not be worth the effort to setup the HD array, making the SSD a good choice.
Solid State Storage in the future without RAID?A point I stress to people is to not blindly carry a great idea from the past into the future without understanding why. (Of course, one should also not blindly discard knowledge, most especially the reason underlying reason behind the knowledge).
So why do we have RAID? In the early days, disk drives were notoriously prone to failure. Does anyone remember was the original platter size was? I thought it was in the 12-18in range. MTBF may have been in the 1000hr range? Even today, at 1M-hr MTBF, for a 1000-disk array, the expectation is 8.8 disk failures per year (the average hours per year is 8,765.76, based on 365.24 days) Some reports show much higher failure rates, perhaps 30 per year per 1000 disks. Of course this includes all components in the storage system, not just the bare drive.
Sure, SSDs will also have a non-infinite MTBF.
But the HD is fundamentally a single device.
If the motor or certain components in the read/write mechanism fails,
the entire disk is inaccessible.
An SSD is not by necessity inherently a single device.
The figure below shows a functional diagram of an Intel SSD.
There is a controller, and there are non-volatile memory chips (NAND Flash).
In system memory, the ECC algorithm is design to correct single bit errors
and detect double-bit errors within an 8-byte channel using 72-bit.
When four channels are uniformly populated it can also detect
and correct an entire x4 or x8 dram device failure and detect double x4 chip failures.
I suppose SSDs might already have some chip failure capability
(but I have not found the documentation that actually states this detail).
There should be no fundamental reason an SSD cannot have redundant controllers as well. With proper design, the SSD may nolonger be subject to single component failure. With proper design, an SSD storage system could conceivably copy off data from a partially failed individual SSD to a standby SSD, or even a disk drive.
I stress that may not be what we have today, but what I think the future should be.Having set most of the base, we can now discuss the strategies behind configuring the storage system. The most fundamental is probably that the traffic for each LUN must travel through a specific path. So on FC, do not plan on having more than 320-360MB/sec per LUN. EMC documents say that traffic can go on both loops of the FC pair, meaning 720MB/sec. I may have not interpreted this correctly, so verify this. The traffic to LUNs must be properly distributed over the available paths and HBAs with adjustments for the PCI-E slot bandwidth.
In the past, I have seen approximately 11MB/sec per disk in running a table scan on a SAN storage system. This works out to 175 x 64KB IOs, meaning the storage is issuing 64K IOs serially instead of taking advantage of the sequential capabilities of the disk drive.
The EMC whitepaper h5548 "Deploying EMC CLARiiON CX4-960 for Data Warehouse/Decision Support System
(DSS) Workloads" states:
number of key FLARE changes have been included in release 28,
the array system software release in support of the CX4 family, to ensure that we can drive
the underlying disk drives to the considerably higher level of data delivery rate using the "thin"
RAID striping configurations.
Thin RAID is described as a RAID group with few drives.
Later the EMC paper advocates 2+1R5 or 2+2R6,
in contrast with Microsoft FTDW which advocates 1+1 RAID 1.
The Microsoft FTDW documents also statement 100MB/sec per disk is possible. It would then take only 8 disks in one DAE to saturate a FC loop pair. My assertion is that even DW is not always true sequential, so a more reasonable target is 15 disks in one DAE, 2 DAE in one 4Gb/s FC loop pair. If the 720MB/sec per loop pair can be achieved, this works out to 24MB/sec per disk.
If there ever is 8Gb/s FC on the back-end, it would be desirable to continue with 2 DAE per loop pair averaging 48MB/sec per disk.
A direct attach storage system has no problems in aggregating disk drive sequential performance. In the past, I have place 15 disks in a single enclosure on a 3Gb/s x4 SAS port for 800MB/sec. Today, perhaps 12 disks on a 6Gb/s x4 SAS port works well with the 12 disk enclosures, or splitting the 24 disk SFF enclosure into two.
Technically, the correct strategy on SAN is to create three or more LUNs on each of the main (non-log) RAID groups. The first LUN is for data, or the filegroup with the big table (usually order line-items) with the other tables and indexes in the filegroup on the second LUN. The tempdb data file is on the next LUN. The last LUN, with the short-stroke strategy, should be the largest capacity and is intended for inactive files. This could be database backups, flat files or even data files for archive tables.
If we were to be extremely aggressive in performance optimization, the data and temp LUNs would be raw partitions. Most people do not even know about this capability in the Windows operating system and SQL Server, and will probably be afraid to venture into this territory.
The following example shows a storage system with 9 RAID groups, 4 FC loop pairs, and perhaps 8 DAEs total, 2 per loop pair. With the data & temp LUNs evenly distributed across SP and loops, how should the log RAID group LUN be configured? If the SPs and loops are not heavily loaded, the log could be set to either SP. It could be in its own loop, or even be distributed among unused disks in the other loops.
| RAID Group | SP | Loop | LUN A | LUN B | LUN C | LUN D |
|---|---|---|---|---|---|---|
| 0 | A | 0 | FG1 data 1 | FG2 data | temp 1 | other |
| 0 | B | 0 | FG1 data 2 | FG2 data | temp 2 | other |
| 0 | A | 1 | FG1 data 3 | FG2 data | temp 3 | other |
| 0 | B | 1 | FG1 data 4 | FG2 data | temp 4 | other |
| 0 | A | 2 | FG1 data 5 | FG2 data | temp 5 | other |
| 0 | B | 2 | FG1 data 6 | FG2 data | temp 6 | other |
| 0 | A | 3 | FG1 data 7 | FG2 data | temp 7 | other |
| 0 | B | 3 | FG1 data 8 | FG2 data | temp 8 | other |
| 8 | ? | ? | log | |||
If one were to follow the thin RAID group strategy, then there would be multiple RAID groups in each loop.
One storage system feature I do not intend to use is dynamic volume growth or relocation. SQL Server already has features for this. If more space is needed, add another DAE, RAID group, and set of LUNs. Use the ADD FILE command. Be sure to rebuild indexes during off hours to redistribute data across all files in the filegroup. When moving data from older disks to new disks, we could use DBCC SHRINKFILE with the EMPTYFILE option, but I have found that it can be faster to rebuild the indexes (and clustered indexes) to another filegroup first, and then shrink the file.
This can also be a fundamental question of whether to use the SQL Server or storage system features to perform certain functions. A transactional database engines, be it SQL Server, Oracle, DB2 etc have carefully designed procedures to main data integrity.
One of most serious problems I have seen is that in large IT organizations, the database and storage systems are managed by completely different groups. The storage group is intent on carrying out the latest dogmatic doctrine from storage vendors, such as storage as a service. Applications request service based on capacity only. The storage service (system) will automatically relocate LUNs to avoid hot spots, including moving LUNs between SSD, 15K and 7200RPM storage.
The SAN admin will strictly follow the configuration strategies put out by the storage vendor, in complete disregard of the base of accumulated knowledge on Database storage performance. This is guaranteed to be the worst possible configuration for database performance. When the DBA raises questions regarding poor storage performance, the SAN admin looks at the tool provided by the SAN vendor and pronounces that all parameters are well under the system limits. That disk latencies are in the hundreds of milli-seconds is not a reason to be concerned. The line-of-business and data warehouse systems really need to be on dedicated storage systems.
Per other discussion, each highly active log file must have its dedicated RAID 1 pair of physical disk drives, and possibly RAID 10 sets in extreme cases. Unknown is whether really critical log traffic should be directed over a dedicated HBA. Even more extreme is if an SP under heavy load from data and temp LUNs can provide super low latency for log LUNs. If not, then in such extreme circumstances, the log might require its own dedicated SP. The middle storage systems all have 2 controllers. Enterprise storage systems can have multiple controllers, but each is incredibly expensive. An entry level storage system has one or two controllers, but the intent is the complete storage system has multiple entry storage units, which is why entry level SAN systems makes sense even for big-time databases.
The other question on logs, suppose there multiple high activity log files on the same RAID group. Without a cache, the actual IO to disk is non-sequential, hopping between the log files. Can the storage controller effectively cache the log writes? Since the writes to the individual log files are sequential, it will fill the cache line and RAID stripe for that matter.
Comments on System Memory ConfigurationSome people think I try to put formulas in areas that they think are too mysterious, like SQL Server execution plan cost structure. I put formulas were I think it is meaningful. The Microsoft FTDW puts out an arbitrary on system memory with no substantiating data. My thoughts:
Server systems today are frequently configured with 128GB of memory. A 4GB ECC DIMM costs about $150 depending on the source. So 32 x 4GB contributes on the order of $6,000 to the cost of a system. Given the cost of storage performance, filling the system with 4GB memory modules can have a relatively high value to cost ratio. In any case, an effort to determine the most correct memory configuration will likely exceed the cost of filling the DIMM sockets with the largest capacity DIMM without a price per GB premium. Today, this is the 4GB DIMM. Next year or in 2012, it might be the 8GB DIMM.
 
SQL Blog post Data, Temp and Log
First, especially for all the people with SAN storage, drive letters are of no consequence. What matters is the actual physical disk layout behind each RAID Group or LUN. Forget capacity, pay attention to the number of spindles supporting each RAID group. While there is nothing wrong with making 2 or more LUNs from the same set of disks, I prefer to make 1 big LUN from the RAID Group, then create multiple partitions from Windows, just to be absolutely clear what storage space shares spindles.
If the RAID group is shared with other applications, make sure that the SLA guarantees read and write latency. One very large company conducted a stress test in the QA environment. The SAN admin carved the LUNs from the same pool of disks as production, but thought he had a really powerful SAN that this was not important. It was and he did not.
The general theory is to aggregate random IO loads into a common pool of disks, and to provide dedicated physical drives for each significant sequential load. For some reason, many people incorrectly interpret this as placing the log files for all databases into a common drive.
Each database log for a well design application, in theory, generates pure sequential IO, and some serious technical heavy weights say that a single pair of physical disks in RAID 1 can handle almost any SQL load (barring very limited exceptions). Well this is not true if you do transaction log backups, have transaction roll-backs, replication and database mirroring. One might do additional calculations for this situation. If one places more than one log file on a common physical disk set, then even though the IO to an individual file might be sequential, the complete set of IO to the disk is not.
So does each log file get its own set of physical disks? If there are 50 databases in the instance, does the server need 50 pairs of disks in RAID 1 for logs?
OK, let go back to the original theory. A single 15K disk can do approximately 200 small block random IOPS for data spread across the entire disk. It can over 4000-5000 pure sequential IOPS in SQL Server generated log writes. For each pair of physical disks you take away from the common pool, then that is 200 random (write) IOPS that has been lost (assuming RAID 10). If a database does not generate more than 200 log writes/sec and does not require sub-ms log write latency, it is better to leave this log in the common pool.
But this is not how SQL Server behaves. Run a query that aggregates a very large number of rows, and involves a join. To be specific, the execution plan should show a hash operation, whether for the join or for the aggregate. Check the hash operation cost details. The IO cost component should be non-zero. Otherwise your test query is too small to actually use tempdb (check with the SQL engine team to be sure on this).
Now watch data and temp IO over time. (that is not based on the difference before and after). The disk IO temporal pattern is read from data, write to temp, read from data, write to temp etc, then may at the end, read from temp (but sometime not, think about what this means). So if you had split your disks 50-50 data-temp (excluding OS and log disks), then at any given time, 50% of your disks are busy, the other 50% are idle. Is this the best use of your (gawd awful expensive) resources (if you bought a SAN)?
There are special considerations if certain data files depend on low latency response, and others benefit from high-queue depth operation, but hey, if life were simple, would your company need skilled people? If the SAN vendor tells you its ok to have one common pool for everything, then that person is a liar or incompetent, and frequently both.
 
In the past, the competent technical analysis was all oriented around 1 heavy database per server. Hence many of the multi-db configuration strategies lacked good substantiating analysis.
In the past, most medium to large businesses were on 4-socket systems with small business on 2. Today with quad or six-core processors, the reality is that a single socket processor could handle the compute requirements for medium and possibly even large business. However, there is not much in the way of cost difference between a single socket system and a dual-socket system. A 2-socket system typically has much higher memory capacity when both sockets are populated. Also, the two-socket systems like the HP DL/ML370G6 and Dell T710 have massive internal storage. Have you noticed the per storage bay cost of these are less than for the external storage unit? So this means the low-end systems should target a 2-way with the cheaper processors, just to get the memory capacity.
I would not blindly recommend a tempdb strategy without first looking at the app. The two-ends of the tempdb spectrum are 1) every gawd dang stored proc seems to create a temp table, making for very high activity to tempdb system table, but there is actually very file io to temp. 2) big queries with hash or sort ops that actually write and read from temp. In between, there might be a situation where there are heavy write temp, but very little read.
In a heavy system, I would recommend 48-96 15K disk drives, which can handle high random IO, sequential IO and temp.
For many active db, instead of 1 pair of disks for the logs of each heavy db, i prefer SSD for the logs.
On lower cost system, my thinking is 64-bit SQL Server + larger memory configuration + high degree of parallelism to increase the set point where intermediate results have to be written to tempdb. On a 32-bit system, this occurs at a relative low point (I worked it out once back in SQL 2000 days). On 64-bit, it is proportional to memory (32-bit is proportional only to non-AWE memory). The memory set point is also one per thread, so high-degree of parallelism also increases the amount that can be kept.
Failing this, get an inexpensive SSD. I have no recent experience with RAM disks. For $200-400, I am not going to be bothered with issues that could be solved with the cheap SSD.
I am also very leery of the number tempdb files. The original complaint was for case 1) in SQL Server 2000 pre-sp3 days. The issue was supposedly fixed, and yet people keep talking about tempdb files per core. No one seems interested in citing actual performance tests on a specific version. This is generally by people who collect rules without caring why
 
Comments by Alejandro Mihanovich:
Very interesting article. I work at Microsoft. I'm now an MCS guy (used to be a SQL server PFE).
I'm writing a design document for SQL 2008 so I need to have good foundation on disk IO concepts.
The only concept I'm missing comes from the following paragraph from
the SQL predeployment I/O best practices.
"Maintain a balanced storage design.
The architecture of specific HBAs (Host Bus Adapters) and their settings influence performance
and throughput.
But in general, more LUNs result in more independent I/O queues and potentially
more concurrent outstanding I/Os.
Assuming that the Storage Area Network (SAN) has no bottlenecks,
one can expect better scalability with multiple LUNs and multiple paths between server HBAs and the SAN."
I don't see your mentioning of this in your arguments, but it looks that having many different LUNS creates more queues and will allow for more outstanding IOs (and thus more IOPS assuming no SAN bottlenecks). What are your comments on this ?
My reply:
sorry about the lack of detail, i recall describing more detail in other places.
I would suggest caution in the above quote. Balance between the number of IO channels (ie, HBA)
and physical disks is essential. Sure a LUN could be construed as a physical disk,
but I have seen total idiots think having multiple LUNs (technically and originally,
this just meant Logical Unit Number,
but in SAN, most use this term to refer to the logical disk created by the storage system
and presented to the host OS as a physical disk) is all that is required,
not thinking that 10 LUNs created on 10 physical disks is not 10 LUNs on 100 physical disks.
In brief, lets distinguish between two different disk queues. One, the HBA typically defaults to "Per Target" mode, where the default queue depth is created for each target (LUN). I think the other mode is per HBA or something (look in the QLogic manual for this), meaning all LUNs on that HBA channel share one queue.
The other place a disk queue is on the disk itself. I think most SAS/FC allows for 32 outstanding IO requests. That is it can only do elevator seeking for upto 32 IO requests.
Back to the HBA, I believe the default queue depth is 32, which means queue depth 32 per LUN. Some people, including MS people, blindly say increasing queue depth to 128 or 254 improves performance. Well sure if you have a storage system with lots of physical disks. My guess is 8 disks per LUNs or more. If there are only a small number of physical disks per LUN, then whats the point of increasing the queue depth? So I reiterate: it is grossly irresponsible to suggest increasing queue depth on the HBA without understanding the physical disk to LUN mapping. Also, if multiple LUNs are created from the same physical disks, you still end up with mismatched queue depth to physical disks.
So, balance the number of IO channels, physical disks, and set queue depth appropriately, and even better, get a competent professional, not some idiot with a list of rules.
QDPMA 2008 article Storage, and also on SQL Blog Storage Performance for SQL Server
There is a perplexing lack of material on overall storage performance strategy for database servers. Too much of the content is narrowly focused on tuning techniques to improve performance over an unknown base. First, storage system IO performance is well understood from the science and engineering perspective. Second there is sufficient information on SQL Server IO patterns. It follows that the overall storage performance objective is then to achieve the proper balance between capability and cost. Storage performance is achieved by distributing load across several disk controllers, and a (very) large number of disk drives. After which, the storage capacity is likely to be many times larger than the database. Any document that only discusses sizing storage to meet the database space requirement is written by someone who should not be giving advice on database performance. A key element in this storage performance strategy is keeping the amortized cost per disk reasonable. Testing and tuning is the final step to verifying that the storage system performance meets the design specification and requirements.
OLTP and data warehouse applications have inherently incompatible objectives. A transactional database server needs sufficient storage performance to support peak volume processing while maintaining good or acceptable responsiveness. In recent years, most database servers now have sufficient memory that this no longer an issue. The remaining major issue is handling transient IO surges that disrupt transaction processing. If it is not practical to achieve complete immunity, then tolerable limits to the magnitude and duration that transaction processing is affected should be established. A data warehouse should be able to power through large queries with as much as bandwidth and IOPS as warranted. Rules on disk performance counter limits can be disregarded. The storage performance was purchased, so one may as well use it.
There are several causes for transient IO surges. Examples include: checkpoints, large queries, and other internal housekeeping activities. A log backup or transaction rollback can also severely degrade transaction processing because this breaks the normal pure sequential IO pattern. One common recommendation is offload large queries to a separate report server. But it is not possible to completely suppress IO surges from other activities. Brute force capability is necessary in the storage system regardless. Once the storage system is configured with brute force capability, it may also be possible to run large queries without degrading transactions. (Large queries on transaction processing servers should be controlled by setting the max degree of parallelism to 1 or perhaps no more than 1 out of 4 processors.)
The key to preventing transaction processing disruptions in the IO subsystem is maintaining low latency reads from data files and very low latency writes to the log file, both of which are necessary to complete transactions. Even if the entire database resides in memory, a large query requiring temporary workspace could cause data to be flushed out of the buffer cache, requiring reads for subsequent queries. Writes to data are not required to complete a query as this is handled by the lazy writer. Reads from log should only occur during transaction log backups and possibly from a very large transaction rollback, which should not occur in properly designed transaction server during busy hours.
A very brief discussion of disk drive performance characteristics is warranted. The difference between random and sequential IO is well known if not fully appreciated. Recent generation 10,000rpm (10K) disk drives can sustain sequential transfer rates of 80MB/sec on the outer tracks and 40MB/sec on the inner tracks. For 15K drives, 125MB/s on the outer and 70MB/s on the inner tracks. Random IO performance is constrained by disk rotational speed and seek-time. The disk media on a 10K drive completes a full rotation in 6ms, for an average rotational latency contribution of 3ms, and 2ms for a 15K drive. Current generation 10K drives have an average seek time of 4.6ms for reads and 5.2ms for writes. A 15K drive might have average seek times of 3.5ms for reads and 4.2ms for writes. For a small 8KB access, the transfer time is approximately 0.1ms. Other contributions are negligible, so the total latency for random reads is about 8ms on 10K and 5.6ms on 15K drives. This leads to the frequently quoted small block random IO performance of 125 IOPS for 10K and 175 IOPS for 15K drives respectively.
Implied in the above calculations are that disk IO is issued serially one at a time and data is distributed across the entire disk. When data is restricted to a narrow section, then the average seek time is lower, the short stroke effect. When multiple disk IO requests are issued simultaneously, the disk re-orders the IO sequence, command queuing, for higher throughput at the expense of longer latency. If data is distributed over the entire disk, high-queue depth operation quickly leads to very high latency, exceeding 20ms at queue depth 4 and over 100ms at queue depth 32 per disk. This is the underlying cause and effect of the maximum queue depth 2 recommendation. A very important matter rarely discussed is the short-stroke effect combined with high queue depth operation. When only a tiny fraction of the disk is used for the active database, preferably 5% or less, latency increases much more slowly with queue depth, less than 20ms at queue depth 8 and less than 40ms at queue depth 16 and random IO performance can exceed 400 IOPS per disk. This characteristic is the key in maintaining strong transaction processing resiliency. This is also counter to the arguments made by SAN vendors in achieving high disk space utilization with shared storage.
The SQL Server engine has internal mechanisms governing disk IO. The details concerning what operations and the set points are not disclosed as the strategy is probably still evolving. In the past, there were differences in IO strategy between Standard and Enterprise editions. The assumption was that Enterprise Edition would be employed on systems with high performance storage systems suitable for aggressive high queue depth IO operation.
One objective of the SQL Server IO strategy is to prevent over flooding the disk queue during checkpoints, effectively shutting down transaction processing. Some builds of SQL Server 2000 were not effective at this, while early builds of SQL Server 2005 were successful. In large queries, queue depth is throttled, but not enough to prevent excessive impact in transaction processing on weak storage systems and perhaps too much for best data warehouse performance on strong storage systems. For execution plans with loop joins or bookmark lookups, SQL Server issues single IO in a serialized manner (one IO is issued, the next IO starts after the first is completed) for estimated row counts 20 and lower. At estimated row counts 30 and higher, SQL Server issues multiple concurrent IO, but does not drive the disks for maximum throughput. In any case, a query requiring disk IO involving 30 plus estimated rows can execute faster than the similar plan involving 20 estimated rows because the disk IO is issued concurrently instead of one at a time serially. In a simple table scan operation, SQL Server issues 256K-1MB IO at low queue depth. Scans followed by a hash operation generate 8K IO, which is not the best choice. See the paper: Microsoft SQL Server I/O Basic Chapter 2 and included references for more details.
Since SQL Server cannot know whether it is being used for transaction processing, or data warehousing, it cannot automatically determine the best disk IO settings. There really should be a setting for the database usage model. In addition, it is helpful to declare the number of physical disk drives in each array. This allows the SQL Server engine to know how many concurrent IO can be issued while maintaining low latency, without relying entirely on the disk access time feedback. The SQL Server engine internal IO governor is beneficial, but this alone cannot mask transient IO surges effectively on a weak disk system. If IO is not throttled enough, disk access latency can become excessive. If IO is throttled, then the full IO performance capability of the disk system is not realized.
The major server systems today based on the Intel 5100, 7300 or AMD/nVidia chipsets all have very powerful IO capability with 28 or more PCI-E lanes, allowing for 7 or more x4 PCI-E slots. Intel reported that the 5100 chipset can sustain 3GB/s disk IO. The other two chipsets should be able to equal or exceed this. Many systems configure fewer slots with a mix of x4 and x8 widths. The issue is that the first generation PCI-E RAID controllers could not fully utilize the x8 PCI-E bandwidth (1.5-1.6GB/sec). It is possible the second generation PCI-E controllers can but there is very little information on this. A single PCI-E RAID controller is capable of sustaining 800MB/sec in a x4 PCI-E slot.
To realize the full system IO capability, it is necessary to spread IO across as many controllers or HBAs as allowed by the number of independent PCI-E slots, and across as many disk drives as warranted. Technically, 8 drives could saturate a single PCI-E controller in pure sequential IO. In practice, few SQL operations will generate pure sequential IO. Common external storage enclosures hold up to 14-16 disks. The HP disk enclosures hold 10, 12, and 25 disks. Configuring one or two fully populated enclosures with a total of 12-30 disks for each PCI-E controller is a perfectly sound strategy.
Let us examine the disk configurations employed in the TPC benchmarks. In a recent TPC-C publication, an HP ProLiant ML370G5 with 2 x Xeon X5460 quad core processors is configured with 7 P800 RAID controllers, 600 36GB 15K disks drives for data and 28 72GB 15K drives for log. There are 100 disk drives connected to each of 6 controllers for the data files. The total disk space on the 600 data drives without RAID overhead is 20TB. The size of the data files is 1.8TB. The active portion of disk space used is about 10%. Many TPC-C publications are cost-performance optimized, meaning that it is reasonable to assume each of the 600 data disks are loaded to approximately 200 IOPS, for 120K IOPS total. The TPC-C benchmark generates exceptionally high random IO activity, infrequently seen in actual transaction processing databases. To support this IO rate, a number of special steps are involved. A custom driver to bypass the Windows mini-port architecture is one. Raw partitions are used instead of the NTFS file system. Disk performance counters are disabled in both the operating system and SQL Server. In typical systems running below 10K IOPS, these special precautions have no impact. Raw partitions might be useful for disk activity on the order of 20K IOPS or higher. Disabling performance counters is not advisable in almost all cases.
The TPC-H disk configurations are probably more meaningful to actual heavy duty critical database servers than the TPC-C configurations. The large DSS type queries in the TPC-H benchmark are effectively heavy transient IO surges. There is some variation in TPC-H disk configurations depending on the size of the database and amount of system memory configured. The HP report for at 1TB scale factor (1TB data, and another 300GB indexes) on a ProLiant DL585G2 with 32GB memory is a suitable representation of a heavy duty database server. The disk configuration is 8 RAID controllers (7 PCI-E and 1 PCI-X) connected to a total of 200 disks for data, 25 disks per controller. One controller also connects 2 disks for OS and 4 disks for logs. The TPC-H workload does not generate significant log load. A heavy duty transaction server should have the logs on a dedicated controller.
Not all applications require the full performance capability of the above TPC-H storage configuration. A basic level of capability can be achieved with 2 controllers and 2 external disk enclosures with 12-16 disks each. An intermediate capability level is achieved with 4 controllers and 4 disk enclosures with a total of 48-60 disk drives. Really serious capability requirements should employ 6-8 disk controllers and 120-200 disk drives. The cost for one controller and one enclosure populated with 73GB 15K drives (Dell MD1000) is about $7500. The basic capability cost using direct attach storage is then about $15K, $30K for intermediate capability and $60-100K for heavy duty capability. The Dell MD3000 that supports clustering is more expensive. SAN storage systems are much more expensive, with a amortized cost per disk exceeding $2000.
For any given storage configuration, based on the number of controllers, number of disks, and RAID level, the theoretical random and sequential IOP performance can be calculated. The actual realized capability may be far below what should have been possible. Always test the configuration. Storage vendors like to offer “advanced” capabilities without fully disclosing the performance overhead. Also be aware of disk partition alignment. It is important to test IO performance over the anticipated size of the database.
Modern server systems have considerable IO performance capability. This requires distributing load over multiple storage controllers and very many disk drives. If the resulting configuration has far more storage capability than needed by the primary database, this is a very important performance bonus, providing the capability of handling heavy IO surges while maintaining reasonably IO latency. The IO governor in the SQL Server engine helps, but cannot hide weak storage performance. Always test to verify the actual storage performance is in line with the expected performance level.
Disk partition alignment is a problem caused by the Windows operating system in RAID stripes. For whatever reason from long ago, the OS allocates 63 sectors (of 512 bytes, or 31.5KB in all) for its own use. The first partition then starts on the 64th sector. A common stripe size on RAID arrays is 64KB. The problem is that in random 8KB disk read or writes, accesses to either the first or last 8KB of a 64KB extents will reside on different stripes of the RAID array, i.e., different physical disks. So in theory, every 8 accesses for 8KB blocks will result in 10 disk accesses, for a degradation of 20% in random 8K IO performance. There are various reports of greater performance impact. In some storage systems, unaligned accesses bypass the cache, so actual performance impact could be greater. Another possibility is that a particular storage system has become seriously fragmented over time, and the simple act reformatting the file system has some benefit. Windows Server 2008 finally corrects this situation. See the presentation by Jimmy May on the SQL Server Disk Partition Alignment. There is more information on disk partition alignment targeted for Microsoft Exchange ( msexchange.org tutorials Disk-Geometry).
For some inexplicable reason, people seem to want to believe that a large cache on the SAN can compensate for a low disk count. A separate disk cache for a database engine is fundamentally a silly idea. The database engine is a disk cache specially optimized for databases. Suppose the database has an 8GB data buffer. What do you really expect to be in a 4GB SAN cache that is not already in the database engine’s internal cache? If the SAN cache is larger than the database cache, i.e., system memory, then one should probably find another person to do system configuration. Accessing cache in the database engine is much more efficient than going out to the storage controller cache. All this having been said, a large write cache on the storage controller is beneficial in handling checkpoints and other write bursts. RAID controllers should probably be offered with 2-4GB cache, not the 256-512M typically offered.
Designing high performance I/O for SQL Server
Authors: Thomas Kejser & Mike Ruthruff
SQL Server Customer Advisory Team
Tech Reviewer: Jimmy May
Jimmy presented at the Curaco PASS users group.
Below are some storage related material I found.
When I get a chance, I will comment where appropriate.
How It Works: Bob Dorr's SQL Server I/O Presentation
Solid State Disks and SQL Server
Wes Browns article on Fusion-IO
Storage Top 10 Best Practices
Predeployment I/O Best Practices
SQL Server Best Practices Article
Physical Database Storage Design
Deploying SQL Server 2005 with SAN #1
System and Storage Configuration for SQL Server
I will fill this out later
Seagate 7200RPM 3.5in drive

Seagate 10K 3.5in drive

Seagate 15K 3.5in drive
        Seagate 7200RPM 3.5in drive                         Seagate 15K 3.5in drive
        Seagate Savvio 10K 2.5in drive         Seagate Cheetah 15K 2.5in drive