Home, CBO, Performance Benchmarks, Server Systems, Systems Architecture, Processors, Storage,
Storage Configuration: Part I, Part II, Part III, Part IV, Appendix

Storage Configuration

Storage performance is not inherently a complicated subject. The concepts are relatively simple. In fact, scaling storage performance is far easier compared to the difficulties encounters in scaling processor performance in NUMA systems. Storage performance is achieved by properly distributing IO over:

  1. 1) multiple independent PCI-E ports (system memory and IO bandwidth is key)
  2. 2) multiple RAID controllers or host bus adapters (HBAs)
  3. 3) multiple storage IO channels (SAS or FC, complete path)
  4. most importantly,
  5. 4) a large number of disk drives (15K or SSD?)
  6. 5) with the short-stroke effect

with consideration for random and sequential IO patterns, and in certain cases possibly also separation of low-queue and high-queue patterns, but this is not always possible. It helps to know how to estimate the theoretical performance in IOPS and bandwidth for a given number of disks and IO channel, and then test to see how your configuration compares to the expected characteristics.

It is also necessary to have a basic idea of the capabilities and limitations of each component or bus in this chain. Storage performance cannot be achieved with magic/secret registry settings or other incantations.

A dozen 1TB 7200RPM supporting data, temp and log files, however impressive the capacity seems to be, will have poor performance by database standards no matter what secret settings are applied.

Nor is performance achieved with a grossly overpriced SAN storage system, with relatively few big capacity disk drives, configured in complete disregard of the principals of disk system performance.

Reference Configuration

Without getting deep into concepts, I will provide a simple example of what I consider a balanced storage system configuration. The objective for this reference configuration is the ability to sustain transaction processing throughput with no more than minor degradation during a moderately heavy reporting query. The configuration is also suitable for data warehouse workloads.

A 4-way server, that is a system with four processor sockets, on which the current generation Intel Xeon 7400 (now 7500) series and Opteron 8400 series processors have six (or 8) cores per socket, the reference storage configuration is 4-5 controllers and 120 15K disk drives as detailed below. (Intel finally announced the Xeon 7500/6500 series in the middle of writing this, so I will have make adjustments later.)

ProcessorsIntel Xeon X7560
or Intel Xeon X7460
or Opteron 8439
Cores4 x 8 = 32 (X7560)
4 x 6 = 24
Memory64 x 4GB = 256GB (X7500)
32 x 4GB = 128GB
HBA
Controllers
4-5 Dual-Port 4/8 Gbit/s FC
or 4-5 6Gb/s SAS with 2x4 ports
IO Channels8-10
Disk Drives120 x 15K
Disks per channel12-15

This is only a reference configuration. With direct-attach storage, eminently suitable for data warehouse, should have an amortized cost per disk of $500-600, for a total of $60-70K. In a SAN, the cost per disk might range from $1500-3000 per disk, for a total cost of $180-360K. A SAN vendor will probably attempt to substitute 600GB 15K disk instead of the low capacity models. This will push cost per disk to over $6K, usually resulting in a storage system with far too few disks.

At this time, a SAN is required for clustering. In the past, Windows supporting clustering on SCSI, with two hosts on the same SCSI bus. But this capability was removed as customers seemed anxious to buy very expensive SAN storage. The SAS protocol also supports two hosts connected to the same SAS network, so it should also be possible to enable clustering, but Microsoft does not currently support this.

A really high-end storage system could have over 1000 disk drives. This does need not be a single storage system, it could be multiple systems. Of course, for exceptional random IO needs, a serious effort should be made to determine if solid-state storage can be implemented to keep the spindle count manageable.

Sizing Guidance

If your storage vendor opens with a question as to your capacity requirements, don't waste anymore time. Just throw the rep out and proceed to the next vendor.

Performance Targets

For calculation purposes, I am going to assume 100 of 120 disks are allocated for data and temp, and the remaining 20 for other purposes including logs. In actuality, if only 4 disks are required for logs, then 116 disks would be allocated to data and temp.

Random IOPS
  low queue, full-stroke185 per 15K 3.5in disk,
205 per 15K 2.5in disk,
17-22K tot
  high queue, full-stroke300+
  low queue, short-stroke250+ per disk, 25K tot
  high queue, short-stroke400+
Sequential
  4Gb/s FC330-360MB/s per port,
720MB/s per dual-port HBA,
2.6-3.0GB/s+ total
  2x4 3Gb/s SAS RAID controller
  in x8 PCI-E Gen 1 slot
0.8GB/s per x4 port,
1.6GB/sec per adapter
6GB/s+ or system limit*
  2x4 6Gb/s SAS RAID controller
  in x8 PCI-E Gen 2 slot
1.6GB/s per x4 port,
2.8GB/sec per adapter
10GB/s+ or system limit*

System Memory and IO bandwidth

The previous generation of Intel systems (and Server Systems) built around the 5000P/X and 7300 chipset may have been limited to 3GB/sec in realizable IO bandwidth, regardless of the apparent bandwidth of the IO slots. There is no clear source for the realizable IO bandwidth of a 4-way Opteron system. An authoritative source indicated that the 8-way Opteron platform could achieve 9GB/sec in IO bandwidth, with approximately 7GB/sec realized from a SQL Server query. This may have been the TPC-H Pricing Summary Report, which is moderately CPU-intensive for a single table scan query, so the 9GB/sec value might be achieved in other SQL queries. It is reasonable to suppose that a 2009-10 generation 4-way Opteron should be able to achieve 4.5GB/sec or higher, but actual documentation is still desired.

The Intel Nehalem generation servers (Xeon 5500 and 5600 series, and the 6500 and 7500 series should be able to sustain phenomenal IO bandwidth, but I have yet to get my hands on a system with truly massive IO brute force capability.

For server systems built around the Intel Xeon 5500 and 5600 processors, I recommend the HP ProLiant DL/ML 370G6 and Dell PowerEdge T710, as these should have 2 5520 IOHs for extra IO bandwidth. Other systems built with dual IOHs should be suitable as well. The systems built around the new Xeon 6500/7500 should also have massive IO bandwidth. In general, Opteron system have decent system bandwidth. The 2-way system are targeted more towards web/app servers, while the 4 & 8-way systems are targeted more towards database IO requirements.

The system memory bandwidth contribution is more complicated. Consider that a read from disk is also a write to memory, possibly followed by a read from memory. A system with SDRAM or DDR-x memory, the cite memory bandwidth is frequently the read bandwidth. The write rate to SDR/DDR memory is one-half the read rate. So the IO bandwidth might be limited to one-third the memory bandwidth, regardless of the bandwidth of the PCI busses. In the past there was a system with 2 DDR memory channels (64 bit or 8 byte wide) at 266MHz has a read bandwidth of 4,264MB/sec. The maximum disk IO bandwidth possible was around 1,350MB/sec, even though the system had two PCI-X 100/133MHz busses.

The more recent Intel chipsets, including the 5000 and 7300, have FB-DIMM which uses DDR, but have a separate device on the memory module. This allows simultaneous read and write traffic at full and half-speed. The 5000P chipset has 4 memory channels. With DDR2-667, the memory bandwidth is 5.3GB/s per channel or 21GB/sec system total for read, and 10.5GB/s for write. There are reports demonstrating 10GB/sec IO bandwidth, or even 7GB/s. The PCI-E bandwidth over 28 PCI-E lanes is 7GB/s unidirectional.

PCI-E Nominal and Realizable Bandwidth (bi-directional)

The table below shows PCI-E nominal and realizable bandwidths in GB/sec. PCI-E gen 1 (or just PCI-E) signals at 2.5Gbit/s. After 8B/10B (or is it 10B/8B?) overhead, the nominal bandwidth is 250MB/sec per lane per direction. Keep in mind PCI-E has simultaneous bi-directional capability. So a PCI-E x4 slot has a nominal bandwidth of 1GB/sec in each direction. Actual test transfers show that the maximum realizable bandwidth for a PCI-E x4 slot is approximately 800MB/sec. PCI-E gen 2 signals at 5.0Gbit/s or 500MB/sec per lane per direction, or double then gen 1 bandwidth for a given bus width.

Slot widthPCI-E Gen 1PCI-E Gen 2 
x41.0
0.8
2.0
1.6
nominal
realizable
x82.0
1.6
4.0
3.2
nominal
realizable

Systems of the Intel Core 2 processor architecture generation (Xeon 5100-5400 and Xeon 7300-7400 series) are almost exclusively PCI-E gen 1, as are the accompanying chipsets: the 5000P and 7300. The Intel 5400 MCH did support PCI-E gen 2, but no tier-1 system vendor produced a server with this chipset. (Supermicro, popular with white-box builders, did have 5400-based motherboards.) Systems of the Intel Nehalem generation and later have PCI-E gen 2. If someone could advise on when AMD Opteron transitioned from PCI-E gen 1 to gen 2, I would appreciate it.

Serial Attached SCSI (SAS)

SAS started out with 3.0Gbit/sec signaling. Unlike SATA, SAS appears to be used only with a x4 wide connection. Most SAS adapters have 2 x4 ports. The HP Smart Array P800 has 4 x4 ports. The nominal bandwidth of a x4 3Gb/s SAS connection is 12Gbit/sec. The realizable bandwidth appears to be 1.0-1.1GB/sec.

Unfortunately, this is not matched with the bandwidth of a PCI-E gen 1 slot. To realize more than 800MB/sec from a single x4 SAS channel requires a x8 PCI-E gen 1 slot, which in turn, results in under-utilizing the PCI-E slot or not achieving balance between the 2 x4 SAS ports. Since most adapters have 2 x4 ports, the maximum realizable bandwidth in a x8 PCI-E gen 1 slot is 1.6GB/sec. Some of the early PCI-E SAS adapters have an internal PCI-X bus that limits realizable bandwidth over both x4 SAS ports to 1GB/sec.

Server systems usually have some combination of x16, x8 and x4 slots. No server adapter relevent to databases can use more bandwidth than that provided by a x8 slot, so each x16 slot could have been 2 x8 slots, for a waste of an otherwise perfectly good x8 slot. The x4 slots are usually a good match for network adapters. A PCI-E gen 2 x4 slot is exactly matched to 2 x 10GbE ports.

Matching the available x16 and x8 slots to storage controllers is not always possible. Sometimes it may be necessary to place one or more SAS storage controllers in the x4 slots, in which case it is important to distribute the number disks behind controllers in x8 and x4 slots proportionately as appropriate.

In the last year, 6.0Gb/s SAS adapters and disk drives became available. The same bandwidth mismatch situation between 3Gb/s SAS and 2.5 Gb/s PCI-E gen 1 also occurs with 6Gb/s SAS and 5Gb/s PCI-E gen 2. In addition, LSI Logic states that their 6Gb/s SAS controller has a maximum combined bandwidth of 2.8GB/sec over both x4 SAS ports.

SAS RAID Controllers

For direct-attach storage, the SAS adapter is frequently also a RAID controller. Most SAS RAID controllers are built around LSI Logic silicon, notably the LSI SAS 1078 for 3Gb/s SAS and the new SAS 2008 for 6Gb/s SAS and 5Gb/s PCI-E gen 2. Intel used to make a PCI-E to SAS RAID controller built around the 80333 IO Processor, but mysteriously dropped out of the market soon after releasing the new 81348 IOP in 2007. There might be another vendor as I am not sure who makes the controller for the HP P800.

LSI 1078


LSI 1078

LSI has a 4 x4 PCI-E gen 2 6Gb/s SAS RAID Controller, listing a LSI SAS 2116. It is unclear if this is a variation of the 2008 or just two die in board.

Next