Home, Optimizer, Benchmarks, Server Systems, Systems Architecture, Processors, Storage,
Storage Overview, System View of Storage, SQL Server View of Storage, File Layout,

System View of Storage

To understand storage, it necessary to have complete view, and this includes the server system itself. The diagram below is a 2-way Intel Xeon 5500 system with 2 IOH devices. In this picture are processors, memory, and the IO hub. IO generates traffic on the PCI-E channels, processor interconnects, also the memory channels. Effective storage requires a balance of memory, processor interconnect and IO bandwidth.

QPI 2 IOH

Below is a representation system-wide storage concepts. IO bandwidth is distributed across multiple channels for simultaneous access, not just for fail-over. IO is also distributed across very many disk drives.

SAS

System Memory and IO bandwidth

The previous generation of Intel systems (and Server Systems) built around the 5000P/X and 7300 chipset may have been limited to 3GB/sec in realizable IO bandwidth, regardless of the apparent bandwidth of the IO slots. There is no clear source for the realizable IO bandwidth of a 4-way Opteron system. An authoritative source indicated that the 8-way Opteron platform could achieve 9GB/sec in IO bandwidth, with approximately 7GB/sec realized from a SQL Server query. This may have been the TPC-H Pricing Summary Report, which is moderately CPU-intensive for a single table scan query, so the 9GB/sec value might be achieved in other SQL queries. It is reasonable to suppose that a 2009-10 generation 4-way Opteron should be able to achieve 4.5GB/sec or higher, but actual documentation is still desired.

The Intel Nehalem generation servers (Xeon 5500 and 5600 series, and the 6500 and 7500 series should be able to sustain phenomenal IO bandwidth, but I have yet to get my hands on a system with truly massive IO brute force capability.

For server systems built around the Intel Xeon 5500 and 5600 processors, I recommend the HP ProLiant DL/ML 370G6 and Dell PowerEdge T710, as these should have 2 5520 IOHs for extra IO bandwidth. Other systems built with dual IOHs should be suitable as well. The systems built around the new Xeon 6500/7500 should also have massive IO bandwidth. In general, Opteron system have decent system bandwidth. The 2-way system are targeted more towards web/app servers, while the 4 & 8-way systems are targeted more towards database IO requirements.

The system memory bandwidth contribution is more complicated. Consider that a read from disk is also a write to memory, possibly followed by a read from memory. A system with SDRAM or DDR-x memory, the cite memory bandwidth is frequently the read bandwidth. The write rate to SDR/DDR memory is one-half the read rate. So the IO bandwidth might be limited to one-third the memory bandwidth, regardless of the bandwidth of the PCI busses. In the past there was a system with 2 DDR memory channels (64 bit or 8 byte wide) at 266MHz has a read bandwidth of 4,264MB/sec. The maximum disk IO bandwidth possible was around 1,350MB/sec, even though the system had two PCI-X 100/133MHz busses.

The more recent Intel chipsets, including the 5000 and 7300, have FB-DIMM which uses DDR, but have a separate device on the memory module. This allows simultaneous read and write traffic at full and half-speed. The 5000P chipset has 4 memory channels. With DDR2-667, the memory bandwidth is 5.3GB/s per channel or 21GB/sec system total for read, and 10.5GB/s for write. There are reports demonstrating 10GB/sec IO bandwidth, or even 7GB/s. The PCI-E bandwidth over 28 PCI-E lanes is 7GB/s unidirectional.

System and Controller IO (2009)

The major server systems today based on the Intel 5100, 7300 or AMD/nVidia chipsets all have very powerful IO capability with 28 or more PCI-E lanes, allowing for 7 or more x4 PCI-E slots. Intel reported that the 5100 chipset can sustain 3GB/s disk IO. The other two chipsets should be able to equal or exceed this. Many systems configure fewer slots with a mix of x4 and x8 widths. The issue is that the first generation PCI-E RAID controllers could not fully utilize the x8 PCI-E bandwidth (1.5-1.6GB/sec). It is possible the second generation PCI-E controllers can but there is very little information on this. A single PCI-E RAID controller is capable of sustaining 800MB/sec in a x4 PCI-E slot.

To realize the full system IO capability, it is necessary to spread IO across as many controllers or HBAs as allowed by the number of independent PCI-E slots, and across as many disk drives as warranted. Technically, 8 drives could saturate a single PCI-E controller in pure sequential IO. In practice, few SQL operations will generate pure sequential IO. Common external storage enclosures hold up to 14-16 disks. The HP disk enclosures hold 10, 12, and 25 disks. Configuring one or two fully populated enclosures with a total of 12-30 disks for each PCI-E controller is a perfectly sound strategy.

Let us examine the disk configurations employed in the TPC benchmarks. In a recent TPC-C publication, an HP ProLiant ML370G5 with 2 x Xeon X5460 quad core processors is configured with 7 P800 RAID controllers, 600 36GB 15K disks drives for data and 28 72GB 15K drives for log. There are 100 disk drives connected to each of 6 controllers for the data files. The total disk space on the 600 data drives without RAID overhead is 20TB. The size of the data files is 1.8TB. The active portion of disk space used is about 10%. Many TPC-C publications are cost- performance optimized, meaning that it is reasonable to assume each of the 600 data disks are loaded to approximately 200 IOPS, for 120K IOPS total. The TPC-C benchmark generates exceptionally high random IO activity, infrequently seen in actual transaction processing databases. To support this IO rate, a number of special steps are involved. A custom driver to bypass the Windows mini-port architecture is one. Raw partitions are used instead of the NTFS file system. Disk performance counters are disabled in both the operating system and SQL Server. In typical systems running below 10K IOPS, these special precautions have no impact. Raw partitions might be useful for disk activity on the order of 20K IOPS or higher. Disabling performance counters is not advisable in almost all cases.

The TPC-H disk configurations are probably more meaningful to actual heavy duty critical database servers than the TPC-C configurations. The large DSS type queries in the TPC-H benchmark are effectively heavy transient IO surges. There is some variation in TPC-H disk configurations depending on the size of the database and amount of system memory configured. The HP report for at 1TB scale factor (1TB data, and another 300GB indexes) on a ProLiant DL585G2 with 32GB memory is a suitable representation of a heavy duty database server. The disk configuration is 8 RAID controllers (7 PCI-E and 1 PCI-X) connected to a total of 200 disks for data, 25 disks per controller. One controller also connects 2 disks for OS and 4 disks for logs. The TPC-H workload does not generate significant log load. A heavy duty transaction server should have the logs on a dedicated controller.