Home, Optimizer, Benchmarks, Server Systems, Systems Architecture, Processors, Storage,
Storage Overview, System View of Storage, SQL Server View of Storage, File Layout,

File Layout

Having set most of the base, we can now discuss the strategies behind configuring the storage system. The most fundamental is probably that the traffic for each LUN must travel through a specific path. So on FC, do not plan on having more than 320-360MB/sec per LUN. EMC documents say that traffic can go on both loops of the FC pair, meaning 720MB/sec. I may have not interpreted this correctly, so verify this. The traffic to LUNs must be properly distributed over the available paths and HBAs with adjustments for the PCI-E slot bandwidth.

In the past, I have seen approximately 11MB/sec per disk in running a table scan on a SAN storage system. This works out to 175 x 64KB IOs, meaning the storage is issuing 64K IOs serially instead of taking advantage of the sequential capabilities of the disk drive.

The EMC whitepaper h5548 "Deploying EMC CLARiiON CX4-960 for Data Warehouse/Decision Support System (DSS) Workloads" states:
number of key FLARE changes have been included in release 28, the array system software release in support of the CX4 family, to ensure that we can drive the underlying disk drives to the considerably higher level of data delivery rate using the "thin" RAID striping configurations.
Thin RAID is described as a RAID group with few drives. Later the EMC paper advocates 2+1R5 or 2+2R6, in contrast with Microsoft FTDW which advocates 1+1 RAID 1.

The Microsoft FTDW documents also statement 100MB/sec per disk is possible. It would then take only 8 disks in one DAE to saturate a FC loop pair. My assertion is that even DW is not always true sequential, so a more reasonable target is 15 disks in one DAE, 2 DAE in one 4Gb/s FC loop pair. If the 720MB/sec per loop pair can be achieved, this works out to 24MB/sec per disk.

If there ever is 8Gb/s FC on the back-end, it would be desirable to continue with 2 DAE per loop pair averaging 48MB/sec per disk.

A direct attach storage system has no problems in aggregating disk drive sequential performance. In the past, I have place 15 disks in a single enclosure on a 3Gb/s x4 SAS port for 800MB/sec. Today, perhaps 12 disks on a 6Gb/s x4 SAS port works well with the 12 disk enclosures, or splitting the 24 disk SFF enclosure into two.

Technically, the correct strategy on SAN is to create three or more LUNs on each of the main (non-log) RAID groups. The first LUN is for data, or the filegroup with the big table (usually order line-items) with the other tables and indexes in the filegroup on the second LUN. The tempdb data file is on the next LUN. The last LUN, with the short-stroke strategy, should be the largest capacity and is intended for inactive files. This could be database backups, flat files or even data files for archive tables.

If we were to be extremely aggressive in performance optimization, the data and temp LUNs would be raw partitions. Most people do not even know about this capability in the Windows operating system and SQL Server, and will probably be afraid to venture into this territory.

The following example shows a storage system with 9 RAID groups, 4 FC loop pairs, and perhaps 8 DAEs total, 2 per loop pair. With the data & temp LUNs evenly distributed across SP and loops, how should the log RAID group LUN be configured? If the SPs and loops are not heavily loaded, the log could be set to either SP. It could be in its own loop, or even be distributed among unused disks in the other loops.

0A0FG1 data 1FG2 datatemp 1other
1B0FG1 data 2FG2 datatemp 2other
2A1FG1 data 3FG2 datatemp 3other
3B1FG1 data 4FG2 datatemp 4other
4A2FG1 data 5FG2 datatemp 5other
5B2FG1 data 6FG2 datatemp 6other
6A3FG1 data 7FG2 datatemp 7other
7B3FG1 data 8FG2 datatemp 8other

If one were to follow the thin RAID group strategy, then there would be multiple RAID groups in each loop.


One storage system feature I do not intend to use is dynamic volume growth or relocation. SQL Server already has features for this. If more space is needed, add another DAE, RAID group, and set of LUNs. Use the ADD FILE command. Be sure to rebuild indexes during off hours to redistribute data across all files in the filegroup. When moving data from older disks to new disks, we could use DBCC SHRINKFILE with the EMPTYFILE option, but I have found that it can be faster to rebuild the indexes (and clustered indexes) to another filegroup first, and then shrink the file.

This can also be a fundamental question of whether to use the SQL Server or storage system features to perform certain functions. A transactional database engines, be it SQL Server, Oracle, DB2 etc have carefully designed procedures to main data integrity.

One of most serious problems I have seen is that in large IT organizations, the database and storage systems are managed by completely different groups. The storage group is intent on carrying out the latest dogmatic doctrine from storage vendors, such as storage as a service. Applications request service based on capacity only. The storage service (system) will automatically relocate LUNs to avoid hot spots, including moving LUNs between SSD, 15K and 7200RPM storage.

The SAN admin will strictly follow the configuration strategies put out by the storage vendor, in complete disregard of the base of accumulated knowledge on Database storage performance. This is guaranteed to be the worst possible configuration for database performance. When the DBA raises questions regarding poor storage performance, the SAN admin looks at the tool provided by the SAN vendor and pronounces that all parameters are well under the system limits. That disk latencies are in the hundreds of milli-seconds is not a reason to be concerned. The line-of-business and data warehouse systems really need to be on dedicated storage systems.

Per other discussion, each highly active log file must have its dedicated RAID 1 pair of physical disk drives, and possibly RAID 10 sets in extreme cases. Unknown is whether really critical log traffic should be directed over a dedicated HBA. Even more extreme is if an SP under heavy load from data and temp LUNs can provide super low latency for log LUNs. If not, then in such extreme circumstances, the log might require its own dedicated SP. The middle storage systems all have 2 controllers. Enterprise storage systems can have multiple controllers, but each is incredibly expensive. An entry level storage system has one or two controllers, but the intent is the complete storage system has multiple entry storage units, which is why entry level SAN systems makes sense even for big-time databases.

The other question on logs, suppose there multiple high activity log files on the same RAID group. Without a cache, the actual IO to disk is non-sequential, hopping between the log files. Can the storage controller effectively cache the log writes? Since the writes to the individual log files are sequential, it will fill the cache line and RAID stripe for that matter.

Comments on System Memory Configuration

Some people think I try to put formulas in areas that they think are too mysterious, like SQL Server execution plan cost structure. I put formulas were I think it is meaningful. The Microsoft FTDW puts out an arbitrary on system memory with no substantiating data. My thoughts:

Server systems today are frequently configured with 128GB of memory. A 4GB ECC DIMM costs about $150 depending on the source. So 32 x 4GB contributes on the order of $6,000 to the cost of a system. Given the cost of storage performance, filling the system with 4GB memory modules can have a relatively high value to cost ratio. In any case, an effort to determine the most correct memory configuration will likely exceed the cost of filling the DIMM sockets with the largest capacity DIMM without a price per GB premium. Today, this is the 4GB DIMM. Next year or in 2012, it might be the 8GB DIMM.


< Previous, Next >