PCI-E, SAS, FC,
RAID Controllers, Direct-Attach,
SAN, Dell MD3200, CLARiiON AX4, CX4, VNX, V-Max, HP P2000, EVA, P9000/VSP, Hitachi AMS
SSD products: SATA/SAS SSDs, PCI-E SSDs, Fusion iO, other SSD
Storage Performance 2013
Storage has changed dramatically over the last three years driven by SSD developments. Most of the key components necessary for a powerful storage system are available and the cost is highly favorable for direct placement of data files. Some additional infrastructure elements could greatly enhance the flexibility of storage systems with SSDs. There is still some discussion on whether SSD should interface directly to PCI-E or continue using the SAS/SATA interfaces originally designed for hard disks. New products coming this year include Express Bay, an ecosystem of connectors allowing both PCI-E and SAS/SATA to co-exist until a clear direction is established. Also expected in the coming year are PCI-E SSDs based on the NVM Express interface.
The Intel Xeon E5 processors, codename Sandy Bridge-EP, have 40 PCI-E gen 3 on each processor socket. Even though PCI-E gen 3 is 8GT/s, a change in the encoding means that the usable bandwidth is double that of PCI-E gen2 at 5GT/s. The net realizable bandwidth of a PCI-E gen 3 x8 slot is 6.4GB/s versus 3.2GB/s for gen 2.
The unfortunate aspect is that the major server vendors all implement a mix of x16 and x4 slots,
while the HBA vendors seem to be concentrating on products for PCI-E x8.
Only Supermicro has a system with
8 10 PCI-E gen 3 x8 slots.
Could a vendor put 2 HBA/RAID Controllers designed for x8 onto a single card for a x16 slot?
Perhaps the new Express Bay form factor will have some means to use x16 slots?
Another disappointment is that the 4-socket Xeon E5-46xx systems only connect half of the PCI-E lanes. This might be because the base system configuration is 2-socket populated. If a full set of slots are provided, there would no connection to half of the slots unless all four sockets are populated. But this is also an issue on the 2-socket systems if only 1 socket is populated.
For the most part, I will discuss direct-attach storage configuration, as we can pick and choose among the latest components available. Technically, direct-attach with SAS can support a 2-node cluster, but few system vendor promote this configuration. Dell sells the MD3200 as direct-attach storage supporting 4-node clusters, but technically it is a SAN that just happens to use all SAS interfaces.
The objective in the baseline storage configuration below is to achieve very high IO bandwidth even in the low capacity configuration. Of course it will also have very high IOPS capability because the main elements are SSD. My recommended storage system has both SSD and HDD in each IO channel.
This intent is to place the main databases on SSD and use the HDD for backups and for restore verification. For an Inmon style data warehouse, the HDD might also be used for older data. The reason for having both SSD and HDD on each IO channel is to take advantage of simultaneous bi-directional IO. On a database backup, the IO systems reads from SSD, and simultaneously writes to HDD.
4 RAID Controllers, 4GB/s per controller, 16GB/s total IO bandwidth
4 Disk enclosures (yes, I am showing 8 enclosures in the diagram above)
4 x 16 = 64 SSD
4 x 8 = 32 (10K) HDD
The standard 2U enclosure has 24 x 15mm bays. My preference is for a 16 SSD and 8 HDD mix. With the small capacity 100GB SSD, there will be 1.6TB per enclosure and 6.4TB over 4 enclosures before RAID. In 7+1 RAID 5 groups, there will be 5.6TB net capacity, and 4.8TB in 3+1 RG across 4 units. The goal is 4GB/s per controller because the SAS infrastructure is still 6Gbps, supporting 2.2GB/s on each x4 port. With 16 SSDs per controller, each SSD needs to support 250MB/s. Most of the recent enterprise class SSDs are rated for well over 300MB/s per unit, allowing for a large degree of excess capability. Another option is to configure 12 SSDs per controller, expecting each SSD to support 333MB/s.
The cost structure for the above is as follows:
RAID controller $1K
2U Enclosure $3K
Intel SSD DC 3700 100GB SSD $235
Seagate Savvio 600GB 10K HDD $400
This works out to $11K per unit or $44K for the set of 4. The set of 16 x 100GB contributes $3760. For the 800GB SSD, the R5 7+1 capacity is 44.8TB at cost $148K. At maximum expansion of 4 enclosures per RAID controller, capacity is 170TB at cost is $590K. Of course at this level, I would elect a system with more PCI-E slots for greater IO bandwidth. Another option is a RAID controller with 4 x4 SAS ports. Unfortunately none of these have 4 external ports.
While the Intel SSD DC 3700 drew reviews for pioneering consistency of IO performance
instead over peak performance, it is only available in SATA interface.
Crucial has the P410m with similar specifications but with SAS interface.
This is listed on the Micron website as in production, but probably only to OEM customers.
There are other enterprise grade high endureance MLC SSDs with SAS interface as well.
Note: I do not recommend anything less than 10K HDD even to support database backups. The 10K HDDs are not particularly expensive as direct-attach components ($400 for the 600GB model). Only SAN vendors sell $400 HDDs for $2K or more.
SAS 12Gbps Enclosures
Disk enclosures supporting SAS at 12Gbps might become available as early as this year. Each of the 12Gbps SAS x4 uplink and down link ports would then support 4GB/s. The RAID controller (HBA) can support 6GB/s+ in a PCI-E gen 3 x8. The system with 4 RAID controllers could then deliver 24GB/s instead of 16GB/s. At 16 SSDs per controller, this would require 400MB/s per SSD. While SSDs are rated as high as 550MB/s, getting 400MB/s per SSD in an array is more complicated.
We should not need 12Gbps SAS SSDs or HDDs. The internal wires in the enclosure connect through a SAS expander. The IO from each device bay can signal at 6Gbps, then uplink to the HBA at 12Gbps, assuming that packets are buffered on the expander.
The standard 2U disk enclosure today supports 24 or 25 2.5in (SFF) bays, with 15mm thickness. This is the dimension of an enterprise class 10K or 15K HDD with up to 3 platters. The older full size notebook used a 9mm HDD supporting 2 platters. Thin notebooks used a 7mm HDD restricted to a single platter. There is no particular reason for an SSD to be more than 7mm.
So It would be better if the new 12Gbps SAS enclosures support more than 24 bays. My preference is for 16 x 15mm and 16 x 7mm bays. Personally, I would like to discard the SSD case to further reduce thickness.
Another option is to employ the NGFF, perhaps a 1U stick, at 5mm or less. There could be 2 rows of 24 for SSD, and the 16 x 15mm bays.
I believe that the all-SSD idea is misguided. SSDs are wonderful, but HDD still have an important role. One example is having the HDDs available for backup and restores. I want local HDD for backups because so very few know how to configure for multiple parallel 10GbE network transmission, not to mention IO bandwidth on the backup system.
A database backup that has not been actually verified to restore (with recovery) is a potentially useless backup. Having HDDs for backup and restore verification preserves the write endurance on the SSD. This allows the use of high-endurance MLC instead of SLC. In some cases, it might even be possible to use consumer grade MLC if everything is architected to minimize wear on the SSD.
Some of the discussion on PCI-E versus SATA/SAS interface for the NAND/Flash controller incorrectly focuses on the bandwidth of a single 6Gbps lane versus 4 or 8 lanes on PCI-E. It is correct that PCI-E was designed to distribute traffic over multiple lanes and that hard drives were never expected to exceed to bandwidth of a single lane at the contemporary SATA/SAS signaling rate. The transmission delay across an extra silicon trip, from NAND controller with SATA interface to a SATA to PCI-E bridge chip, on the order of 50ns, is inconsequential compare this with the 25-50µsec access time of NAND.
The more relevant matter is matching NAND bandwidth to the upstream bandwidth. All (or almost all?) the SATA interface flash controllers have 8 NAND channels. Back when SATA was 3Gbps and NAND was 40-50MB/s, 8 channel s to the 280MB/s net bandwidth of SATA 3G was a good match. About the time SATA moved to 6Gbps, NAND at 100-133MB/s became available so 8 channels was still a good choice.
NAND is now at 200 and 333MB/s, while SATA is still 6Gpbs. The nature of silicon product cost structure is such that there is only minor cost reduction in building a 4 channel flash controller. The 8 channel controller only requires 256-pin package.
The PCI-E flash controllers have been designed with 32 NAND channels. The IDT 32-channel controller has 1517 pins, which is not excessively difficult or expensive for a high-end server product (the Intel Xeon E5 26xx processors have 2011-pins). The SandForce SF2500 8-channel controller has 256-pins (8-byte, the 16-byte is 400-pin). As noted earlier a PCI-E gen 3 x8 port supports 6.4GB/s. Over 32 channels, each channel needs to provide 200MB/s. The new 333MB/s NAND is probably a better fit to sustain the full PCI-E gen 3 x8 bandwidth after RAID (now RAIN because disks are replaced by NAND).
Based on 64Gbit die, and 8 die per package, a package has 64GB raw capacity. The 32-channel PCI-E would have 2TB raw capacity (net capacity with 0.78 for over-provisioning and 0.875 for RAIN would be 1400GB) versus 512GB on an 8-channel SATA/SAS SSD.
As is today, a PCI-E SSD can achieve maximum bandwidth at lower NAND capacity and in a more compact form factor than with SAS SSD. On the other hand, SAS infrastructure provides flexible expansion. Capacity can be increased without replacing existing devices. Some systems support hot swap PCI-E slots. However arrangement of the connector in the systems makes this a complicated matter. The implications are that PCI-E slot SSDs are highly suitable for high density requirements with limited expansion needs. One database server example is tempdb on SSD.
The new generation of PCI-E SSDs may employ the NVMe interface standard. There is a standard driver for Windows and other operating systems, which will later be incorporated into to the OS distribution media allowing boot from an NVMe device, as with SATA devices today. This is mostly a client side feature.
For the server side, the NVMe driver is designed for massive bandwidth and IOPS. There can be up to 64K queues, 64K commands (outstanding IO requests?). The driver is designed for IO to be both super-efficient in cpu-cycles and scalable on NUMA systems with very many processor cores.
To promote the growth of SSD without betting on which interface, the Express Bay standard defines a connector that can support both PCI-E and SATA or SAS. Some Dell servers today support PCI-E to SSDs in the 2.5in HDD form factor (SFF), but I am not sure if this is Express Bay. This form factor will allow PCI-E devices to hot-swapped with the same ease as SAS devices today.
As mentioned earlier, the PCI-E slot arrangement in server systems does not facilitate hot-add, even if it is supported. Existing PCI-E SSDs also do not provide a mechanism for capacity expansion, aside from adding a card to an empty slot or replacing an existing card.
Of course, there are PCI-E switches, just like the SAS expanders. A 64 lane PCI-E switch could connect 5 x8 PCI-E devices over a x8 upstream link. Other possibilities is a x16 link supporting 12.8GB/s to host with 4 ports for SSD, or 8 x4 ports to SSD for finer grain expansion. It may also be possible to support multiple hosts, as in a cluster storage arrangement?
Below is a representation of a typical configuration sold to customers by the SAN vendor. I am not joking in that it is common to find 2 ports FC or FCOE on each host. The most astounding case was a SAN with 240 x 15K disks and 2 single port FC HBAs in the server. Even though the storage system service processors had 4 FC port each (and the FC switches had 48-ports), only 1 on each SP was connected. Obviously the storage engineer understood single component and path failure fault-tolerant design. It was just too bad he built a fault-tolerant garden hose system when a fire hose was needed.
As I understand it, what happened was the SAN engineer asked how much space is needed for the databases accounting for growth, and then created one volume for it. The Windows OS does support multi-path IO. Originally the storage vendor provided the MPIO driver, but now it is managed by Microsoft. Apparently it was not considered that even with MPIO, all IO for a single volume has a primary path. The secondary path is only used when the primary is not available.
High-bandwidth SAN Configuration
A proper SAN configuration for both OLTP and DW database servers is shown below. Traditionally, an transaction processing database generates 8KB random IO, and IO bandwidth was not a requirement. This was 15 years ago. Apparently the SAN vendors read documents from this period, hence their tunnel vision on IOPS, ignoring bandwidth. For the last 10 or more years, people have been running large queries on the OLTP system. Furthermore, it is desirable to be able to backup and restore the transaction system. This means bandwidth.
Note that the system below shows 8Gbps FC, not 10Gbps FCOE. A single 10Gbps FCOE may have more bandwidth than a single 8Gbps FC port. But no serious storage system will have less than 4 or even 8 ports. Apparently FCOE currently does scale over multiple ports, due to the overhead of handling Ethernet packets? An Intel IDF 2012 topic mentions that this will be solved in the next generation.
The above diagram shows 8 x 8Gbps FC ports between host and storage system. Each 8Gbps FC port can support 700MB/s for a system IO bandwidth target of 5.6GB/s. An OLTP system that handles for very high transaction volume may benefit from a dedicated HBA and FC ports for log traffic. This would allow the log HBA to be configured for low latency, and the data HBA to be configured with interrupt moderation and high throughput.
An alternate SAN configuration for SQL Server 2012 is shown below with local SSD for tempdb.
The write cache on a SAN must be mirrored for fault tolerance. There is very little detail on the bandwidth capability of the link between controllers (or SP) on SAN systems, beyond what can be deduced from the fact that the sustained write bandwidth is much lower than the read bandwidth. So keeping tempdb off the SAN should preserve IO write bandwidth for traffic that actually needs protection.
The number of volumes for data and temp should be some multiple of 8. It would be nice to have 1 volume for each FC path. However we do need to consider how SQL Server place extents over multiple files. This favors RAID groups of 4 disks.
File Layout 2013-10
In a HDD storage system, the objective for bandwidth is to simultaneously issue large block IO to all data disks across all IO channels. A 256K block could be sufficiently large to generate 100MB/s per disk (400 IOPS, not random). If this were issued at low queue depth (2?), then the storage system would not only generate high IO bandwidth and still be perceptually responsive to other requests for small block IO.
For small block random IO, it is only necessary to distribute IO over all hard disks with reasonable uniformity.
The file layout strategy has two objectives. One is to not overwhelm any single IO channel. In direct-attach, this is not a problem as the smallest pipe is x4 SAS for 2GB/s. In a SAN, even using 8Gbps FC, this is a concern as 8Gb FC can support only 700-760MB/s. Although 10Gb FCoE seems to have higher bandwidth, this may not scale with the number of channels as well as straight FC. The new Intel Xeon E5 (Sandy-Bridge EP) processors may be able to scale 10Gb FCoE with Data Direct IO (DDIO) - but this needs to be verified.
The second is to ensure IO goes to every disk in the RAID Group (or volume). By default, SQL Server allocates a single 64K extent from each file before round-robin allocating from the next file. This might be the reason that many SAN systems generate only 10MB/s per disk (150 IOPS at 64K), along with no read-ahead.
The -E startup flag instructs SQL to allocate up to 4 consecutive extents before proceeding to the next file. See James Rowland-Jones Focus on Fast Track : Understanding the –E Startup Parameter for more on this. In a 4-disk RAID group with stripe-size 64K, a 256K IO to the file would generate a 64K IO to each disk.
It would be necessary to rebuild indexes before this scheme takes effect. Somewhere it was mentioned that it is important to build indexes with max degree of parallelism limited to either 4 or 8. It might be in the Microsoft Fast Track Data Warehouse Reference Architecture. Start with version 4 for SQL Server 2012, and work backwards?
Edit 2013-10: In SQL Server versions 2008 R2 and 2012, I am seeing 512K IO issued in table scans for both heap and clustered indexes built without the -E startup parameter. Does this mean -E is nolonger necessary? I would like to seee Microsoft documentation on this.
This would have significant implications on RAID group size for HDDs. RAID 10 might favor 4 disks with 128K stripe or 8 disk with 64K stripe size? RAID 5 might favor 5 disks with 128K or 9 disks (or perhaps 10) with 64K stripe?
There are additional nuances to SQL Server table scan IO in versions 2008 R2 and 2012. At DOP 1, the IO size might be 192K or some combination with an average IO size in that range except when there is only 1 file. Otherwise IO size appears to be 512K. The IO strategy appears to be to issue IO at low queue depth whi may be additional implications of the 512K IO.
In the Fast Track Data Warehouse configurations for SQL Server 2008 R2 and 2012 by HP and some other vendors, it is mentioned that the setting in Resource Governor: REQUEST_MAX_MEMORY_GRANT_PERCENT can "improve I/O throughput" at high degree parallelism. (oh yeah, the applicable DMV is dm_resource_governor_workload_groups) Well simply issuing IO at sufficient queue depth so that all volumes are working will do this. There is no reason to be cryptic.
Trace Flag 1117 (-T1117) causes all files in a filegroup to grow together. See SQL Server 2008 Trace Flag -T 1117.
With SSD, the second may not be important as the SQL Server read-ahead strategy (1024 pages?) should generate IO to all units. On the hard disk, generate close-to-sequential IO was important. On SSD, it is sufficient beneficial just to generate large block IO, with 64K being large.
The old concept of distributing IO over both devices and channels still apply. The price of SSD is sufficiently low to warrant serious consideration. While there is more flexibility in SSD configuration, it is still necessary to validate performance characteristics with real queries to an actual database. SQLIO or other synthetic tests are not sufficient. If the SAN vendor advised in the configuration, then chances are IO bandwidth will be not be good.
If anyone thinks I am being unfair to or overly critical of SAN vendors, do the following test.
Find the biggest table in your database, excluding LOB fields.
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT COUNT(*) FROM TableA WITH (INDEX(0))
Then compute 8 (KB/page) * (physical reads + read-ahead reads)/(elapsed time in ms)
Is this closer to 700 MB/s or 4GB/s? What did your SAN vendor tell you?
I am also not fan of SSD caching or auto-tiering on critical databases, meaning the database that runs your business, that is managed by one or more full-time DBAs. In other applications, there may not be a way to segregate the placement of hot data differently from inactive data. In SQL Server, there are filegroups and partitioning. We have all the necessary means of isolating and placing hot data whereever we want it. SSD caching or auto-tiering will probably require SLC NAND. With active management using database controls, we should be able to use HET MLC or even MLC.
I stress the importance of analyzing the complete system and how it will be used instead of over-focusing on the components. There are criteria that might be of interest when there is only a single device or even single HBA. Today it is possible to over-configure the storage performance without unwarranted expense, and this is best accomplished by watching the big picture.
Adaptec reports that their Series 7 SAS RAID Controller (72405 - PCI-E gen 3 x8 on the upstream side and 6 x4 SAS 6Gpbs) using the PMC PM8015 controller can do 500K IOPS and 6.6GB/s.
related posts on storage:
IO Queue Depth Control
IO Queue Depth Strategy
If you are using SQL Server 2012 clustering on a SAN, I do suggest placing tempdb on local SSD, making use of the new 2012 feature that does not require tempdb to be on shared storage. Keep in mind on the SAN, writes must be mirrored between two storage processors for fault recovery, and this is not a cheap thing to do. We should plan redo whatever was using tempdb at the time.