Home, Optimizer, Benchmarks, Server Systems, Systems Architecture, Processors, Storage,

PCI-ESASFCHDDSSD TechnologyRAID ControllersDirect-Attach,
  SAN,  Dell MD3200,  CLARiiON AX4, CX4VNX,   V-Max,  HP P2000EVAP9000/VSP,  Hitachi AMS
  SSD products: SATA/SAS SSDsPCI-E SSDsFusion iOother SSD

This builds on the previous Storage Performance 2013 but does not cover the full scope, and hence does not replace the 2013 article.

Storage Update 2014 2014-06

In this year we have and will be seeing more next generation SSDs, many built to the NVMe standard, some to PCI-E gen3 and perhaps more infrastructure for PCI-E SSDs in the 2.5in form factor (SFF-8639 connector). NVM Express has been in the works for some time, announced in 2007, and the first specification in 2011. To some degree, the pace of progress in SSDs has slowed down until the transition to NVMe is made because it is that important to realize the full capability of NAND flash storage. NVMe is important for several reasons, impacting both client and server oriented systems, and includes a thorough overhaul of the driver architecture. Most of the initial NVMe products announced interfaced to PCI-E gen 2. But more recently Intel announced the DC P3700 and 3600 (and 3500?) that PCI-E gen3 x4 and NVMe, see the Intel DC P3700 review on Anandtech.

Xeon E5 and later processors

The Intel Xeon E5 2600 series processor (Sandy Bridge) and the E5 (2600 series) and E7 v2 processors (based on Ivy Bridge) have 40 PCI-E gen 3 lanes on each processor socket. Even though PCI-E gen 3 is 8GT/s, a change in the encoding means that the usable bandwidth is double that of PCI-E gen2 at 5GT/s.

Sandy Bridge EP 2-socket

Be aware that the PCI-E lanes are now one the processor itself. Hence the number of lanes possible depends on the number of processor sockets populated. Some 4 socket systems configure PCI-E slots from only from 2 processor sockets, and there is no option for the full set of 4 x 40 lanes.

The nominal bandwidth of PCI-E gen 2 x8 is 8 x 5GT/s x (8/10 encoding) x (1 Byte/8 bits) = 4GB/s. The net realizable bandwidth of the PCI-E g2 x8 slot is 3.2GB/s. PCI-E gen uses 128b/130b encoding, so the gen3 x8 has very close to 8GB/s nominal bandwidth and perhaps 6.4GB/s realizable bandwidth.

Processors PCI-E Organization

Below is the PCI organization of Intel Xeon E5 2600 series(original and v2) processors with 40 PCI-E gen 3 lanes and DMI 2 which is PCI-E gen 2 x4 plus additional protocols to support legacy South Bridge functions.


Each of the two x16 ports can be configured in almost any combination of x4, x8 slots in additional to the basic x16 slot.

SSD - NAND Controllers

Below are some SSD controller options for interfacing NAND. The HDD replacement option is to use a SATA or SAS interface on the upstream side.

Config_2013 SSD_SATA

For this, 8 NAND channels are common, with the option of placing 1 or 2 NAND packages on each channel. Each NAND package itself might contain up to 8 NAND chips (target?) A package of 8 x 64Gbit chips is 64GB. An SSD of 8 packages x 64GB = 512GB. (I need to look into this. 128GB chips are available. Why is max SATA SSD capacity 960/1024GB?)

Another option is to interface NAND directly to PCI-E. Some common implementations had 16 or 32 NAND channels. When PCI-E was at generation 2, the realizable bandwidth of a x8 slot was 3.2GB/s. Presumably the decision on having 32 NAND channel was made when NAND was 100MB/s?


The high pin-count chip packages are expensive and also have cost implications on the PCB. More recently 333 and 400MB/s NAND are available, so perhaps the number of channels can be reduced?

System level connection to SSD

In a desktop or single socket server, the connections from processor to SATA SSD is the following:


The path from processors is first to the PCH (formerly known as the south bridge) to the SSD. Internal to the SSD, the first connection NAND controller which interfaces to NAND. There might be an external DRAM chip attached to controller, or this could be internal to the controller.

In a proper server system, the path would probably start from the processor, go through the PCI-E channel (x4 or x8) to an HBA/RAID controller, then out on an x4 SAS channel then to an enclosure (internal or external) which has a SAS expander chip (can also be called an SAS switch) before connecting over x1 SAS to the SSD, as represented below.


The other option with PCI-E SSDs would have the following path.



There are web articles on SSDs making a big deal of the direct PCI-E to NAND architecture. Much of this is from enthusiasts with no engineering background and appreciation of the wide range in orders of magnitude of the time scales involved. A 3.3GHz processor has a clock of 0.3 nano-seconds. The access latency of NAND based SSDs are typically on the order of 100 micro-seconds. (Intel is claiming they can bring this down to 20µ-sec, which be really great.) Each interface or switch/expander chip might introduce 100-200ns delay. So the elimination of a single chip in the path is essentially irrelevant.

Even the question bandwidth is often confused. When SATA SSDs first became popular, a number of circumstances all happened to align. The right capacity range in terms of both functionality and cost fit well with an 8-channel NAND controller. It just so happened at the time that the bandwidth from 8 NAND channels matched up with first 3Gbps SATA and then later 6GGbps SATA. The realizable bandwidth of 6Gbps SATA is 600MB/s after 8b/10b encoding overhead. After command overhead, the net bandwidth is somewhat less. This would match well with 8 NAND channels at 100MB/s, as some degree of over-configuration on the downstream side is desirable to reduce the impact of other delays.

On top of this, number of pins necessary to implement an 8 channel NAND controller is in the 250-300 range, which also happens to be very economical a high volume price sensitive consumer product. Now NAND is available at 333-400MB/s but there is no inclination to reduce the number of NAND channels. Hence a great deal of bandwidth is left un-utilized when the upstream link is 6Gbps SATA.

Today, SAS has progressed to 12Gbps, but there seems to be no inclination to mover SATA beyond 6Gbps? Both SAS and PCI-E are designed to operate with multiple parallel channels, so bandwidth has no relevance to the choice of SATA/SAS versus PCI-E in servers systems. SATA is not designed for multi-channel? Hence the desktop market is interested in PCI-E SSD, but perhaps not at the cost structure of previous 16-channel NAND controllers that require 1000-pin packages?

The next generation SandForce SSD controller shown below can interface to either SATA or PCI-E up to x4. It can also be configured for 8 or 9 NAND channels so that RAIN in 8+1 better aligns with existing capacity point while adding redundancy.



One of the elements of server storage is that both capacity and performance should be scalable. In addition, we would also like the ability to achieve high bandwidth even when the installed (purchased) capacity is not maximum. The problem is that most server system vendors offer a set configuration of PCI-E slots for which there is not a correct PCI-E SSD, specifically the x16 slots.

Dell servers now have the option of PCI-E 2.5in form factor via the SFF-8639 connector. I believe this take 16 PCI-E lanes from one slot, and routes this to 4 bays with x4 per bay. This facilitates hot-swap drives, but does not provide bandwidth-capacity flexibility.

My proposal is the PCI-E storage enclosures should have a PCI-E switch similar to the SAS expanders in existing disk enclosures.




In the diagram above, there are 16 PCI-E lanes connectivity upstream to host (processors) and 16 PCI-E lanes downstream for expansion. For gen 3, each x4 has 4GB/s bandwidth nominal and 3.2GB/s realizable for a total x16 realizable bandwidth of 12.8GB/s. This should be sufficient for most applications, but more serious requirements could have two sets of x16 in parallel. There are 8 x4 connection points (bays) with 4 populated. This would call for a 64 lane (or 16 x4 ports) PCI-E switch which is commercially available (from IDT?).

In the arrangement, there is not a pressing need that the PCI-E SSD match the bandwidth of x4 gen 3. So potentially an 8 or 9 NAND controller would work?




The IDT announced the first product, later sold to PMC NVME Flash Controllers with 8, 16 and 32 NAND channels. Other announced products include Samsung XS1715 and XP941, the LSI SandForce 3700 controller. All of these are PCI-E gen 2.

Samsung XS1715, Samsung XP941, Samsung XP941, Anandtech Samsung XP941 Review, Home, Intel DC P3700,

IDT PCI-E switches, Samsung Data Center SSD SM843 SV843T,


The standard 2U enclosure has 24 x 15mm bays. My preference is for a 16 SSD and 8 HDD mix. With the small capacity 100GB SSD, there will be 1.6TB per enclosure and 6.4TB over 4 enclosures before RAID. In 7+1 RAID 5 groups, there will be 5.6TB net capacity, and 4.8TB in 3+1 RG across 4 units. The goal is 4GB/s per controller because the SAS infrastructure is still 6Gbps, supporting 2.2GB/s on each x4 port. With 16 SSDs per controller, each SSD needs to support 250MB/s. Most of the recent enterprise class SSDs are rated for well over 300MB/s per unit, allowing for a large degree of excess capability. Another option is to configure 12 SSDs per controller, expecting each SSD to support 333MB/s.

The cost structure for the above is as follows:
  RAID controller $1K
  2U Enclosure $3K
  Intel SSD DC 3700 100GB SSD $235
  Seagate Savvio 600GB 10K HDD $400

This works out to $11K per unit or $44K for the set of 4. The set of 16 x 100GB contributes $3760. For the 800GB SSD, the R5 7+1 capacity is 44.8TB at cost $148K. At maximum expansion of 4 enclosures per RAID controller, capacity is 170TB at cost is $590K. Of course at this level, I would elect a system with more PCI-E slots for greater IO bandwidth. Another option is a RAID controller with 4 x4 SAS ports. Unfortunately none of these have 4 external ports.

While the Intel SSD DC 3700 drew reviews for pioneering consistency of IO performance instead over peak performance, it is only available in SATA interface. Micron Crucial has the P410m with similar specifications but with SAS interface. This is listed on the Micron website as in production, but probably only to OEM customers. There are other enterprise grade high endurance MLC SSDs with SAS interface as well.

Note: I do not recommend anything less than 10K HDD even to support database backups. The 10K HDDs are not particularly expensive as direct-attach components ($400 for the 600GB model). Only SAN vendors sell $400 HDDs for $2K or more.

SAS 12Gbps Enclosures

Disk enclosures supporting SAS at 12Gbps might become available as early as this year. Each of the 12Gbps SAS x4 uplink and down link ports would then support 4GB/s. The RAID controller (HBA) can support 6GB/s+ in a PCI-E gen 3 x8. The system with 4 RAID controllers could then deliver 24GB/s instead of 16GB/s. At 16 SSDs per controller, this would require 400MB/s per SSD. While SSDs are rated as high as 550MB/s, getting 400MB/s per SSD in an array is more complicated.

We should not need 12Gbps SAS SSDs or HDDs. The internal wires in the enclosure connect through a SAS expander. The IO from each device bay can signal at 6Gbps, then uplink to the HBA at 12Gbps, assuming that packets are buffered on the expander.


The standard 2U disk enclosure today supports 24 or 25 2.5in (SFF) bays, with 15mm thickness. This is the dimension of an enterprise class 10K or 15K HDD with up to 3 platters. The older full size notebook used a 9mm HDD supporting 2 platters. Thin notebooks used a 7mm HDD restricted to a single platter. There is no particular reason for an SSD to be more than 7mm.

So It would be better if the new 12Gbps SAS enclosures support more than 24 bays. My preference is for 16 x 15mm and 16 x 7mm bays. Personally, I would like to discard the SSD case to further reduce thickness.


Another option is to employ the NGFF, perhaps a 1U stick, at 5mm or less. There could be 2 rows of 24 for SSD, and the 16 x 15mm bays.

I believe that the all-SSD idea is misguided. SSDs are wonderful, but HDD still have an important role. One example is having the HDDs available for backup and restores. I want local HDD for backups because so very few know how to configure for multiple parallel 10GbE network transmission, not to mention IO bandwidth on the backup system.

A database backup that has not been actually verified to restore (with recovery) is a potentially useless backup. Having HDDs for backup and restore verification preserves the write endurance on the SSD. This allows the use of high-endurance MLC instead of SLC. In some cases, it might even be possible to use consumer grade MLC if everything is architected to minimize wear on the SSD.


Some of the discussion on PCI-E versus SATA/SAS interface for the NAND/Flash controller incorrectly focuses on the bandwidth of a single 6Gbps lane versus 4 or 8 lanes on PCI-E. It is correct that PCI-E was designed to distribute traffic over multiple lanes and that hard drives were never expected to exceed to bandwidth of a single lane at the contemporary SATA/SAS signaling rate. The transmission delay across an extra silicon trip, from NAND controller with SATA interface to a SATA to PCI-E bridge chip, on the order of 50ns, is inconsequential compare this with the 25-50µsec access time of NAND.

The more relevant matter is matching NAND bandwidth to the upstream bandwidth. All (or almost all?) the SATA interface flash controllers have 8 NAND channels. Back when SATA was 3Gbps and NAND was 40-50MB/s, 8 channel s to the 280MB/s net bandwidth of SATA 3G was a good match. About the time SATA moved to 6Gbps, NAND at 100-133MB/s became available so 8 channels was still a good choice.


NAND is now at 200 and 333MB/s, while SATA is still 6Gpbs. The nature of silicon product cost structure is such that there is only minor cost reduction in building a 4 channel flash controller. The 8 channel controller only requires 256-pin package.

The PCI-E flash controllers have been designed with 32 NAND channels. The IDT 32-channel controller has 1517 pins, which is not excessively difficult or expensive for a high-end server product (the Intel Xeon E5 26xx processors have 2011-pins). The SandForce SF2500 8-channel controller has 256-pins (8-byte, the 16-byte is 400-pin). As noted earlier a PCI-E gen 3 x8 port supports 6.4GB/s. Over 32 channels, each channel needs to provide 200MB/s. The new 333MB/s NAND is probably a better fit to sustain the full PCI-E gen 3 x8 bandwidth after RAID (now RAIN because disks are replaced by NAND).


Based on 64Gbit die, and 8 die per package, a package has 64GB raw capacity. The 32-channel PCI-E would have 2TB raw capacity (net capacity with 0.78 for over-provisioning and 0.875 for RAIN would be 1400GB) versus 512GB on an 8-channel SATA/SAS SSD.

As is today, a PCI-E SSD can achieve maximum bandwidth at lower NAND capacity and in a more compact form factor than with SAS SSD. On the other hand, SAS infrastructure provides flexible expansion. Capacity can be increased without replacing existing devices. Some systems support hot swap PCI-E slots. However arrangement of the connector in the systems makes this a complicated matter. The implications are that PCI-E slot SSDs are highly suitable for high density requirements with limited expansion needs. One database server example is tempdb on SSD.

NVM Express

The new generation of PCI-E SSDs may employ the NVMe interface standard. There is a standard driver for Windows and other operating systems, which will later be incorporated into to the OS distribution media allowing boot from an NVMe device, as with SATA devices today. This is mostly a client side feature.

For the server side, the NVMe driver is designed for massive bandwidth and IOPS. There can be up to 64K queues, 64K commands (outstanding IO requests?). The driver is designed for IO to be both super-efficient in cpu-cycles and scalable on NUMA systems with very many processor cores.


Express Bay

To promote the growth of SSD without betting on which interface, the Express Bay standard defines a connector that can support both PCI-E and SATA or SAS. Some Dell servers today support PCI-E to SSDs in the 2.5in HDD form factor (SFF), but I am not sure if this is Express Bay. This form factor will allow PCI-E devices to hot-swapped with the same ease as SAS devices today.


PCI-E Switches

As mentioned earlier, the PCI-E slot arrangement in server systems does not facilitate hot-add, even if it is supported. Existing PCI-E SSDs also do not provide a mechanism for capacity expansion, aside from adding a card to an empty slot or replacing an existing card.

Of course, there are PCI-E switches, just like the SAS expanders. A 64 lane PCI-E switch could connect 5 x8 PCI-E devices over a x8 upstream link. Other possibilities is a x16 link supporting 12.8GB/s to host with 4 ports for SSD, or 8 x4 ports to SSD for finer grain expansion. It may also be possible to support multiple hosts, as in a cluster storage arrangement?


SAN Configuration

Below is a representation of a typical configuration sold to customers by the SAN vendor. I am not joking in that it is common to find 2 ports FC or FCOE on each host. The most astounding case was a SAN with 240 x 15K disks and 2 single port FC HBAs in the server. Even though the storage system service processors had 4 FC port each (and the FC switches had 48-ports), only 1 on each SP was connected. Obviously the storage engineer understood single component and path failure fault-tolerant design. It was just too bad he built a fault-tolerant garden hose system when a fire hose was needed.


As I understand it, what happened was the SAN engineer asked how much space is needed for the databases accounting for growth, and then created one volume for it. The Windows OS does support multi-path IO. Originally the storage vendor provided the MPIO driver, but now it is managed by Microsoft. Apparently it was not considered that even with MPIO, all IO for a single volume has a primary path. The secondary path is only used when the primary is not available.

High-bandwidth SAN Configuration

A proper SAN configuration for both OLTP and DW database servers is shown below. Traditionally, an transaction processing database generates 8KB random IO, and IO bandwidth was not a requirement. This was 15 years ago. Apparently the SAN vendors read documents from this period, hence their tunnel vision on IOPS, ignoring bandwidth. For the last 10 or more years, people have been running large queries on the OLTP system. Furthermore, it is desirable to be able to backup and restore the transaction system. This means bandwidth.

Note that the system below shows 8Gbps FC, not 10Gbps FCOE. A single 10Gbps FCOE may have more bandwidth than a single 8Gbps FC port. But no serious storage system will have less than 4 or even 8 ports. Apparently FCOE currently does scale over multiple ports, due to the overhead of handling Ethernet packets? An Intel IDF 2012 topic mentions that this will be solved in the next generation.


The above diagram shows 8 x 8Gbps FC ports between host and storage system. Each 8Gbps FC port can support 700MB/s for a system IO bandwidth target of 5.6GB/s. An OLTP system that handles for very high transaction volume may benefit from a dedicated HBA and FC ports for log traffic. This would allow the log HBA to be configured for low latency, and the data HBA to be configured with interrupt moderation and high throughput.

An alternate SAN configuration for SQL Server 2012 is shown below with local SSD for tempdb.


The write cache on a SAN must be mirrored for fault tolerance. There is very little detail on the bandwidth capability of the link between controllers (or SP) on SAN systems, beyond what can be deduced from the fact that the sustained write bandwidth is much lower than the read bandwidth. So keeping tempdb off the SAN should preserve IO write bandwidth for traffic that actually needs protection.

The number of volumes for data and temp should be some multiple of 8. It would be nice to have 1 volume for each FC path. However we do need to consider how SQL Server place extents over multiple files. This favors RAID groups of 4 disks.

File Layout 2013-10

In a HDD storage system, the objective for bandwidth is to simultaneously issue large block IO to all data disks across all IO channels. A 256K block could be sufficiently large to generate 100MB/s per disk (400 IOPS, not random). If this were issued at low queue depth (2?), then the storage system would not only generate high IO bandwidth and still be perceptually responsive to other requests for small block IO.

For small block random IO, it is only necessary to distribute IO over all hard disks with reasonable uniformity.

The file layout strategy has two objectives. One is to not overwhelm any single IO channel. In direct-attach, this is not a problem as the smallest pipe is x4 SAS for 2GB/s. In a SAN, even using 8Gbps FC, this is a concern as 8Gb FC can support only 700-760MB/s. Although 10Gb FCoE seems to have higher bandwidth, this may not scale with the number of channels as well as straight FC. The new Intel Xeon E5 (Sandy-Bridge EP) processors may be able to scale 10Gb FCoE with Data Direct IO (DDIO) - but this needs to be verified.

The second is to ensure IO goes to every disk in the RAID Group (or volume). By default, SQL Server allocates a single 64K extent from each file before round-robin allocating from the next file. This might be the reason that many SAN systems generate only 10MB/s per disk (150 IOPS at 64K), along with no read-ahead.


The -E startup flag instructs SQL to allocate up to 4 consecutive extents before proceeding to the next file. See James Rowland-Jones Focus on Fast Track : Understanding the –E Startup Parameter for more on this. In a 4-disk RAID group with stripe-size 64K, a 256K IO to the file would generate a 64K IO to each disk.


It would be necessary to rebuild indexes before this scheme takes effect. Somewhere it was mentioned that it is important to build indexes with max degree of parallelism limited to either 4 or 8. It might be in the Microsoft Fast Track Data Warehouse Reference Architecture. Start with version 4 for SQL Server 2012, and work backwards?

Edit 2013-10: In SQL Server versions 2008 R2 and 2012, I am seeing 512K IO issued in table scans for both heap and clustered indexes built without the -E startup parameter. Does this mean -E is nolonger necessary? I would like to seee Microsoft documentation on this.

This would have significant implications on RAID group size for HDDs. RAID 10 might favor 4 disks with 128K stripe or 8 disk with 64K stripe size? RAID 5 might favor 5 disks with 128K or 9 disks (or perhaps 10) with 64K stripe?

There are additional nuances to SQL Server table scan IO in versions 2008 R2 and 2012. At DOP 1, the IO size might be 192K or some combination with an average IO size in that range except when there is only 1 file. Otherwise IO size appears to be 512K. The IO strategy appears to be to issue IO at low queue depth whi may be additional implications of the 512K IO.

In the Fast Track Data Warehouse configurations for SQL Server 2008 R2 and 2012 by HP and some other vendors, it is mentioned that the setting in Resource Governor: REQUEST_MAX_MEMORY_GRANT_PERCENT can "improve I/O throughput" at high degree parallelism. (oh yeah, the applicable DMV is dm_resource_governor_workload_groups) Well simply issuing IO at sufficient queue depth so that all volumes are working will do this. There is no reason to be cryptic.

Trace Flag 1117 (-T1117) causes all files in a filegroup to grow together. See SQL Server 2008 Trace Flag -T 1117.

With SSD, the second may not be important as the SQL Server read-ahead strategy (1024 pages?) should generate IO to all units. On the hard disk, generate close-to-sequential IO was important. On SSD, it is sufficient beneficial just to generate large block IO, with 64K being large.


The old concept of distributing IO over both devices and channels still apply. The price of SSD is sufficiently low to warrant serious consideration. While there is more flexibility in SSD configuration, it is still necessary to validate performance characteristics with real queries to an actual database. SQLIO or other synthetic tests are not sufficient. If the SAN vendor advised in the configuration, then chances are IO bandwidth will be not be good.


If anyone thinks I am being unfair to or overly critical of SAN vendors, do the following test.
Find the biggest table in your database, excluding LOB fields.
Run this:

Then compute 8 (KB/page) * (physical reads + read-ahead reads)/(elapsed time in ms)
Is this closer to 700 MB/s or 4GB/s? What did your SAN vendor tell you?

I am also not fan of SSD caching or auto-tiering on critical databases, meaning the database that runs your business, that is managed by one or more full-time DBAs. In other applications, there may not be a way to segregate the placement of hot data differently from inactive data. In SQL Server, there are filegroups and partitioning. We have all the necessary means of isolating and placing hot data whereever we want it. SSD caching or auto-tiering will probably require SLC NAND. With active management using database controls, we should be able to use HET MLC or even MLC.

I stress the importance of analyzing the complete system and how it will be used instead of over-focusing on the components. There are criteria that might be of interest when there is only a single device or even single HBA. Today it is possible to over-configure the storage performance without unwarranted expense, and this is best accomplished by watching the big picture.

Adaptec reports that their Series 7 SAS RAID Controller (72405 - PCI-E gen 3 x8 on the upstream side and 6 x4 SAS 6Gpbs) using the PMC PM8015 controller can do 500K IOPS and 6.6GB/s.

I will keep this topic up to date on www.qdpma.com Storage 2013

related posts on storage:
IO Queue Depth Control
IO Queue Depth Strategy
io-queue-depth-strategy (2010-08)
data-log-and-temp-file-placement (2010-03)
io-cost-structure-preparing-for-ssd-arrays (2008-09)
storage-performance-for-sql-server (2008-03)

If you are using SQL Server 2012 clustering on a SAN, I do suggest placing tempdb on local SSD, making use of the new 2012 feature that does not require tempdb to be on shared storage. Keep in mind on the SAN, writes must be mirrored between two storage processors for fault recovery, and this is not a cheap thing to do. We should plan redo whatever was using tempdb at the time.