Home, Cost-Based Optimizer, Benchmarks, Server Systems, Systems Architecture, Processors, Storage,
  Storage Overview, System View of Storage, SQL Server View of Storage, File Layout,
  PCI-E, SAS, FC, HDD, SSD, RAID Controllers, Direct-Attach,
  SAN, Dell MD3200, CLARiiON AX4, CX4, V-Max, HP P2000, EVA, P9000/VSP, Hitachi AMS

Developments in Storage - Good-bye to FC on the back-end,
good riddance to gold-bricked SAN systems with pathetic IO capability

There several developments of great interest in storage technology. One is the Intel C5500 storage processor, codenamed Jasper Forest, a version of Nehalem with integrated PCI-E and special features for storage and embedded systems. Another are new SAS components from LSI that bring SAN-like capability to SAS storage systems. The Dell PowerVault MD3200 series now seem to have Windows cluster capability. Finally, storage system vendors are beginning to abandon FC on the back-end, even on the high-end.

Hopefully all of this together with the gradual maturing of SSD at the storage unit level, and proper integration of SSD into storage systems, we can finally break free from grossly over-priced SAN solutions that seem to unfailingly find a way to achieve spectacularly poor IO performance.

The Problem with SAN Systems To Date

The problem is not entirely with the SAN concept or technology. But part of it is. The database engine and the storage system must be tightly coupled. In fact, a key architectural aspect of the database engine (data and logs) is built around the disparate random and sequential IO performance characteristics of hard disk drives. The concept shortest path to metal also applies. Inserting extra layers of beauracracy does not help, in fact it can have severe negative consequences.

The more serious problem is that storage vendors want to sell storage systems with exceptionally high margins. To explain the case for the really expensive storage, arguments were conjured up. One is shared storage. Another is management features and capabilities. The big lie is of course the big cache which seems to be big, but really is not as big or effective as it is imagined to be. The first is counter to the core of the database engine-storage relation, and the second should have been handled by the database engine and good planning on part of the DBA.

These in turn became elevated pervasively into storage system doctrine, accepted as fact without sound technical justification and even substantiating evidence to the contrary. The evangelist acolytes of doctrine, the SAN admin, have a vested interest in putting doctrine into practice.

The consequence is vicious cycle of circular arguments where the SAN vendor doctrine runs counter to the architectural requirements of the database engine, turning a not that difficult or expensive matter into a business crippling problem. Because the storage system is so expensive, the customer cannot afford enough disks and IO channels. Because the system does not have adequate performance, features and capabilities are introduced to fix problems caused or aggravated by the SAN in the first place.

The bottom line of the SAN vendor proposal is that instead of spending X on 4X the required capacity, exactly the desired IOPS and bandwidth to manage the database without buggy software and complicated user interface, they propose that you spend 2X for exactly the required capacity, one quarter the required IOPS, one-tenth the bandwidth, and they will provide the complicated buggy software to help manage around the deficient IO performance. But they solved the problem of your excess disk capacity!

The amazing observation is that after buying a really expensive storage "solution," no one seems interested in asking why the really expensive storage system did not provide a solution to the performance requirement. Instead, the inclination is to look elsewhere for faults.

Intel C5500/C3500 Storage Processor

There are three features of C5500, 1) Non-Transparent Bridge, 2) DMA with CRC, fill and RAID 5/6 assist, and 3) Asynchronous DRAM Refresh intended for storage systems that should not be needed in standard server and workstation systems.

PCI-E Non-Transparent Bridge (NTB)

NTB allows independent systems to communicate over a high-bandwidth, low latency direct PCI-E connection. With the (normal) PCI-E transparent bridge, all devices on both sides are discovered and configured by the local host. Discovery stops at the NTP. Below are two independent UP systems connected through the NTB port. The NTB port can also be connected to a Root (non-NTB) port.

SAS

Compare this to the architecture of a SAN system in providing fault-tolerance and protecting against data loss in the write cache.

SAS

Enhanced DMA and RAID Assist

Two other features are Enhanced Crystal Beach 3 (CB3) DMA and hardware RAID 5/6 assist. The DMA function can calculate CRC outside of the processor core. It can also do block fill, zero out or fill pages also outside of the processor core. RAID 5 and 6 calculations are also performed here.

Asynchronous DRAM Refresh

There is a pin on the processor that should be triggered on detection of power failure or system lockup. Once triggered, DRAM is put into self-refresh, with a battery backup, preserving the contents of memory.

LSI SAS 6Gbps

Serial Attach SCSI (SAS) was the actual successor to SCSI. FC co-existed with SCSI, absorbing the high-end, but never encroached on the lower end of SCSI, which in turn was above IDE/ATA/SATA. SAS from the beginning was announced as dual-ported, while SATA was single port. The intent was to support more complex storage systems, with redundant paths and multiple hosts.

LSI 1078

Apparently the first generation SAS did not implement dual-port across all components. This is not unusual as the capabilities that were in the first gneration were sufficiently useful, and it is important to establish both an infrastructure of component suppliers and a demand base as well. Dual-porting now seems to be widely available.

Switching is not part of the SAS specification, but it is possible to implement this into the SAS environment. The new LSI SAS6160 switch has 16 x4 SAS ports.

LSI 1078

Zoning is a SAS 6Gpbs feature, and is necessary to implement SAN type storage environment. While it is absolutely essentail the critical transaction processing database servers and data warehouse DSS servers have dedicated storage, zoning is still useful.

LSI 1078

The LSI switch might first be widely deployed with blade systems. But the application we are interested in is clustered OLTP database systems.

LSI 1078

Clustering

The key functionality provided by SAN systems is Windows clustering support. Clustering is the simplest and most popular method of providing high availability. It used to be possible to clustered using direct-attach SCSI storage with two hosts on the SCSI bus, even though this is impractical for large storage systems. Microsoft later removed support for this mode. More recently, Microsoft added shared nothing clusters, but this still required the storage system to handle replication? The other alternative, database mirroring is not simple to implement for multi-database applications.

Today, we now have a cluster solution with a SAS-only storage system in the Dell PowerVault MD3200. The HP P2000 can also support SAS both front and back-end. Hopefully, this will encourage small and medium sized operations to deploy databases in clustered environments with dedicated storage, not a shared SAN systems.

With a reasonably low cost structure, preferably well under $1000 per 15K, people can now afford proper storage configurations with IO distributed over very many disks and many SAS channels, instead of the disasterous capacity oriented view of SAN vendors. A bare base capacity 15K disk drive costs less than $200, so a price of $300 from a system vendor is acceptable. But there should be no reason the full amortized cost per disk of a storage system to be in $2K range as it is today with FC SAN systems.

SSD Technology and RAID Paradigm

The SSD has already matured greatly. The peculiar NAND flash write behavior issues are mostly being solved with TRIM, over-provisioning and other techniques. The other matter I having promoting is doing away within RAID. RAID is fundamentally a hard disk concept. The hard disk is inherently a single component with moving parts. A mechanical failure could render the entire device inaccessible.

In the old days, hard disks failed frequently. And it took many hard disks to build an adequate storage system. So the storage system would experience failure much more frequently without RAID.

The SSD is not required to be a single device. The is no reason the SSD cannot have dual controllers. The SSD should already have memory component failure redundancy built-in. Doing away with SSD RAID eliminates the issue of the extremely high IOPS issue.

I cannot pass up one more dig at SAN vendors. Technically SAN systems do not support RAID. RAID is an acronym for Redundant Array of Inexpensive Disks. When the disks are sold for an amortized cost of $6K, what SAN systems support is really RAED.

SSD and Storage System Integration

Recent SSD technology in the late 3Gbps SATA/SAS period can sustain 260MB/s large block read and write and 20K plus IOPS. The next generation designed for the 6Gbps SATA/SAS interface should be able to achieve 500MB/sec per device. Storage system disk enclosures were designed naturally with hard disks in mind. Having 24 hard disk on one x4 6Gbps SAS channel is reasonably choice. Even though the very latest 15K 2.5in hard disks can support 150MB/sec (15K 3.5in disks can sustain 200MB/sec), Not many activities are pure sequential, so 24 disks per x4 SAS is a reasonable balance between bandwidth and cost.

For a SSD storage system, 5-8 devices potentially can saturate the x4 SAS channel. Now if there are extra drive bays, there is definite value in a hybrid storage solution mixing 4-6 SSD and 6-8 HDD on a single x4 SAS channel.