Home, CBO, Performance Benchmarks, Server Systems, Systems Architecture, Processors, Storage,
Storage Configuration: Part I, Part II, Part III, Part IV, Appendix

Storage Configuration Part III

Hard Drives

The table below shows the specifications for the recent Seagate 3.5in (LFF) and 2.5in (SFF) 15K drives. The 2.5in Savvio drive has lower average seek time. The rotational latency for 15K drives is 2.0ms. The transfer time for an 8KB block ranges from 0.04ms at 204MB/s to 0.065ms at 122MB/s. The average access time for 8K IOP randomly distributed over the entire disk is then 5.45ms for the 3.5in disk and 4.95ms for the 2.5in disk. It should also be considered that the 3.5in Cheetah 15K.7 has media density of 150GB per platter versus 73GB for the 2.5in Savvio 15K.2. If the 3.5in disk were only populated to 50% capacity, the average seek latency would probably be comparable with the 2.5in disk.

 Cheetah 15K.6Cheetah 15K.7Savvio 15K.2
  Avg. Read Seek3.4ms3.4ms2.9ms
  Avg. Write Seek3.9ms3.9ms3.3ms
  Sequential Max171MB/s204MB/s160MB/s
  Sequential Min112MB/s122MB/s122MB/s

The sequential transfer rates assume no errors and no relocated logical blocks. On the enterprise class disk drives, this is effectively achieved. On the high-capacity 7200RPM drives, the ability to sustain the perfect transfer rates is highly problematic, and the data sheet may not specify the design transfer rate.

The chart below shows IOMeter results for a single 10K over a range of data space utilization and queue depth demonstrating the short-stroke effect on IOPS (vertical axis).


The charts below show latency on the vertical scale in ms for a range of data utilizations and queue depth.
10K lat Q

10K lat Q

There is no point to having the big capacity SATA disks in the main storage system. We said early that the short-stroke effect was key. This meant we will have much more space than needed on the set of 15K drives. The SATA drives are good for allowing dev and QA to work with the full database. There are too many developers who cannot understand why a query works fine on a tiny 10MB dev database, but not the 10TB production database.

Short Stroke Effect

The short-stroke effect is absolutely essential for transaction processing systems with tight mandatory limits on responsiveness. The short-stroke effect lowers latency and improves random IO performance. Most importantly, the short-stroke effect keeps latency low during heavy IO surges when active data is kept in a very narrow band of the disk. On a fully populated disk where full strokes are required, latency can jump to several hundred milli-sec during heavy IO surges.

One of the fundamental arguments made by SAN storage vendors is that by consolidating storage, it is possible to achieve high storage utilization, i.e., guaranteeing the full-stroke criteria. A heavy IO surge is very likely to cause transaction processing volume to collapse. To benefit from the short-stroke effect, it is necessary to restrict the active data to a very narrow range. The remaining disk capacity can still be used for data not in use during busy hours. This means having far more capacity than the active database, which in turn implies that it is essential to keep amortized cost per disk low, i.e., forgoing frills.

Solid State Storage

Most SSD fall into one of two categories: One is an SSD with one of the standard disk drive interfaces such as SATA, SAS, FC, or one of the legacy interfaces. The second type connects directly into the system IO port (PCI-E), for example the Fusion-IO SSDs. TMS has a complete solid state SAN system, which might even included DRAM for storage as well as non-volatile memory.

Most of SSD devices in the news have a SATA interface and are intended for use in desktop and mobile systems. There might be (or have been) technical issues with using the SATA SSD in an SAS storage system when there are multiple SAS-SAS bridges in the chain, even though SATA drives can be used in these systems.

STEC makes the SSD for the EMC DMX line, possibly other models as well, and for several other storage vendors. The specifications for the STEC SSD is 52K random read IOPS, 17K random write IOPS, 250MB/s sequential reads, and 200MB/s sequential write.

When using SSDs with disk interfaces in a mixed HD/SSD environment, it might be a good idea to place one SSD in each of the IO channels instead having multiple SSDs on one channel. Check with the storage vendor on technical issues.

The general idea behind the Fusion-IO architecture is that the storage interfaces were not really intended for the capabilities of an SSD. The storage interface, like SAS, was designed for many drives to be connected to a single system IO port. Since Fusion-IO could build a SSD unit to match the IO capability of a PCI-E slot, it is nature to interface directly to PCI-E.

What I would like from Fusion-IO are a range of cards that can match the IO bandwidth of PCI-E gen 2 x4, x8 and x16 slots, and deliver 2, 4 and 8GB/s respectively. Even better is the ability to simultaneously read 2GB/s and write 500MB/s or so from a x4 port, and so on for x8 and x16. I do not think it is really necessary for the write bandwidth to be more than 30-50% of the read bandwidth in proper database applications. One way to do this is to have a card with a x16 PCI-E interface, but the onboard SSD only connects to a x4 slice. The main card allows daughter cards each connecting to a x4 slice, or something to this effect.

One more thing I would like from Fusion-IO is using the PCI-E to PCI-E bridge chips. In my other blog on System Architecture, I mentioned that the 4-way systems such as the Dell PowerEdge R900 and HP ProLiant DL580G5 for Xeon 7400 series with the 7300MCH use bridge chips that let two PCI-E port share one upstream port. My thought is that the Fusion-IO resides in an external enclosure, attached to the bridge chip. The other two ports connect to the host system(s). One the host would be a simple pass through adapter that sends the signals from the host PCI-E port to the bridge chip in the Fusion-IO external enclosure. This means the SSD is connected to two hosts. So now we can have a cluster? Sure it would probably involve a lot of software to make this work, who said life was easy.

At this time, I am not entirely sure what the proper role is for SSD in database servers. A properly designed disk drive storage system can already achieve phenomenally highly sequential IO, just stay away from expensive SAN storage, and do not follow their standard advice on storage configuration.

Random IO is the natural fit for SSD. Let us suppose the amortized cost of a 146GB 15K disk is $500 in direct attach, $2K in a SAN, and a similar capacity SSD is $3000. Then table below shows cost per GB, and cost per IOP between HD and SSD, using fictitious but reasonable numbers.

 Direct AttachSANSSD
Amortized Unit Cost$500$2000$3000

A database storage system should not be entirely SSD. Rather a mix of 15K HD and SSD should be employed. The key is to put the data subject to high random IO into its own filegroup on the SSD.

Some people have suggested and temp as good candidates for SSD. For a single active database, a storage system that works correctly with the hard drive sequential characteristics is fine. It is only the situation of multiple high activity databases. Ideally, the storage controller cache interprets the pattern of activity from multiple log files on one LUN, so that a single RAID group is sufficient. If not, then this is a good fit for SSD.

I am not certain that SSDs are necessary for temp. For the data warehouse queries, the temp on a large HD array seems to work fine. In the TPC-H benchmark results, the queries that showed strong advantage for SDD involve random IO, not heavy tempdb activity.

Two criteria can sometimes terminate extended discussion. If the database happens to fit in memory, then there will not be heavy disk activity, except possible for log and temp. The log activity can be handled by disk drives. In this, it may not be worth the effort to setup a 48-disk HD array for temp, so a single SSD for temp is a good choice.

The second criteria is for slightly larger databases that exceed system memory, perhaps in the 200-400GB range, but are sufficiently small to fit on one or two SSDs. Again, it may not be worth the effort to setup the HD array, making the SSD a good choice.

Solid State Storage in the future without RAID?

A point I stress to people is to not blindly carry a great idea from the past into the future without understanding why. (Of course, one should also not blindly discard knowledge, most especially the reason underlying reason behind the knowledge).

So why do we have RAID? In the early days, disk drives were notoriously prone to failure. Does anyone remember was the original platter size was? I thought it was in the 12-18in range. MTBF may have been in the 1000hr range? Even today, at 1M-hr MTBF, for a 1000-disk array, the expectation is 8.8 disk failures per year (the average hours per year is 8,765.76, based on 365.24 days) Some reports show much higher failure rates, perhaps 30 per year per 1000 disks. Of course this includes all components in the storage system, not just the bare drive.

Sure, SSDs will also have a non-infinite MTBF. But the HD is fundamentally a single device. If the motor or certain components in the read/write mechanism fails, the entire disk is inaccessible. An SSD is not by necessity inherently a single device. The figure below shows a functional diagram of an Intel SSD. There is a controller, and there are non-volatile memory chips (NAND Flash).

10K lat Q

In system memory, the ECC algorithm is design to correct single bit errors and detect double-bit errors within an 8-byte channel using 72-bit. When four channels are uniformly populated it can also detect and correct an entire x4 or x8 dram device failure and detect double x4 chip failures. I suppose SSDs might already have some chip failure capability (but I have not found the documentation that actually states this detail).

There should be no fundamental reason an SSD cannot have redundant controllers as well. With proper design, the SSD may nolonger be subject to single component failure. With proper design, an SSD storage system could conceivably copy off data from a partially failed individual SSD to a standby SSD, or even a disk drive.

I stress that may not be what we have today, but what I think the future should be.

< Previous, Next >