PCI-E, SAS, FC,
HDD, SSD Technology,
RAID Controllers, Direct-Attach,
SAN, Dell MD3200, EMC AX4, CX4, VNX, V-Max, HP P2000, EVA, P9000/VSP, Hitachi AMS
SSD products: SATA/SAS SSDs , PCI-E SSDs, Fusion iO, other SSD
Wikipedia has SSD as Solid-state drive. I have also seen reference as solid-state disk. But there is no drive and no disk, so I am calling it solid-state-device.
After years of anticipation and false starts, the SSD is finally ready to take a feature role in database server storage. There were false starts because NAND flash is very different from hard disks and cannot be simply dropped into a storage device and infrastructure built around hard disk characteristics. Too many simple(ton) people became entranced on only seeing the featured specifications of NAND-based SSD, usually random IOPS and latency. It is always the details in small print (or outright omitted) that are critical. Now, enough of the supporting technologies to use NAND-based SSD in database storage systems are in place and more are coming soon.
Random IO performance has long been the laggard in computer system performance. Processor performance has improved along the 40% per year rate of Moore's law. Memory capacity has grow at around the 27% per year rate (memory bandwidth has kept pace, but not memory latency). Hard disk drive capacity for a while grew at 50%-plus per year. Even HDD sequential transfer rates has increased at a healthy pace, from around 5MB/s to 200MB/s over the last 15 years. However random IOPS has only tripled over the same 15 year period from 5400RPM to 15K. The wait for SSD to finally break the random IOPS stranglehold has been long, but is finally taking place.
We should expect three broad lines of progress in the next few years. One is the use of SSD to supplement or replace HDD is key functions. Second is a complete redesign of storage system architecture around SSD capabilities with consideration that high-capacity HDD is still useful. Third, that it is time to completely rethink the role of memory and storage in server system architecture, and perhaps database architecture with respect to data and log.
A quick survey of SSD products is helpful to database professionals because of the critical dependency on storage performance. However, it quickly becomes apparent that it is also necessary to provide at minimum a brief explanation of the underlying NAND flash, including the proliferations SLC, MLC and eMLC. Next are the technologies necessary to implement high-performance storage from NAND flash. The Open NAND Flash Interface ONFI industry workgroup is important in this regard. This progresses to the integration of SSD in storage systems, including form factor and interface strategies. From here were can form a picture of the SSD products available, and develop a plan to implement SSD where appropriate.
To take the place of hard drives in a computer system, the storage technology prefers non-volatile memory, in which information is retained on power shutdown. Of the NV-memory technologies, NAND flash is most prevalent in hard-disk alternative/replacement storage devices. NOR flash has special characteristics, suitable for in-place code execution. Other non-volatile memories include Magneto-resistive RAM, Spin-Torque Transfer, and Memristor. Phase-Change Memory has promise in low granularity, and lower read latency.
The Micron NAND website is a good source of information on NAND. Wikipedia has a description of Flash Memory, explaining the fundamentals and the difference between NAND and NOR. The diagrams below from Cyferz show NOR wiring on the left and NAND wiring on the right.
A key difference is that NAND has fewer bit (signal) and ground lines, allowing for higher density, hence lower cost per bit (well today it does not make sense to talk about price per bit, so price per Gbit helps eliminate the leading zeros.
Sometime in 1997?, Intel published a paper on multi-level cell for NOR Flash, called StrataFlash. At some point, MLC made its way to NAND supporting 2-bits per cell. There is currently a 3-bit cell in development, but this may be more for low performance applications. MLC has significantly longer program (write) time than SLC.
The Intel's 3rd generation SSD, with 25nm NAND from the Intel-Micron Flash Technologies (IMFT) joint venture will be out soon. The IMFT 34nm 2-bit per cell 4GB 172mm2 and 24nm 2-bit per cell 8GB 167mm2 die (from Anandtech) are shown below.
IMFT 34nm 2-bit per cell 4GB 172mm2 and 24nm 2-bit per cell 8GB 167mm2 die
IMFT 34nm 3-bit per cell 4GB 126mm2
A significant portion of the die is for logic?
Numonyx (now Micron) has public specification sheets for their NAND chips.
|Small page||SLC||128M-1G||512 byte||16b||16K||512||256 words||8 words||8K word||256 word|
|Large page||SLC||1G-16G||2 Kbyte||64b||128K||4K||1K words||32 words||64K word||2 Kword|
|Very Large page||SLC||8G-32G||4 Kbyte||128b||256K||8K(?)|| || || || |
|Very Large page||MLC||16G-64G||4 Kbyte||224b||512K||28K|| || || || |
|Type||Density||Random Access||Page Program||Block erase||ONFI|
A time for each subsequent byte/word is cited as 25 ns, for a 40MHz clock frequency. SLC is typically rated for 100K cycles, and MLC for 5,000 cycles. The (older) lower capacity SLC chips have 512 byte pages.
I am not sure about this, but I understand the NAND chip itself could be referred to as a target
and is also a Logical Unit.
A single package could have one or more (up to 8?) die, hence is each die is addressed by the LUN?
The chip is divided into planes, the die in the above pictures have 4 or 8 planes?
which may also be a logical unit? or is a logical unit below a plane?
Below a logical unit plane(?) is a block, and then the page.
NAND organization: plane? logical unit (chip?), 2 planes
(may support interleaved addressing),
block, page. Target is one or more LU.
See Micron Choosing the Right NAND
The two figures below are from the Micron document NAND 201, by Jim Cooke, September 2011. The first is a 2Gb NAND from 2006. The second is a 32Gbit NAND in 2010.
In the figure below, the 32Gb 25nm Micron SLC NAND Flash is one LUN,
comprised of 2 planes.
Each plane is 16Gbit, comprised of 2048 blocks.
Each block is 8Mbit or 1M bytes + 56K additional, comprised of 128 pages.
Each page is 8K bytes (or 64K bits) + 448 bytes additional,
After NAND became the solid-state component of choice, the industry started to learn the many quirks and nuances of NAND SSD behavior. NAND must be erased an entire block at a time (2,000μs?). A write (or program) must be to an erased block.
The block write requirement has significant impact write performance. Write to MLC is far slower than for SLC. Write performance issues of MLC can be solved with over provisioning.
The Wikipedia Write Amplification explain in detail on the additional write overhead due to garbage collection. Write Amplification = Flash Writes/Host Writes. Small random writes increases WA. Write amplification can be kept to a minimum with over-provisioning.
The block write requirement has significant impact write performance. Writes to SLC was already not fast to begin with, writes to MLC is much slower than SLC (800 versus 200-500μs) and on top of this, the implication of the block erase requirement can result in erratic write performance depending on the availability of free blocks. The write performance issues caused by the block erase requirement can be solved with over provisioning.
Below are slides from the Intel Developer Forum 2010 "Enterprise Solid State Drive (SSD) Endurance", Scott Doyle and Ashok Narayanan.
NAND SSD may exhibit a "bathtub" effect in read-after-write performance. The intuitive expectation is that mixed read-write performance should be close to a linear interpolation between the read and write performance specifications. Without precautions, the mixed performance may be sharply lower than both the pure read and pure write performance specifications.
This example is cited by a STEC Benchmarking Enterprise SSDs report.
Flash NAND also has wear limits. Originally this was 100,000 cycles for SLC and 5-10K for MLC. The write longevity issues of MLC seem to be sufficiently solved with wear leveling and other strategies. SLC SSD may become relegated to a specialty market.
The fact that NAND SSD has a write-cycle limit suggests that database administration could be adjusted to accommodate this characteristic. If there were some means of determining that an SSD is near the write-cyle limit, active data could be migrated off, and the SSD could be assigned to static data. In an OLTP Database, tables could be partitioned splitting active and archival data. In data warehouses, the historical data should static.
The characteristics of NAND flash such as block erasure and wear limits, a simple direct mapping of logical to physical pages is not feasible. Instead there is a Flash Translation Layer in between. Numonyx provide a brief description here. The FTL is implemented in SSD controller(?), and determines the characteristics of the SSD. Below is a block diagram of FTL between the file system and NAND.
Another diagram from the Micron/Numonyx NAND Flash Translation Layer (NFTL) 4.5.0 document. This document has a detailed description of the Flash Abstract Layer, or Translation Module which incorporates functionality for bad block management, wear leveling and garbage collection.
The strategy for writing to NAND somewhat resembles the database log, and the NetApp Write Anywhere File Layout (WAFL), which is an indication that perhaps a complete re-design of the database data and log architecture could be better suited to solid-state storage.
NAND density is currently at 128 or 256Gbit density per die for 2-bit cells,
meaning 64G cells.
This is 16GB on one die! SLC is now at 128Gbit?
(Never mind, apparently the Numonyx SLC 64Gbit product is 8 x 8Gbit die stacked.
Still very impressive at both the die and package level.)
One aspect of such high densities is that bit error rates are high.
All (high-density?) NAND storage require sophisticated error detection and correction.
The degree of EDC varies for enterprise and consumer markets.
The Micron website describes High-Endurance NAND as
"Enterprise NAND is a high-endurance NAND product family optimized for intensive enterprise applications. Breakthrough endurance, coupled with high capacity and high reliability (through low defect and high cycle rates), make Enterprise NAND an ideal storage solution for transaction-intensive data servers and enterprise SSDs.
Our MLC Enterprise NAND offers an endurance rate of 30,000 WRITE/ERASE cycles, or six times the rate of standard MLC, and SLC Enterprise NAND offers 300,000 cycles, or three times the rate of standard SLC. These parts also support the ONFI 2.1 synchronous interface, which improves data transfer rates by four to five times compared to legacy NAND interfaces."
Enterprise MLC is available upto 256Gbit, and SLC to 128Gbit. I will try to get more information on this.
Below is an interesting combination of SLC and MLC.
ONFI "define standardized component-level interface specifications as well as connector and module form factor specifications for NAND Flash."
The Michael Abraham, Micron presentation ONFI 2 Source Synchronous Interface Break the I/O Bottleneck explains both ONFI 1.0 (2006) and 2.x versions (if the above link does not work, try ONFI presentations.) Below is a summary of the Abraham presentation.
In the original ONFI specification, the NAND array had parallel read that could support 330MB/s bandwidth (8KB in 25us) with SLC?, but the interface bandwidth was 40MB/sec (the slidedeck mentions 25ns clock, corresponding to 40MHz, but the ONFI webite says 1.0 is 50MB/s). Then accounting Array Read and Data Output is 25 + 211us for SLC and 50 + 211us for MLC for net bandwidth 34 and 30MB/s. Net write bandwidth is 17MB/s and 7MB/s respectively. Below is the single channel IO.
| || || ||Read||Write|
|SLC 4KB page||2||8KB||25μs||211μs||34MB/s||211μs||250μs||17MB/s|
|MLC 4KB page||2||8KB||50μs||211μs||30MB/s||211μs||900μs||7MB/s|
Note that the write latency is very high relative to hard disk sequential writes, as is transaction log writes. I believe the purpose of DRAM cache on the SSD controller is to hide this latency.
While the bandwidth and latency for NAND at the chip level is not spectacular, both could be substantially improved at the device level with more die per channel, more channels, or both as illustrated below.
Note: I am puzzled by the tables below, I am thinking that the number of channels and the die per channels axis was inadvertently switched. If the signalling bandwidth is 40MB/s, then 4 channels is required for a maximum of 160MB/s, but it does take multiple die per channel to reach the channel bandwidth of 40MB/s.
SLC 2-Plane Performance: Die per channel vs. # of channels
|# of channels||1||2||4||8||1||2||4||8|
|1 die per ch||34||40||40||40||19||38||40||40|
|2 die per ch||68||80||80||80||38||76||80||80|
|4 die per ch||136||160||160||160||76||152||160||160|
MLC 2-Plane Performance: Die per channel vs. # of channels
|# of channels||1||2||4||8||1||2||4||8|
|1 die per ch||30||40||40||40||7||14||28||40|
|2 die per ch||60||80||80||80||14||28||56||80|
|4 die per ch||120||160||160||160||28||56||112||160|
SLC could achieve near peak performance with 4 channels and 2 die per channel. MLC could also achieve peak read performance with 4 channels and 2 die per channel, but peak write performance required 8 die per channel.
ONFI 2.0 defines a synchronous interface, improving IO channel to 200MB/sec and allowing 16 die per channel. The version 2.0 (2008) allowed speeds greater than 133MB/s. Version 2.1 (2009) increased this to 166 & 200MB/s, plus other enchancements, including in ECC. (The current Micron NAND parts catalog list 166MT/s as available). Read performance is improved for a single die and for multiple die. Write performance did not improved much for a single die, but did for multiple die on the same channel. Version 2.2 was other features. ONFI 2.3 add EZ-NAND to offload ECC responsibility from the host controller.
Below are the net bandwidth calculations for ONFI 2.x.
| || || ||Read||Write|
|SLC 4KB page||2||8KB||25μs||43μs||120MB/s||43μs||250μs||28MB/s|
|MLC 4KB page||2||8KB||50μs||43μs||88MB/s||43μs||900μs||8MB/s|
Synchronous SLC 2-Plane Performance: Die per channel vs. # of channels
|# of channels||1||2||4||8||1||2||3||4|
|1 die per ch||120||200||200||200||28||56||112||200|
|2 die per ch||240||400||400||400||56||112||224||400|
|4 die per ch||480||800||800||800||112||224||448||800|
Synchronous MLC 2-Plane Performance: Die per channel vs. # of channels
|# of channels||1||2||4||8||1||2||4||8|
|1 die per ch||88||176||200||200||8||16||32||64|
|2 die per ch||176||352||400||400||16||32||64||128|
|4 die per ch||352||704||800||800||32||64||128||256|
Almost all SSDs on the market in 2010 are ONFI 1.0. SSDs using ONFI 2.0 are expected soon(?) with >500MB/s capability?
The future ONFI 3.0 with increase the interface to 400MT/s.
The existing interfaces to the storage system today were all designed around the characteristics of disk drives, naturally because the storage system was comprised disk drives. As expected, this is not the best match to the requirements and features of non-volatile memory storage. The Non-Volatile Memory Host Controller Interface (NVMHCI) specification will define "a register interface for communication with a non-volatile memory subsystem" and "also defines a standard command set for use with the NVM device." NVMHCI specification should be complete this year, with product in 2012.
A joint Intel and IDT presentation by Amber Huffman and Peter Onufryk at Flash Memory Summit 2010 discusses Enterprise NVMHCI. In the storage system today, there is a controller on the hard drive (the chip on the hard drive PCB), with an SAS or SATA interface to the HBA.
The argument is that the HBA and controller should be integrated into a controller on the SSD, with PCI-E interface upstream. Curiously, IDT mentions nothing about building natie PCI-E flash controller, considering that they are a specialty silicon controller vendor.
Below is the Enterpise NVMHCI view. The RAID controller now has PCI-E interfaces on both upstream and downstream sides. I had previously proposed that RAID functionality should be pushed in to the SSD itself.
Kam Eshghi also of Integrated Device Technology has a FMS 2010 presentation "Enterprise SSDs with Unrivaled Performance A Case for PCIe SSDs" endorsing the PCI-E interface. The diagrams below are useful to illustrated the form factor. Below is a RAID PCI-E implementation using a standard RAID controller with PCI-E on the front-end and SATA or SAS on the back-end, an Flash controller with a SATA interface, and NAND chips.
In the next example, the host provides management services, consuming resources,
and finally, a Flash controller with native PCI-E interface (and RAID capability?).
The desire to connect solid-state storage directly to the PCI-E interface is understandable. My issue is the current standard PCI-E form factor is not suitable for easy access. There is the Compact PCI form factor (not yet defined for PCI-E?) where the external and PCI connections are at opposite ends of the card, instead of at two adjacent sides. This would be much more suitable for storage devices. Some provision should also be made from greater flexibility in storage capacity expansion with the available PCI-E ports.
There is a joint presentation by LSI and Seagate arguing that the SAS/SATA interface does limit SSD performance, and has excellent infrastructure for module expansion and ease of access.
The current trend with SSD with SATA/SAS interfaces is the 2.5in HDD form factor. The standard 3.5in HDD form factor is far too large for SSD. For that matter, the 3.5in form factor has become too big for HDD as well. The standard defined heights for 2.5in drives are 14.8mm, 9.5mm, and 7mm. Only enterprise drives now use the 14.8mm height, as notebook drives are all 9.5mm or thinner.
The 7mm height drive used in thin notebooks have limited capacity (250GB?), but might be ideal for SSD. The standard 2U rack holds 24 x 14.8mm drives, but could hold perhaps 50 x 7mm SSD units?
(Update) Apparently Oracle/Sun has already implemented the high-density strategy. The F5100 implements upto 80 Flash Modules (2.5in, 7mm form factor, SATA interface) in a 1U enclosure for 1.92TB capacity. I suppose the Flash Modules are two deep. A hard drive enclosure is already heavy enough with 1 rank of disks, but 2 deep for a flash enclosure is very practical. And to think there are still storage vendors peddling 3U 3.5in enclosures!
Gary Tressler of IBM proposes that SSD should actually adopt the 1.8 form factor. Presumably there would only be a single SSD capacity. The storage enclosure with have very many slots, and we could just plug in however many we need to.
I believe STEC is one of the component suppliers for Enterprise-grade SSD, especially with SAS interface, while most SSDs are SATA. EMC just announced Samsung as a second source. SandForce seems to be a popular SSD controller source for many SSD suppliers.
The Intel SSD controller below.
SandForce makes SSD processors used by several SSD vendors. The client SSD processor is the SF-1200. Random Write IOPS is 30K for bursts, 10K sustained, both at 4K blocks. The SF-1500 is the enterprise controller. The performance numbers are similar. Both support ONFI 50MT/s, SATA 3Gbps and can correct 24 bytes (bits?) per 512-byte sector. The SF-1500 is listed as also supporting eMLC, has unrecoverable read errors less than 1 in 1017, with Reliability MTTF 10M operating hours and supports 5-year enterprise life cycle (100% duty). The SF-1200 has unrecoverable read errors less than 1 in 1016, reliability MTTF is 2M operating hours and supports 5-year consumer life cycle with 3-5K cycles.
SSD vendors with the SandForce processor include Corsair and OCZ.
The new SandForce 2000 processor line became available in early 2011. The SF-2000 series supports the ONFI 2 166MT/s. The Enterprise processor is the SF-2500 & 2600 line. SATA 6Gbps and below are supported. The SF-2500 is SATA, supporting only 512B sectors? The SF-2600 also supports 4K sectors, has a SATA interface, but can work behind a SAS/SATA bridge.
| ||Sequential (128K)||Random (4K)|
The SandForce 2000 controller diagram.
Notes on the SandForce SF-2500 and 25600 from the SandForce web site:
"RAISE (Redundant Array of Independent Silicon Elements) technology. RAISE provides the protection and reliability of RAID on a single drive without the 2x write overhead of parity."
Max capacity: 512GB using 32Gb or 64Gb/die components.
Performance (sustained): 500MB/s at 128KB blocks, up to 60K IOPS at 4KB, read and write.
Flash type: MLC, eMLC, SLC, 3xnm, 2xnm (Asynch, Toggle, ONFi2 up to 166MT/s)
Sector size: SF-2500 512B, SF-2600 520, 524, 528 & 4K+DIF/
Reliability: ECC up to 55 bits correctable per 512-byte sector (BCH). Unrecoverable read errors: less than 1 sector per 1017 bits read.
The client SF-2200 version cites mostly the same specifications except Unrecoverable read errors: less than 1 sector per 1016 bits read.
Here is an article referenced on the ONFI website Advances in Nonvolatile Memory Interfaces Keep Pace with the Data Volume