Home, Optimizer, Benchmarks, Server Systems, Systems Architecture, Processors, Storage,

Previous versions of this topic are Storage Performance 2013 and Storage Update 2014.

Storage 2017 2017-03

Storage was once an important topic, both because it was important and the database team once had influence in system and storage configuration. But since then, the infrastructure department have asserted control. There may be one or possibly two standard system configurations to choose from. There might be a mandate to run on VM or the cloud. Storage may be part of infrastructure or the province of a separate SAN team.

Fortunately in more recent years, the typically system configuration has such a ridiculously enormous memory configuration that can hide a good number of storage performance issues. Now that all-flash memory is a popular choice for new storage purchases, it is possible that random IO and high queue depth performance is also nolonger an issue.

An unaddressed issue is the matter of IO bandwidth. Astonishingly, the root source of this seem be that SAN people do not understand the distinction between the Gbits/s that fiber channel uses for specifications and GBytes/s that is used in most other information groups. Another part of this is that adequate IO bandwidth is achieved by parallelism, both in the IO channels and volumes at the storage end.

For database people who are still interested in storage from the performance perspective, it is time to review and reassess the developments of recent years. The emphasis here is on direct-attach storage. It is possible to configure high-bandwidth IO on a SAN system, but some combination vendor doctrine and SAN admin obstructionism put up too many roadblocks that make such a discussion mostly pointless.

System Architecture

The starting point remains system architecture, as this is the upstream destination and consumer of storage. The most popular server configuration in recent years is perhaps a 2-socket system with the Xeon E5 processor, currently in version 4 (Q2 2016) based on the Broadwell core, with a v5 based on Sky Lake expected later in 2017.

E5v4 2P

The salient points of the Xeon E5 v4 are: 4-22 cores, 4 memory channels and 40 PCI-E lanes (x600-series), all per socket.

It should be mentioned that twenty years ago, the Intel Pentium Pro processor featured glueless multi-processing up to 4-way (now sockets). This was seminal processor for more than one reason, but enabling a 2-way multi-processor at only a moderate price premium over a single processor system was immensely important at the time. For all too common reasons, people remembered the rule of 2-way (or higher) but did not bother with the reason why.

Today, processor cores are immensely powerful and there can be up to 22-cores in the E5 v4, along with vast memory and IO capability. The unquestioned mandate of 2-way does not apply anymore. Yet few people are willing to accept that a single socket E5 v4 system is more than adequate for medium and possibly even heavy workloads.

E5v4 1P

Even the server version of the desktop processor, of which the latest version is Xeon E3 v5 with the Sky Lake core. is suffient for perhaps medium workloads. The key difference between the Xeon E3 and the desktop Core i7 is that ECC memory is enabled on the E3. The Xeon E3 has 4 DIMM slots, capable of 64GB memory. Even its 16 PCI-E lanes are acceptable except that no one makes a motherboard with 4 x4 PCI-E slots for this to be practical. That and that there is only a moderate price difference between a Xeon E3 and a single socket E5 with an entry processor. Between the two, the E5 system has far better configuration options.

For extraordinary requirements, the 4-way system is an option. The Xeon E7 v4 processor can have up to 24-core. There are 3 QPI links, allowing a direct connection between processors in a 4-socket system. The Xeon E5 4600-series does support 4-way, with direction connection to two other processors and a 2-hop connection to the third processor.

E7v4 4P

Note that the E7v4 has 32 PCI-E lanes versus 40 in the E5. The common configuration in 4-way systems seems to be that the PCI-e lanes from two of the 4-sockets are exposed. It might be that the thought is no more than 64-lanes should be needed?

The next generation Xeon E5 and E7 with the Sky Lake core is supposed to have 6 memory channels and 48 PCI-E lanes. See wccftech Xeon v5.

See Systems Architecture for a more detailed discussion on this topic.

SSDs

It is convenient to start with a picture of desktop (or entry server) with one (or more) SATA SSD as shown below. The upstream interface of the SSD is SATA, from which most computer systems can bootup. The SATA interface is built into the PCH, also known as a south bridge.

E3 SSD

The SSD consists of a controller interfacing NAND to SATA, DRAM for storing metadata (except in very low cost products) and NAND packages with chips inside. For much of the early period of SATA SSD popularity, the controller has 8 NAND channels, each of which could have one or two packages.

In the server environment, the SAS interface is pervasive. A representation of SAS storage is shown below. In a server system with direct-attach storage, the boot device could be on the SATA interface off the PCH, or it could be on SAS storage. The main storage is on one or more RAID controllers with PCI-E on the upstream side and typically two x4 SAS ports on the downstream side. For SAN, the connection is most frequently Fiber Channel HBA, but infini-band, SAS or other are possible as well.

E3 SSD

In the above diagram, the downstream side of the RAID controller has 2 SAS x4 ports. Each SAS x4 port can connect to a disk array enclosure. Inside the DAE is a pair of SAS switches. Each switch connects to one x4 port on the upstream side, one x4 port on the downstream side for further expansion, and one SAS lane to each disk bay. The dual expanders allow for a redundant configuration, the details of which are provided by the system vendors.

SAS infrastructure was pervasive in the server storage environment when SSDs became popular. It is possible to built controller interfacing NAND to SAS, which is both a successor to SCSI and a superset of SATA. However, some found it more convenient to use a SAS to SATA chip to connect a SATA SSD. It could be argued that the SAS-SSD combination is not the optimal architecture for flash storage, but it was still a very good balance betweeen performance, capacity and flexibility.

The RAID controllers were originally designed connect up to 200 or so hard disks, and were more than capable of handling 40K IOPS possible from that many 15K HDDs. A client SATA SSD might have a specification of over 90K IOPS in both read and write, but the write performance is possible only when previously erased pages are available. The sustained mixed read-write performance of an 8-channel SSD might be 10K IOPS, which is still far better than an HDD, and at much lower latency as well. Still, a moderate of numbers SSD could easily saturate the RAID controller. At some point, it was realized that a firmware modification could support much better performance with SSDs, and this improved firmware came at an extra charge. The firmware performance gain probably came from bypassing on the cache on the RAID controller.

Some argued that a controller interfacing NAND directly to PCI-E would be a superior solution. For PCI-E SSDs of 2011-14 period, a practical configuration was a controller with PCI-E x4 on the upstream side, and 16 NAND channels on the downstream side, or PCI-E x8 on the upstream and 32 NAND channels on the downstream.

PCIE SSD   PCIE SSD

Over the last two years, there was more interest in attaching to the PCI-E interface for client SSDs. In this time, NAND density was approaching 256GB at the package level, and the NAND interface bandwidth was reaching 800MB/s. As the desired capacity for client SSDs was in the 512GB-1TB range, the newer controllers may be 4 or 8 channels. The server market is interested in both higher capacity and higher performance levels and so will continue to want controllers with more channels, However server infrastructure based on PCI-E is not set, so it is unclear if there should be separate controllers for client and server.

NVM Express

It was obvious years ago that the legacy IO stack, both software and hardware was not designed for the IOPS possible with solid-state storage. For this reason, Intel and others initiated NVM Express, for a complete redo of both the software and hardware stack. The primary upstream interface of NVMe is PCI-E. For client use, newer UEFI (formerly BIOS) allows a system to boot from an NVMe SSD.

In brief, the NVMe software stack is designed for multi-core processors driving an SSD storage system comprised of very many devices in having many very deep queues. NVMe is also significantly more efficient in host CPU load per IO.

NVME

NAND

The intention here is not to cover NAND flash, but a brief overview is helpful. Below is a diagram from an article in Anandtech. The NAND die is split into two planes, which are essentially independent units. Below that are blocks and pages.

NAND die

Below is a diagram from a Micron article from 2010 on a 32Gbit product. At the time the page size was 8K plus error correction. The page size increased to 16K in a later product. I am not sure if page size has increased again beyond 16K. Page size affects the amount DRAM necessary to for metadata.

NAND 201

In cost sensitive consumer products, a larger page size reduces the amount of DRAM. In the server environment, it is more important to do the job the right way and not be driven by a moderate cost implication. For SQL Server in particular, it might be important that page size be 8KB? The latest NAND die is 256Gbit.

I am not sure where the diagram below is from, which shows the key nature of NAND writes. The write, internally called a program, occurs on entire page at a time. The page being programmed must be an erased paged. The erased operation can only be done on an entire block. The nature of these two operations has deep implications on the characteristics of NAND, and are discussed in detail elsewhere.

Page Block

A final point about NAND is that the packages that we can see on a PCB is not a single die, but a stack of die. Most NAND products have up to 8 die in one package. There is a Samsung specialty product with the controller, DRAM and 16 NAND chips on a single stack, but this is probably not relavent in server oriented SSDs.

NAND 201

There are a number of important sub-topics covered elsewhere and not discussed here. One is the bits per cell, including SLC, MLC and TLC. It would seem that SLC is nolonger necessary and all server-grade products are MLC, varying the degree of over-provisioning to achieve the desired write endurance. Also not discussed are the new 3D multi-layer NAND products or other technologies such as Intel 3D XPoint.

SSD revisted

Having briefly covered NAND, we now have a better understanding of the SSD. The basic SATA SSD is not analogous to a single HDD, but rather an array of devices. The older 8-channel controller with a single package per channel has 8 packages total. Depending on the capacity and NAND chip, it could have up to 8 NAND chips per package.

I have not looked at the NAND interface protocol. Assuming it allows overlapping commands, the SSD then behaves in a manner analogous to an array of 128 components, based on 64 NAND chips, which in turn are 2 planes per die.

NAND 201

NAND manufacturers used to make datasheets available but this is not the case in recent years, so it is difficult to understand exactly what the situation is today. In principle, a single NAND chip should be able to saturate the NAND interface in read operations. Write operations to a single chip uses only a small fraction of the interface bandwidth. Hence having multiple chips in one package allows both read and write operations to achieve approximately the interface bandwidth. I am not sure if the write operations mean program only, or a mix of page program and block erase. It would be helpful if vendors could explain in detail.

Still, the SSD is composed of multiple devices, be it characterized by the total number of channels, chips or planes. For this reason, SSD performance should be specified at multiple queue depth levels. Queue depth 1 is important because this characterizes single thread serialized IO performance, but is only a fraction of maximum performance. We should be interested in performance at the highest queue in which performance scales linearly with queue depth, and then perhaps the queue depth for maximum performance. Latency versus queue depth information is also helpful. But apparently this is too much complexity, so the common practice seems to be cite performance at queue depth 1 and 32.

The PCI-E SSD with 16-channels then could be composed of up to 128 NAND chips and 256 planes. The 32-channel SSD could have up to 256 chips and 512 planes.

NAND 201

 

Storage Arrays

A powerful HDD array might be comprised of 1000 HDDs. Based on about 100 HDDs per RAID controller, 4 × 24 or 25 disks per enclosure, we would need 10 RAID controllers and 40 disk enclosures. For 2.5in or SFF HDDs, the enclosure is 2U. Based on 16 enclosures per 42U rack, this would be 3 racks. Alternatively we could have about 200 disks per RAID controller, 5 controllers in all, and the same rack space.

NAND 201

NAND 201

NVMe based PCI-E form factor in the server environment has two options. The PCI-E added card is basically for legacy systems. Some newer systems have SFF sized bays with PCI-E interface.

NAND 201

 

IO Queue Depth Theory

One might be surprised to hear that the objective in storage performance is not necessarily achieving maximum performance but rather the desired behavior. The reason for this has to do with the nature of small block random IO in hard disks. The access time for a hard disk is approximately the rotational latency plus the seek time. For a random IO to a block anywhere on the disk at queue depth 1, the average access time is the time for a disk to rotate 180 degrees, and the for the arm to move one-half of the full traverse.

For a 15K HDD, rotational latency is 2ms and average seek time is typically 3ms for 200 IOPS at QD1. When IO is issued at higher queue depth, the controller can reorder IO to achieve higher throughput at the expense of higher latency. The other aspect of HDD performance is that sequential IO is impressive, at over 200MB/s for more recent models.

All of IOPS at QD1 per disk, peak random IOPS (at high QD and latency) and bandwidth among others are important in various aspects of database performance. Low latency is desired in online transaction processing in which a live user is waiting for the response, high IOPS tolerating long latency is of interest in batch processing transactions, reports, and DW. Bandwidth is important in DW. A hard disk array can be configured and operate to any one of these three characteristics, but not more than one.

This is why there was a rule in the old days that OLTP and DW should always be on separate systems, specifically meaning separate storage systems. Period, no matter how powerful the server is.

With the practicality of all-flash storage in a database server, any reasonable performance requirement in terms of IOPS or bandwidth could be far exceeded. Average latency is also be substantially improved. However, firmware optimization was required for desired quality of server in avoiding excessively long latency in individual IOs.

Now with all-flash storage, is it possible to achieve all three simultaneously? In principle, low latency and peak IOPS are fundamentally incompatible. However, the peak IOPS of the storage system is not the criteria. Instead, it is the peak IOPS that the database engine will drive. And hence, it might be possible to achieve simultaneous low latency and peak IOPS required by the database engine when the storage system has IOPS capability far beyond what is necessary.

PCI-E Switches

In principle, for any particular storage system, is should be possible to employ a particular IO queue depth policy in the application program, i.e., the database engine in this case, to achieve any desired IO behavior.

 

 

 

There are additional nuances to SQL Server table scan IO in versions 2008 R2 and 2012. At DOP 1, the IO size might be 192K or some combination with an average IO size in that range except when there is only 1 file. Otherwise IO size appears to be 512K. The IO strategy appears to be to issue IO at low queue depth whi may be additional implications of the 512K IO.

In the Fast Track Data Warehouse configurations for SQL Server 2008 R2 and 2012 by HP and some other vendors, it is mentioned that the setting in Resource Governor: REQUEST_MAX_MEMORY_GRANT_PERCENT can "improve I/O throughput" at high degree parallelism. (oh yeah, the applicable DMV is dm_resource_governor_workload_groups) Well simply issuing IO at sufficient queue depth so that all volumes are working will do this. There is no reason to be cryptic.

Trace Flag 1117 (-T1117) causes all files in a filegroup to grow together. See SQL Server 2008 Trace Flag -T 1117.

With SSD, the second may not be important as the SQL Server read-ahead strategy (1024 pages?) should generate IO to all units. On the hard disk, generate close-to-sequential IO was important. On SSD, it is sufficient beneficial just to generate large block IO, with 64K being large.

Summary

The old concept of distributing IO over both devices and channels still apply. The price of SSD is sufficiently low to warrant serious consideration. While there is more flexibility in SSD configuration, it is still necessary to validate performance characteristics with real queries to an actual database. SQLIO or other synthetic tests are not sufficient. If the SAN vendor advised in the configuration, then chances are IO bandwidth will be not be good.

Addendum

If anyone thinks I am being unfair to or overly critical of SAN vendors, do the following test.
Find the biggest table in your database, excluding LOB fields.
Run this:
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS IO ON
SET STATISTICS TIME ON
GO
SELECT COUNT(*) FROM TableA WITH (INDEX(0))
GO

Then compute 8 (KB/page) * (physical reads + read-ahead reads)/(elapsed time in ms)
Is this closer to 700 MB/s or 4GB/s? What did your SAN vendor tell you?

I am also not fan of SSD caching or auto-tiering on critical databases, meaning the database that runs your business, that is managed by one or more full-time DBAs. In other applications, there may not be a way to segregate the placement of hot data differently from inactive data. In SQL Server, there are filegroups and partitioning. We have all the necessary means of isolating and placing hot data whereever we want it. SSD caching or auto-tiering will probably require SLC NAND. With active management using database controls, we should be able to use HET MLC or even MLC.

I stress the importance of analyzing the complete system and how it will be used instead of over-focusing on the components. There are criteria that might be of interest when there is only a single device or even single HBA. Today it is possible to over-configure the storage performance without unwarranted expense, and this is best accomplished by watching the big picture.

Adaptec reports that their Series 7 SAS RAID Controller (72405 - PCI-E gen 3 x8 on the upstream side and 6 x4 SAS 6Gpbs) using the PMC PM8015 controller can do 500K IOPS and 6.6GB/s.

I will keep this topic up to date on www.qdpma.com Storage 2013

related posts on storage:
IO Queue Depth Control
IO Queue Depth Strategy
io-queue-depth-strategy (2010-08)
data-log-and-temp-file-placement (2010-03)
io-cost-structure-preparing-for-ssd-arrays (2008-09)
storage-performance-for-sql-server (2008-03)

ps,
If you are using SQL Server 2012 clustering on a SAN, I do suggest placing tempdb on local SSD, making use of the new 2012 feature that does not require tempdb to be on shared storage. Keep in mind on the SAN, writes must be mirrored between two storage processors for fault recovery, and this is not a cheap thing to do. We should plan redo whatever was using tempdb at the time.