Home, Cost-Based Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

Companion article Memory Latency.

It is commonly repeated that memory and performance are practically synonymous for database systems. This was largely true a very long time ago. Beyond a certain point that was past long ago, large memory configurations continued to be useful in bringing IO volume down to manageable levels. But the impact of memory on performance was only moderate when the storage system could deliver sufficient IOPS so as to limit performance. More frequently, large memory was used to mask (often very serious) deficiencies in the storage system, and this was the only truth to the belief about memory and performance.

Most people have probably already seen that nothing dramatic happened as memory configuration grew in size, and data reads dropped to noise levels. This is related to the reason that Microsoft SQL Server has column-store for DW and Hekaton for transaction processing. If the goal was dramatic performance increase, then a completely different strategy was preferred over the database engine that originated in the 1970's.

If systems have more memory capacity than necessary and IO performance is no longer an issue, killed on both ends with massive memory from one direction, and all-flash storage from the other direction, then one might think that the topic of memory and IO is obsolete.

Except that the factor that is important, memory latency, is impacted. One of the avenues to lower memory latency may involve a dramatic reduction in memory size. Before pursuing this path, we need to clearly understand exactly what the role between memory capacity, IO and performance.

Memory and IO in Database Performance

Memory is obviously much faster than disk IO, even SSD IO. It would seem intuitive that memory is important, the more the better. Yet few people bother with the details, and it is the details that important concepts become clear. A very long time ago, memory was extremely valuable yet very expensive. Over the years, the relation between memory and IO in the performance of database systems has evolved.

In recent years, the memory capacity of server systems has grown to ridiculous levels, rendering any old rules on memory configuration obsolete. As this was happening, SSD or All-Flash-Array storage became first, viable. And now, probably the default choice over HDD arrays. So now that we do not absolutely have to have it, we can more IOPS than we needed even before memory became so large.

A brief summary is given on how memory and IO impact on performance has changed over the years. An assessment is made of the of the memory and IO situation today, particularly with regard to SSD storage and NVMe, with quantitative analysis.

One might think if the problem of always needing more memory has been solved, then this is no longer an important topic. True in one sense, but the prioritization of memory capacity has been deeply ingrained into server system architecture strategy to the extent that memory latency became more important but has been neglected. To correct the distortions in system architecture, we need to understand more precisely the role of memory size. Then we can safely support undoing the memory capacity distortion. before

This article examines why memory capacity has become excessive. The companion article examines why Memory Latency is what really matters now. These two topics will support the argument for a clean sheet rethinking of server system architecture, in a later article not too much later.

The Role of Memory in Database Systems

The evolution of the role of memory in database performance can be divided into four categories. In the beginning, simply having enough memory for the executables and key meta data structures without hitting the page file (an archaic element of virtual memory) was an expensive luxury. But this is so old that we do not need to talk about it.

Once there was enough to avoid paging, some memory was actually left over for the buffer cache. The first priority was to cache the most critical pages, including the index root and upper intermediate levels (and execution plans). In this phase, buffer cache hit ratio was a meaningful performance counter.

The third phase was leaf-level data caching. The buffer cache is now retaining the hot leaf-level pages. Adding memory does reduce IO, and that helps when the storage system performance is inadequate. But it can take a doubling of memory to see impact. Watching buffer cache hit ratio sit somewhere over 99.9% became pointless. Page life expectancy was then the counter to watch.

Finally, we have the residual phase. The entire active portion of the database is in memory. There is still residual IO from infrequently accessed data, unless this too is already in memory. Of course, any write activity must always be written to disk. At this stage, the performance counter that we do not have is something to tell us how of the buffer cache is rarely accessed?

1. Origins - INGRES and the 1970's

The grandfather of SQL Server, via Sybase, is INGRES. The development work for INGRES was done on a PDP-11/45 and 70, per The Design and Implementation of INGRES (also Path to In-Memory Databases). In this era, having the 300KB necessary to get all the executables into memory was a luxury.

2. The 1980's - Memory was Important

This was the era of the VAX 11/780, introduced in 1977, but ubiquitous in the 80's. Base memory was 128K initially, then 256, and then perhaps 1M? My recollection was that many organizations sprang for a whopping 2M or more. We were now caching the important index upper level pages. Every group of critical pages cached had a significant impact on IO. All accesses to an index must go through the root level page. And many accesses might go through a single intermediate level page.

3. The 1990's - Memory was Helpful

By the mid-1990's, 4-way SMP systems were becoming popular and de facto standard for database servers. The proprietary 4-way 80486 systems could support 1GB memory. Then came the 32-bit address limit of 4GB, of which 3GB could be directly addressable, and we had to do hokey tricks to use more memory.

In this period, it was known that doubling memory from 4GB to 8GB would improve the 4-way TPC-C result by just less than 10%. A further increase to 16GB might yield another 5%.

4. When is Memory Big Enough

By the early 2000's, memory configurations of 8-16GB were practical. But on SQL Server, we were mostly waiting for the arrival of SQL Server 2005 for 64-bit on X86, unless one were willing to go the route of Itanium. In the 2007-09 timeframe, both Intel and AMD 4-way systems could support 32 DIMMs, for maximum memory of 128GB with 4GB modules, and 256GB with 8GB modules.

The big jump came in 2010, with Nehalem-EX, in which the 4-way system had 64 DIMMs, and 1TB was possible with 16GB DIMMs. The requirements for the TPC-C and E benchmarks are that the database size scales in proportion to the performance. By coincidence, the database size was typically on the order of ten times larger than memory. A very large array of hard disks was necessary to support the required IOPS.

In the real world, people may have had really large databases. But frequently much of it was old data, as the practice of archiving out past data was being neglected. For the data architects who did not employ guids, most active data probably fit into memory in this period.

Memory and Performance

Scenario Walk Though

Consider a scenario where a database system can be described in a not impossibly complicated model. Suppose the server system has 100 cores, and 200 logical processors (LP), as 2-way Hyper-Threading is all that Intel supports in their main line processor. Suppose that the baseline is a system with enough memory for SQL Server executable, internal data, and the root and intermediate levels of indexes, including clustered indexes.

Suppose some representative transaction processing database for which performance on the baseline system is 10,000 transactions per second. A transaction might be comprised of 10 different calls from the client, not unlike a certain benchmark.

Performance is 100 tps per core or 50 tps per LP. Each transaction consumes 20 CPU-milli-seconds of LP time, the worker time column in the DMV dm_exec_query_stats.

Suppose that there are 100 physical disk reads per transaction under the baseline memory configuration (it doesn't really matter which specific pages are in memory) and some unspecified number of writes. If it makes you happy, suppose its 10 writes per transaction.

Then under baseline conditions, the storage system needs to be able support:

  10,000 tx/s × (100 reads/tx + 10 writes/tx) = 1.1 million IOPS

If the storage system has RAID 10 volumes, the physical storage units (HDDs or SSDs) will see 1.2M IOPS, or 1.4M IOPS in RAID 5.

HDD Storage Array

Suppose the storage is comprised of 15K HDDs (these became obsolete a few years back with SSDs taking over high-performance HDDs). Each 15K HDD supports 200 IOPS at queue depth 1 with 5ms latency. Higher IOPS per disk are possible at higher queue depth, but latency is also longer.

The number of disks needed is:

  5,500 15K HDDs not accounting for RAID, or
  6,000 in RAID 10, or
  7,000 in RAID 5.

Considering the number of disks, do not bother calculating capacity. Just get the lowest capacity 15K drive. The floor space required in terms of 42U racks is:

# enclosurestypeHDDsper 42U rack# of racks
143U15 LFF21030
104U top load60 LFF600?10
212U24 SFF50412

I am not actually sure that we can fill a rack with just disk enclosures. Certainly, in the case of the 4U 60-disk top-loader, the weight must be an issue?

Are there any serious problems so far? Well yes there is. Hard disk mean-time-between-failure (MTBF) is usually cited at 1-2 million hours. Even if the manufacturer cites 2M hours NTBF, for a disk drive in a large array, with all the vibration from 100+ disks in a rack, don't expect the higher number.

There are 8760 hours in one (non-leap) year. The expected annual disk failure rate is 0.876%. Not bad, right? Not bad for your workstation with one disk drive.

In an array of 5000 HDDs, the expectation is 43.8 failures per year. How long does it take to rebuild a RAID group from a failed drive? Drive failures are handled by the RAID controller and are invisible to the database server, but performance is degraded while the rebuild is in progress.

Even if the overall system is sufficiently reliable for enterprise requirements, the complexity of operating a system with so many components and resultant rate at which individual component failures occurs is not pleasant nor does it inspire confidence. For this reason, more memory was valuable in bringing system complexity to a manageable level.

HDD IO Latency

Suppose that the transaction consists of several SQL statements. No single statement has an execution plan in which the estimated rows going into a key lookup or loop join is 25 rows or more. This is the point at which SQL Server will issue IO at high queue depth. Below this level, IO is issued one at a time.

The transaction incurs 100 IO × 5ms per IO = 0.5 sec in total IO latency. It does not matter if the storage system has hundreds of disk drives. IO is issued singly and sequentially, the second IO is not issued until the first completes, and so on.

There are write IOs for transactions, but the user does not see this latency, handled by the lazy writer. Half of one second is noticeable in human perception, but not bad. There is also the worker time of 20ms, and whatever comes from network round-trips to the application.

The elapsed time (duration) to complete a transaction is the worker time plus the IO latency time (excluding application and network time here for simplicity). Consider the following:

  (20 + 500ms) ÷ 20ms per thread = 26 worker threads per LP

Twenty-six threads per logical processor are needed to drive the system to maximum throughput. The interpretation of this is that 26 threads on one logical processor consume at a combined total of 520 CPU-ms over a 520 ms duration. Each thread consumes 20ms, and waits 500ms for IO. Ideally, we would like overall system utilization to be under 70% to avoid excessive queuing delays. But the design should be capable of sustaining maximum throughput.

In the above, it was assumed that we can achieve any IOPS performance level simply by aggregating enough disks. This is not such a simple matter. SAN products themselves may be limited to 200K IOPS depending.

And also, the SAN vendor may have argued that with their 32GB of cache, you really don't need 2000 HDDs, their magic will solve your IOPS problem. They ignored the fact that the database server has 256GB memory, of most which is buffer cache. So exactly what is in the 32GB cache on the SAN that is not in the 200GB buffer cache of the database server?

SSD Storage Array

Now suppose storage is on an All-Flash Array (like we need another acronym) of SSDs or whatever the vendor calls their NAND storage units. Under enterprise conditions of continuous operation, we cannot expect the 90K IOPS cited for client SSDs because that performance level only lasts for ten minutes or so until the pre-erased blocks run out.

An Enterprise SSD typically has 28% or higher over-provisioning and firmware setting for consistency of performance instead of peak. SSDs on the SATA/SAS interface typically employs a NAND controller with 8 channels, and can sustain performance of probably more than 10K IOPS? Now only 140 SSDs in RAID 5 are needed to support 1.1M IOPS, though we might have more to meet capacity requirements.

The SSDs to support multi-million IOPS can fit in 6 2U enclosures, less if enterprise storage vendors ditch the 15mm height that was driven by SFF HDDs, in favor of a 7 or even 5mm form factor. Or even less if we get rid of the case around the SSD, going with just the PCB, and maybe a thin heat sink.

With luck, there might be looking at 1 or 2 failures per year. But in any case, RAID groups comprised of SSDs rebuild much faster than for HDDs. And we probably had more than 2 or even 10 times the required IOPS capability, so the rebuild does not substantially affect performance, depending on the RAID controller settings. (Somebody needs to experiment on this!)

Suppose SSD IO latency is 0.2ms. Yes, SSD vendors cite around 60 µsec. In SQL Server, we need to load 8KB pages into memory. It is better to assume 0.2ms IO latency. The 100 IO issued serially now take 100 × 0.2ms = 20ms.

Transaction elapsed time (worker + IO, again, excluding network and app) is now 20 + 20 = 40ms. The number of threads needed for maximum throughput is now 40 ÷ 20 = 2 per logical processor.

On the HDD storage, 26 threads per logical processor, or 5200 threads total on the system were necessary to drive the system with 200 LP to maximum theoretical throughput. Now on the SSD storage, only 400 threads are needed. There is overhead on managing a large number of threads. The dramatically lower latency both improves single thread performance and overall system performance in having far fewer threads to manage.

TPC-E 4×X7560 on HDD vs. 4×E7 on SSD

It is too bad that none of the system vendors bothered to publish TPC-E benchmark results on a common system, one with HDD and the other with SSD.

There is however, an IBM result of 2,022.64 tpsE (2010-03-30) for a 4-way Xeon X7560, on HDD storage to compare with an IBM result of 2,862.61 tpsE (2011-06-27) for a 4-way Xeon E7-4870 on SSD storage. Both systems had 1TB memory. The Nehalem-EX processor is 8-core, 2.26GHz and the Westmere-EX is 10-core, 2.40GHz.

vendordateprocessorcoresmemstoragetpsE
IBM2010-03-304×Xeon X7560321TBHDD2,022.64
IBM2011-06-114×Xeon E7-4870401TBSSD2,862.61

These two processors should be very similar in IPC, as Westmere is a process shrink of Nehalem. The difference in core count is 25% and in frequency of 6.19%. The TPC-E scaling between the two systems is 41.5%.

Assuming that performance scales linear with core count (this is generous), then we have an excess of 13.2% performance gain. Some could be attributed to the high CPU frequency, but the basis of the Memory Latency studies suggests that frequency scaling is weak for transaction processing.

IO Cost Structure

In the NVMe website, there is a document by Amber Huffman of Intel showing that an AHCI IO has a cost 25,000 CPU-cycles. This has to do with multiple uncacheable register accesses? (I am thinking this is a register on the SATA device, not on the CPU?) In the IDF 2013 NVM-Express ... PCI-E SSD, slide 16, SCSI/SAS versus NVMe overhead is:

stacklatencycycles
SCSI/SAS:6.0 µs19,500 CPU-cycles
NVMe:2.8 µs9,100 CPU-cycles

A similar deck at FMS NVM-Express Overview,

The purpose of NVMe was to build an entirely new storage stack, one properly suited to the insanely high IOPS capability, both of a single SSD device, and an array with a very large number of SSDs as well.

Ten years ago, I ran a disk test from SQL Server on a large storage array. From a single thread, SQL Server could drive 50,000 random IOPS. That was from a system with a Core 2 processor. The subsequent Nehalem and follow-on processor cores are much more powerful.

The SQL Server IO encompasses not only the bare IO, but also additional engine tasks, like picking a page to be evicted from the buffer cache. It might be possible to drive over 100,000 IOPS from the operating system driver with one thread or logical processor. But let's suppose that SQL Server IO is 50K IOPS per LP for now until someone can provide better details.

I have some old data from 2008, before Nehalem and following processor architectures, at IO Cost Structure. These tests were not done at saturation levels, so proper saturation level testing needs to be done. Roughly, the cost of an IO to disk is about 4-5 times higher than the cost of a Logical IO operation for data in memory. The ratio might be smaller when a transaction or higher level isolation is required, as that adds to the cost of the LIO.

Assume that physical IO is 4 four times more expensive in CPU (worker) time. Consider the example where 80% of accesses are in memory, and 20% requires physical IO. For every 5 operations, 4 are LIO-only and 1 with Physical IO (and accompanying LIO). The total CPU cost is then split evenly between the pure LIO operations of which there are 4, and the one PIO. The total cost is equivalent to 8 LIO. With all data in memory, performance should be 1.6X the baseline. The cost is now 5, all LIO, versu 8 for 80/20.

A poor performance gain on getting all data into memory might suggest that the baseline condition already had a very high percentage of frequently accessed data in memory.

All Data in Memory

Now returning to our baseline scenario of 10K tps and 1M read IOPS on the 100 core, 200 LP system. The SQL Server worker time is 20µs per IO, corresponding to 50K IOPS per LP. The 1M read IOPS consume the resources of 20 LPs, or rather, 10% of the 200 LPs.

Next, all the memory in the world is thrown into the system so that data is now in memory. We still have the 100K write IOPS, but the 1M read IOPS are gone after some initial activity. Our performance should now be 11,000 tps. from the compute resources of 20 LP that were previously expended on IO.

For better math, the 20ms per transaction is now reduced by 100 × 20µs = 2ms, so the new transaction worker time is 18ms. The theory is that performance is now 11,111 tps. Depending on the exact situation, we may have thrown terabytes of memory at the "IO problem" for 11% performance gain? In hard disk days, reducing IOPS to a level that could be met by a practical disk array would have been worth the money. In the relatively new era of SSDs, 1M IOPS is not a red flag. Overhead will be even less once NVMe is deployed.

This was based on an extremely simplified model of SQL Server IO. We should also consider now that instead of running 26 threads per LP to drive the system to maximum CPU, now we need to running only 200 threads, one per LP, because there is no more IO latency. In actuality, we might run 2 threads per LP because there is network latency between the application and the database server, plus application rendering time.

The overhead for managing 400 threads is far lower than for 5200 threads. Our performance gain for putting the entire database into memory should therefore be more than 11%. The performance difference between a system with bare memory and properly configured storage to one with the entire database in memory is not that much.

Where the memory is extremely valuable is when there is a serious disconnect between the database requirements and the infrastructure - SAN team's understanding of IO.

System Level Throughput versus Single Thread Performance

In the above, the argument was that overall throughput performance may not have a strong dependency on memory size, provided the storage system was capable of as many IOPS as necessary. It was always possible to configure HDD storage for high IOPS simply by aggregating an enormous number of disk drives. Of course, few people actually did this, so a situation in which performance was limited by IOPS capability was not uncommon.

A proper SSD configuration should have far more IOPS capability than necessary. On the old storage software stack, the CPU overhead of extraordinary IOPS might be significant, on the ballpark assumption of 50K IOPS per core from SQL Server. The expectation is that a full NVMe stack will lower the OS portion of the overhead, leaving just the SQL Server overhead.

Memory and IO Summary

An explanation was given as to why memory size was of critical importance a very long time ago. Those circumstances are now ancient history. Still, memory size was helpful in the more modern era of multi-GB memory to compensate for inadequacies of the storage system, for whatever reason. Memory is also useful for improving single thread performance, as hard disk latencies can be longer in relation to worker time.

But the big picture is that humungous memory capacity has limited ability to generate meaningful throughput performance gains over a baseline that already has plenty of memory. This is assuming a proper storage system. The special circumstances such as when SQL Server issues IO at queue depth 1 or at high queue depth, is also a factor in the benefits of memory.

For anyone expecting very large performance gains on modern server systems, this can be achieved with various solutions. For transactions, Hekaton memory-optimized tables has potential. For DW, column-store indexes is an option. In the accompanying article, it is argued that Memory Latency. is now the major performance factor for transaction processing performance in modern systems, over the long standing believe of memory size.

Columnstore overcomes the memory issue by marching through memory sequentially. The memory access patterns are predictable, and the processor can be directed to pre-fetch memory. The first objective of Hekaton was to eliminate the overhead of locks. Locks generate its own set of serialized memory accesses. But notice that memory-optimized tables use hash indexes. The hash index pre-allocates memory based on the bucket count. This sacrifices efficiency of memory size when the table is partially populated, but has the benefit of requiring fewer serialized round-trip memory accesses over b-tree indexes.

Hekaton

I will not talk about Hekaton here, except for the following. If we can fit the entire database into memory, then we would have designed the database engine very differently. Something like how it is done in Hekaton (coincidence?). Then, very large performance gains can be achieved.

Appendix

The diagram below is from Bounding Worst-Case DRAM Performance on Multicore Processors, on Oak Central?
 
 
 
 

References

mkomo a history of storage cost.