Home,   Parent,   Query OptimizerBenchmarks ProcessorsStorage ScriptsExecStats

Too Much Memory (2018-09, in-progress)

2S_Broadwell_tpce1

"Too much memory" are words never uttered without being preceded by "There is no such thing as ..." And yet, here we are, with all evidence pointing in this direction. It was true long ago that more memory capacity was desperately desired and as a corollary, DRAM needed to be less expensive. In the 1970's, system memory capacity was single digit megabytes. Each doubling of memory had a large effect on IO. This in turn had ramifications on system operation. The rule of more memory was burned in our subconsciousness. And this rule was followed without question over the generations. System memory became gigabytes and now terabytes. Is it not time to re-evaluate faith in the concept of more memory?

The belief that more memory at lower cost will solve problems is widespread. Many semiconductors players are hard at work on translating one of the nonvolatile memory technologies into a product to fill the cost and density gap between DDR4 and NAND flash storage. This would have great value for applications that generate 1M IOPS when bound by current system memory capacities. Most of the IOPS could then be moved from storage to memory with a multiple orders of magnitude reduction in latency. But how many applications have this characteristic?

Simply putting in far more memory than really needed for an application should not have serious negative consequences (that cannot be corrected). The deeper problem is that setting false requirements and objectives points industry in the wrong direction for developing future products. Computer system architecture is about making a good set of compromises. Not making concessions in one area that have little impact usually results in deficiencies for other areas, sometimes with significant consequences.

The large majority of memory in modern systems is used for only a tiny fraction of all memory accesses. A very large majority of accesses could fit in a relatively compact subset of main memory. In this circumstance, lower memory latency has the greatest impact. A very different memory strategy than the ones currently employed and being pursued is needed.

Basic Assessment I: Memory Round-trip

There are two distinct memory characteristics that may require inherently different solutions. One is streaming memory access, which is a topic covered by other people elsewhere. The discussion here pertains to serialized round-trip memory access, usually originating from pointer-chasing code.

Consider the system example of a 2-socket Xeon E5 v4 (Broadwell) with 22-core processors. There are 44 cores total over both sockets. Hyper-Threading is enabled for 88 threads (or logical processors) total.

2S_Broadwell_tpce1

The terminology for this system in TPC benchmark reports is: 2 processors, 44 cores, and 88 threads. Intel uses the term "socket" in place of "processor" in some circumstances. Microsoft prefers term "logical processor" over the Intel and TPC term "thread" in referring to each of the two apparent CPUs enabled by Hyper-Threading on a core. The processor thread is entirely different from software thread. There is no convention that is free of potential confusion, and one must interpret each situation as best fits.

Assume for simplicity local node memory latency is 90ns (actual is 93ns per 7-cpu) and remote node latency is 140ns. TPC-E benchmark has been optimized for Non-Uniform Memory Access (NUMA). Memory accesses on a 2S system is 80% local and 20% remote, per Virtualization Performance Insights from TPC-VMS. Average memory latency with 80/20 split is 100ns. Without NUMA optimizations, memory access is expected to be 50/50 local and remote for an average latency of 115ns.

At 100ns round-trip latency, one thread can do 10M serialized round-trip memory accesses per second, in which one access must complete before the next access is issued. At a processor base frequency of 2.4GHz, a round-trip memory access is 240 cycles.
   The Xeon E5-2699 v4 22-core processor launched Q1'16 has base frequency 2.2GHz.
   Later in Q4'16, there was the E5-2699A v4 22-core at 2.4GHz.
   A mutex has approximately the same latency as a round-trip memory access, coincidence?
Assume that 5% of operations involve a round-trip memory access before the next step can be determine. For this exercise, disregard the effect of a mix of L1, L2 and L3 hits.

Out of 100 operations, 95 complete in a single cycle each. The 5 ops memory access consume 240 cycles each for a sub-total of 1200 cycles. The full 100 operations complete in 1295 cycles. In all, 7.7% cycles in a given thread do actual work and 92.3% are no-ops waiting for memory.

NoOps1

Intel Hyper-Threading (HT) is of degree 2. That is, there are two threads or logical processors for each core. In this example, the core is active in 15.4% of cycles, and idle on 84.6% of cycles.

NoOps1

The low core utilization per thread is the reason HT is highly effective on pointer-chasing code. The generic term for HT is simultaneous multi-threading (SMT). (Intel Nehalem and later implementations of HT decode from each thread on alternating cycles, different from that of the Pentium 4 era?) Recent IBM POWER processors implement SMT4 or 8. It is unclear why the Intel main line processor, and specifically the server oriented Xeons, are HT2. The Xeon Phi, based on the Atom core, is HT4?

Assessment II: Memory Bandwidth

Each thread in this example generates 9.23M serialized memory access/sec. The full set of 88 threads on 44 cores over 2 processor sockets generates 812M accesses/sec.

TPCE_mem_access

Each memory access fills a 64-byte processor cache line, so 812M accesses/sec translates to 52GB/s. The access size of 64 bytes also aligns with the DDR4 8-word prefetch standard on an 8-byte (64-bit) word.

The Xeon E5 v4 has 4 DDR4 memory channels at up to 2400MT/s, with 8 total in the 2-way system. The 2400MT/s DDR4 channel has a nominal bandwidth of 19.2GB/s. The nominal aggregate bandwidth of 8 channels is 153.6GB/s.

The realizable bandwidth of the 2S Broadwell system, as reported in the Hot Chips 29 Intel Xeon Scalable Processor (Skylake-SP) presentation by Akhilesh Kumar, appears to be about 142GB/s.

HC29_SkylakeSP

The figure above is for 100% local memory read, sequential address, which is different from what is of interest here, random read and write. On the left, at low load, local memory latency is about 70ns. At 50GB/s load, it is 90ns.

When we speak of 1KB, we mean 1024 bytes. When speak of data transfer rates in MT/s, we mean 1M decimal transfers per sec. It is necessary to understand when to apply the 2^10 definition of K and the 10^3 definition. The kibibyte-KiB, mebibyte-MiB terminology is too late to the game, and irritating to old school computer people.

Memory accesses at 411M/s per processor (812M/s over two sockets) and 64 bytes per access is 26GB/s memory bandwidth per processor:
  2 threads/core × 22 cores × 9.23M access/sec × 64 B/access = 26GB/s.

This is under both the processor nominal memory bandwidth of:
  4 channels × 2400MT/s × 8B = 76.8GB/s per socket
and the actual of 71GB/s, also per socket.

Some memory request might involve one ore more overlapped operations, or other variations that might result in higher actual bandwidth.

Remote memory access bandwidth is also well under the limits of the 2 QPI bi-directional links connecting the two Xeon E5 processors. The Xeon E5-2699 v4 QPI links operate at 9.6GT/s. Each QPI is 20-bits per direction. The net sustained data rate is 19.2GB/s per direction. Even with a 50/50 local/remote split, the QPI bandwidth is sufficient.

The 4-way Xeon E7 system, with each processor having one direct QPI connection to the other 3 processors, also has sufficient bandwidth for 25/75 local/remote split.

Assessment III: Database Row Access

A full model of database system operations is very complicated. An empty remote procedure call from application server to database server might be 60µsec. In the TPC-E workload, the transaction rate is the neighborhood 55 tpsE per thread. There are a total of 10 procedure calls per trade, for a call rate of 550 RPC/sec per thread. Each RPC then averages 1.82ms, so the network round-trip overhead is small.

An SQL statement to read a single row via index access might be on the order of 10µsec at read committed isolation level. A single statement accessing additional rows in different leaf level pages should have lower incremental cost per row. Accessing additional incremental rows in the same leaf level page is even lower.

Query costs at higher isolation levels (repeatable read and serializable) are significantly higher, as are write operations. A guess might be 20µsec for a single row SQL statement. As a rough order of magnitude, we might estimate that each thread drives 50,000 single row accesses per sec. The combined set of 88 threads drives on the order of 4.4M rows/sec.

For each row in a separate leaf level page, memory accesses are required for the 96-byte page header, the slot array at the end of the page, and the row contents itself, per Inside the Storage Engine: Anatomy of a page by Paul Randall. Presumably the page header and slot array accesses can be overlapped.

For a large table on the order of several billion rows with a compact index key, we can expect a b-tree index depth of 4. An index seek must navigate the root level, two intermediate levels and one leaf level page. In addition, several system table pages and other critical structures are accessed. Each index seek and row access touches many pages in addition to the page with the actual row data.

Assessment IV: IO

The last TPC-E report with HDD storage for data file was sometime in 2011? The HP ProLiant DL380 G7 report of May 11, 2010 was 1110.11 tpsE. System configuration was 2S Xeon 5680 (Westmere-EP) 6-core processors, 96GB memory, and 528 15K HDDs over 6 controllers in RAID 1+0 for data. We can surmise that physical IO was 175 IOPS per 15K HDD for 92,400 IOPS. Initial database size was 4791GB.

The Lenovo System x3650 M5 TPC-E report of Mar 31, 2016 was 4938.14 tpsE for 2 × Xeon E5-2699 v4 22-core processors at 512GB memory, SSD storage, and initial database size of 20,518GB. Given the similar data-to-memory ratio, the physical IOPS level on this system is expected to be around 415K IOPS.

From the earlier rough estimate of over 4M rows accessed per sec, we can make two deductions. One is that the really critical data structures are in memory, meaning system tables, and index upper levels. Second, most of active data leaf level pages are also in memory, perhaps 90% (based on 400K IOPS and 4M rows/sec). This is with 512GB system memory, probably 490GB for SQL Server, most of which is buffer cache, against a 20TB database.

Memory Access Distribution

In general, memory accesses are expected to have non-uniform distribution. If we were to organize memory regions by access frequency, then the distribution of access frequency would have a shape something like the figure shown below.

MemoryAccess

The dimensions shown above are artificial. The purpose for the vertical scale labels is only meant to convey that the total volume on the 44-core, 88-thread system is expected to be somewhat under 1B/s.

The contents of each region may be mixed. In a database system, the executable binaries could be mostly in the extreme region to the left. System tables including locks might be in this or the next region. Index root and intermediate levels could be in the third group. The most active data leaf level pages next and followed by other active pages in the last region.

What is important are the relative dimensions on both axis, frequency and size, of the regions. The expectation for database systems is that the very high access frequency elements are small in comparison to typical modern server system memory configurations (>512GB). But there is enough of it to be well outside the capacity of modern processor cache (1M L2, 30-60M L3).

Earlier, we estimated that the 2S 44-core, 88-thread system is driving 812M memory accesses per sec or other (memory latency dependent?) operations such as mutexes. It is then obvious that only very small percentage of memory accesses go the leaf level data pages.

In other words, the very large majority of memory accesses are to other structures, such as executable binaries, system tables, index root and intermediate levels, etc. It is also not difficult to realize that these hot regions of memory constitute as small subset of system memory. Furthermore, many of the leaf level data pages accesses are to hot data, which is also generally small compared to modern system memory configurations.

TPC-E: IO and Memory

The TPC-E benchmark requires the database initial size to scale with performance at about 3.85GB per tpsE or higher. This ratio happens to put the database size much larger than system memory, given processor capability and memory capacity limits. Many production database systems run with active data largely or entirely within memory.

There are published TPC-E reports with memory configurations that span a wide range of data to memory ratios from as low as 6.5 to 1 to a high as 50:1. See TPC-E Benchmark Review. The larger memory configurations in relation to data size are in the 4S systems having the SMB memory expander.

None of these show indication of memory noticeably improving performance, i.e., through a significant reduction in IO. This is strong indication that modern systems are operating at the far right of the memory distribution graph.

Database IO and Memory

In principle, the active data set of a database can be much larger than the system memory allocated to buffer data. When active data is larger than memory, there is IO read activity to the data files. Even if all data is in memory, there is still IO write activity to the data files.

The IO rate versus memory size characteristic of a transaction processing database should have the same general shape of the memory access frequency distribution shown earlier. The shape of the IO versus memory curve can be measured by load testing a representative (or actual) workload over a range of memory settings in which the storage system is not performance limiting.

This is a simple test that would provide important understanding of the true memory requirement. But this type of test is rarely done because it is easier to fill the memory slots. Usually, IO volume is not an issue, and no one worries about the shape of the curve.

In the 1980's, it was difficult to achieve reliable operation at 10,000 IOPS level from an HDD array. By the early 2000's, storage arrays with one thousand HDDs could operate reliably at over 100,000 IOPS. A good all-flash array (comprised of multiple units) today can support over 1M IOPS. The more interesting number is actually IOPS at low latency, <1ms, perhaps even 100µ.

Of course, what is possible in a correctly configured storage system may be and usually is different than that in the production system. It is also necessary to verify the system is not bandwidth limited, particularly for fiber channel when the distinction between bits and bytes is often confused.

Each database read IO for a page will be followed by several memory accesses, some of which are serialized. Assume that (almost) IO are leaf level data page accesses. Excluding the operations involved in loading the page into memory, assume one IO represents between 4 to 10 memory accesses. It is apparent then that what constitutes a very heavy IO load (1M IOPS) might represent only 1% of the overall memory access volume.

Memory Configuration Practice

Consider again our 2S Xeon E5 v4 system (1S is shown for simplicity). In 2016, the typical memory configuration might have been 16 × 32GB = 512GB (8 DIMMs per socket).

1S_Skylake

At the time, the 64GB DIMM was over $3000, versus around $1000 for the 32GB DIMM. Today, the 64GB DIMM is around $1000.

A technical justification for the memory configuration would involve showing either: improved performance, or memory-storage cost balance. In practice, almost everyone fills the DIMM slots with the largest practical (not having a huge per GB premium) capacity module without technical analysis.

In principle, a proper solid-state storage array, with data files distributed over multiple RAID controllers or HBAs, each of which being comprised of many SSDs, and logs on a separate controller/HBA with its own physical storage units, can support over 1M IOPS.

In practice, real-world production database systems generating 1M IOPS are very infrequent? More common are environments operating in the 10-100K IOPS range. Of these, more are at the lower end for reads from data files with proper tuning, (because database people like massive memory overkill). IOPS at the higher end are typically the result of employing row GUIDs in key (or all) tables. In the old days of HDD storage, prolific use of GUIDs could have been a death penalty. Now, the IOPS load easily fits within the capability of properly configured All-Flash arrays.

Regardless of what the storage system can support, there is value if IO can be reduced from 1M to 100K IOPS. There is CPU overhead in evicting a page from the buffer cache and for the IO itself. Also, if the entire database can be put into memory, then memory-optimized tables with lock-free operation becomes an option, further reducing CPU overhead.

If the amount of memory necessary do to this were 10TB, then we might be interested in a tiered memory system some other technology having lower cost than mainstream DDR4 SDRAM. And if the access time of the second tier memory were somewhat higher, it is not an issue because only a tiny fraction of accesses go there. This is what database people mean when they say memory capacity-cost is an important issue.

Memory - the 99% Strategy

Every piece of information we have indicates that the call for more memory capacity at the system level may have benefits in lower IO. But most probably, this only affect on the order of 1% of memory accesses. Furthermore, there is reason to believe that 99% or more of memory accesses go a region that is but a small fraction of modern server system memory capacity. What then is the better strategy to pursue from here?

Given that the processor cores in transaction processing databases spend most of their cycles waiting for memory, we should be in interested memory with lower latency. Increasing the degree of HT from 2 to 4 is a work-around low core utilization issue, but increasing thread level performance will always have the higher value.

Memory technologies with low latency already exists, in product form as well. All are more expensive, and will probably result in a system with lower capacity if only a single memory technology is employed.

Accepting a smaller memory capacity should have no consequential negative effects even though this runs counter to the old rule of more memory is better. This was a good rule in the long ago past, when "thou shalt have more memory" was engraved on tablets at some mountain beside a bush on fire but not burnt. But there have been multiple orders of magnitude change in the underlying parameters since then, and what was once true is not true anymore.

TieredMemory

A viable system can have low latency memory at smaller capacity so long as the resulting IO was not too high, perhaps around the 300K IOPS level.

MemoryAccess

But there is no reason we cannot have multiple memory tiers. DDR4 may be expensive compared to NAND, but it is cheap compared to the new tier-1 memory.

Within the six memory channels of Xeon SP, potentially 4 could be employed for low latency memory and two for mainstream DDR4. Or two channels could be a mix of DDR4 and some form of non-volatile memory, example Optane.

TieredMemory

The tiered approach has great flexibility to achieving a balance of low average memory latency and cost, accounting for any differences in IO.

MemoryAccess

Xeon Phi - Knights Landing (KNL) Example

Another option would be something similar to that employ in Knights Landing (KNL) in which specialty memory is in the processor package itself. Below is a representation of Xeon Phi x200, aka KNL.

KNL_2

There are six DDR4 memory channels going outside of the processor package. There are also 8-channels to a high bandwidth memory off-chip but inside the processor package. KNL has the same number of off-package signals (3647) as Xeon SP. (Note: KNL has 36 PCI-E lanes, versus 48 in Xeon SP and no UPI. Presumably, there are more ground and power pins?) While the packaging is sophisticated, it is not excessively expensive, in relation to the objective.

Note: KNL memory latency is reported by John D. McCalpin at U-Texas-Austin to be just over 150ns for MCDRAM and 130ns for DDR4. Presumably this is a combination of KNL and MCDRAM in relation to its objective of extremely high bandwidth for memory streaming workloads. The objective of our system is very low memory latency for pointer-chasing workloads.

At Wikichips, there is a discussion of the Hot Chips 30: Intel Kaby Lake G session. The product is an Intel Kaby Lake H processor with an AMD Radeon RX Vega graphics and 4GB HMB2 memory all in one package.

hc30-kblg-emib

The connection between the GPU and HBM memory uses the Intel EMIB technology, which allows smaller wires. The connection between the CPU and CPU uses fat wires because the PCI-E interface already exists on both devices, with supporting circuitry to drive signals both off-chip and off-package. It is possible that future product might have an interconnect using EMIB?

Presumably, if we were to have a processor with on package memory, the use of EMIB would allow a very wide interface?

So Why Not?

If low latency memory is a good idea, then why has it not been studied and even put into production already? Well, there are complications. It is almost universally accepted that the real servers are multi-processor.

2S_Skylake

Since about 2010, Intel processors have integrated memory controllers (IMC), and longer for AMD. There are good reasons for having the IMC. But it also means that multi-processors systems are inherently NUMA. Also, the Xeon E7's had the SMB to double memory capacity, but added an extra hop in the path. In Xeon SP, the SMB is generally not used anymore?

Ideally, software should be architected to achieve a high degree of memory locality on NUMA systems. This was done for both the TPC-C and TPC-E benchmarks. The number of real-world production systems properly optimized for NUMA may be in the single digits?

In an application not optimized for NUMA, the expectation is that memory access is randomly, and somewhat evenly distributed between memory nodes. This is 50/50 to local and remote nodes at 2S. On the 2S Xeon SP, average memory latency is expected to be 114ns, based on 89ns local and 139ns remote node access for a workload not optimized for NUMA.

Of this, perhaps 46ns (tRC) is at the DRAM interface. (Some state that random access latency is tRCD + CAS + transfer time, about 31.5ns, but sustained operations should be tRC?)

Suppose a low latency memory technology has half the density of DDR4 SDRAM, example 4Gbit versus 8Gbit. In other words, perhaps 2X (or greater) cost. And that we are restricted to a single module per memory channel. Then our processor would connect to 6 low latency 32GB memory modules (or 4 plus 2 channels of 2×64GB DDR4).

If tRC were 16ns at the memory chip, then latency might be 59ns for local node and 109ns for remote node. Average memory latency for software not optimized for NUMA would then be 84ns.

The reduction in average memory latency might result in 33% performance gain (based on 5% of code being serialized memory access). This should be attractive, but apparently is not sufficiently compelling?

The Single Processor System

In Multi-Processors Must Die, I argued that the multi-processor server concept is an antiquated relic of the past, and one that should be largely abandoned except in special circumstances, such as workloads properly architected for NUMA.

This means the server for most situations, is a single processor system having all memory local. The curious thing is that memory latency of the 1S Broadwell system is 75ns compared to 93ns for local node memory in the 2S system.

1S_Skylake

In Notes on NUMA Architecture, by Leonardo Borges, Intel Software Conference 2014 Brazil, it is suggested that the 2S system does not impose a penalty on local node memory access?

On the assumption that 1S memory latency really is 76ns on 1S Xeon SP (1ns higher L3 than in Broadwell) the expectation for stepping down from 2S to 1S is that thread level performance increases by 45% (memory latency reduced from an average of 114ns in 2S to 76ns in 1S).

System level throughput decreases by only 27%, having half the total number of threads and cores, for an application not optimized for NUMA, which applies to most situations.

Note that this is only a projection based on a very simple model of performance largely governed by round-trip memory latency. Still, the poor scaling from 1S to 2S is generally not appreciated. Only one TPC-E report for a 1S system has been published, and that was in 2007 on an Intel processor before the IMC.

The simple model for 80/20 local/remote (reported TPC-E mix) scaling is 1.57×, which is not bad, and probably worthwhile. The model predicts 1.91× scaling from 2S to 4S with 70/30 local remote split, versus actual of 1.72×. So obviously the simple model is overly simple and has serious limitations.

Even so, there is sufficient actual data to support the 1S as the preferred system for most situations. That being the case, we can now re-evaluate the impact of low-latency memory.

For a low latency memory product having 16ns latency at the die interface versus 46ns for DDR4, the system memory latency is expected to be reduced from 76ns to 46ns. In the simple model, the projected impact is a 1.56× increase over the same 1S at DDR4 latency.

That, combined with the per thread on 1S w/DDR4 already being 45% better than that on the 2S-DDR4, yields better performance than 2S-DDR4 by just over 10%. Now, on a 1S system, low latency memory looks very attractive.

DRAM in Brief

DRAM vendors can best speak for the high-volume production cost structure of various low latency memory technologies such as RLDRAM, eDRAM and even SRAM. In brief, one major aspect of DRAM latency has to do with the bank size. Another is the multiplexed row and column addressing. Below is a Samsung DDR3 1Gb die image from Chipworks (Tech Insights).

DDR3_Samsung1Gb_DDR3

DDR3, manufactured in 1-8Gb densities, has 8 banks. DDR4, at 4, 8 and now 16Gb densities, has 16 banks. At 8Gb and 16 banks, the bank size is 512Mb.

The Micron 8Gb 2Gx4 has banks organized as 131,072 rows x 128 columns x 8 x 4. The value 8, second from the right, is the burst length, and 4 at the end is the word size. The DRAM chip has 21 address signals, 4 for bank and 17 for the row.

The column address is 10 bits, though only 7 are passed to the decoder. The lower 3-bits of the column address may be used to determine the order of the 8-word burst. It is clear that multiplexed addressing in DRAM is a relic of the past.

Lower latency can be achieved by reducing the bank size. This means a larger portion of the DRAM die is comprised of control elements, and in turn results in higher cost. The number of banks did increase over the generations from 4 in SDRAM to DDR and early DDR2, to 8 in later DDR2 and DDR3, and then to 16 in DDR4.

But banks size also grew from 64M to 1G (16 banks at 16Gb density) over this period. DRAM vendors were told that latency was not sufficiently important to justify cost increase. Bandwidth growth did justify some cost increase. In part, this was because there was too much latency contribution from elements outside of the DRAM that the benefit of lower latency at the chip was diminished.

See Onur Mutlu ACACES Summer School 2018, Lecture 5 Low-Latency Memory.

Memory Economics

Here, we will mostly focus on the equivalent value from the user perspective. Low latency memory that has the ability to produce performance comparable to that of 2S with DDR4 using just 1S should have value equal to that of the second processor if not higher.

One question is: which processor? At the low end of Xeon SP, it could be $200. Considering that we are targeting the most difficult problem, the processor in question is then the high-end 28-core Xeon SP 8180 at $10,009.

The second question is how much low latency memory is required? We could propose technical analysis based on sequence of considerations. But, the long standing practice for high value projects is to get the biggest processor. Then fill the memory slots with the biggest DIMMs.

Assuming that our system has 4 memory channels for low latency memory that can only accept one module (also to reduce latency?). Further, we presume that the largest capacity module is 32GB, whereas 64 or 128GB is possible with DDR4. So, the presumption is that this system will be configured with 4 × 32GB modules for 128GB.

It will not matter if technical analysis says we only need 64 or just 32GB of more expensive low latency memory. In database world, we practice massive overkill on memory.

One more factor is this. Regardless of system level throughput at high thread-level parallelism, there is also value in better single thread performance. This is why AMD Opteron has been at a major disadvantage to Intel in server processors for the last 8-10 years. On this basis, a cost structure of up to $100/GB is not unreasonable if a 50% reduction in latency at the system can be achieved.

We could also factor in software licensing. The major RDBMS products are licensed on a per core basis. For the Enterprise Edition, the discounted price might be $3,500 per core, working out to about $100K per 28-core processor.

From the hardware point of view, the most cost-effective strategy is a tiered structure of low latency memory sufficient to cover more than 90% of memory accesses, mainstream DRAM to cover 9% and some form of lower cost memory to cover the remaining 1%. The expectation is that low latency memory capacity less than one-quarter that of DDR4 is sufficient to this task.

For the tiered-memory system to work, a new version of software is needed to recognized tiered-memory and have some strategy to use each tier accordingly. As this new system will have substantially better performance than existing systems, database vendors will probably raise software licensing to reflect to new value.

In this regard a system with a single tier of all fast memory of around 512GB under a budget of perhaps $50,000 running existing software version, unaware of tiered-memory, might also be a viable product.

Memory Summary

Memory capacity and price were once the primary driving objectives for DRAM as main memory in computer systems. Over forty years, capacity grew from megabytes to gigabytes and now to terabytes. The requirements have been met, exceeded, exceeded with overkill and then exceeded with ridiculous overkill. The bulk of system memory today is being used to cache storage accessed at extremely low frequency by DRAM standards.

In the modern era, some applications desired streaming memory bandwidth. Others, prominently database transaction processing, are largely characterized by round-trip memory latency. There are memory products and systems with extreme memory bandwidth. Memory products with low latency also exists, but these are typically used in network switches, with nothing that can be used in server systems.

There appear to be insurmountable obstacles to implementing low latency memory for server systems. But these turn out to be obstacles of our own making, in the antiquated belief that servers are multi-processor systems and that memory capacity must be ridiculously huge. The old concepts are now not only obsolete, but also a serious liability. By abandoning the multi-processor for a single processor system, low latency memory becomes very viable. The second key is to realize that memory capacity does not need to be anywhere near as large as it is in current DDR4 systems.

 

References

qdpma.com:     Too Much Memory     TPC-E Benchmark Review

Anandtech:
The Intel Xeon E5 v4 Review: Testing Broadwell-EP With Demanding Server Workloads by Johan De Gelas on March 31, 2016,
Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade? by Johan De Gelas & Ian Cutress on July 11, 2017,
and Dissecting Intel's EPYC Benchmarks: Performance Through the Lens of Competitive Analysis by Johan De Gelas & Ian Cutress on November 28, 2017.

Intel Software:
Intel Xeon Processor E5-2600 V4 Product Family Technical Overview By David Mulnix (Intel), April 19, 2016
Intel Xeon Processor Scalable Family Technical Overview By David Mulnix (Intel), July 10, 2017, updated September 14, 2017

SQL Skills - Paul Randall:
Inside the Storage Engine: IAM pages, IAM chains, and allocation units
Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps

blogs.msdn.microsoft.com:
Platform layer for SQL Server, Slava Oks July 20, 2005

Onur Mutlu, Professor of Computer Science at ETH Zurich
website, lecture-videos

specifically, ACACES Summer School 2018, Lecture 5 Low-Latency Memory
DRAM memory latency: temperature, row location, voltage,

Onur says in Lecture 5: Low-Latency Memory at around 49min point that 7.5ns (for each of the three CAS, RCD, RP) is possible? versus the common 14.5ns?

Graphics etc.
Anandtech The NVIDIA Turing GPU Architecture Deep Dive: Prelude to GeForce RTX, by Nate Oh on September 14, 2018.
nVidia Turing TU102 754mm2, TU104 545, TU106 445, Pascal 471mm2
$999-1199, 699-799, 499-599

Anandtech NVIDIA Volta Unveiled: GV100 GPU NVIDIA Volta Unveiled: GV100 GPU and Tesla V100 Accelerator Announced by Ryan Smith on May 10, 2017.
815mm2, $18K

 

Memory-Optimized Tables -

SQL Server Hekaton, the codename for memory-optimized tables, better known with the misnomer in-memory database, is as much another method of addressing the memory latency issue.

People might say it is about getting rid of disk IO, but memory-optimized tables has advantages over a traditional database that happens to fit in memory.

There two basic elements in the SQL Server implementation for Hekaton. One is lock-free operation, with ACID ensured via multi-version concurrency control (MVCC, Per-Åke Larson). Second basic element is the use of hash indexes instead of b-tree. The hash index requires pre-allocation of memory for the estimated number of buckets and has the advantage of needing few memory round-trips than a b-tree index. Both elements reduce memory round-trips.

It was suggested that use of memory-optimized tables might contribute to a 3X increase in performance. This could be interpreted as needing one-third the number of memory round-trips as the traditional database engine with b-tree indexes and locks. It is also possible that the 3X gain is only (or mostly) achieved in write operations.

More performance gain is possible from this point through the use of natively compiled procedures.

 

Which gap is more important? Are people looking at the wrong gap?

PersistentMem

 

L3 Comments

If we pursue memory latency reduction, then it must be done across the full stack. The first step is to abandon multi-processing, as in multiple processor sockets and even multi-die processors in one socket. Low latency memory attached directly to a processor with an integrated memory controller then becomes highly effective.

Then, another source becomes a significant contributor, that being L3. Given the large 1M L2 on Xeon SP, the L3 contribution might be too weak to incur 19.5ns of delay. So perhaps the request should be immediately sent to the memory controller?

The Xeon SP based on the XCC die already has sub-NUMA clustering, in which half can behave as a NUMA node. In the new model, perhaps there are more tier-1 memory controllers, and each group of 4 or so cores share a memory controller with its own L3. And there are still 2 tier-2 memory controllers on the entire die connecting to DDR4?

 

Main Points

Memory latency is the high impact parameter in modern systems. All the technology for lower latency memory already exists. Yet there are obstacles of our own making. DRAM with low latency is simply a matter of trading off price, and potentially capacity.

One reason we are reluctant to make this trade-off is the believe that database systems need ridiculously huge memory capacity. This in fact a deeply flawed perception and every bit of evidence points to database systems running just fine on merely huge or even just large memory.

The second obstacle is that the impact of low latency memory is muted because much of the system level memory latency occurs outside the DRAM chip. A large part of this occurs in multi-processor systems in which memory access in non-uniform. This and that few database systems have been architected to achieve memory locality on NUMA systems.

But why are data center standard systems multi-processor? This also turns out to be a once reasonably but now obsolete concept. Almost all transaction processing workloads can run just fine on a single processor. The Intel Xeon SP now have up to 28-cores, with Hyper-Threading at 2-way. Performance per thread on the single processor is better than on multi-processor, more so for environments without effective NUMA optimization, which is almost everyone.

Now that we have established that multi-processor is not necessary, our system is single processor. On this system, low latency memory has high impact. And the system will run just fine with much less memory than possible with conventional DDR4 SDRAM. Perhaps 128GB of low latency memory is enough, and the budget for better memory is very high.

 

Rule of Thumb?