Home,   Parent,    Memory Latency (2018-02)

SRAM as Main Memory (2018-02)

The accompanying article discussed the critical impact of round-trip memory latency on applications with even a small amount of pointer-chasing code, prominently database transaction processing. The issue has become so important that server sizing strategy should now center on minimizing memory latency. This in turn leads to the single processor (socket) system, eliminating not only remote node memory access but also the remote node cache coherency check. The other avenue is in the memory technology itself. SRAM as main memory has been proposed before, but presumably no one has made a successful case outside of the Cray-1S. Otherwise there would be a (current) product along this line.

Case for SRAM as Main Memory - short version

SRAM can have very fast access times for small arrays, but in a large memory array there will be delays. Let us suppose for this discussion that 8-32GB of SRAM can reduce average memory latency by 30-50% compared to a DRAM-only memory system. Suppose further that the end user cost of SRAM is $1600/GB, about 100 times more expensive DRAM. It might seem that such a huge cost disparity could not possibly be justified by a moderate reduction in memory latency? However, from the perspective of scaling database transaction processing throughput performance, SRAM at this cost is viable and even attractive.

The Intel Xeon SP processors with 28-cores are priced at $8,700 and $10,000, excluding the M models. Scaling up a single processor to 2-way and 4-way multi-processor systems is expected to yield throughput performance gains of 1.4 and 2X respectively. This is equivalent to the performance gain expected from reducing average memory latency on the single processor system by 30 and 50%. In other words, a budget of $10,000 for 30% lower memory latency and $30,000 for 50% is cost equivalent to the accepted multi-processor scaling solution.

Does 1.4X scaling from 1-socket to 2S and 2X for 1S to 4S seem low? Surely there are benchmarks that show otherwise? Published TPC-E benchmark results show that scaling from 2-way to 4-way for the Xeon 8180 28-core processors is 1.72X. There are no results for single processor to 2-socket.

But, the TPC-E committee made great efforts to achieve a high degree memory locality on a non-uniform memory access (NUMA) system. If we were to survey real world environments on how much memory locality was achieved by architecting their database and application for NUMA optimizations, the reply would be: what is NUMA? There is also a large increase in average memory latency from single processor to the 2-way system, both in the local node memory access and remote node.

If a system with SRAM as main memory only achieves throughput performance comparable to existing solutions at similar cost, is there a need for a new approach? Modern servers are complex entities that aggregate very many cores and threads (logical processors via Hyper-Threading, AKA, SMT). Most line-of-business (LOB) software were developed more than 10 years ago, when processors had a single core or during the early days of multi-core.

Software developed blind to the intricacies of the underlying system architecture does not just work well by accident and will not be able to leverage its full capabilities. A database that seems to scale well to 4 or 8 cores most likely will not scale anywhere close to 112 cores and 224 logical processors. This is why serious performance problems still occur on what appears to be incredibly powerful hardware.

Organizations spend tens or hundreds of millions implementing their line-of-business (transaction processing) systems. Attempting to scale up is likely to cause as many or more problems than it solves. When a performance problem is discovered late in the process, hardware that can solve or mask problems of deficient software architecture is beyond price.

The single processor system with SRAM as memory is less complex. Thread level performance is better. With fewer total cores and threads (logical processors), full performance is achieved with fewer concurrent threads, resulting in less contention between threads. Memory access is uniform, avoiding serious NUMA anomalies. The system with SRAM as memory has better ability to provide a brute force hardware solution to problematic software than existing platforms.

Preliminary

The focus here is on an estimate-guess of the impact of SRAM as memory for database transaction processing. This is just one of many applications, but it is an important one which can justify almost any price for hardware that can fix problems of deficient software architecture. From this, we can establish an equivalent value for SRAM memory, and whether the value outweighs the cost. The details on how SRAM as main memory, both on the hardware and software stack, is actually implemented is left open. Examples cited here only serve to facilitate the discussion.

SRAM Density

It is widely accepted that cost is the huge obstacle to SRAM as main memory. Not only is SRAM expensive, but also computer systems, especially those intended for database servers, have enormous DRAM memory capacity. System memory capacity is misunderstood to be a requirement instead of just the result of years of pursuing capacity for a nearly forgotten reason.

Mainstream DRAM has been optimized for cost over many generations. As such, the cost difference between DRAM and SRAM is much higher than suggested by a DRAM bit cell being a single transistor and SRAM typically having 6 transistors. It is said that signal routing impacts SRAM cell density more than the individual transistor size? There are alternative technologies having intermediate cost and benefit, between that SRAM and conventional DRAM. An example is Reduced Latency (RL-)DRAM.

A visual inspection of the Intel Skylake die shown on Wikichip suggests that 1MB L3, data and ECC, is 1.2mm2. It is assumed that the structure above and below the ring interconnect are the L3 tags.

Skylake_2core_group2

The Intel 14nm process presentation says the SRAM (bit) cell is 0.0588µm2. Then 1MB with ECC is 1024 × 1024 × (8+1) cells at 0.529mm2. See the Micro 48 keynote by J. Thomas Pawlowski of Micron for more on SRAM density. The Micron 8Gbit DDR4 SDRAM has package dimensions of 7.5mm x 11mm. Die size must be less than the package area of 82.5mm2. DRAM density is then greater the 11MB per mm2, inclusive of ECC.

SRAM On-package?

With this rough understanding of SRAM cost and density, the plan must be to achieve maximum impact in memory latency reduction to justify the cost of SRAM. A single processor system has the advantage of all memory being local and eliminates remote node cache coherency penalty. Even though SRAM is advocated as main memory, it does not need to be exclusive. And it would be foolish to not leverage low cost DRAM.

One example is SRAM and DRAM memory in a fashion similar to the Xeon Phi (Knights Landing). SRAM is in the processor package, while DRAM is off-socket in conventional DIMM slots.

SRAM3

The objective of on-package MCDRAM in Knights Landing is memory bandwidth. By keeping the high-bandwidth DRAM on-package, fewer signals need to be routed off-package. Intel's Embedded Multi-Die Interconnect Bridge (EMIB) has 55µm bumps versus >130µm bumps for signals going off package, allowing for higher signal density. It is unclear if EMIB also lowers transmission delays, as MCDRAM was not targeting latency. Intel expects to decrease EMIB bump size in following generations, possibly to as small as 10µm.

Depending on actual SRAM density, we might be able to put 5-10GB of SRAM in close proximity to a large die processor. If SRAM can be stacked, then perhaps 2-4X more is possible?

SRAM Cost

Motley Fool estimates the cost structure of Intel’s 14nm process at $9,100 per 300mm wafer. The Silicon Edge die-per-wafer estimator says 86 die of dimensions 17.6 x 35mm fit on a 300mm wafer. Assuming the SRAM is made with spare banks, the expectation is that yield is high even for the large 614mm2 die. As a rough approximation, the end-user cost of SRAM is assumed to be in the $1000-1600/GB range. The current end-user cost of ECC DRAM is about $16/GB (Crucial, 2018 Jan).

The question not answered here is what latency could be achieved by SRAM memory. Keep in mind that the L3 sub-system today contributes 18-19ns. It would probably be necessary to rethink the entire path from L3 to fast memory for minimal latency. For this discussion, assume that 30ns access to SRAM memory inclusive of L3 is possible.

Database System Memory

Recent generation server processors have 12 DIMM slots. The number of DIMM slots scales in multi-processor systems. The Intel Xeon E5 v1 to v4 have 4 memory channels and support 3 DIMMs per channel. The Xeon SP has 6 memory channels and 2 DIMMs per channel. The price of a 64GB ECC DIMM is currently $1000 (it was less in 2015?). A system with 768GB (12x64G) per socket is not at all cost prohibitive if extra memory solves a problem.

From this, some might think that the system memory requirement for a database server is 1.5TB because the standard practice is a 2-socket system. There was once a valid reason for filling the system memory slots with large capacity DIMMs, justifying this as the memory capacity requirement. The purpose was to reduce IO to noise levels. The IOPS capability of a large HDD based array was one matter. Another matter was the impact of a RAID group being rebuilt on a single drive failure, during which performance was severely degraded. Massive memory overkill negates this as a problem.

In the last few years, SSD or all-flash became the better storage choice for important systems. The criteria for this is not just the hardware cost, but also that of the rack space required. A single 2U enclosure can accommodate sufficient SSD capacity for even very large databases. One or more full 42U racks might be required to accommodate the several hundred HDDs necessary for high IOPS capability.

Another important development is NVMe, which substantially reduces the CPU overhead in the IO operation. The overhead to support 50-100K IOPS on the legacy IO software stack consumed an entire processor core. The NVMe stack can support a much higher IOPS level for the same compute resource.

As a generalization, in a system configured with 1TB memory, the contribution of the second half-TB might be that of lowering IOPS from 100K to less than 10K. When storage was on an HDD array, this was worth the cost of the extra memory. But now with storage on flash and the NVMe stack, it is not essential.

So, any past practice is no longer a valid statement of system memory capacity requirement. The new requirement principle for memory capacity should be based on reducing IOPS such that the CPU overhead for that IO level is low in relation to total compute resources. As modern processors have many cores, a reasonable rule might be for IO to consume CPU less than one core. To this end, a single socket system with 12 DIMM slots should be able to encompass all but the most extreme requirements.

Impact of SRAM and DRAM

The general characteristic for the frequency of access versus incremental memory is expected to be something like the following. The initial amount is accessed in almost every operation, and may include system tables and index root level pages. There is a middle range in which data is accessed frequently, possibly representing index intermediate level pages and other hot data. Any further incremental memory is only for data accessed infrequently by DRAM standards (<1M per sec).

mem_freq1

Assume that our new system has both SRAM and DRAM memory, with fast memory having 30ns latency and big memory having 75ns latency. If 50% of accesses that miss L2 and L3 go to fast memory, then average latency is 52.5ns, 30% lower than the 75ns latency of DRAM in a single socket system. At 84% of accesses going to SRAM, average memory latency is 50% lower than DRAM.

It is then reasonable to believe that based on our understanding of database memory access patterns, a significant reduction in average memory latency can be achieved with some amount of SRAM that is much less than the typical DRAM configuration.

Single Processor - DRAM

For this discussion, let us set aside the pervasive practice of the 2-way server as the baseline standard system. Consider a single processor system, which could have up to 28 cores in the current generation Xeon SP.

  XCC_1S

This system has memory latency around 75ns. On 7-cpu, the Core i7-7820X 8-core processor is reported as RAM latency = 79 cycles (4.3GHz) + 50ns. The 79 cycles for L3 at 4.3GHz works out to 18.4ns. There is a slide in Intel Xeon Scalable Architecture Deep Dive showing 19.5ns L3 latency for the Xeon SP 8180, which is the XCC die. It is expected that there be a difference in L3 latency between the LCC, HCC and XCC die.

Other applicable parameters for the 7-cpu Core i7-7820X report is DDR4-3400 16-18-18. The 16 clocks for CL is 9.41ns, the 18 clocks for RCD and RP is 10.6 and the sum of RCD+CL+RP is 30.6ns. In a server system, memory timings are 2666-19, which works out to 14.25ns each and 42.7 total. Then 75ns or perhaps 80ns is more likely in a server with ECC memory, which has more conservative timings?

Multiprocessor Scaling

In the 2-way system, local node memory is reported as 89ns, and remote node as 139ns. Note that local node memory access on the 2-way system is significantly higher than memory access on the single socket system. This is attributed to the remote node L3 check for cache coherency.

XCC_1S

The figure below is from an Intel slide deck for Xeon SP.

LocalRemote

Local node memory on a 2-way Broadwell Xeon E5 v4 system is reported as 93ns. Remote node latency is not reported, but estimated here to be 148ns. Memory access on a single socket Xeon E5 v4 is about 75ns.

For a database not architected to achieve a high degree of memory locality on a NUMA system, the expectation is that memory accessed will be 50/50 between the local and remote node for an average of 114ns (89 & 139). This is 52% higher than memory latency on the single socket system (75ns). In case anyone is curious, approximately zero percent of real-world databases are architected for NUMA. The TPC-C and TPC-E benchmark databases are exceptions.

What this means is that we should expect the scaling from 1-socket (processor) to 2-socket to be about 1.35X. This is about same gain that we would expect from reducing average memory latency in a single socket system by about 30%.

A database architected for NUMA and achieving 100% memory locality would still incur an 18.7% increase in memory latency over the single socket system. An application on a VM allocated with all cores and memory from one socket would incur this plus the VM overhead (one extra address translation = 4-5%?).

Next is scaling to a 4-way Xeon SP system in which there are 3 UPI allowing a direct connection to each of the remote processors, but not the M model with the memory expander. The simple model suggests that there is only a moderate increase in average memory latency over the 2-way system.

XCC_4S

Using 25% local and 75% remote, then average memory latency is 126.5ns, 11% higher than on the 2-way. This would suggest excellent scaling is possible from 2 to 4 sockets. However, the TPC-E benchmark shows 1.72X scaling from 2-way to 4-way Xeon SP 8180 processors. The TPC-E is reported as having 80% memory locality on a 2-way system. Presumably, 60% is by design and the other 40% is split 50/50 between node 0 and 1. Scaling on a database not architected for memory locality would be less.

The Xeon SP does support 8-way systems, but there seems to be general acknowledgment that big-iron systems are now rarely called for?

The Case for SRAM as Main Memory

The case for SRAM as main memory is then made as follows. The expected performance impact is large in software with pointer-chasing code, prominently, database transaction processing. Furthermore, the throughput performance gain on the same order as that of scaling from 1 to 2 or even 1 to 4 processors is possible. An argument can then be made that the processor with SRAM as memory has value comparable to its throughput performance equivalent, plus the value of having better thread level performance.

The Intel Xeon SP processor line ranges from $200 to 10,000 excluding M models. Models with 20 or more cores that must be made from the XCC die range from $2,600 and up. There are also lower core count models with 3 UPI or fabric links that must be made from the XCC die, the lowest of which is $1,691. The 12-18 core products range from $1,002 to $3,543. The LCC products range from $213 to $1,221.

The most economical method of scaling performance from the low end is simply to increase the number of cores while staying with a single socket system. So, the proposal for SRAM as memory is primarily based on scaling performance of the high core count processors. However, as the processors with SRAM as memory have better thread level performance, there may be interest at the medium or even low core count level regardless of throughput equivalent cost. Note: in the real world, no one bothers with detailed technical analysis, so expect some technically irrational behavior.

A 26% reduction in average memory latency translates to 1.35X throughput performance gain. This could be achieved with 42% of accesses going to fast memory of 30ns, with the remainder to DRAM at 75ns. This is the expected performance level of 2-way system with DRAM having 89ns local and 139ns remote node memory latency for a database not architected for memory locality on NUMA.

SRAM_2S

A 50% reduction in average memory latency corresponds to the throughput performance of a 4-way system with conventional DRAM memory.

SRAM_4S

The amount of SRAM sufficient to achieve a 26% reduction in average memory latency has the value of a second processor, which could be as much as $10,000. The SRAM sufficient for a 50% reduction has equivalent value to 3 additional processors, or as much as $30K. The main point here is that SRAM at $1600/GB is very reasonable from the point of view of transaction processing performance.

Of course, the reduction in average memory latency is workload dependent, and the value is also dependent on the baseline product (or core count). We do not need to worry about these variations, as it is up to the user to decide which option is best. When this is the line-of-business system, no price is too high if hardware can fix problems deeply burrowed in the database design.

Alternative

Before concluding the argument for SRAM as main memory, some discussion is given on alternatives. The main issue of the large difference between round-trip memory access and the CPU clock cycle time resulting a very high percentage of no-ops could be solved with HT (generic term SMT). IBM implements SMT at either 4 or 8, and SPARC implements SMT8. Intel only implements HT at 2 for unstated reasons, possibly because their mainline core is built into a very broad product range not all of which benefit from HT.

While 4 or even 8 logical processors per core could achieve higher utilization of compute cycles, this does nothing for thread level performance. Furthermore, to realize the maximum throughput performance of high SMT, it is necessary to have very many concurrently running threads, which is already high because modern high-end systems have so many cores.

Most transaction databases were designed many years ago, probably in the era of single to quad-core processors on 4-way systems. There may (or not) have been no significant measurable contention at 4-16 threads, but even a minute degree of contention becomes a serious issue at 100 threads. Problems on NUMA systems are even more severe.

In principle, we could promote the practice of re-architecting databases for NUMA systems, but people much prefer throwing large amounts of money at hardware rather than pursuing this path. Another option is to implement memory-optimized tables, also known as in-memory database, which should eliminate a high percentage of the memory round-trips. Some have done this, but it is not easy, and will most probably be very expensive and time consuming to implement.

The single socket processor with SRAM main memory has none of these issues and the further advantage of having better thread level performance, which has significant additional value. As such, SRAM as memory is the better solution when a brute force hardware fix is required, and this may be more frequent than one might think.

Summary

There no public information on what latency could be achieved with SRAM as memory, but 30ns is not an unreasonable value. There is good reason to believe that some intermediate amount of fast memory can reduce average memory latency by a significant fraction for a database transaction workload.

The value of SRAM as memory could be comparable to that of scaling a single processor to that of a 2 or 4-way system. Then SRAM cost being comparable that of one to three processors is reasonable. And the cost of a high core count processor is as high as $10,000. This is sufficient to dispel any notion that SRAM is too expensive to be main memory.

It is further pointed out that the system with SRAM as memory is the better solution. The conventional practice of scaling up on a multi-processor system with DRAM as memory may or may not solve transaction performance problems. Organizations did not have a problem paying for million-dollar all-flash storage in the hope that it would make problems go away.

Reference

The Memory Guy high-bandwidth low-latency (HBLL)

Addendum

On the Oracle side, a few organizations went to the effort of optimizing for RAC, which is the same type of architecture necessary for single compute node system with NUMA.