Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

SRAM as Main Memory Cost Benefit

Proponents of SRAM as main memory almost never get past the huge disparity in price between DRAM, heavily optimized for cost, and SRAM, used as on-die cache optimized for performance in L1 and L2, for density in L3 and sometimes for power. It may also be difficult to imagine database transaction processing as a candidate for SRAM main memory given the huge memory configurations commonly deployed. But consider the following.

1. A system could have both SRAM and DRAM as main memory, with the SRAM memory being far larger than the on-die cache of modern processors, yet smaller than DRAM. The objective is extremely low latency. This means placing the SRAM physically as close as possible to the processor die, presumably in the same package, with the lowest possible interconnect delay. This might be the Intel EMIB. It should be possible to put 2.5GB of SRAM around a large die processor, more if SRAM can be stacked.
2. The large majority of memory in a database system is used to reduce IO, for which DRAM performance parameters and cost are eminently suitable. A minority of memory implemented with SRAM could potentially achieve a substantial reduction in average memory latency.
3. It so happens that database transaction performance is largely governed by round-trip memory latency, given the nature of B-tree index navigation and the disparity between the CPU clock-cycle and DRAM latency.
4. Attempting to scale transaction processing performance on a multi-socket system has moderate to limited effectiveness as few (zero?) real world databases have been architected to the characteristics of a system with non-uniform memory access (NUMA) and can even be counter-productive.
Scaling from 1 to 2 sockets might be 1.6X or less, and even lower from 2S to 4S, with the cumulative 1S to 4S gain of around 2X.
5. Increasing the degree of Hyper-Threading (HT), generic term Simultaneous Multi-Threading (SMT), has the potential for outstanding throughput performance at very high concurrency so long as there is almost zero contention between threads (guess how likely this is), but does nothing for single thread performance.
6. There is an established and prevalent pattern of willingness to throw hardware at the transaction processing system in the hope that this will solve problems of poor architecture and coding. This includes the employment of multiple $5-10K processors. Yet doing so often aggravates existing and creates new problems from NUMA and high concurrency operation.
7. A single-socket system has better single-thread performance than a multi-socket system as all memory accesses are to the local node resulting in lower average latency. Even though a single-socket system is now capable of handling most workloads, it might be perceived as insufficient for the usual bigger and more expensive must be better preconceived notions.
However, the single-socket system with SRAM plus DRAM memory further improves single thread performance and could match the throughput of the conventional DRAM two-socket and perhaps even that of a 4-socket system. The key is to make memory latency as low as possible, even if extraordinary measures are necessary.
The single socket system would not have the negative aspects of NUMA. It would need fewer cores and hence operate at lower concurrency level, avoiding contention issues. All of this makes the single-socket SRAM system a better technical solution in having fewer problematic aspects.
8. Consider that the flagship Intel Xeon SP 8180 Platinum 28-core processor carries a price of $10,000. The SRAM budget would be $10K at a 50% gain, and $30K at 100% gain for equality in throughput. The better single-thread performance and fewer negative aspects of the single socket SRAM memory system also has value, which could be substantial.

All of the above creates substantial space for SRAM to have high value impact as main memory. An SRAM cost structure 100 times higher than DRAM would not be prohibitive. And knowing database world, the real question will be: how much can I get, and I want more that.

 

Background

The issue of round-trip memory latency being well over one hundred CPU clock-cycles is a topic of academic papers but is generally not discussed in the IT community. SRAM has lower latency, but costs 100 times more than mainstream DDR4 DRAM. Almost everyone quickly concedes that SRAM is too expensive to be viable as main memory and instead look to specialized versions of DRAM as options. It would seem improbable that there could be some circumstance that would justify or even favor SRAM as main memory. And yet, it cannot be dismissed without a proper cost-benefit analysis.

It so happens that database transaction processing performance is largely governed by round trip memory latency. There is almost enough data available to make an analysis on the value of lower memory latency. Transaction processing is one sub-category of several major categories in computing. But it is an important one in which there is a prevalent pattern of hoping that brute force hardware can solve problems of poor architecture and coding.

The situation in processors for the last ten years has been that there are limited opportunities in pursuing substantial performance gain within a core, but overall performance can still scale by increasing the number of cores in a processor socket. The impact of round-trip memory access latency in modern processors is that the very large majority of cycles are no-ops waiting for memory.

The work-around solution for this is hyper-threading (HT) or the generic term, simultaneous multi-threading (SMT) in which a single core appears as multiple logical processors to achieve higher utilization of the execution units within each core. At the system level, all multi-socket systems inherently have non-uniform memory access (NUMA) because the memory controller has been integrated into the processor die. Integrating the memory controller has overall benefits, but there are implications of NUMA as well.

Most mainstream RDBMS have the ability to support a database in making effective use NUMA hardware. However, approximately zero percent of real-world databases have been architected for NUMA. A database not architected for NUMA has limited scaling on a multi-socket system. In any case, database transaction performance involves scaling over very many cores, and even more logical processors. For the strategy of scaling to high core and thread count to be effective, it is necessary to reduce contention between threads to nearly zero, which few have done.

From these factors, there are serious matters in database transaction processing for which work-arounds only introduce more issues. The costs, benefits and limitations of scaling performance on current hardware architecture can be estimated. We can also speculate on the characteristics of an alternative processor and system architecture built with SRAM as one main memory element, and DDR DRAM as another.

A major benefit is that substantially lower round-trip memory latency directly addresses single thread performance, reducing the reliance on work arounds involving NUMA system architecture (with caveats) and extreme high thread concurrency. An argument is made that SRAM is viable as main memory, or at least warrants further analysis by entities with far more resources.

Xeon SP (Skylake) processors and systems

The latest Intel Xeon processor with multi-socket support, now using the scalable processor (SP) nomenclature, has up to 28 cores, 6 memory channels split over two memory controllers, 48 PCI-E lanes, and 2 or 3 Ultra Path Interconnects (formerly QPI).

  xeon sp

See Intel Xeon processor scalable family technical overview for additional details, particularly differences between the Xeon SP and the previous generation Broadwell Xeon processor.

  xeon sp

The common system configurations are 2-socket and 4-socket. The 2-way is by far the majority choice.

  XCC_2way

The 4-way is the significant second platform choice. There is an option in the Xeon SP with the M models in which a scalable memory buffer (SMB) expands each memory channel into two channels, doubling the number of memory slots, with options for mirroring and other. The SMB was mandatory in the Xeon E7s, but optional in the Xeon SP?

  XCC_4way

It is possible to build a glue-less 8-socket system, and other configurations with 8 or more sockets are possible with custom components. Curious, few if any major system vendors advocate a single socket Xeon SP, or Xeon E5 in the previous generations.

The Xeon SP processor has 6 memory channels capable of supporting 12 memory slots, 2 per channel. The maximum economical memory configuration is 768GB per socket (non-M models) with 64GB DIMMs at $1000 each. The 128GB costs twice as much per GB, allowing for 1.5TB memory per socket.

The flagship Intel Xeon Platinum 8180 28-core 2.5GHz processor is $10,000, so the practical maxed out socket is $22,000 for processor and memory. The common practice in server system configuration for transaction processing databases is to fill the memory slots. There was once good reason for this. Memory reduces IO, and there were limitations in the IOPS capability of hard disk based storage systems, more so when the priorities of the infrastructure department was not aligned with the priorities of the database system.

Over the last several years, all-flash storage has become not only more affordable for critical systems, it has also become more practical than hard disk arrays when factors such as floor space footprint are taken into account. The important point here is that configuring to the maximum server system memory capacity happens to be a convenient practice and the true memory requirement is not something on the order of several hundred GBs.

DRAM, SRAM and 3D XPoint Cost

DDR 4 DRAM density is currently 8Gbit on a 20nm process. A 64GB ECC DIMM is made from 2 die stacked in one package for 16 Gbit or 2GB of memory. There are 18 packages on each side of the DIMM, for 36 packages total. This forms a 8Gig×72-bit configuration. Nominally 64-bits is for data and 8 bits for ECC. The package is 9.5mm×11.5mm or smaller (see Micron datasheet for DDR4 SDRAM MT40A2G4 etc.). The die must be smaller than the package and probably much smaller? Assuming the die is less than 8mm×10mm, this would suggest DRAM density is greater than 100M bits per milli-meter square?

The cost structure of DDR DRAM with ECC is publicly available. In 2017, it is about $15.60 per GB for modules up to 64GB. There is currently a shortage of DRAM supply with respect to demand. Previously, DRAM with ECC had been as low as $10 per GB.

There is a specialty DRAM, reduced latency DRAM (RL-DRAM) manufactured by Micron that could also be a fit. Cost is higher than regular DRAM and lower than SRAM. It is currently used in network switching.

There is not much useful information on SRAM cost structure. The commercially available SRAM products are for specialty applications and are shown with prices not in line with the cost of silicon in the PC world. However, it is possible to make some deductions based on public information.

Wikichip estimates that the Skylake core is 8.73mm2. A two-core group inclusive of 4M L3 is 25.347mm2, implying 1.977mm2 per MB of L3 cache. Cache includes both data, ECC and tags. SRAM as memory would have only the data and ECC. It is reasonable to estimate that a 128MB SRAM chip has 250mm2 die size.

The Intel Advancing Moore's Law in 2014 says the SRAM cell (1-bit) on their 14nm process is 0.0588µm2. Assuming 9-bits per byte (64 + 8 for ECC), then 1MB is 0.529mm2 or 1GB on a 542mm die?

Motley Fool in The Economics of Intel Corporation's Monster Core i7 6950X estimates Intel’s cost structure for a 14nm at $9,100 wafer cost. The 10-core Broadwell has a 246mm2 die. There are 223 dies per wafer. Based on 0.2 defects/square centimeter, there should be 139 good die per wafer for a cost of $64, excluding packaging and test? The actual effective yield should be better because Intel sells products with fewer than all cores active. A die is viable as long as some cores, the memory controller, PCI-E and possibly the QPI/UPI links work. The lowest price for an Intel Broadwell product using the 10-core die is the E5-2603 v4 at $213.

The standard practice for SRAM is to build extra banks in each die. A die with one or two defective banks could still be functional. The expectation is that even large die SRAM will have good yield, diminished only by the number of complete die that fit on a wafer. Using Silicon Edge die-per-wafer and assuming good yields, then a $9,100 wafer yielding 90 good die of 500mm2 works out to rough cost of $100 per 256MB for the bare die. There are additional costs to build the is into a product.

  DiePerWafer_35x15

There are individual Intel products that have margins in excess of 90%. However, it is impractical to achieve such margins for all viable die manufactured. Furthermore, the Intel business model is to run their fabs at full capacity, building as many as necessary so long as average margin is strong. To this end, it is useful to have components that can soak up capacity.

The assumption made here is that a 256MB SRAM on a 500mm2 die contributes $400 to the retail cost of a product. This works out to $1600 per GB, about 100 times higher than the current retail cost of DRAM. This alone is probably sufficient to scare off academics from proposing SRAM as main memory. But until we know how much SRAM is needed and what the impact on average memory latency is, the final answer has not been determined.

Two other technologies are mentioned here, Intel 3D XPoint and of course NAND flash. The enterprise class Intel Optane DC P4800X with 3D XPoint chips plus PCI-E controller, is $3000 at 375GB net for $8 per GB. The workstation class Optane SSD 900P is $390 at 280GB for $1.40 per GB. These products are comprised of 3D XPoint components and a controller to interface 3D XPoint to PCI-E. Presumably in a memory product, 3D XPoint would interface directly. It would be reasonable that the cost of the enterprise product should head to $2 per GB.

For completeness, NAND flash is probably less than $1 per GB amortizing NAND chips and the controller, capacitors and other supporting components, but not factoring storage system level elements.

Regardless of what vendor marketing people say, DRAM costs are not particularly high even with current prices nearly double that of 2015/16 for the database transaction processing economics. Intel 3D XPoint at half the cost of DRAM is probably not worth the effort, on the assumption of 1µs latency with a memory interface (as opposed 10µs for access through the PCI-E interface and controller). The exception is that higher density can be achieved than with DRAM. For example, a 256GB 3D XPoint product with a DIMM interface at $8/GB would be of interest as a companion to 64GB DDR4 DIMM at $16/GB. A price of $2-4/GB with the density advantage versus DRAM at $16/GB would be very interesting.

NAND flash, with lowest possible latency on the order tens of micro-sec, is not viable as memory except as a mechanism to backup memory on power failure. NAND is wonderful as storage and should remain there, but could be displaced by 3D XPoint prices comes within a factor of 2 to NAND, as the properties of 3D XPoint are better than NAND in single thread and mixed read/write.

Database Memory Requirement

There is no definitive answer on the memory requirement for database servers. The major benchmarks require a correlation between reported performance and database size. In the TPC-E benchmark, a throughput score of 6,600 tps-E corresponds to a database size of 29TB, all of which is active. In the real world, many databases are not that large, and only a small percentage is active.

The general expectation is that the shape of the frequency of access versus memory capacity curve should be something like the following. This can also be interpreted as the IOPS versus memory curve.

  Memory_IO2
  Expected memory access frequency

There is some minimum level of memory for bare functionality. This should be the memory required by the operating system, the database engine executable and some basic internal data structures. The initial incremental memory has huge impact. System tables, plan cache, locks and index upper levels are touched many times for every row and column.

In the intermediate range, there is strong impact as frequently accessed index and table leaf level pages become resident in the buffer cache. Finally, there is the weak phase, when additional memory enables caching infrequently accessed pages.

That we have been in the weak phase for some years is evident. Twenty years ago, high-volume systems may have generated 10,000 IOPS or more on an array of two hundred or more hard disks (10K RPM). Twenty years is ten doublings per Moore’s Law, corresponding to an increase in performance by a factor of 1000. If the memory to active data set ratio were fixed, we expect 10M IOPS now. Instead, it is common to find high volume processing systems showing on the order of 10,000 data page read IOPS (writes cannot decrease beyond some minimum amount per transaction). It is not uncommon to find situations in which the entire active database fits in memory.

For much of the initial period the last twenty-years, money spend on memory was money well spent because there was a practical upper bound on the IOPS possible from a hard disk array. The larger an array, the more likely that one or more disks on the array will be a in failed state at any given point in time. The RAID array can be configured to rebuild automatically, and there can be hot-spare drives. But performance is degraded during the rebuild. This and the fact that SAN vendors advocate a configuration strategy meant to justify the extremely high margins on their products rather than meet the special requirements of database transaction processing.

Several years ago, NAND flash storage became not extremely expensive for critical line-of-business applications. More recently, All-Flash has become preferable to hard disk storage systems period, whether extremely high IOPS are needed or not. One factor is the physical volume necessary to meet a specific IOPS level. This is typically far below what a flash array is capable of, but requiring perhaps one hundred HDDs or more.

Note even though large flash arrays are capable of incredible highly IOPS levels, there is also the CPU necessary to drive the IO. In a database, there is additional cost to evicting pages currently resident in memory to make room for pages being loaded from storage. The new NVM-e protocol helps in the first regard.

Impact of Memory Latency

Modern processors and systems have scaled memory bandwidth over time, increased cache size to lower the probability of a cache miss, and added the memory pre-fetch instruction. However, consider the basic element of a database operation, B-tree index navigation. The process involves repetitively reading a memory value to determine the next memory address to access, also known as pointer chasing code. This is fundamentally dependent on round-trip memory latency. Cache size can have only limited effectiveness.

The base frequency for the Xeon high core-count processors is in the range of 2.0-2.5GHz. The Skylake generation cores are designed to operate at higher than 4GHz. The quad-core desktop processor is rated for 4.0GHz base frequency at 91W TDP, and 4.2GHz in turbo when thermal conditions permit. The Xeon E3 quad-core version of Skylake is rated for a more conservative 3.7GHz base frequency at 80W TDP or 20W per core amortizing shared elements, and 4.0GHz max turbo.

The 28-core XCC die has base frequency is 2.5GHz at 165W TDP or 5.9W per core. In other words, the very powerful 4GHz plus capable Skylake core is de-rated in order to bring the thermal envelope of 20-plus cores in one die to a manageable level. This strategy works for database transaction processing because there is little value in high frequency operation on account of dead cycles waiting for long latency memory accesses.

At 2.5GHz, the CPU cycle time 0.4ns. Memory latency for local node memory may be 75ns, inclusive of L3, considering that this a complex topic and available material mentions a range of values. At 0.4ns cycle time, the 75ns round-trip memory access is 188 CPU-cycles. It is not hard to work that just a small fraction of code having pointer chasing code results in memory latency being the determinant of performance.

Simple Transaction Model

Consider a transaction comprised of 3M actual instruction/operations of which 3% require a round-trip memory access, but disregarding the L2 and L3 latency on cache hits. If we had a magic single cycle access memory, then the 3M cycle transaction completes in 1.2ms, for a performance of 833 tps.

CPU FreqTransaction OpsClock cycleTx time with 1-cycle memPerformance
2.5GHz3M0.4ns1.2ms833 tps

Now returning to reality, with round-trip memory of 75ns. For every 97 instructions that complete in 1-cycle each, there are three that incur the 188-cycle memory round-trip, totaling 661 cycles, of which 15% are actual instructions and 85% are no-ops waiting for memory. The 3M cycle transaction incurs 90,000 round-trip memory access delays, for a combined expenditure of 19.8M cycles and a performance of 126 tps.

Transaction OpsMemory accessesMem latencyTotal cyclesPerformance
3M90,000188 cycles19.8M126 tps

Frequency Scaling

From this, we can derive that there is not much benefit in increase CPU frequency, as this would only reduce the time to complete the 15% of improve the of actual instruction and have no effect on the 85% wait cycles.

The chart below shows frequency scaling in the simple model based on incurring 3% round-trip memory delays of 75ns. There is not much further scaling beyond 2GHz.

  freq_DS
  Simple model for frequency scaling on single-socket system

Actual observed transaction performance difference between 135MHz and 2.7GHz was a factor of 3. This test occurred when the BIOS/UEFI update to an HP ProLiant DL380 set the system into power save mode with frequency 135MHz.

Hyper-Threading (generic: Simultaneous Multi-Threading)

It is also why Hyper-Threading, implementing 2 logical processors on each physical core, one of which executes, the other is inactive until the first encounter a wait, is very effective. The generic term is simultaneous multi-threading (SMT). IBM POWER implements 4 and 8-way SMT. Intel should seriously consider increasing HT to 4 or 6-way, as 8 might be too much and there is no need for a power of 2 value.

NUMA Scaling

Another obvious inference is that scaling is only moderate on applications or databases not architected to achieve a high degree of memory locality in systems having a NUMA architecture. For several years now, this happens to be all multi-socket systems.

Elaborating on what was mentioned earlier, the major database engines have some elements for supporting better memory locality on NUMA systems. The basics is some means of affinitizing a connection to one or more cores in a particular socket. For SQL Server, the original means was via the connection port number. In more recent versions, this can also be done with Resource Governor. From this, the reliance is on the initial transaction activity to load pages into the local node, and hope that no other activity causes a particular set of data pages to load into the wrong node. It might be to actually implement elements of Parallel Data Warehouse into a single instance/process.

In the simple model for a database not architected for NUMA, a two socket-system has 50/50 local and remote memory access. Assuming 75ns for local memory access and 125ns for remote node, average memory latency in a 2-socket is 100ns. The expected performance is then 98.4 tps per logical processor, a degradation of 22%. Throughput scaling, the combined performance with 2X the number of cores over a 1-socket system is 1.55X.

SocketsAvg. Mem LatencylatencyTotal cyclesPerformance
1S75ns188 cycles19.8M126 tps
2S100ns250 cycles25.4M98.4 tps
4S112.5281 cycles28.2M88.6 tps

This is probably too simple a model, because exclusive memory access requires a check of both nodes? So, the real scaling is less?

For a 4-ways system having direct links to all other sockets, and not having the SMB memory expander of the Xeon E7, the simple model predicts 1.8X scaling from 2-socket to four sockets. The TPC-E benchmark was noted to have 80% locality of memory access (tpc-vms-2013-1.0.pdf) and shows 1.72X scaling from 2 to 4-socket, so a database not architected for NUMA could be 30-40% less?

The SMT/HT Alternative

One option for the issue of memory latency and high percentage of dead cycles is increasing the degree of HT/SMT. IBM POWER has options for 4 and 8 logical processors per core. In this strategy, the CPU is capable of incredibly high multi-threaded throughput for a suitable workload. Core utilization is fairly high when running multiple logical processors simultaneously.

When throughput performance at high thread count is the most important metric, then this is a good solution. The key to high thread count performance with high level SMT is having nearly zero contention between threads. If there is even a small amount of contention, scaling concurrent threads will approach a point of diminishing returns. Few databases have been architected for this because exceptionally high core and logical processor counts have not been available in practical system until recently.

But there are situations in which single thread transaction performance is important or even very important. In this case, memory latency is the only avenue main issue as high processor frequency does not help. In fact, much of the frequency that modern processors are capable of is unusable in single thread transaction processing.

Memory Optimized Tables and Natively Compiled Procedures

Another option is memory optimized tables and natively compiled procedures. Memory-Optimized (MO) tables removes locks, which is one source of memory round-trips. MO also uses hash indexes, which needs fewer memory round-trips than B-tree indexes. This requires pre-allocation of memory, which is not a big deal because memory is plentiful in modern system.

What is difficult is porting a legacy database to use memory-optimized tables. One in which developers have incorporated the many new features, introduced with each successive version over the years in the conventional engine. There may or may not have been good reasons to use a specific feature, beyond its new so it must be the wave of the future? There is value for an approach that can be dropped into an existing system without excessive code changes.

The Memory Latency Strategy

Conventional DRAM has been and continues to be designed for cost. It is possible to design DRAM for low latency, one example being Reduced Latency DRAM (RLDRAM), which has an SRAM like interface, more smaller banks and other measures to reduce latency. The most aggressive option is SRAM. At 100 times more expensive than DRAM, most people do not think this is viable. But the argument is made here that it is indeed viable given the nature and economics of database transaction processing. And this market sub-segment is sufficiently important to justify a custom product.

In all likelihood, the SRAM solution is not just straight replacement of DRAM with SRAM, accounting for differences in protocol. The assertion here is that the enormous DRAM configuration of server systems has gone beyond true memory requirements, more so when storage is all flash. As such the existing memory physical connection strategy is probably not best.

  SRAM1

The placement of SRAM on modules that connect to a socket (slot) and then to the processor socket would add too much latency diminishing the crucial benefit of SRAM.

SRAM In-package?

SRAM is very expensive and the objective of SRAM is low latency. The implementation strategy can allow expense commensurate with its value in reducing latency. One option, represented below is SRAM placed within the processor package.

  SRAM2

This is strategy Intel followed with the Knights Landing (KNL), aka Xeon Phi x200, except that KNL uses MCDRAM with the objective of exceptionally high memory bandwidth.

 

As mentioned earlier, much of the frequency capability of the Skylake core is wasted in database transaction processing. The Atom core in KNL less than half the size of Skylake. KNL has 36 tiles, each tile comprised of 2 cores.

That said, even though transaction processing favors more cores at lower frequency over few powerful cores, there are still important tasks that require a few powerful cores. It can be speculated that solution could be comprised of a small number of powerful cores and a large number of small cores?

Embedded Multi-Die Interconnect Bridge (EMIB)

Intel recently discussed Embedded Multi-Die Interconnect Bridge (EMIB) in which silicon is used to bridge the signals from one chip to another.

The bumps that drive signals off a chip and off the socket as well are currently 130µm. The bumps used to drive signals via EMIB to another die is currently 50µm and Intel believes this can be reduced to 10µm in the future. The small size of the interconnect mechanism is not just about density, but also, perhaps more importantly, that signals do not need as much amplification to be driven off-chip. This strategy would put SRAM both physically close to the processor and minimize the interconnect delays.

Multi-level Memory

Note that if were to implement something like the above, there is every reason to still include the conventional DRAM memory controller. The system might have one processor socket with 5-10GB SRAM in the processor package, and still have six memory channels for 768GB DRAM with 64GB DIMMs and possibly larger capacity 3D XPoint modules. In this case, there would be two or three classes of memory: SRAM, DRAM and possibly 3D XPoint or comparable.

memory_3_level

In the Hot Chips 2015 Knights Landing presentation, the various modes of operation with multiple memory levels are discussed.

SRAM Latency

There is not sufficient public information to say what SRAM latency is. Intel has three types of SRAM, one for high performance, probably used in L1 and L2 cache, a second for high-density, probably used for L3 cache, and a third for low power. L1 latency is 4 clock cycles, which could be 1ns in a 4GHz core. The L2 cache latency is cited as 12 cycles for the 256K variety, and 14 cycles in the Skylake SP. This would indicate a relationship between cache size and latency. The longer physical distance signals need to travel in 32K, 256K and 1M caches manifests in the latency? L1 cache latency must be set in terms of CPU-cycles because it is part of the instruction execution pipeline.

It is puzzling that L2 and L3 cache latency are specified in cycles only without qualification. Considering that processor frequency can vary between 1.8GHz and 4GHz or thereabouts. If 12 cycle latency is possible at 4GHz, would it not be possible to achieve 6 or 7 cycle latency at 2GHz?

Core Interconnect

Presumably L3 is built from the dense version of SRAM. L3 cache latency is specific as about 42 cycles for the quad-core desktop processors. In this case, internal component interconnect is a ring connecting 4 cores, the agent and the graphics controller.

skylakeE3ring

In the bigger Xeon processor, the internal interconnect has more components to connect. In Haswell and Broadwell, the MCC and HCC die have two ring sets.

Broadwell_HCC_ring2   Broadwell  

In the Skylake SP, the components are connected with a mesh for all die options.

Skylake_XCCgs   Skylake

L3 cache latency is cited as 50-70 cycles. The L3 latency might vary due to difference in distance, both physical and logical. Examples are local L3, L3 in an adjacent core, or in L3 at the far end from a particular core.

A question is whether L3 is hard set in cycles or also depends on core frequency. What if one core is in turbo boost mode, another is at base frequency and a third is in power save? There is a reference somewhere that cites remote node L3 latency at 100-300 cycles?

In any case, L3 latency is expected to be influenced by the time necessary to navigate the mesh interconnect.

Sub-NUMA Clustering

It was stated earlier that almost no real-world databases have been architected for a NUMA system architecture. Hence scaling on NUMA systems is expected to be no more than moderate, with the possibility of adverse NUMA effects. A single socket system with SRAM elements in main memory could achieve comparable performance as a multi-socket system, hence justifying the cost of SRAM. One of the arguments in favor of this strategy was that it would then not be necessary to re-architect existing database for NUMA optimizations.

But the very large die has dimensions on the order of 30×20mm, and connects 20-plus cores. There is enough complexity that better memory latency be achieved in dividing the single socket processor in NUMA nodes. In Skylake SP, it is called Sub-NUMA Clustering (SNC).

xeon sp

In Broadwell, it is called Cluster On-Die (COD). There might yet another name for this on Knights Landing.

In our case, this might apply to both the DRAM memory controllers and even the SRAM. Perhaps the SRAM is divided into several nodes, each node connected to one or a few cores for the lowest possible latency?

SRAM Latency (cont.)

Access latency to SRAM memory should be simpler than to cache because it is a straight access without the need to check the tags. If we were to implement SRAM as main memory, it be advisable to change the current L3 cache architecture. This on the assumption that the internal interconnect ring or mesh contributes significantly to latency.

How much SRAM is necessary?

The proposed layout shows ten 500mm2 SRAM die around one XCC size die presumably in a manner that could allow EMIB connection. Each large die could be 256MB. The capacity for ten SRAM die around the processor die is then 2.5GB of SRAM in close proximity to the CPU.

This may not be enough to be compelling. Can SRAM be stacked? If so, how high? Two dies stacked for 5GB SRAM might be enough. Four dies stacked for 10GB would probably be more than enough for the critical memory requirement.

At a retail price of $400 per die, the 16GB would contribute $16,000. Intel had encountered delays in bringing a 10nm processor to market. Perhaps 10nm SRAM having almost double the density might not be as difficult?

There is no clear material to suggest what SRAM main memory latency would, both in the simple case, and in the most aggressively optimal case. For the discussion here, the assumption is that in-package SRAM has a latency of 30ns compared to 75ns for local node DRAM, and that the processor has both SRAM and DRAM as memory. Depending on the mix of operations that incur the round-trip memory access latency, average memory latency could range from 75 to 30ns.

  DRAM_SRAM_latency
  Average memory latency for mix of access to SRAM at 30ns and DRAM at 75ns

Using our simple model of transaction in which 3% of operations incur a round-trip memory access latency, we can estimate transaction performance for a range of average memory latency values. Note that memory latency is decreasing from left to right.

  tps_latency
  Simple model transaction performance versus memory access latency

In the simple model, a reduction of average memory latency from 75 to 60ns would contribute to a performance gain of 20%. This is very good, but not earth shattering. A reduction to 45ns would correspond to a performance gain of 50%, about the expected gain going from a 1-socket system to 2-socket. For the flagship Intel Xeon Platinum 8180 28-core processor basis, this would justify the cost of a second processor and perhaps the 12x64GB DIMMs attached to the second socket.

Should it be possible to get average latency close to 30ns, then the performance gain is on the order of 2X, comparable to the scaling expected from 1S all the way to 4S. Now, the price justification is on the order $30-60K on hardware alone. Does anyone now think there is a problem with SRAM at $1600 per GB, 100 times more expensive than DRAM?

The questions are: what latency can SRAM as memory achieve, even if it requires restructuring the processor mesh, introducing NUMA to the single socket system, and how much SRAM is necessary to achieved a high percentage of memory accesses to the low latency nodes.

There is also another avenue opened by very low latency memory. Previously it was mentioned that scaling up frequency was not particularly effective with DRAM memory, leaving much of the capability of recent generation Intel cores untapped.

With very low memory latency, frequency scaling is more effective.

  freq_DS
  Frequency scaling, DRAM and SRAM latency levels

Reduced Latency DRAM

It is also possible that RL-DRAM as the fast node of main memory could work. There is some information in the Micron session at Hot Chips 2016 by Pawlowski. Below is slide 9 from the Pawlowski slide deck.

MicronMemoryTut_2016

Summary ‐ SRAM as Main Memory

Memory latency has been a serious matter for some time now in pointer chasing code. It is now possible to put a substantial amount of SRAM near the processor for maximum latency impact. Even at a cost 100 times more than DRAM, the amount necessary to have significant impact compares well against the conventional means of scaling database transaction performance. The low latency strategy does not need a multi-socket system, avoiding potentially serious issues on NUMA. This also does not rely on very high thread level parallelism, avoiding contention issues. Single-thread performance is also better which has value in its own right.

Proper analysis is necessary to substantiate the simple estimates made here. The full implementation of the SRAM main memory strategy would involve coordination between the processor, OS and RDBMS vendors, so that proper use is made fast memory and conventional DRAM memory. This might seem to be not easy, but the effort only needs to be done by a small number of organizations with sufficient resources. Once done, the new system could be dropped into most existing environments with powerful impact. Compare this with trying to implement memory-optimized tables and natively compiled code into legacy systems, which would have to be done by each organization, probably incurring a high degree of disruption.

 

Appendix

The main intent of this article is to provide the justification for SRAM as main memory for database transaction processing systems along with some parameters and goals. There are many details not covered here that are important, and are left to those who can best further this proposal.

Ideally we would like a coordinated effort between the processor, OS and database engine teams. Even without ideal alignment, the could be fallback modes. In the KNL Xeon Phi x200, the in package memory could be cache. Here, the SRAM could be main memory exclusively, with DRAM treated as the page/swap file, as othere have suggested. A database engine would evict a buffer cache page instead of swapping it to the page file, but this could be changed.