Home,   Parent,    Memory Latency II,  Single Processor,  Low Latency Memory,  SRAM II

Low Latency Memory (2018-03, work-in-progress)

For the last twenty years, the standard practice has been to scale server performance with multi-processor (MP) systems. What is not widely understood is that after processors integrated the memory controller, the initial step from single processor to 2-way has only moderate scaling due to the introduction of non-uniform memory access (NUMA) in the MP system architecture. Scaling versus cores within a single processor however, can be excellent. The new practice should now be to employ single processor systems for the very large majority of situations, with multi-processor systems relegated to extreme boundary cases.

Once we accept that the single processor system is the correct strategy, the next conclusion is that low latency DRAM should be pursued. Over the last twenty years, the data transfer rate for memory has increased from 100-133MHz in the SDRAM generation to over 2666MT/s in DDR4. But the full access latency (or row cycle time) has barely changed from perhaps 60ns to 45ns. This not because some fundamental limit of DRAM has been reached. Rather it is a deliberate decision to prioritize manufacturing cost. It was presumed that low latency would not have significant incremental market value to justify the higher cost.

A significant portion of the memory latency in multi-processor servers occurs in the system elements outside of DRAM. A reduction in latency on the DRAM chip may not impact total memory latency in the MP system sufficiently to outweigh the cost. However, in a single processor system with less outside contribution, the impact is sufficient to justify even a very large increase in DRAM cost. The first opportunity is for low latency DRAM that would be compatible with the existing memory interface of current or near-term next generation processors, be it DDR4 now or DDR5 in a couple of years.

The next step is for memory with a new optimized interface, which must be implemented in conjunction with the processor. The most obvious change is to demultiplex the address bus, basically RL-DRAM, but optimized for server systems. Memory latency is so important that it is likely even SRAM cost structure is viable, but the discussion here focuses on DRAM.

Memory Latency in Single and Multi-Processor Systems

There are three die options in the Skylake based Intel Xeon SP: LCC, HCC and XCC. The figure below shows representation of the Xeon SP XCC model. Memory access latency on L2 miss consists of L3 plus the DRAM access. L3 latency may be less than 19ns on the smaller LCC and HCC die.


Intel cites L3 at 19.5ns for the XCC die. 7-cpu reports memory access for Skylake X as L3 + 50ns = 69ns for DDR4-3400 16-18-18-36. On more conservative ECC memory at 2666MT/s 18-18-18 timing, the full memory access might be 76ns, or perhaps slightly higher.

The figure below, from Anandtech Dissecting Intel's... originally sourced from Intel, shows memory latency for a 2-way system with Xeon SP 8180 28-core processors. Also see Sizing Up Servers....


Note that local node memory latency is shown as 89ns, significantly higher than latency on the single processor system. Remote node memory latency is 139ns. I am assuming that this difference is due to remote node cache coherency, but processor architects and engineers are welcome to elaborate on this. There is a slide in Notes on NUMA Architecture, Intel Software Conference 2014 Brazil stating that in local memory access, the memory request is sent concurrently with the remote node snoop, implying minimal penalty, and yet there is a difference of 13ns?

Be aware that different values are cited by various sources. The figure below is from computerbase.de.


Even if a database or other application had been architected to achieve a high degree of memory locality on a NUMA system, there is still a memory latency penalty on the multi-processor system relative to the single processor system.

Almost zero real-world databases have been architected to achieve memory locality on a NUMA system. I have heard that a handful of Oracle environments went to effort to re-architect for RAC, which is the same type of architecture necessary to achieve memory locality on NUMA. Most environments should see average memory latency of 114ns on a 2-way Xeon SP system based on 50/50 local-remote node mix. Average memory latency on a 2-way Xeon E5 v4 system might be 120ns.

Memory Latency and Performance

The main line Intel processor cores manufactured on the original 14nm process can operate with base frequency at the high 3GHz level. Cores manufactured on the 14nm+ process can operate at the low 4GHz level, base. In the high core Xeon models, this is typically de-rated to 2-2.5GHz base frequency with turbo boost at about 3GHz.

At 2.5GHz, the CPU clock is 0.4ns. Single processor memory latency of 76ns corresponds to 190 CPU-cycles. Local node memory access in a multi-processor system of 89ns is 222.5 cycles and remote node latency of 139ns is 347.5 cycles. Average 2-way memory access based on 50/50 local-remote node split is 114ns or 285 cycles.

If even a very smaller percentage of code involves a round-trip memory access to determine the next operation, then most CPU cycles are spent as no-ops. In this case, performance is essentially the inverse of memory latency. A 30% reduction in memory latency corresponds to 1.4X performance and 50% to 2X. See Memory Latency for more on this topic.


The Dual In-line Memory Module has been a standard for many years now. In the figure below, a memory module is shown with 9 DRAM chip packages forming a 72-bit wide word or line.


Of this, 64-bits are data and the 8 extra bits are for ECC. There is an extra chip for registered DIMMs. Desktop system used unbuffered DIMMs and do not have ECC.

A module can be double-sided. When there are chip packages for two separate sets, then the DIMM is said to be having 2 ranks. In case above, there is one set on each side of the DIMM, hence dual.

The figure below shows two separate sets on each side, for a total of four sets.


In this case, the DIMM is rank 4. The DIMM datasheet will have the actual configuration as there are exceptions. Below is a DIMM probably of rank 4.


Each package can have stacked DRAM die. The image below is probably NAND die as it appears to be stacked 16 high. DRAM packages with stacked die are usually 2 or 4 high? See Samsung’s Colossal 128GB DIMM.



DRAM - preliminary

The DRAM chip is an enormously complicated entity for which only some aspects are briefly summarized here. See Onur Mutlu and other sources for more.

Below is a simplified rendering of the Micron 8Gb DDR4 in the 2G x 4 organization. See the Micron datasheet for a more complete and fully connected layout (DDR4 8Gb 2Gx4).


The die is sub-divided into 4 bank groups. Each group has 4 banks for a total of 16 banks. Within a bank, there are rows and columns. The row is a word-line and the column is a bit-line.

The 2G x 4 organization means there are 2G words of 4-bits. The address bus has 21 signals. Two bits are for addressing the bank groups, and two bit are for the banks. The remaining 17 signals are multiplexed for the row and column addresses. All 17 bits are used for the row (128x1024 = 131,072) address. Only 10 bits are used for the column address. The total address is then 4 + 17 + 10 = 31 bits, sufficient for 2G words.

Notice that only 7 bits of the column address go into the column decoder. Eight words are fetched from the bank on each access

The diagrams below are from Bounding Worst-Case DRAM Performance on Multicore Processors, on Oak Central. The original source is: "Memory Systems: Cache, DRAM, Disk" by Bruce Jacob et. al. (2008). The first image shows "Command and data movement for an individual DRAM device access on a generic DRAM device."


The next figure is "A cycle of a DRAM device access to read data."


The three DDR timing parameters commonly cited are CL, tRCD, and tRP. A fourth number that may be cited after the first three is tRAS. Wikipedia Memory timings has definitions for these terms. "The time to read the first bit of memory from a DRAM with the wrong row open is tRP + tRCD + CL." The tRCD, CL and tRP element are often identical and in the 13-14ns range. In DDR4, the data burst phase transfers 8 words. For DDR4-2666, the command-address clock is 0.75ns and the data transfer rate is one word every 0.375ns, so 8 transfers takes 3ns.

The data restore phase occurs shortly tCAS (CL) and overlaps with tBurst. There is a gap between the column read and the (array) precharge periods. As mentioned earlier, DRAM is very complicated. For database transaction processing and other pointer-chasing-code, the parameter of interest is tRC. This value is sometimes not cited, and it cannot be determined by the three commonly cited timing parameters.

Micron SDRAM through DDR4

The table below shows some organization details for selected Micron DRAM products from SDRAM through DDR4. The year column is the year of the datasheet, which might be a revision published after initial product launch.

typeDenOrgbanksRow addr.
Col addr.
Rows x
SDRAM256M64Mx4413118K x 2K199913315ns60ns
DDR512M128Mx4413128K x 4K200040015ns55ns
DDR22G512Mx48151132K x 2K2006106613.13ns54ns
DDR31G256Mx48141116K x 2K2008186613.91ns48ns
DDR34G1Gx48161164K x 2K2017213313.09ns47ns
DDR48G2Gx4161710 (7)128K x 1K2017320013.75ns46ns

Most of the parameters above are in the DRAM chip datasheet, but tRC is usually found in the DIMM datasheet, and sometimes buried in the back pages. The timing column is the CL-tRCD-tRP amalgamated into a single value and converted from (command/address) clock cycles to nanoseconds.

The CL-tRCD-tRP values decreased only slightly from SDRAM to DDR2. From DDR2 to current DDR4, they have not changed for the registered ECC DIMMs used in server systems.

There are unbuffered DIMM products without ECC bit made with specially screened DRAM part having lower (better) values. These products also have heat sinks for improved cooling, as temperature effects timing. Presumably, it would not be feasible to rely on this approach for server systems.

The 256Mb SDRAM datasheet is listed as 1999. This could be a 250 or 180nm manufacturing process. The 8Gb DDR4 is probably a 2016 or 2017 product, possibly on a 22nm or 2x process. From 256Mb to 8Gb, there are 5 doublings of density. In between 180nm and 22nm are five manufacturing processes: 130, 90, 65, 45 and 32nm.

The manufacturing process used to mean the transistor gate length. However, in the last 15 or so years, it is just an artificial placeholder representing a doubling of transistor density between generations. In the Intel 22 to 14nm process transition, the density increased by 2.7X (Intel's 10nm).

It would seem that one doubling of DRAM density between 256Mb at 180nm and 8Gb at 22nm is missing, but this could be the upcoming 16Gb single die. There is a convention in which the DRAM bit cell size is specified as xF2 and F is the process linear dimension.

DRAM Die Images

In Under the Hood: The race is on for 50-nm DRAM, the Samsung 1Gb DDR2 is mentioned as 50nm class in 2009 as a 6F2-based cell design, die image shown below. In this case, 50nm class is 58nm.


Also is Under the Hood, "On the other hand, Hynix's 8F2 cell design showed a 16.5 percent larger cell than Samsung's. It should be noted, that despite the larger cell size, Hynix's 1-Gbit DDR2 SDRAM achieved an impressive chip size of 45.1 mm2, only 2.7 percent larger than Samsung's 1-Gbit DDR2 SDRAM."

Semiconductor Manufacturing & Design has a die image of the Samsung 1Gb DDR3 in How to Get 5 Gbps Out of a Samsung Graphics DRAM, shown below.


Embedded has a discussion on Micron SDRAM and DDR in Analysis: Micron cell innovation changes DRAM economics.

DRAM Latency

Two items are mentioned as having significant impact on DRAM latency. One is the bank size. Another is the multiplexing of row and column addresses. In principle, it is a simple matter to sub-divide the DRAM chip into more banks, as the design of each bank and its control logic is the same. This would however increase the die size as there would be more of the control logic associated with each bank. Earlier we said that DRAM manufacturers understood that the market prioritized cost over latency.

In SDRAM 256Mbit, the bank size is 8Kx2K = 16M words, with 4 banks at x4 word. (I am using the term word to mean the data path out. The could apply to a DRAM chip at 4, 8 or 16-bits, and a DIMM at 64 or 72-bits.) For DDR4 8Gbit, bank size is 128Kx1K = 128M with 16 banks at x4 word size.

There is some improvement in latency between the manufacturing process for SDRAM in 1999 and DDR4 in 2017. Long ago, the expectation was a 30% reduction in transistor switching time corresponding to 40% increase in frequency for a logic. From about the 90nm process forwards, transistor performance improved at a much diminished rate.

In DRAM, there is both the logic sub-units made of transistors and the bit cell array made of transistors and capacitors. I do not recall mention of capacitor performance versus process generation. Presumably it is minimal? Still, this allowed the main timing parameters to remain at 13-14ns each as bank size increased from 16M to 128M at 4-bit data width.

A single bank can only transfer one 8-word burst every tRC. This takes 4 clocks, as DDR transfers 2 data words every clock. The tRC is somewhat longer than tRCD + CL + tRP. For DDR4-2666, these components are 19 clocks each for a total of 57 clocks at 0.75ns per clock. The cited tRC at 3200 MT/s is 45.75 and at 2933 MT/s is 46.32ns. Presumably then tRC at 2666 MT/s is either 61 clocks for 45.75ns or 62 clocks for 46.5ns. (61 clocks for tRC is 57 + 4?)

A single bank can transfer data for only 4 out of 61 or 62 clock cycles. DDR4 currently has 16 banks. So, it would seem that sustained memory bandwidth was important and warranted an increase in the number of banks. But latency was not deemed sufficiently important to further increase the number of banks.

Reduced Latency DRAM

Micron has a RLDRAM 3 product at up to 1.125Gb. One organization is 4M x18 and 16 banks. The datasheet is 2015 so this might be on the same manufacturing process as the 2Gb DDR3. The RLDRAM package is 13.5mm x 13.5mm versus 8mm x 14mm for the 2Gb in x16 form. This might be due to the difference in pins, 168 versus 96. The two products might have the same die size or the 1.15Gb RLDRAM could have a larger die size than the 2Gb DDR3, and may or may not be on a common manufacturing process.

The 1.125Gb RL-DRAM has 16 banks. Each bank is 16K rows by 256 columns = 4M words, or 72M bits.

The appropriate comparison for this is the 2Gbit DDR3 in 128M x 16 organization. The comparable DDR3 chip is 8 banks. Each bank is 16K x 1K = 16M words at 16-bit per word = 256M bits. So, roughly, the bank size is 3.55 times smaller in words at a similar w in terms of bits per bank.

The second aspect of RLDRAM is that the row and column addresses are not multiplexed. For the 64M x 18 organization, there are 25 address bits, even though 26 bits are required to address 64M words. There are 4-bits for the bank address. The row address is 14-bits. The column address is 7-bits, but only 5-bits go into the Column decoder. The bank is shown as 16K x 32 x 8 (x18 data). The 5-bit going into the decoder corresponds to 32 columns. I am presuming that the address has 2 word granularity?

Between the smaller bank size and the non-multiplexed address, Micron cites their RL-DRAM as having tRC minimum value of 6.67ns (8-cycles of 2400T/s, or 0.833ns per data transfer). But why only cite the minimum value? We are mostly interested in average value and possible the min-max range, excluding refresh, which all DRAM must have.

My guess would be that by having extra banks, the tRP period can be hidden if access are randomly distributed between banks. If so, then the smaller banks reduces the 13-14ns timings to 6.67ns?

It is presumed that both the smaller bank size and non-multiplexed address contributes to significantly lower latency. Some details on how much each aspect contributes would be helpful.

As we are accustomed to multi-GHz processors, it might seem strange that the multiplexed row and column address has much of a penalty. The tRCD and CAS latency are each 13-14ns in mainstream DRAM. In this regard, we should recall that processors have been at 10 pipeline stages since Pentium Pro in 1995. The post-Pentium 4 Intel processors are probably 14-16 pipeline stages, though Intel no longer shows the number of stages in their micro-architecture diagrams.

In this regard, the full start to finish time to execute an instruction is on the order of 5ns. Then factor in that DRAM has a very different manufacturing process than logic, which is not optimized for performance. It is presumed that the logic on DRAM is not pipelined except for the data transfer sub-units? (the DRR logic core runs at 1/8th of the data transfer rate, or 1/4 of the command/address clock. On DDR-2666, the core runs at 333MHz.)

Low Latency DRAM

As mentioned earlier, we are interested in two separate approaches to low latency DRAM for server systems in handling transaction processing type workloads. The near-term approach is to retain the existing mainstream DDR4 interface and DDR5 when appropriate. Ideally, the new special memory could be compatible with now or then existing processors designed for conventional memory. But some divergence may be necessary and is acceptable.

This would preclude a non-multiplexed address bus. The number of banks would be increased from 16 in the 8Gb DDR4 presumably to 64 or more. The memory controller would have to know that there are more bits for bank addressing, which is why this new memory may or may not work (at full capability?) in existing processors. (The memory controller tries to schedule requests so that accesses do not go to a given bank within a tRC interval.)

But it is possible the next generation processor could work with both conventional and the new high bank count DRAM memory. (Before conventional memory was made with multi-banks, there was a Multibank DRAM product from Mosys.)

The low latency DRAM die would be larger than a similar density conventional DRAM chip. Either the new memory would have lower DIMM capacity options or the DIMM form factor would have to allow for a taller module. A 2X reduction is capacity would not be a serious issue. If the difference in areal size were 4X, then the larger form factor would be preferable.

In the longer term strategy, more avenues for latency reduction are desired. The RL-DRAM approach of non-multiplexed row and column addresses is the right choice. Somewhere its mentioned that the RL-DRAM interface is not very different from SRAM. This could be an additional option that I looked at in SRAM as Main Memory.


We need to wake up to the fact that scaling performance with multi-processor systems should no longer be the standard default approach. Most situations would benefit greatly from the single processor system. On a hardware cost only evaluation, the cost-performance value could go either way. Two 20 core processors cost less than one 28-core, but one 24-core is about the same as two 16-core processors. When software per-core licensing is factored, the single processor system advantage is huge. Once we accept the single processor system approach, it is easy to then realize that a new low latency memory provides more huge additional value.

This needs to be communicated to both DRAM and processor manufacturers. A cost multiplier of 2 or even 4X is well worth it if system level memory latency could be reduced by 20ns. A near-term future product with a backward compatibility option would mitigate risk. The longer term approach is a clean break from the past to do what is right for the future.


There is manufacturing variation in DRAM latency. Most mainstream DRAM is set at a conservative rating for high yield. Micron does have parts with slightly better latency, usually denoted with a E in the speed grade. Some companies (ex. G.Skill) offers memory made with specially screened parts. One product is DD4-4266 (2133 clock) at 17-18-18 timings. CL 17 at 2133MHz is 7.97ns. RCD and RP at 18 is 8.44ns. This strategy works for extreme gaming, allowing a few deep-pocket players to get an advantage. For servers, it is probably worth while to offer two or three binned parts. If two, then perhaps the top one-third as the premium and the rest as ordinary.

The other factor that affects latency is temperature. Lower temperature allows the charge state of the capacitor to be measured more quickly. Many DRAM parts lists 64ms, 8192-cycle refresh for up to 85C, and 32ms refresh for >85C to 95C. Onur's slides also says that lower temperature allows lower timing values. This is one reason why the special DIMMs for gaming have a heat sink. So, a question is whether we want this or even more elaborate cooling for our low latency strategy. Perhaps cooling should be used for the specially binned parts, and just normal cooling for the bulk parts.

Also see The Memory Guy
Super-Cooled DRAM for Big Power Savings,
Is Intel Adding Yet Another Memory Layer?,
A 1T SRAM? Sounds Too Good to be True!, Zeno Semi, A 1T (or 2T) SRAM Bit Cell,