Fast DDR4 SDRAM Proposal (2018-10)

Memory latency is of such paramount importance that we should to a large degree abandon multi-processor systems in favor of single processor (socket) systems as detailed in Multi-Processors Must Die. The advantage of having all memory on the local node outweighs much of the benefits of additional resources in a multi-socket system, which are partially negated by the burden of remote node cache coherency penalties.

To this purpose, DRAM vendors are called upon to offer low latency DRAM suitable for use as main memory in server systems. The value of lower memory latency out weighs almost any cost incurred. Furthermore, we can sacrifice system memory capacity if necessary, as detailed in Too Much Memory.

DRAM_bank2

It is understood that this is not a trivial request. Despite its conceptual simplicity, modern DRAM is in fact incredibly complicated, with almost every detail a tightly guarded trade secret. Asking manufacturers to produce a special DRAM product for low latency could incur significant up-front costs.

Given that the value is expected to be very high, a temporary product which could be brought to market with much less effort and expense is proposed. Much of the latency in modern DRAM is attributed to the bank size and organization. The supporting elements outside of the DRAM bit array comprises a significant fraction of the die size. As cost was viewed as one of the primary drivers, the design attempts to minimize the area necessary for non-bit array elements.

Hence, the DRAM is made with as few banks as is necessary to support the sustained bandwidth objective, this being the other major factor in main memory. The bank arrangement is also structured with cost in mind? DDR4 has 16 banks.

At 8Gb density in ×4 configuration, there are 2G 4-bit words. Each bank is 512M bit, or 128M words. The full address is 31-bits (2^31) of which 4 bits is for the bank, including bank group.

The original purpose of multiplexed row and column addressing was to minimize the number pins, which contributes to system cost. The 4Kbit Mostek MK4096 had 16-pins, versus 22-pins for the Intel 2107 4Kb product.

If pin count were still important, the multiplexed addressing scheme for 2G words might have 17 signals for 4 bank + 13 row address bits in the first group followed by 14 column address bits in the second group. (the bank address needs to be among the first set).

But instead, the 8Gb (2G×4) DRAM has 21 address bits. The 4 bank and 17 row addrress bits are sent first, followed by a 10-bit column address in the second group. In other words, the bank organization is highly asymmetric at 128K (131,072) rows × 1024 columns (× 4-bits per word). Presumably this arrangement is to minimize the area necessary for the sense amp array (and IO gating).

The long path for the rows is said to be a major contributor to latency in existing DDR4 products. The proposal is then to simply disable the upper half of the rows in the bit array furthest away from the sense amps.

DRAM_bank2

Timings can now be based on the rows closest to the sense amps. This is analogous to the old practice of short-stroking hard disk drives for lower seek times.
Perhaps the term is now short-banking?

There are academic papers that propose to split the DRAM into fast and slow cells. The fast bits are the rows closer to the sense amps and the slow bits are in rows further away. The driving requirement in these proposals is to not increase the cost of the DRAM.

The arguments made in the Multi-Processors and Memory references cited earlier are that memory latency is extremely valuable, and that system memory capacity is probably much larger than necessary.

A tiered memory model would require software support, both in the operating system and at the application levels. The desire here is to have a quick drop-in solution, regardless of cost implications.

The justification is sufficient to absorb double the cost per capacity. Halving system memory capacity is also acceptable. Even better is if we could bin production parts for best characteristics, then use a control signal to disable the far rows.

Timing values of the three principal components tCAS, tRCD and tRP in current generation DDR4 is approximately 14ns each. This is a conservative value to allow for high yield. By binning parts and employing a heat sink, memory modules with 10ns timing in the principal components is possible.

The objective for the short-stroke rows would be 7ns and tRC perhaps less than 30ns?

In summary, this is only the near-term proposal. Long-term, dispensing with multiplexed row-column addressing is desired.

Special parts with higher bank counts, similar to RL-DRAM (16 banks at 1.125Gb density) may be desirable. The extreme bank count of the Intel 1Gb eDRAM (128 banks?) is probably not necessary.

If it happens that binning is an effective strategy, then it might be better to bin the standard part with half of the bit array disabled instead of having a special part.

 

 Parent,    DRAM  and SRAM

 

References

Onur Mutlu, Professor of Computer Science at ETH Zurich
website, lecture-videos

ACACES Summer School 2018, Lecture 5 Low-Latency Memory
DRAM memory latency: temperature, row location, voltage, etc.

Low temperature operation also contributes substantially to lower DRAM latency. See Onur Fall 2017 Computer Architecture, Lecture 6, Low-Latency DRAM and Processing In Memory.

Young Hoon Son, Seongil O, Yuhwan Ro, Jae W. Lee, Jung Ho Ahn,
Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations, isca13_charm.

 

Comments

1. If a significant reduction is made in DRAM latency, the next thing to think about is L3. The large die high core count processors now have 18-19ns latency.

This reflects the long distances of perhaps 22mm in horizontal and 14.7mm vertical over the 32.18mm in horizontal and 21.56mm die.

  Skylake_XCCd

The smaller quad-core client processor die has L3 of 10-11ns with dimensions 13.31 × 9.19mm and the length of the ring is perhaps 6.3mm?

  Skylake_4c   Skylake_4c

If we have very low latency to DRAM, then we might consider on L2 miss, issuing the memory access concurrent on L3. If L3 comes back first, then the memory access is discarded?

2. What is the net effect of DRAM refresh?
Does it increase the average memory access latency?
If so, then perhaps even SRAM as main memory is also viable for critical systems?
Onur Mutlu presentation at MemCon 2013, Memory Scaling: A System Architecture Perspective, slide 22 say refresh overhead will rise sharply beyond 8Gb density?