Home,   Parent,    Memory Latency II,   Single Processor,   Low Latency Memory,   SRAM II

DRAM - List (2018-03, work-in-progress)

Micron SDRAM through DDR4

The table below shows some organization details for selected Micron DRAM products from SDRAM through DDR4. The year column is the year of the datasheet, which might be a revision published after initial product launch. The timing column is shown as a single value because the CL-tRCD-tRP elements are identical and has been converted from clock cycles to nanoseconds.

typeDenOrgbanksRow addr.
bits
Col addr.
bits
Rows x
Columns
YearMT/stimingtRC
SDRAM64M16Mx4412104K x 1K199913315ns60ns
SDRAM128M32Mx4412114Kx2K199913315ns60ns
SDRAM256M64Mx4413118K x 2K199913315ns60ns
DDR
DDR256M64Mx4413118Kx2K200340015ns
DDR512M128Mx4413128K x 4K200040015ns55ns
DDR1G256Mx441412 (11)16K x 4K200340015ns
DDR2
DDR2512M128Mx44111116Kx2K2004106613.13ns54ns
DDR21G256Mx48141116Kx2K2014A106613.13ns54ns
DDR22G512Mx48151132K x 2K2006106613.13ns54ns
DDR3
DDR31G256Mx481411(8)16K x 2K2008186613.91ns48ns
DDR32G512Mx48151132Kx2K2015186613.91ns
DDR34G1Gx48161164K x 2K2017213313.09ns47ns
DDR38G2Gx481612(9)64Kx5122015186613.91ns
DDR4
DDR44G1Gx4161610 (7)64Kx1K2014320013.75ns46.16?
DDR48G2Gx4161710 (7)128K x 1K2017266614.25ns46ns

Micron0_64Mx4 SDR 256M - 13+11+2 = 26 x4 = 256M

Micron1_64Mx4 DDR 256M - 13+11+2 = 26 x4 = 256M

Micron2_128Mx4 DDR2 512M - 14+11+2 = 27 x4 = 512M

DDR3 1G - 14+11+3 = 28 x4 = 1G

Micron3_2Gx4 DDR3 8G - 16+12+3 = 31 x4 = 8G

Micron4_2Gx4 DDR4 8G - 17+10+4 = 31 x4 = 8G

Micron0_128Mx16 DDR3 2G - 14+10+3 = 27 x16 = 2G

RLDRAM3_x18 RLDRAM3 1.125Gx18 - 14+7+4 = 25 (26)

The 256Mb SDRAM datasheet is listed as 1999. This could be a 250 or 180nm manufacturing process. The 8Gb DDR4 is probably a 2016 or 2017 product, possibly on a 22nm or 2x process. From 256Mb to 8Gb, there are 5 doublings of density. In between 180nm and 22nm are five manufacturing processes: 130, 90, 65, 45 and 32nm.

The manufacturing process used to mean the transistor gate length. However, in the last 15 or so years, it is just an artificial placeholder representing a doubling of transistor density between generations. In the Intel 22 to 14nm process transition, the density increased by 2.5X. See the summary of the IEDM 2017 and ISSCC 2018 Intel's 10nm on fuse.wikichip.org.

It would seem that one doubling of DRAM density between 256Mb at 180nm and 8Gb at 22nm is missing, but this could be an upcoming 16Gb single die. There is a convention in which the DRAM bit cell size is specified as xF2 and F is the process linear dimension.

DRAM Die Images

Image below, from Under the Hood: The race is on for 50-nm DRAM, shows the Samsung 1Gb DDR2 die, a 2009 product? It is a 50nm class having a 6F2-based cell design. In this case, 50nm class is 58nm.

samsung_dram_1Gb2

There are 8 banks is this product. I cannot find a relatively modern DRAM die layout showing which control (and other) elements that belong to the entire chip versus the non-array elements of each individual bank.

The Under the Hood article also mentions a Hynix 8F2 cell design, 16.5 percent larger than Samsung, in a 1-Gbit DDR2 SRAM, 45.1 mm2 chip, 2.7 percent larger than Samsung's 1Gbit DDR2 SRAM.

Semiconductor Manufacturing & Design has a die image of the Samsung 1Gb DDR3 in How to Get 5 Gbps Out of a Samsung Graphics DRAM, shown below.

DDR3_Samsung1Gb_DDR3

Embedded has a discussion on Micron SDRAM and DDR in Analysis: Micron cell innovation changes DRAM economics.

SDRAM and DDR

One aspect of cost optimization is to keep the logic portions small in relation to the bit cell array. DRAM manufacturers recognized that sustained bandwidth was important enough to warrant increasing the number of banks as necessary.

Modern memory is DDR4 SDRAM, currently 8Gbit at the die level, and same for a single chip package. The Micron 8Gb x4 and x8 parts use a 78-ball package. In 2Gx4, 31-bits are required for the full address. The address bus is 21-bits. Four bits are dedicated for the bank group and bank. The remaining 17 bits are multiplexed, with all 17-bits used for the row address. Technically, the column address comprises 10-bit, but 8 words are fetched simultaneously so only 7 bits go into the column decoder. The x16 part is in a 96-ball package. The Micron RL-DRAM at 1.125Gb density comes in a 168-ball package for either x18 or x36 organizations and has 24 signals for addressing, not multiplexed.

The DDR4 DIMM form factor has 288-pins with a 72-bit data bus, of which 8-bits are for ECC. Presumably, the extra signals to for a non-multiplexed address is no longer a financial factor, merely a matter of inertia.

A significant portion of the memory latency in multi-processor servers occurs in the system elements outside of DRAM. A reduction in latency on the DRAM chip may not impact total memory latency in the MP system sufficiently to outweigh the cost. However, in a single processor system with less outside contribution, the impact is sufficient to justify even a very large increase in DRAM cost. The first opportunity is for low latency DRAM that would be compatible with the existing memory interface of current or near-term next generation processors, be it DDR4 now or DDR5 in a couple of years.

The next step is for memory with a new optimized interface, which must be implemented in conjunction with the processor. The most obvious change is to demultiplex the address bus, basically RL-DRAM, but optimized for server systems. Memory latency is so important that it is likely even SRAM cost structure is viable, but the discussion here focuses on DRAM.

DRAM - preliminary

The DRAM chip is an enormously complicated entity for which only some aspects are briefly summarized here. See Onur Mutlu and other sources for more.

Below is a simplified rendering of the Micron 8Gb DDR4 in the 2G x 4 organization. See the Micron datasheet for a more complete and fully connected layout (DDR4 8Gb 2Gx4).

Micron_8Gb

The DIMM

The Dual In-line Memory Module has been a standard for many years now. In the figure below, a memory module is shown with 9 DRAM chip packages forming a 72-bit wide word or line.

DIMM_2R

Of this, 64-bits are data and the 8 extra bits are for ECC. There is an extra chip for registered DIMMs. Desktop system used unbuffered DIMMs and do not have ECC.

A module can be double-sided. When there are chip packages for two separate sets, then the DIMM is said to be having 2 ranks. In case above, there is one set on each side of the DIMM, hence dual.

The figure below shows two separate sets on each side, for a total of four sets.

DIMM_4R

In this case, the DIMM is rank 4. The DIMM datasheet will have the actual configuration as there are exceptions. Below is a DIMM probably of rank 4.

RDIMM

Each package can have stacked DRAM die. The image below is probably NAND die as it appears to be stacked 16 high. DRAM packages with stacked die are usually 2 or 4 high? See Samsung’s Colossal 128GB DIMM.

chip_stack

 

The die is sub-divided into 4 bank groups. Each group has 4 banks for a total of 16 banks. Within a bank, there are rows and columns. The row is a word-line and the column is a bit-line.

The 2G x 4 organization means there are 2G words of 4-bits. The address bus has 21 signals. Two bits are for addressing the bank groups, and two bit are for the banks. The remaining 17 signals are multiplexed for the row and column addresses. All 17 bits are used for the row (128x1024 = 131,072) address. Only 10 bits are used for the column address. The total address is then 4 + 17 + 10 = 31 bits, sufficient for 2G words.

Notice that only 7 bits of the column address go into the column decoder. Eight words are fetched from the bank on each access

DRAM Latency

Two items are mentioned as having significant impact on DRAM latency. One is the bank size. Another is the multiplexing of row and column addresses. In principle, it is a simple matter to sub-divide the DRAM chip into more banks, as the design of each bank and its control logic is the same. This would however increase the die size as there would be more of the control logic associated with each bank. Earlier we said that DRAM manufacturers understood that the market prioritized cost over latency.

In SDRAM 256Mbit, the bank size is 8Kx2K = 16M words, with 4 banks at x4 word. (I am using the term word to mean the data path out. The could apply to a DRAM chip at 4, 8 or 16-bits, and a DIMM at 64 or 72-bits.) For DDR4 8Gbit, bank size is 128Kx1K = 128M with 16 banks at x4 word size.

There is some improvement in latency between the manufacturing process for SDRAM in 1999 and DDR4 in 2017. Long ago, the expectation was a 30% reduction in transistor switching time corresponding to 40% increase in frequency for a logic. From about the 90nm process forwards, transistor performance improved at a much diminished rate.

In DRAM, there is both the logic sub-units made of transistors and the bit cell array made of transistors and capacitors. I do not recall mention of capacitor performance versus process generation. Presumably it is minimal? Still, this allowed the main timing parameters to remain at 13-14ns each as bank size increased from 16M to 128M at 4-bit data width.

A single bank can only transfer one 8-word burst every tRC. This takes 4 clocks, as DDR transfers 2 data words every clock. The tRC is somewhat longer than tRCD + CL + tRP. For DDR4-2666, these components are 19 clocks each for a total of 57 clocks at 0.75ns per clock. The cited tRC at 3200 MT/s is 45.75 and at 2933 MT/s is 46.32ns. Presumably then tRC at 2666 MT/s is either 61 clocks for 45.75ns or 62 clocks for 46.5ns. (61 clocks for tRC is 57 + 4?)

A single bank can transfer data for only 4 out of 61 or 62 clock cycles. DDR4 currently has 16 banks. So, it would seem that sustained memory bandwidth was important and warranted an increase in the number of banks. But latency was not deemed sufficiently important to further increase the number of banks.

Reduced Latency DRAM

Micron has a RLDRAM 3 product at up to 1.125Gb. One organization is 4M x18 and 16 banks. The datasheet is 2015 so this might be on the same manufacturing process as the 2Gb DDR3. The RLDRAM package is 13.5mm x 13.5mm versus 8mm x 14mm for the 2Gb in x16 form. This might be due to the difference in pins, 168 versus 96. The two products might have the same die size or the 1.15Gb RLDRAM could have a larger die size than the 2Gb DDR3, and may or may not be on a common manufacturing process.

The 1.125Gb RL-DRAM has 16 banks. Each bank is 16K rows by 256 columns = 4M words, or 72M bits.

The appropriate comparison for this is the 2Gbit DDR3 in 128M x 16 organization. The comparable DDR3 chip is 8 banks. Each bank is 16K x 1K = 16M words at 16-bit per word = 256M bits. So, roughly, the bank size is 3.55 times smaller in words at a similar w in terms of bits per bank.

The second aspect of RLDRAM is that the row and column addresses are not multiplexed. For the 64M x 18 organization, there are 25 address bits, even though 26 bits are required to address 64M words. There are 4-bits for the bank address. The row address is 14-bits. The column address is 7-bits, but only 5-bits go into the Column decoder. The bank is shown as 16K x 32 x 8 (x18 data). The 5-bit going into the decoder corresponds to 32 columns. I am presuming that the address has 2 word granularity?

Between the smaller bank size and the non-multiplexed address, Micron cites their RL-DRAM as having tRC minimum value of 6.67ns (8-cycles of 2400T/s, or 0.833ns per data transfer). But why only cite the minimum value? We are mostly interested in average value and possible the min-max range, excluding refresh, which all DRAM must have.

My guess would be that by having extra banks, the tRP period can be hidden if access are randomly distributed between banks. If so, then the smaller banks reduces the 13-14ns timings to 6.67ns?

It is presumed that both the smaller bank size and non-multiplexed address contributes to significantly lower latency. Some details on how much each aspect contributes would be helpful.

As we are accustomed to multi-GHz processors, it might seem strange that the multiplexed row and column address has much of a penalty. The tRCD and CAS latency are each 13-14ns in mainstream DRAM. In this regard, we should recall that processors have been at 10 pipeline stages since Pentium Pro in 1995. The post-Pentium 4 Intel processors are probably 14-16 pipeline stages, though Intel no longer shows the number of stages in their micro-architecture diagrams.

In this regard, the full start to finish time to execute an instruction is on the order of 5ns. Then factor in that DRAM has a very different manufacturing process than logic, which is not optimized for performance. It is presumed that the logic on DRAM is not pipelined except for the data transfer sub-units? (the DRR logic core runs at 1/8th of the data transfer rate, or 1/4 of the command/address clock. On DDR-2666, the core runs at 333MHz.)

Low Latency DRAM

As mentioned earlier, we are interested in two separate approaches to low latency DRAM for server systems in handling transaction processing type workloads. The near-term approach is to retain the existing mainstream DDR4 interface and DDR5 when appropriate. Ideally, the new special memory could be compatible with now or then existing processors designed for conventional memory. But some divergence may be necessary and is acceptable.

This would preclude a non-multiplexed address bus. The number of banks would be increased from 16 in the 8Gb DDR4 presumably to 64 or more. The memory controller would have to know that there are more bits for bank addressing, which is why this new memory may or may not work (at full capability?) in existing processors. (The memory controller tries to schedule requests so that accesses do not go to a given bank within a tRC interval.)

But it is possible the next generation processor could work with both conventional and the new high bank count DRAM memory. (Before conventional memory was made with multi-banks, there was a Multibank DRAM product from Mosys.)

The low latency DRAM die would be larger than a similar density conventional DRAM chip. Either the new memory would have lower DIMM capacity options or the DIMM form factor would have to allow for a taller module. A 2X reduction is capacity would not be a serious issue. If the difference in areal size were 4X, then the larger form factor would be preferable.

In the longer term strategy, more avenues for latency reduction are desired. The RL-DRAM approach of non-multiplexed row and column addresses is the right choice. Somewhere its mentioned that the RL-DRAM interface is not very different from SRAM. This could be an additional option that I looked at in SRAM as Main Memory.

Summary

We need to wake up to the fact that scaling performance with multi-processor systems should no longer be the standard default approach. Most situations would benefit greatly from the single processor system. On a hardware cost only evaluation, the cost-performance value could go either way. Two 20 core processors cost less than one 28-core, but one 24-core is about the same as two 16-core processors. When software per-core licensing is factored, the single processor system advantage is huge. Once we accept the single processor system approach, it is easy to then realize that a new low latency memory provides more huge additional value.

This needs to be communicated to both DRAM and processor manufacturers. A cost multiplier of 2 or even 4X is well worth it if system level memory latency could be reduced by 20ns. A near-term future product with a backward compatibility option would mitigate risk. The longer term approach is a clean break from the past to do what is right for the future.

Addendum

There is manufacturing variation in DRAM latency. Most mainstream DRAM is set at a conservative rating for high yield. Micron does have parts with slightly better latency, usually denoted with a E in the speed grade. Some companies (ex. G.Skill) offers memory made with specially screened parts. One product is DD4-4266 (2133 clock) at 17-18-18 timings. CL 17 at 2133MHz is 7.97ns. RCD and RP at 18 is 8.44ns. This strategy works for extreme gaming, allowing a few deep-pocket players to get an advantage. For servers, it is probably worth while to offer two or three binned parts. If two, then perhaps the top one-third as the premium and the rest as ordinary.

The other factor that affects latency is temperature. Lower temperature allows the charge state of the capacitor to be measured more quickly. Many DRAM parts lists 64ms, 8192-cycle refresh for up to 85C, and 32ms refresh for >85C to 95C. Onur's slides also says that lower temperature allows lower timing values. This is one reason why the special DIMMs for gaming have a heat sink. So, a question is whether we want this or even more elaborate cooling for our low latency strategy. Perhaps cooling should be used for the specially binned parts, and just normal cooling for the bulk parts.

Also see The Memory Guy
Super-Cooled DRAM for Big Power Savings,
Is Intel Adding Yet Another Memory Layer?,
A 1T SRAM? Sounds Too Good to be True!, Zeno Semi, A 1T (or 2T) SRAM Bit Cell,