Parent

DRAM Summary/Update(2018-10)

The figure below from Impact of Processing Technology ... Jeffrey Carl Gealow, 1990, MIT, shows access time for a range of DRAM generations.

DRAM_accesstime

The vertical axis is access time. The Mostek MK4096 4Kb chips have access times between 250-350ns and cycle times between 375-500ns.

Below is a brief summary, some information from Wikipedia DDR SDRAM. The year correspond to use in some Intel chipsets, while Wikipedia shows 2000, 2003, 2007, and 2014 for DDR, 2, 3 and 4 respectively.

TypeDensityBus clock
MHz
Internal
clock
MT/sPrefetch
min
banksYear
SDRAM64-512M66-16766-16766-1671n41997
DDR256M-1G100-200100-200266-4002n42002
DDR2512M-2G200-533100-266400-10664n4-82004
DDR31-8G400-1066100-266800-18668n82009
DDR44-8G1066-2133133-2662133-42668n162014

The table below shows some organization details for selected Micron DRAM products from SDRAM through DDR4. The year column is the year of the datasheet, which might be a revision published after initial product launch.

typeDenOrgbanksRow
addr.
bits
Col
addr.
bits
Rows x
Columns
YearMT/stimingtRC
SDR256M64Mx4413118K x 2K199913315/20ns60/66ns
DDR512M128Mx4413128K x 4K200040015ns55ns
DDR2512M128Mx44111116Kx2K2004106613.13ns54ns
DDR22G512Mx48151132K x 2K2006106613.13ns54ns
DDR31G256Mx48141116K x 2K2008186613.91ns48ns
DDR34G1Gx48161164K x 2K2017213313.09ns47ns
DDR48G2Gx4161710 (7)128K x 1K2017266614.25ns46ns

The timing column is shown as a single value because the CL-tRCD-tRP elements are often identical and has been converted from clock cycles to nanoseconds. When there are two timing values, the higher is the regular part and the lower is the E part.

Below is an image of the Samsung 20nm 8Gb DDR4.

Samsung8Gb   5.8 mm × 9.7 mm die

Tan, Gutmann; Tan, Reif (2008) Wafer Level 3-D ICs Process Technology mentions declining array efficency.

typeArray UtilizationBanks
SDR78%4
DDR70%4
DDR247%4-8
DDR338%8
DDR4<30%?16

The bit array has been scaling with manufacturing process, but certain analog sub-units such as the charge pumps, regulators and sense amps(?) are not scaling.

The table below shows some DRAM products by bank size?

TypeDensityBanksBank SizeOrgtRC
DDR48G16512M131,072×128×3246ns
RL-DRAM1.125G1672M16,384×32×8×186.7ns?
eDRAM1G1288M4×8×256×10243ns

The Intel 1G eDRAM has 128 × 8M banks.
A bank has 4 quarters, with 8 subarrrays.
A subarrray is 256 word-lines × 1024 bit-lines.

Reduced Latency DRAM

Micron has RLDRAM. The current version is RLDRAM 3 at 576Mb and 1.125Gb (64M x 18 = 1152Mb = 1.125 x 1024) density. The 1.125Gb products cite "Reduced cycle time tRC (min) = 6.67-8ns". Exactly what is meant by tRC(min) is unclear, is there a maximum?

RLDRAM has non-multiplexed (bank-)row-column addressing, but can accept multiplexed addresses? If so, then we use 4 of the x18 parts to form a 72-bit rank with capacity 0.5GB. A quad-rank DIMM would be 2GB. Then 8 × 2GB in a Xeon E5 would be a 16GB, enough for a proof-of-concept test?

Hotchips 28, Memory as We Approach a New Horizon, J. Thomas Pawlowski.

Embedded DRAM

According to A 1Gb 2GHz Embedded DRAM in 22nm Tri-Gate CMOS Technology Hamzaoglu et al. at ISSCC 2014, the 128MB eDRAM is 77mm2. The 1GB including ECC should be 693mm2. Presumably 2GB on 14nm is possible?

Technically, eDRAM is DRAM manufactured on logic process as opposed to a DRAM process, regardless of whether it is embedded with logic chip. The logic process does not have the same density as a specialized DRAM process? The figures below are also from ISSCC 2014 5.9 Haswell A Family of IA 22nm Processors (slide from WCCFtech).

eDRAM1Gb

Note that this 1Gbit eDRAM chip is 12.5% smaller than the Microm 1.125Gb RL-DRAM, yet it is divided into 128 banks versus 16 for the RLDRAM product. Random Cycle time is cited as 3ns, better than the tRC (min) of 6.67ns for RLDRAM (do these terms have the same meaning between the two products?).

eDRAM1Gb eDRAM1Gb

Die size is 77mm2. The Silicon Edge die-per-wafer estimator says 784 die of dimensions 10 x 7.7mm fit on a 300mm wafer. Motley Fool estimates Intel's 14nm cost for a 300mm wafer at $9,100 (and $7,000 for 22nm). At 700 good die (a guess), raw cost is $13 per 77m2 die, so perhaps a $26 price? The 8Gbit DDR4 product might have a price of $8, Presuming the 2Gb eDRAM is still 77mm2 at 14nm, then the price is perhaps 13 times more expensive than DDR4?

 

Home,   Parent,   Memory Latency,   TPC-E Benchmarks,   Single Processor,    DRAM,   SRAM

DRAM and the Memory Subsystem (2018-03)

Long ago, memory was desperately needed in database systems to reduce disk IO to an achievable level. This meant that the primary objective in DRAM was cost and other aspects could be sacrificed. As processors became more powerful, it was also necessary to have sufficient sustained bandwidth. And so bandwidth and cost were the two principal requirements that DRAM manufacturers built to.

In recent years, server systems have more than enough memory that IO has been reduced to noise levels, and this occurred as the use of NAND flash for storage was becoming pervasive. Now it is time to re-evaluate the objectives for memory in modern systems. With the slow down between silicon process generations, we expect greater specialization for the particular requirements of each of the major application categories. In database transaction processing, characterized by pointer-chasing code, the need is for low latency memory. The technology for this already exists. Low latency products like Reduced Latency DRAM (RLDRAM) have been used in network switches for years.

Two primary reasons that memory latency have not been pursued are as follows. One is that the majority of latency occurs outside of the DRAM interface. Second, low latency in DRAM costs more, and the perception was that this would not add significant value to the product. Today, memory latency is so important that we should abandon multi-processor systems with its inherent non-uniform memory access (NUMA) implications.

It is not widely appreciated that scaling down to a single processor gives up only 30% of the overall throughput of a 2-way for a transaction processing workload. This alone removes much of the external latency contributions. From here, a reduction in latency on the DRAM has greater impact that is more than sufficient to justify the cost. It is reasonable to expect that a 20ns reduction in latency can be achieved with known existing measures. This is sufficent to allow the single processor system to match the performance of a 2-way system having conventional memory. The value of this capability in hardware terms only is already high. When software licensing on a per-core basis also applies, the value of low latency memory becomes enormous.

Early DRAM

The DRAM IC originated with the Intel 1103 in 1970 as a 1Kbit device, 18-pin package, on a 10µm process having 300ns access time and 580ns (random) cycle time. Multiplex addressing was introduced with the 4Kbit Mostek MK4096 having 16-pins, versus 22-pins for the Intel 2107 4Kb product. Ever since, mainstream memory products have had multiplexed addressing.

A simple representation of a basic DRAM is shown below. There is one array of bit cells, addressed by row and column. The row and column addresses are multiplexed.

DRAM_1bank

The 4Kbit part in a x1 organization has 4K words, requiring 12-bits to address. In the MK4096, this was accommodated with 6 signals multiplexed into 6-bit row and 6-bit column addresses versus 12 pins for the Intel 2107. The access time for the MK4096-6 part was 250ns and the cycle time 375ns.

Presumably, there was some cost difference between 16 and 22-pin packages. At the time, a mini-computer like the VAX-11/780 might have had a memory configuration of 1MB (2MB max with the 4Kb DRAM chip). The number of DRAM chips required for 1MB using 4Kb die is 2048 for data plus 256 for ECC, a total of 2304. This was before the SIMM, so memory came on boards of 128KB (288 chips).

Is it possible that the cost implication of wiring so many chips had the larger role in favoring multiplexed addressing than the cost difference between 16 and 22-pin packages? Any old-timers who were present in this ancient era are welcome to comment.

Junji Ogawa's DRAM Design Overview at Stanford 1999 has a timeline for 1K - 64M from 1971 to 1995. Also see Sudeep Pasricha ECE 554 Computer Architecture lecture Main Memory 2013 Colorado State University. In the 1990's, FPM and EDO DRAM had an access time of 50-60ns, with a cycle time of 84-104ns (Micron EDO DRAM 16M).

SDRAM and DDR-DDR4

Since then, main stream memory products have been some form of Synchronous DRAM. The original single data rate version was replaced by double data rate versions, currently DDR4. The SDRAM has multiple banks, allowing interleaved operation.

DRAM_1bank

The first Intel chipsets to support SDRAM were the 430TX and 440LX in 1997. Single data rate SDRAM was supported in the Intel Pentium II processor with the 440BX chipset in 1998.

Intel then made an unfortunate choice for the next generation memory before subsequently adopting DDR for Pentium 4 with the 845 chipset around 2002. DDR memory in Xeon servers was supported with the E7500 MCH also in 2002? In DDR, data is transferred at twice the command-address clock. Both SDR and DDR products are specified in terms of the data rate.

DDR2 was supported with 915-955 chipsets in 2004. Intel 5000P-5400 MCH supported FB-DIMM version of DDR2. Servers supported DDR2 with the 5000 and 7300 chipsets In DDR2, the internal core of the DRAM operates at one-half the command-address rate.

DDR3 was supported with the Intel Nehalem processors at 800/1066/1333 MT/s. The server products include the Xeon 5500 series. Sandy Bridge (Xeon E5) and Ivy Bridge (E5 v2) were also DDR3 800/1066/1333/1600. DDR4 was supported in Haswell (Xeon E5 v3) at 1600/1866/2133 MT/s. Broadwell (E5 v4) extended to 2400MT/s. Skylake (Xeon SP) further extended to 2666 MT/s. In DDR3, the core operates at one-quarter of the command-address clock.

The increase in data rate from SDR through the DDR versions have been made possible by increasing the prefetch. In DDR, the prefetch is 2n, increasing to 4n in DDR2. Both DDR3 and 4 prefetch 8n. In other words, the internal clock has been largely unchanged over the years. The bus clock represents the address and command rate. The multi-bank arrangement contributed to higher sustained bandwidth. Some improvement came from the manufacturing process but much is from increasing the number of banks.

DDR SDRAM Timing

The diagrams below are from Bounding Worst-Case DRAM Performance on Multicore Processors, on Oak Central. The original source is: "Memory Systems: Cache, DRAM, Disk" by Bruce Jacob et. al. (2008). The first image shows "Command and data movement for an individual DRAM device access on a generic DRAM device."

E1EIKI_2013_v7n1_53_f003b

The next figure is "A cycle of a DRAM device access to read data."

E1EIKI_2013_v7n1_53_f004b

The three DDR timing parameters commonly cited are tCAS (or just CL), tRCD, and tRP. A fourth number that may be cited after the first three is tRAS. From left to right, the first two elements are tRCD and CL. The data burst period can be determined from the data transfer rate. In DDR3 and 4, it is 8 cycles of the data rate or 4 cycles of the command/address rate. The data restore phase is not specified and overlaps with the data burst. However, tRAS is the sum of the first two elements tRCD and CL, plus the net of the unspecified elements. The full cycle time tRC is the sum of tRAS plus tRP.

The tRCD, CL and tRP elements are often identical and in the 13-14ns range. In DDR4, the data burst phase transfers 8 words. For DDR4-2666, the command-address clock is 0.75ns and the data transfer rate is one word every 0.375ns, so 8 transfers takes 3ns.

Database transaction processing is largely an exercise in pointer-chasing. The key performance criteria is then sustained serialized random memory access. In this regard, CL is not more than the other elements as it is for sequential or localized memory access. The parameter of interest is probably tRC. This value is sometimes not cited, or it might be difficult to find.

For random memory accesses, i.e., to different rows, each bank can only be accessed once per tRC. Let us suppose that the data burst interval is 8ns (1000MT/s) and tRC is 56ns. Then only one-seventh of the nominal data transfer rate can be sustained by one bank.

With 8 banks, the full transfer rate could be sustained by a single DIMM of rank 1 if there is one access to each bank. Of course, we should also consider that a DIMM could have 1, 2 or 4 ranks and a memory channel could have 2-3 DIMMs in DDR 3 and 4, and up to 4 in SDR, DDR and DDR2(?).

For a Xeon SP 28-core processor running a transaction processing workload, the assumption is that each thread spends 90% of its cycles waiting for memory. With HT enabled, there are 56 concurrent threads, so 50 are waiting for memory. There are 2 memory controllers and 3 memory channels on each controller. In earlier generations, multiple channels might be arranged in parallel. But I would guess that in the late DDR4 generation (2666MT/s), the better approach is to treat each channel independently. Then 6 independent channels each capable of supporting 16 overlapping memory requests allows for 96 concurrent memory requests.

Micron SDRAM through DDR4

The MT/s or data transfer rate shown is one of the common values, and not necessarily the maximum value. Most of the parameters above are in the DRAM chip datasheet, but tRC is usually found in the DIMM datasheet, and sometimes buried in the back pages.

The 256Mb SDRAM datasheet is listed as 1999. This could be a 250 or 180nm manufacturing process. The 8Gb DDR4 is probably a 2016 or 2017 product, possibly 22nm or other 2x process. From 256Mb to 8Gb, there are 5 doublings of density. In between 180nm and 22nm are five manufacturing processes: 130, 90, 65, 45 and 32nm. (There is a difference between DRAM and logic processes, and DRAM process is currently behind logic?)

The manufacturing process used to mean the transistor gate length. However, in the last 15 or so years, it is just an artificial placeholder representing a doubling of transistor density between generations. In the Intel 22 to 14nm process transition, the density increased by 2.5X. See the summary of the IEDM 2017 and ISSCC 2018 on Intel's 10nm at fuse.wikichip.org. There is a convention in which the DRAM bit cell size is specified as xF2 and F is the process linear dimension.

The CL-tRCD-tRP values have largely not changed over the years. There was a slight drop from DDR to DDR2, but this could be because of the coarse granularity between clock cycles. Micron usually has a regular part and a E part with slightly better timing. The 13ns values are E parts, and the 14ns values are regular parts. Lesser variations are just from how the clock cycle multiplier works out.

A major factor in three timing values is bank size. A larger bank incurs a higher timing value. I am presuming that the timing values have been maintained steady as the manufacturing process improvements have been balanced out by increasing bank size, from 64K bits per bank in the 256Mb SDRAM to 512Mb per bank in the current 8Gb DDR4.

Notice that the difference between tRC and the sum tRCD+CL+tRP has shrunk from SDRAM to DDR4. This might be because the data transfer portion has decreased?

Just as a rough estimate, 4 banks is sufficient for a single DIMM of rank 1 to sustain the full transfer rate up to 533MT/s at 60ns tRC, 8 banks for 1200MT/s at 54ns tRC, and 16 banks for 2666MT/s at 48ns tRC, all assuming 8 word bursts. DDR was designed for 2n, DDR2 for 4n, and both DDR 3 and 4 are 8n.

DRAM Die Images

Image below, from Under the Hood: The race is on for 50-nm DRAM, shows the Samsung 1Gb DDR2 die, a 2009 product? It is a 50nm class having a 6F2-based cell design. In this case, 50nm class is 58nm.

samsung_dram_1Gb2

There are 8 banks is this product. I cannot find a relatively modern DRAM die layout showing which control (and other) elements that belong to the entire chip versus the non-array elements of each individual bank.

The Under the Hood article also mentions a Hynix 8F2 cell design, 16.5 percent larger than Samsung, in a 1-Gbit DDR2 SRAM, 45.1 mm2 chip, 2.7 percent larger than Samsung's 1Gbit DDR2 SRAM.

Semiconductor Manufacturing & Design has a die image of the Samsung 1Gb DDR3 in How to Get 5 Gbps Out of a Samsung Graphics DRAM, shown below.

DDR3_Samsung1Gb_DDR3

Embedded has a discussion on Micron SDRAM and DDR in Analysis: Micron cell innovation changes DRAM economics.

One aspect of cost optimization is to keep the logic portions small in relation to the bit cell array. DRAM manufacturers recognized that sustained bandwidth was important enough to warrant increasing the number of banks as necessary.

There are unbuffered DIMM products without ECC made with specially screened DRAM part having lower (better) values. These products also have heat sinks for improved cooling, as temperature effects timing.

DDR4

DDR4 is 16 banks. There does not seem to be a published die image, or layout. The diagram below is just a redrawing of the 8-bank DDR3 images from earlier, now with 16-banks. I am guessing that the banks are high and narrow because there are 17 row address bits and 10 column address bits (plus 2 for x4).

DDR4_16bank

(2018-Sep) EE Times Micron’s 1x DRAMs Examined, Jeongdong Choe, TechInsights 5/15/2018. "The die size of Micron’s 1x nm DDR4 DRAM is 58.48 mm2, while the LPDDR4 is 52.77 mm2."

Micron8Gb_die
(Images: TechInsights)

Comment: I don't see how this is 16 banks?

Micron currently has an 8Gb DDR4 die. There is also a twin-die 16Gb package. A 16Gb die under development.

Samsung just announced their 16GB DDR4 die on a 10nm class process (presumably high 10's?). The 8GB single die product list modes up to 3200MT/s. At 2666MT/s, the regular part lists CL-tRCD-tRP at 19 clocks each or 14.25ns. The E part is 18 clocks and 13.5ns. At 3200MT/s, it is an E part at 22 clocks and 13.75ns.

The Micron TwinDie 16Gb product in x4 and x8 lists modes up to 2666MT/s at 18 clocks. The other twin die 16Gb for x16 has modes up to 3200MT/s. The Micron 64GB LRDIMM (and ECC) is QR with twin die 16G components and has modes up to 3200MT/s. Perhaps one of the data sheets is not up to date? tRC at 2666MT/s is 46.16ns or 61.5 clocks.

Below is a simplified rendering of the Micron 8Gb DDR4 in the 2G x 4 organization. See the Micron datasheet for a more complete and fully connected layout (DDR4 8Gb 2Gx4).

Micron_8Gb

There are 2G words in the x4 part, and 31 bits are required for the full address. The address bus is 21-bits. Four bits are dedicated for the bank group and bank. The remaining 17 bits are multiplexed, with all 17-bits used for the row address (128x1024 = 131,072). Technically, the column address comprises 10-bit, but 8 words are fetched simultaneously so only 7 bits go into the column decoder. The total address is then 4 + 17 + 10 = 31 bits, sufficient for 2G words.

The die is sub-divided into 4 bank groups. Each group has 4 banks for a total of 16 banks. Within a bank, there are rows and columns. The row is a word-line and the column is a bit-line.

Both Micron 8Gb x4 and x8 parts use a 78-ball package (6 columns x 13 rows). The x16 part uses a 96-ball package (6 columns x 16 rows).

78 ball   96 ball

The x16 part has 8 more data signals than the x8, but one less for address. The extra signals in the 96-ball package appear to be additional supply voltage or perhaps for future growth?

An ISSCC 2016 slide on pc.watch.impress.co.jp cites GDDR5 8Gb die size at 62mm2 on a 20nm process. This might be the same size as DDR4 on the same process? Both 8Gb die have 16 banks. The Micron GDDR6 is x16 with a 180-ball package.

The number banks have been scaled to support higher sustained bandwidth, necessitating a decrease in bit array efficency.

It should be noted that memory capacities have become far larger than truly necessary, driven even higher to compensate for incompetent storage configuration. But it is memory latency that is holding back performance. Memory latency can be improved with smaller bank size, requiring more banks, more non-scaling sub-units, and hence higher cost. This is trivial. The value of low latency memory is worth a doubling or quadrupling of cost per unit capacity.

Reduced Latency DRAM

Specialty DRAM products designed for low latency include Micron RLDRAM. The current version is RLDRAM 3 at 576Mb and 1.125Gb (64M x 18 = 1152Mb = 1.125 x 1024) density. The 1.125Gb products cite "Reduced cycle time tRC (min) = 6.67-8ns". Exactly what is meant by tRC(min) is unclear, is there a maximum?

Two characteristics of RLDRAM is non-multiplexed addressing and a higher number of banks. First, the addressing scheme collapses the tRAS and tCAS of conventional DRAM into a single operation. Second, the small bank size further reduces that one remaining timing value. The RLDRAM can also operate with multiplexed addressing

The RLDRAM 3 datasheet is dated 2015, so it might be a contemporary of the 2Gb DDR3 or possibly the 4Gb DDR4 products. Micron does not mention what the cost of RLDRAM is relative to conventional DDR3/4. The Son et. al. paper from ISCA 2013 mentioning that it might be 40-80%.

The 1.125Gb version has x18 and x36 options. This is 1Gb plus 1 extra bit per 8 for parity or ECC. Four of the x18 parts or two of the x36 could form a 72-bit word allowing for 64-bit data and 8-bit ECC. The 1.125Gb RL-DRAM has 16 banks. Each bank is 16K rows by 256 columns = 4M words, or 72M bits.

Micron_8Gb

An appropriate comparison for this might be the 2Gbit DDR3 in 128M x 16 organization, which has 8 banks. Each bank is 16K x 1K = 16M words at 16-bit per word = 256M bits. So, roughly, the bank size is 3.55 times smaller at similar point in time?

For the 64M x 18 organization, there are 25 address bits, even though 26 bits are required to address 64M words. There are 4-bits for the bank address. The row address is 14-bits. The column address is 7-bits, but only 5-bits go into the Column decoder. A bank is shown as 16K x 32 x 8 (x18 data). The 5-bit going into the decoder corresponds to 32 columns. I am presuming that the address has 2 word granularity?

Both the x18 and x36 parts share a common 168-ball package. It is presumed that less than 130 signals would be a required for just the x18 product. Compared to the DDR3/4 x16 parts, the RL-DRAM has 5 more address signals, and 2 more data signals. It would then seem that RL-DRAM only needs moderately more signals than mainstream DDR3/4.

The RLDRAM package is 13.5mm x 13.5mm versus 8mm x 14mm for the 2Gb in x16 form. This might be due to the difference in pins, 168 versus 96. The 1.125Gb RLDRAM die might be similar in size to a 2Gb DDR3 die on a common manufacturing process.

Between the smaller bank size and the non-multiplexed address, Micron cites RL-DRAM 3 as having tRC minimum value of 6.67ns (8-cycles of 2400T/s, or 0.833ns per data transfer). But why only cite the minimum value? We are mostly interested in the average value and possible the min-max range, excluding refresh, which all DRAM must have.

My guess would be that by having extra banks, the tRP period can be hidden if access are randomly distributed between banks. Perhaps the smaller banks reduces the 13-14ns timings to 6.67ns?

Comments

It is presumed that both the smaller bank size and non-multiplexed address contributes to significantly lower latency. Some details on how much each aspect contributes would be helpful.

A significant portion of the memory latency in multi-processor servers occurs in the system elements outside of DRAM. A reduction in latency on the DRAM chip may not impact total memory latency in the MP system sufficiently to outweigh the cost. However, in a single processor system with less outside contribution, the impact is sufficient to justify even a very large increase in DRAM cost. The first opportunity is for low latency DRAM that would be compatible with the existing memory interface of current or near-term next generation processors, be it DDR4 now or DDR5 in a couple of years.

The next step is for memory with a new optimized interface, which must be implemented in conjunction with the processor. The most obvious change is to demultiplex the address bus, basically RL-DRAM, but optimized for server systems. Memory latency is so important that it is likely even SRAM cost structure is viable, but the discussion here focuses on DRAM.

Latency Options - Higher Banks

Two items were mentioned as having significant impact on DRAM latency. One is the bank size. The second is the multiplexing of row and column addresses. Other methods of impacting latency include binning, which is simply screening the parts for actual characteristics, and temperature.

In principle, it is a simple matter to sub-divide the DRAM chip into more banks, as the design of each bank and its control logic is the same. This would increase the die size as a larger percentage of the floor plan is occupied by control logic associated with bit array. In the past, DRAM manufacturers operated under the assumption that there was not much of a premium for latency. In any case, small banks reduces the latency of the key timing values. DDR4 currently has 16 banks for 8Gb (512M per bank) versus RLDRAM 3 at 16 banks for 1.125G (72M per bank). Wide IO2 has 32 banks in an 8Gb die?

It is presumed that there are no fundamental problems in building an 8Gb die with 64 or even 128 banks? It is also presumed that the memory controller needs to know which address signals are for the bank address. This would allow for it to plan the memory access sequence so that a given bank is not accessed more than once per tRC interval.

(Before conventional memory was made with multiple banks, Mosys has a multi-bank DRAM product.)

Discard Multiplexed Addressing

Removing the multiplexed address scheme is also not an issue as this is already the case in RLDRAM. The original reason for multiplexed addressing has long been rendered moot. Making the change now does require coordination between the processor and DRAM manufacturer. It is possible that Intel is reluctant to pursue this because their effort to make significant change of direction in the memory interface for the Pentium 4 did not go well. That was their mistake because it was for desktop systems, which was less tolerant of higher price, and while the new memory offered higher bandwidth, it also increased latency.

Here, we are advocating a reduction memory latency. The Micron RLDRAM product can also accept a multiplex address. Presumably than the new memory controller could be made to to work with both the non-multiplexed new memory and also existing form of memory with multiplexed addressing.

Somewhere its mentioned that the RL-DRAM interface is not very different from SRAM. This could be an additional option that I looked at in SRAM as Main Memory.

Binning

This is simply a matter of screening parts as there are manufacturing variations. The common timing values of mainstream DRAM is based on a very conservative timing for maximum yield.

Current Micron mainstream DRAM product already have two speed rating classes, the regular and sometime an E part that has moderately better timing. The E parts do not seem to be available for the server oriented RDIMM and LRDIMM parts with ECC.

G.Skill has one product, DDR4-4266 (2133 clock) at 17-18-18 timings. CL 17 at 2133MHz is 7.97ns. RCD and RP at 18 is 8.44ns. This strategy works for extreme gaming, allowing a few deep-pocket players to get an advantage.

For servers, it is probably worthwhile to offer two or three binned parts. If two, then perhaps the top one-third as the premium and the rest as ordinary.

Temperature

Temperature also effects timings. Most DRAM products specify 64ms refresh intervals for Tc up to 85 and 32ms for 85-95C. Lower temperature allows the charge state of the capacitor to be measured more quickly. Onur's slides says that lower temperature allows lower timing values. The gaming DIMMs typically have elaborate heat sinks.

So, a question is whether we want this or even more elaborate cooling for our low latency strategy. Perhaps cooling should be used for the specially binned parts, and just normal cooling for the bulk parts. Whether this new system would have liquid cooling for memory is another discussion.

Alternatives

There are ideas that look to achieve low latency while also preserving low cost. The Son et al. paper (full title below) discusses splitting DRAM banks into fast and slow sections based on geometry and proximity to the control logic? This solution would most preserve the existing DRAM cost structure. The memory controller, operating system and application would all work together with the knowledge of fast and slow memory node or address ranges.

The Memory Guy Is Intel Adding Yet Another Memory Layer?, 2013 Feb cites an ISSCC 2017 paper from Piecemakers Technology on 8-channel, 72-bits per channel and 32 banks. Interface is SRAM-type, meaning non-multiplexed address. Latency is 17ns.

Summary

The key to justifying low latency memory at higher cost begins with the single processor (socket) system strategy. By removing as much of latency contributions outside of DRAM, then improvements in the DRAM itself has greater impact. The value of achieving equivalent performance to the multi-processor systems widely used today in a single processor system is enormous.

The technology to substantially reduced memory latency already exists. What is required is for the processor and DRAM manufacturers to coordinate their efforts as product changes from both are required. Even a doubling of memory price per GB is more than acceptable. This might go against the old rules, but it is based on correct analysis.

 

Addendum

DDR4, Samsung 16Gbit

Links for Samsung products with 16Gbit (die) DDR4
Samsung Adds 32 GB UDIMMs to Memory Lineup by Anton Shilov on September 5, 2018.
Samsung Kicks Off Mass Production of 64GB RDIMMs Using 16Gbit Chips by Ryan Smith on June 11, 2018.
Samsung Demos 64 GB RDIMM Based on 16 Gb Chips, Promises 256 GB LRDIMMs by Anton Shilov on March 22, 2018

samsung_ddr4_64gb_16_gb_ics_module

The Wikipedia entry on DDR4 SDRAM say the 288-pin DDR4 DIMM is 133.35mm × 31.25mm (DDR3 240-pin). From a visual inspection, the 16Gb DRAM package might be 10×11.5mm.

The following from Wikipedia is of interest:
In 2008 concerns were raised in the book Wafer Level 3-D ICs Process Technology that non-scaling analog elements such as charge pumps and voltage regulators, and additional circuitry "have allowed significant increases in bandwidth but they consume much more die area". Examples include CRC error-detection, on-die termination, burst hardware, programmable pipelines, low impedance, and increasing need for sense amps (attributed to a decline in bits per bitline due to low voltage). The authors noted that, as a result, the amount of die used for the memory array itself has declined over time from 70–78% with SDRAM and DDR1, to 47% for DDR2, to 38% for DDR3 and potentially to less than 30% for DDR4.
The reference is: Tan, Gutmann; Tan, Reif (2008). Wafer Level 3-D ICs Process Technology. Springer. p. 278 (sections 12.3.4–12.3.5). ISBN 978-0-38776534-1.

 

Assuming the RLDRAM is 2-4 times more expensive than DDR4, it is probably the better choice as a DIMM product. For memory in the processor package with special low latency interconnect, then perhaps eDRAM is worth the price?

IBM z14 System Controller (SC),
Global Foundries 14HP process, 25.3 x 27.5 mm die,
695.75 mm² die size, 7,100,000,000 transistors,
+ 2,100,000,000 cells of eDRAM (~2.1B xTors + 2.1B capacitors)
672MB

z14 sc

 

RDIMM, LRDIMM

Anandtech LRDIMMs, RDIMMs, and Supermicro's Latest Twin gives an explanation of RDIMMs and LRDIMMs:

RvsLRDIMMscheme_575px

"RDIMMs add a register, which buffers the address and command signals. The integrated memory controller in the CPU sees the register instead of addressing the memory chips directly. As a result, the number of ranks per channel is typically higher ... "

"Load Reduced DIMMs replace the register with an Isolation Memory Buffer (iMB™ by Inphi) component. The iMB buffers the Command, Address, and data signals. The iMB isolates all electrical loading (including the data signals) of the memory chips on the (LR)DIMM from the host memory controller. "

Anandtech Intel Xeon E5 Version 3 ... has an explanation of the difference between DDR3 and DDR4 LRDIMMs.

DDR4LRDIMMshorterTrace

Simmtester What is LR-DIMM, LRDIMM Memory ? (Load-Reduce DIMM)

lrdimm_b

Microway tip on DDR4 RDIMM and LRDIMM Performance Comparison provides bandwidth numbers for 1 and 2 DIMMs per channel

 

Nantero NRAM

Hotchips 30, 2018
According to Nantero Details DRAM Alternative NRAM (Nantero)
"Nantero’s non-volatile NRAMs use electrostatic charge to activate stochastic arrays of CNT cells it claims are relatively easy to spin coat on to any CMOS process. It claims it will outstrip the DRAM roadmap starting with a 100mm2 die made in a 28nm process stacking 4-Gbit CNT layers into 8- and 16-Gbit devices."

NRAMtiming

"Theoretically, DDR4 supports up to eight-layer stacks, DDR5 handles 16 and future processes may allow denser individual layers. Nantero forecasts a 64 Gbit NRAM could be made in a 14nm process and a 256-Gbit device in 7nm, both using four layers."

 

References

The DRAM chip is an enormously complicated entity for which only some aspects are briefly summarized here. See Onur Mutlu and other sources for more.

Young Hoon Son, Seongil O, Yuhwan Ro, Jae W. Lee, Jung Ho Ahn,
"Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations," ISCA 13 (link)
has references on RLDRAM 40-80% larger die, per Brent Keeth, DRAM Circuit Design.
Also Figure 2, certain SPEC CPU2006 components show dramatic benefit from lower memory latency, presumably the ones heavy on pointer-chasing code.

Also see The Memory Guy
Super-Cooled DRAM for Big Power Savings, 2017 Jun
A 1T SRAM? Sounds Too Good to be True!, 2016 Feb Zeno Semi,
A 1T (or 2T) SRAM Bit Cell,

Samsung Semi

Anandtech Cadence and Micron DDR4 Update ..., Oct 16 ,2018. Two independent 32/40-bit channels per module. JEDEC DDR5 ..., Apr 3, 2017. Expanded support for NV?

Unorganized material lists