Home,  Parent,  Memory Latency,  TPC-E Benchmarks,  Single Processor,  DRAM,  SRAM

Memory Latency II (2018-04)

Round-trip memory latency is probably now the most aspect of system performance. In part, this is because processors have improved dramatically over the years in almost every aspect except for memory latency. The discussion here mostly covers the period from the Intel processor codenamed Nehalem to present, Skylake in the Xeon SP form. Most of the information pertains to the codename-EP versions corresponding to the Xeon E5 class. A comparison against memory latency in the desktop processors is useful, particularly in reminding us that everything in the system is a series of trade-offs. If we want memory with ECC and big capacity, there is a price. Very little information is available on the EX version or E7 class product lines.

Preliminary and Tools

Memory latency in modern multi-core processors is already a complex topic. The matter is further complicated by multi-processor systems inherently having non-uniform memory access (NUMA) due to the integrated memory controller. There are many scenarios, as to whether data is the cache of another core in the same socket or different socket, unmodified or modified, and shared or exclusive. Recent Intel processors also have many cache-coherency modes, which adds further complexity. The discussion here mostly concerns only the simplified form of local and remote node memory access.

Memory latency test tools include Intel Memory Latency Checker (MLC), currently on version 3.5, and Igor Pavlov's 7-cpu 7-Benchmark with results are posted on 7-Zip LZMA Benchmark. Results have been reported in various places, not all of which are consistent, and different tools were used.

In reviewing available literature on this topic, there appears to a significant change in cache coherency from the Intel Xeon E5/7 class processors between the Nehalem (2009) to Ivy Bridge (2013), and the Haswell (2014) and later processors, impacting single versus multi-processor latency.

Nehalem

The Molka-99 paper provides the following for a 2-way Nehalem Xeon X5570 4-core, 2.93GHz system, 6 x 2GB DDR3-1333 registered ECC memory. Debian 5.0 operating system. The test tool is BenchIT (more).

    2S_Nehalem

Cache latencies are stated both in absolute time and cycles. L1 is 4 cycles, L2 is 10 cycles, and local L3 is 38-cycles or 13ns at 2.93GHz.

The L1 is hardwired into the CPU core pipeline and must be 4 cycles, or 5 in certain cases. I am not sure if L2 is set to a fixed number of cycles or is frequency dependent. In any case, L2 is part of the core, and cannot be dependent on uncore characteristics. The L3 is part of the uncore, and operates independently of the core frequency.

Local node memory access is 65ns and remote memory is 106ns, inclusive of L3 miss. (L3 miss can be determined sooner than the full L3 access, but memory is loaded into L3 on the return trip?) The Thomadakis thesis has 65ns local and 105ns remote.

Molka-99 paper has much greater details on all the possible latency circumstances. A local L3 miss but hit on a different core in the local die is 22.2ns, while a remote die cache hit is 63.4ns for read, exclusive. Other circumstances are modified cache line, shared cache line much details on the exact circumstances.

Anandtech Nehalem: The Unwritten Chapters (Nov 2008) cites L3 at 36 cycles and memory at 148 cycles for the single processor Core i7-965 at 2.66GHz, corresponding to 13.5 and 55.6ns respectively. It is possible that desktop system could use specially screened memory with better timings, so we cannot be sure as to whether the lower memory latency is due to the difference is system architecture (1S versus 2S) or due to a difference in the memory specifications.

Westmere

The 7-cpu site provides the following for a 2-way Westmere Xeon X5650 6-core, 2.66GHz. Memory and operating system are not specified.

2S_Westmere

The 7-cpu Westmere cache latencies are similar to Nehalem. Both L1 and L2 are the same at 4 and 10 cycles respectively. L3 is 40 or 42 cycles at 2.66GHz depending on the core. This corresponds to 15-15.75ns. It would be reasonable for the 6-core Westmere to have higher L3 latency than 4-core Nehalem.

The significant different between the Molka and Thomadakis reports for Nehalem and 7-cpu for Westmere is that 7-cpu says memory access is L3 + 67ns local and L3 + 105ns remote. There should not be any fundamental differences between Nehalem and Westmere. Could this just be a difference in the test tools?

The Intel MLC website has a figure showing 67.5-68.5 for local and 125-126ns for remote node memory access, but without system information. The time frame might be late 2013.

IMLC1
Memory Latency Checker

There is a subsequent figure showing bandwidth to local memory at 38,477MB/s and remote at 19,000MB/s. The value of 38,000MB/s matches up exactly to 3 channels at 1600MT/s.

Westmere EP has 3 memory channels but a max data transfer rate or 1333MT/s. The Sandy Bridge based Xeon E5-2600 series has 4 channels at 1600MT/s. It could be an E5-2400 series. Intel usually tests with their flagship product, not a downgrade model.

It could be that the test achieves about 75% of nominal bandwidth, or Westmere may have supported 1600MT/s unofficially?

Sandy Bridge

The figure below shows a 2-way Sandy Bridge Xeon E5 system. The latency values are from the Intel MLC website. But it is not known if those are for Sandy Bridge.

2S_SandyBridge
Sandy Bridge (uncertain attribution)

In fact, McCalpin states that local node memory for Sandy Bridge is 79ns, suggesting that the Intel MLC site values are actually for Westmere.

There is a significant point to note on the change in system architecture from Nehalem/Westmere-EP to Sandy Bridge. All three processors have two QPI links. However, in Nehalem/Westmere, one QPI on each processor connects to the other processor. The second QPI connects to an IOH. One common IOH is shown, but it could also be two separate IOH's, in turn connected to each other via QPI.

Sidebar - Intel Software Conference 2014 Brazil

The local node values of 67.5 and 68.5ns would be about right for a single processor, implying no penalty in a 2-way system.

The slide shown below is from Notes on NUMA Architecture, by Leonardo Borges, Intel Software Conference 2014 Brazil. Thomadakis has a similar slide.
notes4

Slide 4, Local Memory Access Example says: if not present in any local node cache, then request data from memory, and snoop remote node. Local memory latency is the maximum of the two. The slide does not explicitly say that the DRAM request is sent simultaneous with the remote CPU snoop, nor does it say that the steps are serialized.

Slide 5, Remote Memory Access Example is: local node requests data from remote node over QPI, remote CPU IMC requests to its DRAM, and snoops caches in its node. Again, not specific if the DRAM and snoop are concurrent or serial.

Ivy Bridge

The Saini paper has data for both the 2-way Ivy Bridge Xeon E5-2680 v2 10-core 2.8GHz and the 2-way Haswell Xeon E5-2680 v3 12-core 2.5GHz systems. The E5 v2 system is shown below with the middle (MCC) die.

2S_IvyBridge

Memory is DDR3-1866 on the E5 v2 system. Even though the Ivy Bridge MCC die has 10-cores, there is no guarantee that any given Xeon E5 v2 10-core product is not the HCC die with the extra cores disabled. The Ivy Bridge HCC die has 15 cores with a more complex ring structure.

L1 is reported as 1.3ns corresponding to 3.64 cycles at 2.8GHz. Since we know L1 must be 4 cycles, which means the actual operating frequency must be around 3GHz. Similarly, L2 should be 12 cycles. L3 is cited as 12.8ns which might be 40 cycles at around 3.1GHz. Local node memory is cited at 56.5ns, which seems to be unusually low. Remote node memory is not mentioned.

On the Intel Software forum, there is a comment by John McCalpin citing 79ns for 2S Xeon E5-2680 (Sandy Bridge EP), 69ns for 2S X5680 (Westmere) and 53.6ns for 1S Xeon E3-1270 (Sandy Bridge desktop). In this case, the values attributed to 2S Sandy Bridge from Intel MLC is not correct, as SB local node memory should be higher.

The Hotchips 2014 (HC26) Intel presentation "Ivybridge Server Architecture: A Converged Server" by Irma Esmer etc. slide 16 has charts on memory latency versus bandwidth for both Ivy Bridge and previous generation and E5 and E7 versions. The 2-way E5 v2 latency appears to be around 70ns. The 4-way E7 v2 latency appears to be

Haswell

Both Saini-15 and Molka-15 papers cite results for 2-way Haswell Xeon E5-2680 v3 2.5GHz base, 3.33GHz turbo-boost, 12-core processors. Memory is DDR4-2133. The Haswell MCC die has 12 cores, but again, there is no guarantee that any given product is not the HCC die with some cores disabled.

2S_Haswell_12c
Haswell MCC representation (clipped from HCC image)

Saini-15 cites L1 1.4ns, L2 3.9ns, L3 16.1ns, and memory at 88.6ns. Again, L1 must be 4 cycles, implying true operating frequency around 2.8GHz and L2 should be 12 cycles. L3 could be around 48 cycles assuming the true frequency is 2.96GHz.

Molka-15 explicitly states: "The access times of 1.6 ns (4 cycles), 4.8 ns (12 cycles), and 21.2 ns (53 cycles) for reads from the local L1, L2, and L3 are independent of the coherence state. In contrast, core-to-core transfers show noticeable differences for modified, exclusive, and shared cache lines." The time in ns and cycles correspond exactly to 2.5GHz, meaning that turbo-boost was disabled.

Saini reports 88.6ns for local node memory, no value cited for remote node. Molka cites 96.4ns for local and 146ns for remote node accesses, all in the default Source Snoop Mode.

The Molka et al. Cache Coherence ... Haswell-EP Architecture 2015 ICPP paper also discusses the difference cache access scenarios. In Home Snoop Mode, local memory latency increases to 108ns.

7-cpu reports Intel Haswell 2-way Xeon E5-2699 v3, 18-core (HCC die) 2.3GHz base and 3.6GHz max turbo-boost having L3 latency between 58-66 cycles at 3.6GHz. Memory is 62 cycles at 3.6GHz (L3 17ns) + 100ns.

2S_Haswell

As with Nehalem and Westmere, it is not clear whether the differences between the 7-cpu and Molka results are due to differences between the test tools, the MCC versus HCC die, or other.

Per Saini-15, local node memory latency increased from 56.5ns in the Ivy Bridge based Xeon E5-2680 v2 (10 cores, single ring) to 88.6ns in the Haswell E5-2680 v3 (12 cores, 2 rings). This was attributed to the change from DDR3-1866 to DDR4-2133, presumably CL 15 or 14.06ns. Crucial show ECC registered DDR3-1866 at CL 13 = 13.93ns, so this attribution is unclear.

Anandtech Intel Xeon E5 Version 3 ... reports Xeon E5-2695 v3 (14-core, 2.3GHz base) at 82.8ns for RDIMM and 85.8ns for LRDIMM, both DDR4-2133. Xeon E5-2697 v2 with DDR3-1866 was mentions as 81.6ns.

Broadwell

7-cpu cites Broadwell 2-way Xeon E5-2699 v4 as 65 cycles (18ns) + 75ns.

2S_Broadwell_H

My own measurement on a single socket Xeon E5-2630 v4 10-core (LCC die?) is 38 cycles (17.3ns) + 58ns.

1S_Broadwell

Skylake SP

The figure below is based on an Intel slide shown on the Anandtech article "Dissecting Intel's EPYC Benchmarks: Performance Through the Lens of Competitive Analysis" for Xeon SP. Also see the Anandtech article Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade? .

LocalRemote

In the 2-way Xeon SP system, local node memory is reported as 89ns, and remote node as 139ns.

7-cpu reports Skylake X, a single socket, 8-core processor on the LCC die as having memory latency of L3 + 50ns. The L3 is about 18.5ns. The memory was DDR4-3400 16-18-18 (9.4 and 10.6ns), which is better than normal registered DDR4 ECC memory at 2666 and CL 19 (14.25ns).

Intel reports L3 at 19.5ns for the XCC die. Adjusting for ECC server memory, we can estimate single processor Xeon SP memory latency at 76-79ns. The implication here is that local node memory access in a multi-processor system is significantly longer than single processor latency.

Memory Latency Comments

It is unfortunate that a consistent set of results for a broad range of processors and systems on a common test tool is not available. The disparate results that are available has some inconsistencies. Some do not cite remote node memory latency or other details.

Xeon E7 results, which would shed light on the impact of the scalable memory buffer, are difficult to find and are not discussed here.

It is evident the both local and remote node latencies have increased somewhere between Westmere and Haswell, but it is unclear exactly when this happened. Some of this is expected because the more recent processors have a very complicated internal ring or mesh interconnect for the many cores.

One guess is that in a 2-socket system with the EP (or equivalent) processors, in which the processor connects directly to DIMMs without going through an SMB), local node memory is 67ns in Nehalem and Westmere, around 80ns in Sandy Bridge and Ivy Bridge, and 90ns in Haswell, Broadwell and Skylake. Another possibility is that latency is 80ns from Nehalem to Ivy Bridge (EP) and 90ns from Haswell onward.

TPC-E results

The discussion on  TPC-E Benchmarks has been moved to its own article.

Summary

What information that is available on memory latency for Nehalem and later Intel processors has been gathered here. There is indication that memory latency has increased between the quad-core Nehalem-EP and more recent processors with 18 or more cores. It is expected that the L3 latency has increased from 13ns in Nehalem to 18ns in the 22-core Broadwell HCC die and 19.5ns in the 28-core Skylake XCC die. However, why local node latency has increased by 10-20ns from Nehalem to Haswell and later processors is not clear.

Another curious matter is that the 2-socket local node latency is higher than latency in a pure single socket system. There is an Intel slide suggesting the presence of second socket should not affect local node latency. If this is not true and there is a significant difference, then that is important.

There is a significant difference between local and remote node memory latency. It might be on the order of 50ns. An argument could be made that a NUMA aware application could be architected to exploint the faster local node memory access. Almost zero real world databases have been architected to achieve this. Localality of memory might also be achieved with virtualization, as at least one VM host is able to assign local node memory.

Memory latency is crucially important. Given that we are dealing with applications that cannot effectively achieve memory locality, then we need to ask if multi-processor systems, inherently being NUMA, are necessary. Years ago, multi-processor was necessary and the memory latency effect was interesting but non-actionable. Now, a single processors with 18 or more cores should be able to handle most workloads in properly architected applications. Whatever was assessed to be "best practice" in the long ago past is now totally obsolete and probably outright wrong. It is time re-evaluate the new strategy going forward.

References

Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture
by Daniel Molka, Daniel Hackenberg, Robert Schone, and Wolfgang E. Nagel at 44th ICPP 2015 (Molka-15),

Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
same authors, at PACT 2009 (Molka-99),

Performance Evaluation of an Intel Haswell- and Ivy Bridge-Based ...   by Subhash Saini, Robert Hood, Johnny Chang and John Baron (Saini-15, this might actually be a 2017 paper?).

The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms, Michael Thomadakis, Texas A&M, 2011, Ph.D. Thesis.

John McCalpin's blog, University of Texas, Austin, has entries for Xeon Phi x200 and some Opteron systems.

 

Appendix

The figure below is from "The Intel Xeon Processor E5 Family," by Gilbert and Rowland presented at HotChips 24, 2012

HC24_E5_SB

The above compares Sandy Bridge EP memory latency versus bandwidth to that of Westmere EP. At 1600 MT/s, each memory channel is 12,800MB/s and the four channels on one processor is 51,200MB/s. The system is then clearly a 2-socket system. The memory latency for both Westmere and Sandy Bridge at near zero bandwidth seems to be about 70ns.

The test methodology is not cited, hence it is unclear as to whether the increase in memory latency (linear portion) at higher bandwidth is due to queuing effects or a mix of local and remote access.

On second thought, the the nonlinear portion to the right is more likely to be due to queuing effects?

The figure below is from "4th Generation Intel Core Processor codenamed Haswell," by Hammarlund, HotChips 25, 2013.

HC25_Haswell

Here we can see that memory latency in the desktop system at low bandwidth is just over 60ns, less than in the Xeon E5 class processors. About 6-8ns might be explained by the difference in L3 cache latency, about 10ns in the desktop processors and 17ns in the E5 class processors?

A larger difference could be explained by the use of special memory in the desktop systems, but this does not appear to be the case here. Another possible difference in the use of ECC memory. There must be some time for the ECC check in the memory controller?

Of course, the main point of the Haswell chart is to show the benefit of 128MB eDRAM in the processor package. See A 1Gb 2GHz Embedded DRAM in 22nm Tri-Gate CMOS Technology Hamzaoglu et al. at ISSCC 2014. Per the title, this is a 1Gbit or 128MByte eDRAM on 22nm process and 77mm2 die size.

The possibility of 42ns latency to memory is of immense interest. In the server arena, this would probably be 50ns, which is still good enough. Of course, we are not interested in a mere 128MB. On the same 22nm process, a 1024GB + ECC could fit on a 693mm2 die. A 2GB giant die should be possible on the 14nm process. It should be possible to put 8 die around the similarly large die Xeon HCC/XCC processor for 16GB eDRAM, and more of the memory can be stacked.

The Silicon Edge die-per-wafer says that 77 die fit on a 300mm wafer. Motley Fool estimates the cost structure of Intel’s 14nm process at $9,100 per 300mm wafer. Assume that yield is high through a combination spare banks and lower capacity products having disabled banks. Then an end-user average cost of $1000 per die should be equivalent to current Xeon product margin. A fully functional 2GB die might have a cost of perhaps $2000.

It might seem that $1000 per GB is extraordinarily high compared with ordinary DRAM at $16/GB inclusive of ECC. But the value of lower latency to memory is even higher.

The figure below is from "Ivybridge Server Architecture ...," by Esmer, HotChips 26, 2014.

HC26_Xeon_IB2

The above figure is for 2-way systems. Again, memory latency at low bandwidth appears to be 70ns.

HC26_Xeon_IB2

The above figure is for 4-way systems. Here, memory latency at low bandwidth appears to be just over 100ns, which I am attributing to the scalable memroy buffer in between the processor-integrated memory controller and the DIMM. Note, there might be another intermediary chip on Load Reduced (LR-)DIMM, not sure about RDIMM.

The figure below is from "The New Intel Xeon Scalable Processor," by Kumar, HotChips 29, 2017.

HC29_SkylakeSP