Home,   Parent,    Low Latency Memory (2018-03),    SRAM as Main Memory (2018-02)

Memory Latency - Return to Single Processor (2018-02)

It is almost universally accepted without question that the server is a multi-processor system. Long ago, there were valid reasons for this. Over time, the reasons changed but were still valid. When the memory controller became integrated into the processor, the very low memory latency of a single processor system was an interesting but non-actionable tidbit. The greater compute and memory of the multi-processor was then more important. In recent years, the high core count and memory capacity of a single modern processor is now more than capable of handling most transaction processing workloads. The lower memory latency of the single socket system has the effect of having 40-50% better performance per core or thread than a multi-processor system. It is time to rethink our concept of what a server system is. As attractive as cloud computing is for a number of reasons, none of the major cloud providers offer single processor systems, forgoing the price-performance advantage of the 1S system.


The huge disparity between round-trip access to DRAM memory and the CPU clock cycle has been a serious problem for more than ten years. It is simple to demonstrate that even a small amount of pointer chasing code, in which the memory value being accessed determines the next memory to access, results in performance being almost entirely gated by memory latency. The most important application with this characteristic is database transaction processing. The nature of B-tree index navigation is largely an exercise in pointer-chasing.

The large part of memory latency is incurred on the DRAM chip itself, but L3 cache, the memory controller and signal transmission time also comprise significant elements. Curiously, there are specialty memory products for desktop systems without ECC featuring "hand-screened" ICs and an elaborate cooling mechanism that have latency timings of around 10ns in each of RCD, CL, and RP. For server systems, registered ECC DIMMs are only available with the standard timings of about 14ns for each of the three.

Single and Multi-Processor Systems

There is a mechanism by which we can significantly influence memory latency in a multi-processor (socket) server system, that being memory locality. But few applications actually make use of the NUMA APIs in this regard. Some hypervisors like VMware allow VMs to be created with cores and memory from the same node. What may not be appreciated, however, is that even local node memory on a multi-processor system has significantly higher latency than memory access on a (physical) single-socket system.

That the single processor system has low memory latency was interesting but non-actionable bit of knowledge, until recently. The widespread practice in IT world was to have the 2-way system as the baseline standard. Single socket systems were relegated to small business and turnkey solutions. From long ago to a few years ago, there was a valid basis for this, though the reasons changed over the years. When multi-core processors began to appear, the 2-way became much more powerful than necessary for many secondary applications. But this was also the time virtualization became popular, which gave new reason to continue the 2-way as baseline practice.

Intel Xeon E5 v2, Ivy Bridge to Xeon SP, Skylake

Prior to 2012, the Intel product strategy was a desktop-type processor for 2-way systems and a big processor for 4-way+ systems. There were variations of this depending on the processor generation for various reasons. In 2012, the big processor was brought to both 2-way and 4-way systems in the Xeon E5 line. The next year, Intel employed 3 die layouts to fill their Xeon E5 v2 lineup, with the 6-core LCC, 10-core MCC and 15-core HCC.


It was possible and practical for Intel to put 15 cores on one die in their 22nm process for a high-end product. But there was also significant demand for low and medium-priced products. These were better suited to the cost structure of the 6-core and 10-core die, respectively. Hence the 3 die layout strategy. The E5 v2 was only available publicly with up to 12-cores. The full 15-core was publicly available in the E7 v2, (and in E5 to special customers?).

In the current generation Xeon SP processors, the 3 die layout options are the LCC at 10 cores, HCC at 18 cores and XCC at 28 cores.


The inherent nature of semiconductor cost structure is that a large die is disproportionately more expensive than a small die. Some of this is the unusable real estate due to the number of whole die that fit on a wafer. In very large die, cost is dominated by declining yield at a given defect density (see also Motley Fool). The cost structure of Intel Xeon E5 and Xeon SP processors largely fits this pattern with some market driven adjustments.

The chart below shows the cost per core versus cores for the range of Xeon SP models (excluding certain specialty SKUs).


In the main sequence, the cost per core increases as core count increases. As expected, two small die processors cost less than one large die processor having the same total number of cores as the two small ones. An example is the Xeon 6132 14-core at $2,111 versus the Xeon 8180 28-core at $10,009.

At the system level, there is not much difference between the cost of motherboards that can accommodate one or two processors. It would seem then that there is logic in continuing the 2-way system as standard practice, now having 2 or 3 core count and frequency options to fit the broad range of data center requirements. This concept is valid assuming cores are equal between single and multi-processor systems.

Memory Latency, 1-Socket versus 2-Socket

Except that cores are not equal between single processor and multi-processor systems, significantly in terms of memory latency. And memory latency is the key determinant of performance for a most important class of applications.

The figure below shows a representation of the single socket Xeon E5 v4 system.


On a single socket Xeon E5-2630 v4 10-core processor, L3 latency was 17ns and memory latency was L3 + 58ns = 75ns as measured with the 7z benchmark. Intel cite L3 latency for the E5-2699 v4 22-core processor as 18ns. Hence single socket memory latency is assumed to be 76ns for an HCC die.

In the LCC die, there is a single ring connecting the cores and other sub-components. The HCC die has two interconnect rings, in turn, connected with a pair of switches. One might think that there would be more than 1ns difference between a 10-core LCC and the 22/24-core HCC, and it would be nice if Intel would provide an explanation.

The figure below shows a representation of the 2-way system with Xeon E5 v4 processors.


The 7-cpu website shows L3 at 18ns for the E5-2699 v4, consistent with Intel. Local node memory latency is L3 + 75ns = 93ns.

The 7z-Benchmark does not test remote node memory latency. Intel rarely mentions memory latency in the Xeon E5/7 v1-4 generations. One source cites remote node memory at either 1.7X local node. Another cites remote node at 50-60ns higher than local node. The value of 148ns is a placeholder used in later calculations.

The difference between single socket memory latency and 2-way local node memory latency is attributed to a remote node L3 check to ensure cache coherency. The difference between 76 and 93ns is 22%. As memory latency is the major determinant in database transaction processing performance, 22% is huge.

The local node latency applies for applications that can achieve memory locality on a NUMA system. The database engine has NUMA capabilities. But if the complete application and database environment was not architected to work with the NUMA features, memory locality will not be achieved.

Xeon SP

In 2017, Intel decided that it was important to compare memory latency for their Xeon SP against that of AMD EPYC. And hence memory latency information was in their presentations. The figure below shows 89ns for local node and 139ns for remote node memory latency on the Xeon SP 8180 28-core processor.


In Xeon SP, L2 cache latency increased from 12 to 14 cycles on account of the increase in size from 256KB to 1MB. L3 latency increased from 18 to 19.5ns in the large die models as measured on the Intel Memory Latency Checker. It seems that the Skylake Xeon SP has lower memory latency than Broadwell Xeon E5 v4. This could be true or it could be different measurement techniques.

The figure below is from computerbase.de. Be aware that different values are cited by various sources.


See the Intel Xeon Scalable Architecture Deep Dive (slide deck) and Intel Xeon Processor Scalable Family Technical Overview for more information.

Equivalent Transaction Performance 1S to 2S

On a full 2-socket system, a database not architected for NUMA is expected to have 50/50 local-remote node memory access. Then average memory latency is 114ns (average of 89 and 139) for Xeon SP and 120.5ns (average of 93 and 148) for Xeon E5 v4.

Average memory latency on the 2-way is 50-59% higher than on a single socket system. The expectation is then that performance per core on the 2-way system is 0.63-0.67 times that of the single socket system. The throughput performance of the 2-way, with twice as many cores, is 1.26 to 1.34 times higher.

Based on scaling of 1.35X from 1S to 2S, a 2-way system with 14-core processors is approximately equal in throughput performance to a one-socket system with 19 cores (actual product options are 18 or 20 cores).


The single socket 28-core processor is about equivalent to a 2-socket system with 20-22 core processors.


The single socket system also has better thread level performance in proportion to the difference in memory latency, 50-59%. This has significant value of its own. The single socket system should not have NUMA anomalies that can occur on multi-processor systems. One example is from contention between threads running on cores in different sockets.

The cost difference for the processors may or may not be not regarded as large. However, software is generally licensed on a per core basis. The licensing fee may be on the order of thousands of dollars per core versus the processor cost on the order of hundreds of dollars per core. Performance efficiency per core now becomes more important.


The argument for the single socket as the baseline standard system is compelling. In the past, there were valid reasons for both the greater compute capability and memory capacity of multi-processor systems. In recent years, with storage on SSD, massive memory overkill is no longer necessary, even though it may still be an involuntary muscle-reflex action for DBAs.

The reality is that, with proper design, most transaction workloads can be handled by the 16-28 cores available in a single processor with Hyper-Threading doubling the logical processors (threads). Scaling up to a multi-processor system provides only moderate throughput performance gain and may create new problems, some with serious negative effects. A NUMA-aware VM recovers about half of the efficiency difference between a multi-socket versus a true single socket system. This is helpful, but is short 22% plus its own overhead of the full potential.


On the Oracle side, the database and application architecture strategies necessary to achieve compute node locality on RAC could be applied to achieve (memory) node locality on a (single compute node) NUMA system.

For any one who remembers that single to multi-processor scaling was better in the old days, it was. My recollection is that in the Pentium II to III Xeon period, it was something like the following.



Scaling from 4-way to 8-way was perhaps 1.5X. The reason that scaling from 1 to 2 processors was better back then was that there was little change in memory latency in going from 1S to 2S, and similar from 2S to 4S. However, these comparison should be made based on the same processor and memory controller. The regular Pentium II and III had a slower L2 cache, but simpler memory controller. The Xeon versions had faster and bigger L2 cache, but a more complex memory controller for much larger capacity, which also entails higher latency.