Home,   Parent,   Memory Latency,   TPC-E Benchmarks,   Single Processor,   DRAM,   SRAM

The Case for Single Processor (work in progress)

Multi-Processor Scaling Expectations

Our current understanding is that latency on a single socket high core-count Xeon SP processor is about 76ns for normal ECC DDR4 timings. On a 2-way system, the latencies are 89ns for local and 139ns for remote node. The average latency based on random 50/50 distribution is then 114ns. Scaling up from single processor to 2-way multi-processing incurs a 50% increase in memory latency. This should correspond to a decrease of 33% in performance per core for a workload characterized by pointer chasing code.

Scaling_1S2Sb

The next step in scaling from 2-way to 4-way is just a guess. Obviously, then memory latency only model is far too simplistic. The memory latency in a 4-way based on 25% local and 75% remote node would be 126.5ns, an increase of 11%. Published TPC-E results for scaling from 2-way to 4-way on the Xeon SP 8180 28-core processors is 1.72X. The Virtualization Performance Insights from TPC-VMS paper says TPC-E has 80% local node memory access on a 2-way system.

I will guess that 60% local access is by design, and the remaining 40% is split 50/50 resulting in 80% local. Based on this model, then the expectation is that it would have 70% local access on a 4-way, 60% by design and 10/30 for the remainder, for an average memory latency of 104ns. This 17.5% less than the expected latency for a database not architected to achieve memory locality on NUMA. The scaling will be less than that achieved by TPC-E, and the 1.5X value shown represents an observed 33% in CPU for a specific query going from 2S to 4S.

 

This seems like a natural fit, and we could pursue this discussion from here. But there is more. When Intel introduced (the first successful) glue-less multi-processing with Pentium Pro in 1995, the server community went the whole hog adopting the 2-way as the baseline standard system, the 4-way for databases, and sometimes larger multi-processor (MP) systems as warranted. There was no place for a single processor system in the data center, except as turn-key devices. When multi-core processors arrived, the baseline server was more capable than needed for tier-2 and below requirements. This happened about the same time virtualization became sufficiently mature, so the practice of the 2-way baseline continued.

In the same general period as multi-core, the memory controller was integrated into the processor. AMD did in 2003/4 for AMD before multi-core, and Intel followed in 2009/10 after multi-core. An integrated memory controller substantially reduces latency in a single processor system, and results in all multi-processor systems having Non-Uniform Memory Access (NUMA). Naturally there is a difference between local and remote node memory latency in a NUMA system. If the full software stack was architected to achieve a high degree of memory locality, then it could benefit from local node latency. If not, the average of local and remote node latency could still be better than that of a system with a discrete memory controller. Regardless, the overall benefit of an integrated memory controller is positive.

Until recently, a system with a single processor with integrated memory controller having excellent transaction performance per core was interesting but non-actionable because the greater total compute and memory capacity of a multi-processor system was necessary for heavy workloads. But in recent years, a single processor with 18 cores in 2015, 24 in 2016 and then 28 cores in 2017 can handle even very heavy workloads. Extreme memory capacity is also no longer a necessity or even a significant benefit with storage on all-flash, more so with a full NVMe stack.

The advantages of a single processor system are now too strong to disregard, though few have yet to recognized this paradigm change. The new strategy should be to employ a single socket, increase the number of cores as necessary even though the high core count models are more expensive on a per-core basis before resorting to MP. If it is necessary to step up to a multi-processor system, there is a significant penalty on the per-core efficiency due to higher memory latency.

 

It is almost universally accepted without question that the server is a multi-processor system. Long ago, there were valid reasons for this. Over time, the reasons changed but were still valid. When the memory controller became integrated into the processor, the very low memory latency of a single processor system was an interesting but non-actionable tidbit. The greater compute and memory of the multi-processor was then more important. In recent years, the high core count and memory capacity of a single modern processor is now more than capable of handling most transaction processing workloads. The lower memory latency of the single socket system has the effect of having 40-50% better performance per core or thread than a multi-processor system. It is time to rethink our concept of what a server system is. As attractive as cloud computing is for a number of reasons, none of the major cloud providers offer single processor systems, forgoing the price-performance advantage of the 1S system.

Single and Multi-Processor Systems

There is a mechanism by which we can significantly influence memory latency in a multi-processor (socket) server system, that being memory locality. But few applications actually make use of the NUMA APIs in this regard. Some hypervisors like VMware allow VMs to be created with cores and memory from the same node. What may not be appreciated, however, is that even local node memory on a multi-processor system has significantly higher latency than memory access on a (physical) single-socket system.

That the single processor system has low memory latency was interesting but non-actionable bit of knowledge, until recently. The widespread practice in IT world was to have the 2-way system as the baseline standard. Single socket systems were relegated to small business and turnkey solutions. From long ago to a few years ago, there was a valid basis for this, though the reasons changed over the years. When multi-core processors began to appear, the 2-way became much more powerful than necessary for many secondary applications. But this was also the time virtualization became popular, which gave new reason to continue the 2-way as baseline practice.

Intel Xeon E5 v2, Ivy Bridge to Xeon SP, Skylake

Prior to 2012, the Intel product strategy was a desktop-type processor for 2-way systems and a big processor for 4-way+ systems. There were variations of this depending on the processor generation for various reasons. In 2012, the big processor was brought to both 2-way and 4-way systems in the Xeon E5 line. The next year, Intel employed 3 die layouts to fill their Xeon E5 v2 lineup, with the 6-core LCC, 10-core MCC and 15-core HCC.

IvyBridge3die

It was possible and practical for Intel to put 15 cores on one die in their 22nm process for a high-end product. But there was also significant demand for low and medium-priced products. These were better suited to the cost structure of the 6-core and 10-core die, respectively. Hence the 3 die layout strategy. The E5 v2 was only available publicly with up to 12-cores. The full 15-core was publicly available in the E7 v2, (and in E5 to special customers?).

In the current generation Xeon SP processors, the 3 die layout options are the LCC at 10 cores, HCC at 18 cores and XCC at 28 cores.

Skylake3die

The inherent nature of semiconductor cost structure is that a large die is disproportionately more expensive than a small die. Some of this is the unusable real estate due to the number of whole die that fit on a wafer. In very large die, cost is dominated by declining yield at a given defect density (see also Motley Fool). The cost structure of Intel Xeon E5 and Xeon SP processors largely fits this pattern with some market driven adjustments.

The chart below shows the cost per core versus cores for the range of Xeon SP models (excluding certain specialty SKUs).

XeonSP_price_core

In the main sequence, the cost per core increases as core count increases. As expected, two small die processors cost less than one large die processor having the same total number of cores as the two small ones. An example is the Xeon 6132 14-core at $2,111 versus the Xeon 8180 28-core at $10,009.

At the system level, there is not much difference between the cost of motherboards that can accommodate one or two processors. It would seem then that there is logic in continuing the 2-way system as standard practice, now having 2 or 3 core count and frequency options to fit the broad range of data center requirements. This concept is valid assuming cores are equal between single and multi-processor systems.

Memory Latency, 1-Socket versus 2-Socket

Except that cores are not equal between single processor and multi-processor systems, significantly in terms of memory latency. And memory latency is the key determinant of performance for a most important class of applications.

The figure below shows a representation of the single socket Xeon E5 v4 system.

Xeon_1S_lat2

On a single socket Xeon E5-2630 v4 10-core processor, L3 latency was 17ns and memory latency was L3 + 58ns = 75ns as measured with the 7z benchmark. Intel cite L3 latency for the E5-2699 v4 22-core processor as 18ns. Hence single socket memory latency is assumed to be 76ns for an HCC die.

In the LCC die, there is a single ring connecting the cores and other sub-components. The HCC die has two interconnect rings, in turn, connected with a pair of switches. One might think that there would be more than 1ns difference between a 10-core LCC and the 22/24-core HCC, and it would be nice if Intel would provide an explanation.

The figure below shows a representation of the 2-way system with Xeon E5 v4 processors.

Xeon_2S_lat2

The 7-cpu website shows L3 at 18ns for the E5-2699 v4, consistent with Intel. Local node memory latency is L3 + 75ns = 93ns.

The 7z-Benchmark does not test remote node memory latency. Intel rarely mentions memory latency in the Xeon E5/7 v1-4 generations. One source cites remote node memory at either 1.7X local node. Another cites remote node at 50-60ns higher than local node. The value of 148ns is a placeholder used in later calculations.

The difference between single socket memory latency and 2-way local node memory latency is attributed to a remote node L3 check to ensure cache coherency. The difference between 76 and 93ns is 22%. As memory latency is the major determinant in database transaction processing performance, 22% is huge.

The local node latency applies for applications that can achieve memory locality on a NUMA system. The database engine has NUMA capabilities. But if the complete application and database environment was not architected to work with the NUMA features, memory locality will not be achieved.

Xeon SP

In 2017, Intel decided that it was important to compare memory latency for their Xeon SP against that of AMD EPYC. And hence memory latency information was in their presentations. The figure below shows 89ns for local node and 139ns for remote node memory latency on the Xeon SP 8180 28-core processor.

memory_latencies

In Xeon SP, L2 cache latency increased from 12 to 14 cycles on account of the increase in size from 256KB to 1MB. L3 latency increased from 18 to 19.5ns in the large die models as measured on the Intel Memory Latency Checker. It seems that the Skylake Xeon SP has lower memory latency than Broadwell Xeon E5 v4. This could be true or it could be different measurement techniques.

The figure below is from computerbase.de. Be aware that different values are cited by various sources.

MemoryCacheLatency2

See the Intel Xeon Scalable Architecture Deep Dive (slide deck) and Intel Xeon Processor Scalable Family Technical Overview for more information.

Equivalent Transaction Performance 1S to 2S

On a full 2-socket system, a database not architected for NUMA is expected to have 50/50 local-remote node memory access. Then average memory latency is 114ns (average of 89 and 139) for Xeon SP and 120.5ns (average of 93 and 148) for Xeon E5 v4.

Average memory latency on the 2-way is 50-59% higher than on a single socket system. The expectation is then that performance per core on the 2-way system is 0.63-0.67 times that of the single socket system. The throughput performance of the 2-way, with twice as many cores, is 1.26 to 1.34 times higher.

Based on scaling of 1.35X from 1S to 2S, a 2-way system with 14-core processors is approximately equal in throughput performance to a one-socket system with 19 cores (actual product options are 18 or 20 cores).

1S2S_equiv1

The single socket 28-core processor is about equivalent to a 2-socket system with 20-22 core processors.

1S2S_equiv1

The single socket system also has better thread level performance in proportion to the difference in memory latency, 50-59%. This has significant value of its own. The single socket system should not have NUMA anomalies that can occur on multi-processor systems. One example is from contention between threads running on cores in different sockets.

The cost difference for the processors may or may not be not regarded as large. However, software is generally licensed on a per core basis. The licensing fee may be on the order of thousands of dollars per core versus the processor cost on the order of hundreds of dollars per core. Performance efficiency per core now becomes more important.

Summary

The argument for the single socket as the baseline standard system is compelling. In the past, there were valid reasons for both the greater compute capability and memory capacity of multi-processor systems. In recent years, with storage on SSD, massive memory overkill is no longer necessary, even though it may still be an involuntary muscle-reflex action for DBAs.

The reality is that, with proper design, most transaction workloads can be handled by the 16-28 cores available in a single processor with Hyper-Threading doubling the logical processors (threads). Scaling up to a multi-processor system provides only moderate throughput performance gain and may create new problems, some with serious negative effects. A NUMA-aware VM recovers about half of the efficiency difference between a multi-socket versus a true single socket system. This is helpful, but is short 22% plus its own overhead of the full potential.

Addendum

On the Oracle side, the database and application architecture strategies necessary to achieve compute node locality on RAC could be applied to achieve (memory) node locality on a (single compute node) NUMA system.

For any one who remembers that single to multi-processor scaling was better in the old days, it was. My recollection is that in the Pentium II to III Xeon period, it was something like the following.

Scaling_1S2S

Scaling_4S

Scaling from 4-way to 8-way was perhaps 1.5X. The reason that scaling from 1 to 2 processors was better back then was that there was little change in memory latency in going from 1S to 2S, and similar from 2S to 4S. However, these comparison should be made based on the same processor and memory controller. The regular Pentium II and III had a slower L2 cache, but simpler memory controller. The Xeon versions had faster and bigger L2 cache, but a more complex memory controller for much larger capacity, which also entails higher latency.