Home,   Parent,   Query OptimizerBenchmarks ProcessorsStorage ScriptsExecStats

Multi-Processors Must Die (2018-08, edit 2018-08-09)

Less dramatically, single processor systems should be the rule and multi-processor the exception. It is almost a universal practice to employ the 2-processor (socket) as the standard system in data centers. One would expect that there are good and valid reasons for this considering that everyone does it as if it were a best practice. People have been doing this for twenty years.

Twenty years ago, there were good valid reasons for the 2-way as the baseline system. It might be asked if this practice has been re-evaluated periodically as to whether the underlying reasons are still valid. Or have people been just following what other people did before them without question, which seems to be the standard practice regarding best practices.


Twenty years is ten generations of Moore's Law, over which transistor density increases by a factor 2 in each successive generation. The cumulative change over 10 generations is an increase in density by a factor of one thousand. Multiple orders of magnitude change in the basic building block of the microprocessor has rendered the important issues of twenty years ago mostly irrelevant today.

Two key changes over this period has been the integration of the memory controller (IMC) into processor, and the advent of multi-core processors. Additional important matters are the cumulative pace of change in memory capacity versus requirement, and the transition from hard disk to solid-state storage. These changes combined form the basis of rethinking server system architecture

RISC Revisited

In the 1980's, around the 2µm process (2000nm!), a single chip die could have 40,000+ transistors. The argument for RISC was made. By simplifying the Central Processing Unit of complex processors, the entire CPU could fit on a single die. While not all instructions could be implemented in silicon, the benefits of eliminating inter-chip delays outweighed the advantages multi-chip processors.

The Intel 80286 in 1982 was manufactured on 1.5µm with 134K transistors.
The Berkeley RISC-I in 1982 was 44,420 transistors.
The first MIPS may have been the R2000 on 2.0µm, 80mm2, 110,000 transistors in 1987.
Wikipedia cites early SPARC as 1987, 1300nm, and 0.1-1.8M transistors. Presumably 100K transistors was for the core, and 1.7M was for the L1 cache?

Now, it is time to abandon the multi-chip system architectures of the multi-processor type, to eliminate its impact on memory latency.

Origins of 2-way as Baseline

The practice of the 2-way as the baseline standard server system began soon after the Intel Pentium Pro processor arrived in 1995. The Pentium Pro was the first successful implementation of glue-less multi-processing. Multiple processors could be connected with the same core logic chips used to assemble single processor systems.

The preceding Pentium processor did have glue-less 2-way protocols, but that functionality was a disaster in actual systems. Two processors shared an L2 cache, over an arbitrated bus. Out of the ashes came Pentium Pro with L2 cache on a separate backside bus, and a split-transaction frontside bus allowing overlapped command/address and data transfers.

Prior to Pentium Pro, additional silicon components were necessary to bridge multiple processors together in a workable manner. The higher cost of the specialty components leads to lower demand, which in turns mean less volume to amortize the fixed development costs. The combination of the two is usually usually resulted in a death spiral unless there was a critical requirement with cost-is-no-object parameters.

The first chipset for Pentium Pro was the very complicated 450GX, which was meant to support 4-way servers with large memory and extra PCI busses. The second chipset was the much simpler 440FX with client PC type features and cost structure. From here, most system vendors worked out that having a second processor socket added very little cost to the basic system.

The figure below is more representative of the 440BX chipset in having an AGP port.


In the Pentium II (Klamath and Deschutes) generations, the cost of the second processor could be as low as $200. This created the PC workstation category with dual-processors. More or less, RISC workstations were eradicted over the next several years. For the datacenter, it was not worth the bother to also have a single processor option as one of the standard systems.

Regardless of whether there were actual benefits in having a second processor, both workstations and entry servers of this type were widely deployed. This allowed developers to have ready access to multi-processor systems, facilitating the development of multi-threaded applications. Note there are important differences between the behavior of multi-threaded applications on single and multi-processor systems.

It turned out that VB-classic components in IIS were not multi-processor safe, resulting in the need to recycle after exhausting the 2GB user address space, a bug that was fixed later. Recall that in this period, Microsoft had separate kernels (and HALs) for single and multi-processor systems. Eventually, Microsoft felt comfortable in dropping the single processor builds of Windows NT.

Integrated Memory Controller (IMC)

The AMD Opteron with an integrated memory controller arrived in 2003. At the time, there were a number of factors in this decision. One was that making productive use of more transistors in the processor core was difficult. On the other hand, the lower memory latency in a single processor system having an IMC achieved very good performance gains, depending on the application. It could also enable a lower cost system by having fewer silicon components. This was not targeted in the initial iteration but did occur in subsequent generations.

Below is a representation of the 2-way Opteron system:


and the 4-way Opteron system.


An implication of the IMC is that any multi-processor system inherently has non-uniform memory access (NUMA) architecture, with a memory node associated with each processor. Memory attached to the processor that a thread is running on is local. Memory attached to other processors is remote, traversing one or more hops.

The Intel processor codenamed Timna in the early 2000's was to have an IMC, but this was a poorly conceived reactionary project that was then cancelled. The main line Intel processors incorporated IMC with Nehalem in the 2008-10 timeframe.

Multi-core Processors

The advent of multi-core processors was an acknowledgement that increasing performance of the core to a degree commensurate with doubling of the transistor budget in each process generation was no longer possible. The expected gain for doubling the number of transistors is 1.4X.

Multi-core processors may scale nearly linear with the number of cores and by proxy, the number of transistors, for workloads that can be evenly distributed over many threads and run with minimal contention. Multi-core does nothing for single-threaded applications. Hence, so long as there was an avenue to increase general purpose performance, meaning the processor core, that was the path, until it was no longer feasible.

Multi-core processors occurred around 2005. For the Intel, the Pentium D was simply two Prescott dies in one package. It was possible to do this because Intel processors of that period still interfaced with a bus allowing shared access by multiple devices.

Below is a listing of some milestone Intel Multi-Core processors.

CodenameCores L2/L3Year-M Process
Tulsadual2x1M L2, 16M L32006-865nm
Conroedual4M L22006-765nm
Dunnington63x3M L2, 16M L3 200845nm
Nehalem EP/EX4/88M/24M L3 2009/1045nm
Westmere EP/EX6/1030M L3 2010/1132nm
Sandy Bridge825M L3 201232nm
Ivy Bridge6, 10 & 1537.5M L3 2013/1422nm
Haswell8, 12 & 1845M L3 2014/1522nm
Broadwell10, 15 & 2460M L3 2015/1614nm
Skylake10, 18 & 2838.5M L3 201714nm
Ref: Wikipedia Xeon and Wikichip Intel microarchitectures

On www.qdpma.com, I have collected Intel desktop and Xeon processors die images. They are shown approximately to scale, at 10 pixels per mm, based on available information. Notice how the portion of the die identifiable as L2 or L3 cache has changed from large at 65nm to fairly small at 14nm. Over five process generations, from 65nm to 45nm to 32nm to 22nm and finally to 14nm, the number of cores roughly doubled with each generation reaching 28 cores in Skylake SP (versus 2^5 = 32).

Tulsa was a dual-core die based on Cedar Mill. This was the last of the enhanced Pentium 4 architectures. There was a dedicated 1M L2 for each core and a shared 16M L3. Tulsa arrived as the Xeon 7100 series for 4-way systems in 2006.

Conroe was a dual-core die in which both cores shared the L2 cache and the bus unit, also in 2006. The release dates of Tulsa and Conroe were very close, but Conroe was employed in multiple product lines: desktop, mobile, and multi-processor servers. A quad-core product with two Conroe dies in one processor package followed shortly after.

Dunnington was a 6-core die comprised of 3 Penryn dual-core processors, each with 3M L2 shared by two cores and a 16M L3 shared by all three core-pairs, as the Xeon 7400 series.

Nehalem-EP refers to the quad-core model for 2-way systems as the Xeon 5500 series, and the EX was the 8-core die for 4-way+ servers as the 7500 series. Westmere-EP was a 6-core for 2-way as the Xeon 5600 series and the EX was 10-core for 4-way+ servers as the Xeon E7 series.

Sandy Bridge-EP was an 8-core for both 2-way and 4-way systems as the Xeon E5 series. The EP had only 2 QPI links, so in a 4-way system, two remote processor are directly connected (1-hop) and one remote processor is 2-hop, similar to the 4-way Opteron.

In the next three generations, there were Xeon E5 v2, 3, and 4 products with 2 QPI links similar to Sandy-Bridge EP and there were Xeon E7 v2, 3 and 4 products with 3 QPI links, and used Scalable Memory Buffers (SMB) to double memory capacity per channel.

Terminology in the Multi-core Era

Prior to this, the processor, its single core, and the socket that it plugged into pertained to the same object. There is no terminology for multi-core processors that is not free of confusion, but the standard that Intel uses is as follows. The complete processor in a package fits in a socket. Intel processors are now comprised of multiple cores. When Intel Hyper-Threading (the generic term is Simultaneous Multi-Threading) is pertinent, Microsoft uses the term Logical Processor.

The Transaction Processing Performance Council (TPC) uses the terms: processor, core and thread respectively. Be aware that certain processors are not meant to be socketed, so the term socket may only mean the complete processor and not the package mechanical connection. Because the term processor might be used in the context of its historical nature as a single logical CPU, Intel seems to use the term socket to mean the whole multi-core entity.

Scaling - Shared Bus, Uniform Memory

Scaling from one to two (single core) processors on the old shared bus architecture was very good, perhaps 1.8X for workloads characterized by memory round-trips (example: transaction processing), as opposed to those that stream memory. Scaling from 2 to 4 was good enough with a large cache.



Note: in the mid-2000s, the 4-way Intel systems had either 1) a memory controller with two independent busses, and two processors sharing each bus or 2) four independent busses on the MC and one processor per bus.

Scaling - NUMA

Opteron had exceptional performance at the single processor (again, single core) level for transaction processing, in having low memory latency as the processor connects directly to memory.


Scaling Opteron from 1 to 2 processors incurs step increase in memory latency. Memory access is now a blend of local and remote node access. This is the case unless both the application and database were architected together to achieve a high degree of memory locality on NUMA systems.

The TPC-C and to a lesser degree TPC-E were done in this manner. (As a result, Oracle had good scaling in TPC-C on RAC, but less so with TPC-E. This is why Oracle did not publish TPC-E results, even though single system TPC-E performance was excellent.) Perhaps a handful of real world databases did this, probably on the Oracle side because it was necessary for RAC scaling.

Nevertheless, even without NUMA optimizations, the 2 and 4-way Opteron systems had very good characteristics compared to the bus architecture processor systems. Exact comparisons are complicated because the core processor architectures are different as well as the differences at the system level. There did not seem to be any adverse issues related to the NUMA architecture in 2 and 4-way systems during the period of single, dual and perhaps quad-core processors. That or it was not broadly reported.

Prior Opteron, NUMA systems generally involved nodes of 2 or 4 processors, each with its own memory controller, as represented below.


Severe scaling issues attributed to NUMA were observed in these systems, for which there was very little publicly available technical information. In hindsight, it might be possible that this was due to contention from a combination of NUMA and multiple processors on each node, the whole system having very many processor (and cores), typically 16 or more.

That there was penalty in scaling from 1 to 2 processors due to the shift from all local node to a blend of local and remote node memory access did not matter. High volume database servers of the time needed the compute capability of 2 or 4 processors, and their associated memory and IO capacity. In fact, even greater capability would have been desired if scaling could be realized without running into high scaling issues in databases not architected for NUMA.

Modern 1S versus 2S, post- 2014

So, what is different today, or more broadly, since about 2014? The modern two-processor (socket) ostensibly has double the memory bandwidth and capacity of the single-socket system. A bandwidth dependent application properly architected for NUMA can in fact achieve most of the nominal bandwidth of the 2S system.

Furthermore, the manufacturing cost of one very large die is much more than the cost of two medium size dies with the same total number of cores. Part of this is from the number of whole dies that fit on a wafer, but more of it is due to yield characteristics at very large die size for a given defect density. If the cores between these two systems were equivalent, then the 2-socket could have a large cost advantage.

Except that the cores between 1S and 2S systems are not equivalent when the application is sensitive to memory latency, prominently, database transaction processing. This is true even if the application can achieve a high degree of memory locality on NUMA because memory latency on a single socket system is lower than local node latency of a multi-socket system.

The figure below has values from an Intel slide showing 89ns for local node and 139ns for remote node for a 28-core Xeon SP 8180.


There does not seem to be an Intel source for the single socket 8180. See Memory Latency and Single Processor for more on this topic.

The 7-cpu website has the i7-7820X 8-core 4.3GHz (Turbo Boost) (Skylake X), which could be the LCC die, memory latency 79 cycles + 50ns with DDR4-3400 16-18-18-36 (unbuffered) memory. This works out to 18.37ns for L3.

The 8180 in a server would probably have DDR4-2666 LRDIMM at CL=19. The L3 cited for the 28-core XCC die is 19.5ns. Assuming the LRDIMM ECC server memory has 58ns latency, and L3 is 19.5ns, then the 1S Xeon SP might have 77.5ns total memory latency. For whatever reason, I had made calculations based on the single socket XCC memory latency at 75ns. There is an 18.7% difference between 89 and 75ns memory latency.


In the previous generation Xeon E5 v4 (Broadwell) processors, the single socket E5-2630 v4 2.2GHz 10-core memory latency was measured at 38 cycles L3 + 58ns = 75ns. The 2S E5-2699 v4 22-core 2.2GHz base was reported for 3.6GHz Turbo as 65 cycles L3 + 75ns = 93ns.

Performance Model

The performance model use here was originally based on observed characteristics of a 2S Xeon E5-2680 8-core 2.70GHz (Sandy Bridge), Hyper-Threading (HT) enabled, 16 cores total, 32 logical processors or threads in TPC parlance. On a BIOS/UEFI update, the system defaulted to a power-save model of 135MHz (one-twentieth of 2.7G). The worker time (CPU) for key transactional queries were observed to be approximately 3 times higher than normal.

Published memory latency values for 2S Sandy Bridge, and it was assumed to be 93ns to local node (reported 2S Broadwell value), and 143ns remote node (comparable to reported values) for an average memory latency of 118ns. From this we can calculate that 2.60% of instructions incurring a memory round-trip would be consistent with a 3× performance difference.

Note that this model has issues. At 2.7GHz, 118ns average memory latency is 318.6 CPU-clock cycles. If 2.6% of instructions involve a memory round-trip, then only 10.8% of CPU-cycles are not no-ops waiting for memory. In these circumstances, we expect Hyper-Threading to scaling almost linearly. This is why some RISC processors are SMT4 or 8, that is, 4 or 8 threads per core (SMT is the generic term, HT is Intel terminology). However, at 135MHz, then 72% of cycles are not idle, hence HT scaling is at best 1.39× and probably less. It is likely that 4-5% of instructions involve a memory round-trip.

Regardless, the 2.46% value is used for instructions being memory round-trips. The model further assumes core frequency is 2.5GHz, which is common in more recent high-core count processors. Memory latency is assumed to be 89 ns local and 139ns remote, the published values for Xeon SP (Skylake).

Then the difference between 89ns for local node memory latency in a 2S system and 75ns in a true 1S system is expected to be a performance of 15.8%. A VM with local memory would see something similar to this, except that there is an extra step in the memory translation (search VT-x, EPT, SLAT)?

Suppose that we had architected our combined application plus database to achieve 80% memory locality on the 2S system. Then average memory latency is 99ns, corresponding to a performance difference of 26% per core, or a 1S to 2S scaling of 1.58X.

For a database not architected for NUMA, we would expect average memory latency to be (89+139) ÷ 2 = 114ns, for a 52% difference in latency translating to 43% in performance per core. Scaling from 1S to 2S is 1.4X (2× the number of cores, each core equivalent to 0.7× of cores in a 1S system). The disparity in performance per core is very large.

Given that software licensing costs are frequently on a per-core basis, and is the larger element, more so when Enterprise Edition applies, there is strong preference for the solution having better performance per core. It does not matter that two 14-core processor costs less than one 28-core processor, the 28-core 1S is probably a closer equivalent to the 2S × 20-core system.

Memory Bandwidth, Capacity and IO

Additional considerations in the 1S versus 2S system choice are memory bandwidth, memory capacity and IO channels. Obviously the 2S based on processors with IMC and integrated PCI-E as well, has twice the nominal bandwidth, twice the memory capacity and same with PCI-E. Are these elements important.

There are applications that stream memory, in which the higher number of memory channels are important. But database transaction processing is not such an application. Nor is IO bandwidth an issue. A table scan to memory will not tax modern processor memory bandwidth, not does it tax the IO channels. This leave only the question of memory capacity.

A very long time ago, memory was indeed desperately needed to reduce disk IO to achievable levels. This was probably the era when mini-computers supported a max system memory of 8-64MB. The rule that DBA's lived by was: more memory is better. And perhaps: there is no such thing as too much memory? By the 1990's, server system memory was in the GB range.

In the early to mid-2000's, system memory reached 32-64GB. This was when I noticed that on properly tuned high volume transaction databases, disk IO was typically well below 10,000 IOPS, so long as the data architect did not do something really stupid like employ unique identifier (guid) columns as primary keys on the core tables. This is well below the practical upper bound of 200,000 IOPS of a very large disk array, that being around one thousand 15K HDDs, with short stroking.

By the time of 4-way Nehalem-EX, practical memory was 512GB based on 64×8GB at about $24K ($375 per 8GB), and the extreme configuration was 1TB on 64×16GB at about $75,000 ($1200 per 16GB)? In this period, it was not uncommon to see disk IO reduced to noise levels, except for the checkpoint flushes.

Solid-State / All-Flash Storage

Around this time, 2010-11, solid-state was beginning to become feasible as storage. The cost might have been $10-20 per GB for enterprise class products, which generally means a high degree of over-provisioning, perhaps 25% or higher. Technically, the justification for all-flash is when the cost of the SSD capacity is less than the cost of desired IOPS level in HDDs. In practice, people implemented flash if they had the money and felt it would solve a problem.

By 2014-15, all-flash was probably the better storage choice for line-of-business systems, regardless of IO requirements. The price structure of enterprise class SSD now being $1-4/GB.

The main point here is that the rule of more memory capacity is better has been killed on both ends. One, (economical) system memory capacity has grown at a faster rate than the need for memory, even with today's undisciplined bloated databases.

Second, the old bugaboo of IOPS and IO latency has been vanquished by all-flash storage. Furthermore, stepping down to a single socket system, with 12 DIMM slots is no issue given 64GB is very economical now for a total memory of 768GB with 128GB DIMMs probably becoming economical next year or by 2020.

Memory in Multi-Processor Systems

Before closing out this topic, some discussion on DRAM is warranted. All during this extended period from the 1970's to present, DRAM manufacturers were told by database and systems experts that only price and capacity translated to market value. (It was also understood that memory bandwidth was important, hence the scaling of SDRAM from 66-100MHz to DDR4 at 2666MT/s+ today.)

Although servers are only one segment of the DRAM market, it is the segment that is not afraid to pay a 2-3X per bit premium for the latest, highest density product, while other segments bought the commodity previous generation parts. It is helpful to understand extenuating circumstances that formed the old rules for DRAM.

The server system architecture suitable for database transaction processing according to old school principals is represented below.


The system is designed for very large memory capacity. Notice the SMB multiplexers to increase the number DIMM slots. This was the standard for 4-way systems all the way up to the Xeon E7 up to v4 (Broadwell).

Note that the Xeon E5 has 2 QPI ports while the E7s have 3 for better scaling at the 4-way level (1-hop to all remote nodes). In the restructuring of the Xeon E5/7 into Xeon SP, Intel finally acknowledged that memory configurations have become ridiculously large. There are now options for the highest core count models and 3 UPI links without the memory expander (which is still available in the M models).

Regardless of whether the processors are IMC or have discrete memory controllers, the path to memory is long. Before IMC, it was from processor to memory controller to SMB to DRAM and then back.

After IMC, it could be processor to SMB to DRAM for local node access, or processor to another processor to SMB to DRAM and back for remote node. The processor with IMC has better average memory latency because there is some local node access and the point-to-point signaling protocols are more efficient than the old shared bus.

Average memory latency in this system might be on the order of 127.5ns or longer, based on 90ns local (25%) and 140ns remote (75%).

Memory Today

Of the 127.5ns average latency, 18-19ns is in the local node L3 cache. The DRAM in server systems have conservative ratings, in part to meet higher reliability expectations. Typical timings for tCAS, tRCD and tRP are about 14ns each. Depending on whether just the first two or all three are involved, the latency at the DRAM interface might be 28ns.

This translates to 58ns at the memory controller without the multiplexor (SMB) in between. For gaming systems, there are cherry picked parts available with 10ns timings, which achieves 50ns latency at the memory controller.

If we were to have a different memory technology, something that could lower latency at the DRAM interface by 20ns, then overall average memory latency is reduced from 127.5 to 107.5ns. This would improve transaction processing performance by about 16% based on the model above (3% pointer-chasing code not in cache).

If this doubled the cost of memory, it is probably still worthwhile. If this cut the memory capacity by half, and this were back in the old days when memory was still necessary to reduce disk IO to manageable levels, we might not be agreeable.

Intel has already dropped the SMB. Most people realize that 4-way is no longer necessary for even high-volume transaction processing. Average memory latency is now a 50/50 mix of local and remote. But the final leap of quantitative analysis, requires that people let go of the belief that a server is multi-processor. This belief goes back twenty years and the original rational has been rendered completely irrelevant. Yet people do not want to let go.

Memory - Single Processor/Socket

On stepping down from 2S to 1S, we only give up 30% system level throughput for a 40% increase in performance per core.

Now, a better memory technology that shaves off 20ns of latency would yield a 28% gain.


This brings our 1S with new memory to within 9% of the original 2S with half the number of cores (and core licenses).

The technology and products for lower latency memory already exist, one example being RL-DRAM. One step is de-multiplexing the row and column addresses. Long ago, multiplexing the row and column addresses helped to reduce the cost of the package and wiring a memory system. This was back in the day of the x1 organization, and even predates the SIMM, itself a predecessor of the DIMM.

As with other concepts, multiple generations of manufacturing advancement have rendered the cost benefit of multiplexed addressing moot. But no one wants to deviate from an established ecosystem without clear purpose.

Another step is simply more smaller banks. This entails increasing the amount of die area for logic, which adds cost and DRAM manufacturers were told long ago not to do this. But now the world is different. For the single socket system, the impact of low latency DRAM is not muted as with large complex system architecture. Low latency memory even at more than double the cost of conventional DRAM is well worth the price.


In re-evaluating the server system, a strong argument is made for single-processor as the default choice for a very large and important category of applications. The gains in cost-efficiency are large and significant. If this calls for the use of more expensive processors on a per core basis, that is more than compensated for on the software side. The original reasons for adopting multi-processor as standard has been rendered moot over 10 generations.

In addition, it is necessary to let go of the multi-processor fixation so that we can open up other avenues to reducing memory latency, further improving performance efficiency. And this might include SRAM as main memory.


This is similar to the CISC versus RISC issue from long ago. The CPU of mainframe were built from many discrete components. The RISC adherents argued that single chip simpler microprocessor is the better approach because it gets rid the interconnect delays. This meant that complex instructions would have to be split into several RISC instructions.

The unexpected outcome was neither RISC nor CISC. It was X86, which was not classical CISC nor RISC either, that won out in part on borrowing the technology (as opposed to architecture) elements of RISC and by staying on Moore's Law. Now, we want to get rid of delays incurred by multiple discrete processors for a shorter path to memory.

Consider a 2.5GHz processor core,
and 3% of instructions incurs a memory round-trip.
Then the percentage of active cycles versus average memory latency is:
 75.0 ns     15.2%
114.0 ns    10.5%
126.5 ns      9.6%

It might seem that increasing the Hyper-Threading (generic term SMT) degree from 2 to 4 or even 8 would solve the memory latency induced CPU dead cycle issue. This does indeed solve the system throughput inefficiency.

But there are times in which we really want single thread performance to be as high as possible. Increasing CPU frequency does not help. Decreasing average memory latency does directly address this issue.