Home,   Parent,    Too Much Memory,    RISC vs. CISC,    Multi-Processors Must Die

    RISC_CISC

2S_1Sc

RISC vs. CISC - Modern Analogy (2018-08)

The seminal RISC versus CISC debate of the 1980's is still within the living memory of our senior citizens. Computers of the late 1970's and early 80's were evolutionary successors of the original (discrete) transistor-based systems. As such, they inherited characteristics reflecting priorities from the period of the original systems. Prominently, the complex instruction set (CISC) architecture allowed programs written in assembly language to produce compact binaries.

The principles of Reduced Instruction Set Computing (RISC) are covered elsewhere, including Berkeley RISC. In brief, RISC was forward looking. The expectation was for continued exponential growth in transistor density per Moore's Law. Thought was then given to the best strategy for utilizing the immense number transistors to be available on a single chip in the succeeding generations, and the conclusion was RISC.

Why is a topic from ancient computing history of interest today? Who cares that old fogies reminisce incomprehensibly on a topic of no apparent relevance to the young generation who will shape the future? Basically this. The RISC discussion (an archaic term for how issues were resolved in the past) instigated a rethinking that was necessary to abandon legacy concepts so that the then near to mid-term future could be pursued without excessive baggage.

Since the original RISC concept, almost 20 process generations have elapsed. The strategies to best utilize silicon technology are very different between then and now. Just as RISC called for a rethinking of legacy concepts, it is now past time to re-examine the concepts held in contemporary systems and processors. Obsolete baggage must be discarded, and course changes are necessary to move forward.

The ball and chain holding back modern systems is the huge disparity between the CPU clock cycle and memory latency. There are seemingly insurmountable barriers to this problem, unless one counts the workaround of simultaneous multi-threading (SMT), or Hyper-Threading (HT) in Intel terminology, as the complete and final answer.

Addressing the issue of memory latency appears to be problematic. DRAM latency is simply a matter of cost. But, from the very beginning, the priority requirement for DRAM in computer system main memory was cost, not latency. In the long ago past, more memory was crucial in reducing disk IO. This mission has since been 1) accomplished, 2) exceeded, 3) with overkill and then 4) massive overkill. It is as if adding more memory has become an involuntary muscle action.

And then came All-Flash/SSD storage and later NVMe, obviating any further concern on IO limitations. If that were not enough, there is active development of non-volatile memory with even lower cost, and somewhat higher latency than DRAM.

If a low latency memory technology were adopted in today's prevalent multi-processor server, it would have less than dramatic impact on system level performance. This is because latency in the memory product is a small part of the overall memory latency in a multi-processor system, more so for software that does not achieve a high degree of memory locality. There is simply too much complexity the memory access sequence of a multi-processor system having non-uniform memory access (NUMA).

The first question is then whether multi-processor is necessary today. Or is this simply a concept that was once valid, then carried forward from generation to generation when the originally purpose is no longer relevant? If we can abandon the doctrinal commitment that servers and multi-processor are inseparable, the system becomes a much less complex single processor entity with all memory attached locally.

Next, if we can accept that memory capacity and cost are no longer the overriding priority, then lower latency memory technologies become a viable option. Low latency memory products already exist, though not intended for main memory. Examples are Reduced Latency (RL)DRAM and embedded (e)DRAM.

A significant reduction in latency on the memory product now has sufficient impact at the system level to justify substantially higher cost. In turn, higher system performance recovers anything lost in stepping down from a multi-processor to single processor, if not more. Finally, the full potential of modern multi-GHz processor cores can be realized with the SMT/HT band-aid.

Origins of RISC

Between the 1960's and the 1970's, the manufacturing process technology for the transistor building blocks of processors had changed substantially. The late 1970's was the right time to begin planning for an entire 32-bit central processing unit on single chip of 40-100K transistors.

In this period, it also became more common to program in more portable high-level languages like C. It was found that compilers generally did not produce programs with many of the complex instructions.

It was argued that the evolutionary CISC systems of that time had accumulated too many outdated concepts and was not a good approach to make full advantage of silicon technology going forward. Hence, a revolutionary rethink of the processor was called for. To this effect, John Hennessey . at Stanford and David Patterson. at Berkeley proposed Reduced Instruction Set Computing (RISC).

The term "Reduced" in RISC might be a misnomer. It was only necessary to have a reduced instruction set for the early proof-of-concept RISC processors. Thereafter, the important aspects were: 1) that the instructions have few and regular formats, and 2) to focus on instructions that compilers will actually use.

By the early 1990's, the initial skepticism of RISC had been overcome, and most people fully accepted that the principles were correct. Several process generations (3 → 2 → 1.5 → 1.0 → 0.8 → 0.6 µm) had elapsed since the initial RISC concept. In this period, single chip transistor count increased from 50,000 to perhaps 5 million. (Die size had also increased from 60mm2 to 400mm2.)

Intel dabbled in RISC with the i960 (80960). But they also considered whether the RISC concepts could be improved with better ideas reflecting the upcoming 0.35 µm process in 1996 and for future processes. After all, ideas invented in Happy Valley were better than ideas invented elsewhere. The outcome of this, Itanium, ultimately did not gain traction.

Intel Pentium Pro

That the Intel 8086 descendants ultimately prevailed does not invalid the RISC concepts. The RISC argument were that:
  1) complex instructions would be difficult to pipeline,
  2) irregular instruction format would require a large amount of logic for decoding and
  3) more general purpose (GP) registers than the original CISC set were better.

Intel achieved deep-pipelining in the Pentium Pro by decomposing X86 instructions into RISC-like micro-ops. It needed significant silicon real-estate for the decoding logic. It was estimated that between an X86 adopting some RISC technologies and true RISC processor, the X86 would be at 15% disadvantage in either die size or performance. Some was due to silicon for decoding, and some to the smaller number of registers (visible to the ISA).

As it so happened, Intel had industry leading silicon manufacturing technology, often being 1 year ahead, and two years ahead on a production processor. Intel also had huge economies of scale, that it could amortize a larger design team compared to that common for the RISC processors. It is somewhat ironic that the actual future processor direction emerged just as RISC had won the argument within academic circles.

The 32-bit Intel processors had already committed to the 8 GP register set established by the 80386 versus the 32 GP registers common in RISC processors. There was an internal Intel study on a 64-bit extension of the X86 ISA, also having 8 registers. One person stated that an instruction set architecture (ISA) with more than 8 registers would fundamentally not be X86. However, Intel was committed to Itanium as the 64-bit ISA, and did not pursue this.

AMD Opteron X86-64

Sun had planned on the RISC-based UltraSPARC for their flagship systems with no intention of a version for low-cost systems. I recall hearing some talk that the Solaris operating systems people worked with AMD on a 64-bit X86-64 ISA with 16 GP registers for Opteron (2003), yet backward compatible with the 8 register set of X86 in 32-bit mode.

Between the late 1990's and the early 2000's, it was clear that the RISC processors vendors were having difficulty in executing to Moore's Law, with logic transistor density on the order of 5-10M. (SRAM areal density is higher than logic. In comparing processors with different cache sizes, estimate the logic complexity.) While it might not be suprising that Intel, with their economies of scale, could do this, that AMD could also produce processors at this transistor density and the RISC vendors could not is curious.

Multi-Processor Systems - 1990's

In this time, the issues with building multi-processor (MP) systems were largely worked out. Two significant elements were 1) a large L2 cache on a private backside bus, and 2) a split-transaction front-side bus enabling overlapped command/address and data transfers. Two and four-way multi-processor systems on a shared bus exhibited good scaling.

    4S_PIIXeonS

There was also market interest in scaling to much higher levels than 4-way. The cost of implementing a custom line-of-business application could run to hundreds of million dollars. Regardless of what the true technical requirements should be, it was unseemly to deploy such a high-profile project on a mere $100,000 hardware platform when there were multi-million-dollar system options. Big-Iron also made for better background in executive photo-ops.

    NUMA_xbar

Large systems typically have Non-Uniform Memory Access (NUMA). Software does not run well on NUMA systems by accident. The technical challenges in effectively employing such systems were not properly and fully communicated between all involved parties. People eventually became aware that there were serious problems. The alternative of clustering many smaller systems was investigated, but this had its own difficulties.

In any case, multi-processing was extremely popular from the Pentium Pro forward. The 2-way became the baseline standard for server systems in data center operations. The 4-way systems were typically accompanied with memory controllers designed to connect large memory arrays and these became standard for database servers in the 1990's.

Multi-Core Processors - 2000's, Paxville to Dunnington

By the mid-2000's, further increasing the complexity of core, which in the past meant the entire microprocessor, was no longer feasible. Moore's Law still expected to continue increases in transistor density, though frequency scaling was significantly diminished with the 90nm process in 2004. The path forward from here was multi-core processors. First came dual-core, then quad-core, followed by 6, 8, 10, 15, 18 and 24 cores. The current generation Intel Xeon SP has up to 28-cores.

In the shared bus era, the Intel product strategy tried to have silicon differences between the processor for 2-way and 4-way systems. The 2-way systems used desktop processors, initially in the same form factor, and then in a separate form factor from Pentium 4 and later. The 4-way systems used processors with a large cache, which improved scaling at the 4-way level. The die images of Intel large die processors, scaled roughly proportionate to die size, show the logic-cache split evolving over several generations.

Between 2005-07, including the first few generations of multi-core, Intel tried to continue the strategy of silicon differences between 2S and 4S+. But some products were just two desktop die in a single MP processor socket for better time to market. In this period, the desktop processor cache size had grown to 2-3MB per core, which was adequate to support MP scaling.

The 90nm single-core Potomac, 1M L2 and 8M L3 was succeeded by 90nm Paxville comprised of two single core die with 2M L2 each in 2005. Then came 65nm Tulsa, dual core die, 1M L2 per core, 16M L3 shared, the last of the NetBurst, architecture in 2006.

At 65nm, Tigerton was comprised of two dual-core, die, each die with 4M shared L2 in 2007. The 45nm Dunnington had six cores, 3M L2 shared by each pair of cores, and 16M L3 shared all on a single die, and was the last of the Core architecture in 2008.

While Intel still used a shared bus protocol in the 2006-09 period, the 4-way system had two busses with two processors per bus in the E8500/01 MCH for the Potomac to Tulsa. The 7300 MCH had four busses each with a single processor for Tigerton and Dunnington.

Multi-Core Processors - 2010's, Nehalem to Skylake

The separate silicon between 2S and 4S product lines continued into the Nehalem and Westmere architectures. The Nehalem-EP and Westmere-EP processors for 2-way systems had 4 and 6 cores respectively. Nehalem-EX (aka Beckton) and Westmere-EX for 4S+ systems were large 8 and 10-core dies respectively, and connected to a scalable memory buffer (SMB) for extra-large memory capacity.

    NehalemEX_SMB

The presumption behind this product strategy was that general-purpose applications favored 2-socket (2S) systems having moderate processor cost ($200-1600) and intermediate memory configuration. Database transaction processing desired the high core-count large-die processor, meaning cost is no object, and preferred 4S systems with large memory.

The Xeon 7500 (Nehalem-EX) was the first Intel processor to support glue-less 8-way. Opteron had glue-less 8-way from the beginning in 2004. But the time for these systems were coming to an end.

Sandy Bridge (E?), 32nm, was a transitional product, as an 8-core die for both 2S and 4S systems. There was no Sandy Bridge option for 8-way systems.

The following three generations: Ivy Bridge 22nm, Haswell, and Broadwell, employed a new product strategy. The Xeon E5 line was 2S/4S having 2 QPI links, 40 PCI-E lanes, and connecting to memory without the SMB. The Xeon E7 line was 4S+ having 3 QPI links, 32 PCI-E lanes and connecting to memory via the SMB.

In each generation, there were 3 die layouts having low, medium and high-core count (LCC, MCC and HCC). Ivy Bridge LCC was 6-core, MCC 10-core, and HCC 15-core. Haswell, also 22nm, LCC/MCC/HCC were 8, 12 and 18-cores respectively. Broadwell, 14nm, layouts were 10, 15 and 24-core. Presumably, the E5 and E7 versions had the same basic silicon, but used different stepping. Each had their own LGA-2011 package?

The shift in strategy beginning with Sandy Bridge (Xeon E5) acknowledges that some applications like HPC desired high core-count, without the SMB expander. This includes some databases as well.

After Haswell with up to 18-cores, system vendors Fujitsu and Lenovo stopped publishing TPC-E results for 8-way systems.

In this period, the 4-way became recognized as overkill for most workloads. This is evident with AWS EC2 dedicated hosts not continuing the 4-socket option past Xeon E7 v3 (Haswell). Most server demand is now for 2S systems. AWS EC2 does not offer single socket as dedicated

    Broadwell_HCC_3
Broadwell HCC die, 24-cores, 456.1mm2, 25.20 × 18.10 mm, 2016.

In the Skylake generation, Intel conceded that the extra-large memory capacity enabled with the SMB was no longer a key requirement. The multi-socket product lines now became Xeon SP with some models for 2S, others for 4S, and 4S+. There are sub-models denoted by the M-suffix that use the SMB, but it is not clear whether any system vendors implement this.

   SkyLakeSP  Broadwell_HCC_3
Skylake XCC die, 28-cores, 694mm2, 32.2 × 21.56 mm, 2017.

Curiously however, people are not willing to let go of the concept of the server as being inherently a multi-processor (socket) system even with high core-count processors being available. In part, the popularity of virtual machines (VM) has been an enabler of continuing to justify the MP system practice.

Multi-Processor Non-Uniform Memory Access (NUMA)

Inherited legacy concepts brought us to the current system architecture today. The modern multi-processor system with multi-core processors is a complex entity. That the processor itself is complicated is an unavoidable consequence of the transistor budget being on the order of 10B at the 14nm process. The question is then whether the complexity of a multi-processor configuration is a benefit or a liability. Leaving behind the SMB in Skylake based Xeon SP is the first step in removing unnecessary complexity.

    MemLat_2S

The weak point in systems today is memory latency for one category of applications, and memory bandwidth for another. These are different requirements that cannot be addressed with a single architecture. The discussion here pertains to the first category as bandwidth has been addressed with different strategies elsewhere.

The Skylake core is designed to run at 4GHz+, though it is throttled down to around 2.5GHz in the high core-count products to stay within the thermal envelope. Memory latency is on the order of 100ns, with local node latency around 90ns and remote node around 140ns. Average latency in a 2S with 50/50 local-remote access is then 115ns. Even a small fraction of code that depends on a memory access completing before the next step can be determined results in the very large majority of CPU-cycles being no-ops.

Ideally, software should be architected for NUMA, achieving a high degree of memory locality. For transaction processing, both the application and database must work with a common strategy. In practice, almost no one did this, and there is no inclination to re-architect existing software for NUMA.

Memory technology with lower latency than conventional DDR4 SDRAM already exists. It is more expensive because DDR4 is optimized for cost. Trading cost for latency does not have strong benefit in an MP system because much of the latency is outside of the DRAM itself. Discarding the SMB was the first step. The next step in addressing the most serious problem of modern systems is thus letting go of the multi-processor.

Single Socket - All Memory Access Local

To accept the single socket system, we must be willing to give up half of the total cores available in a 2S system, and half the memory channels, meaning both bandwidth and capacity. The Skylake based Xeon SP has up to 28-cores, which should be capable of handling most workloads.

The cost of one 28-core processor is much higher than two 14-core processors. But cores are not equal between 1S and 2S systems for software sensitive to memory latency and not properly optimized for NUMA. In this circumstance, fewer cores are required in the 1S system. Given that software licensing is usually on a per-core basis and much higher than the hardware cost, the solution requiring fewer cores will outweigh any hardware differences.

On the memory side, latency sensitive software does not stress bandwidth because most CPU-cycles are no-ops waiting for memory. The old rule for memory capacity was more is better. That was set when system memory capacity was in the 2-8MB days. Think VAX 11/780. Today, memory is hundreds of GBs for a single socket (6 or 12 × 64GB) and 1.5TB for 2S systems.

Note that 2S TPC-E from Sandy Bridge (8-core) to Broadwell (22-core) were all (but one) at 512GB memory with no apparent effect on tpsE/core. Also, even Intel realized that the SMB for expanding memory capacity had become overkill, and dropped it from Xeon SP except for specialty sub-models.

In all, there no serious liabilities to the 1S system and strong advantages.

Rather interesting is that memory latency in a single socket system with Xeon E5 processors appears to be about 75ns. (It is possible memory configuration is involved. What is the impact of RDIMM versus LRDIMM? 1 DIMM per channel versus 2 DPC?)

    MemLat_1S

At Intel Software Conference 2014 Brazil, the presentation by Leonardo Borges Notes on NUMA Architecture discusses memory access in a multi-processor system, of either Westmere or Sandy Bridge vintage. For local memory access, on L3 miss, a request is sent to the memory controller and snoop is issued to remote processors. Latency is the maximum of the two responses.

The remote CPU snoop is quicker than local memory, implying that a multi-processor configuration does not negatively impact local node memory access? Yet reported data suggests memory latency is lower in single socket systems than for local node access in MP systems. It is possible this is the effect of RDIMM versus LRDIMM, or single versus multiple DIMMs per channel. Any definitive clarification would be appreciated.

The single socket system achieves lower memory latency by itself. Giving up 50% of the cores of a 2S system gives up only 30% of the performance, as each core in the 1S system has 40% better performance than the same core in the 2S system.

Summary

Modern server system strategies carry forward many practices from the long ago past. But the world has changed and what was best practice is now obsolete in concept and a liability in the modern environment. Just as the RISC versus CISC debate of the 1980s resulted in a redirection, now is the time to do so once more. The concept that we should first let go of is the server and multi-processor. The next is memory. Big memory is not the priority anymore (though tiered memory might not be a bad idea). Memory latency is more important, even though this means going smaller. And this step cannot be taken until we give up multi-processor.

 

 

Additional References

Hot Chips 7, 1995 Optimizing the P6 Pipeline, D. Papworth

Mark of Clemson, P6, and The Pentium Chronicles,
Also, the book "The Pentium Chronicles," by Robert Colwell

Mikko H. Lipasti, University of Wisconsin-Madison Pentium Pro Case Study

Opteron at Hot Chips 14 (2002) The AMD Hammer Processor Core and AMD Opteron Shared Memory MP Systems

Wikipedia List of Xeon Processors

Wikichip Penryn die image for Dunnington;

Memory Latency II,  

Unorganized preliminary thoughts Notes and Ramblings