System Architecture Review: 1998 to 2016
A long, long time ago ...
The Relational Database was perhaps the strongest early drivers of multi-processor servers. On the Intel processor side, Sequent had a Symmetric Multi-processor (SMP) system built around the 80386. A few system vendors built chips to connect multiple 40486 and Pentium processors together. Intel then built their own chips to connect Pentium processors into a 4-way system that system vendors could OEM.
The 4-way Pentium systems very popular for database servers even though they were expensive relative to single socket systems on account of the specialty components. It was also possible to build a 2-way system, but these had nearly the same cost structure of the 4-way and did not sell well.
Twenty years ago (1996) was the time of the Intel Pentium Pro processor. The Pentium Pro could support glue-less multi-processing upto 4-way. This means four processors could connect directly on the processor front-side bus to the stock core logic memory and IO controllers to for a 4-way symmetric multi-processor (SMP) system.
The Pentium Pro processor enabled glueless multi-processor systems up to 4-way. The reduction in system cost structure helped to further grow the 4-way server market, but the 2-way systems market exploded, having a cost structure only moderately higher than single-socket. These were popular for servers, and overnight, the workstation market was created. Many RISC processors eventually fell out of the server market and RISC was soon completely gone from the workstation market.
Pentium II and III Xeon, 1998-2001
Below are representations of the Intel 440BX chipset supporting a 2-way system for Pentium II and III processors, and the 450NX chipset (as best as I can reconstruct) supporting a 4-way system for the Pentium II Xeon processor.
The Intel 440BX was a primarily desktop chipset also used in 2-way servers and workstations. It was good enough as a workstation chipset in having the AGP port for high-performance graphics. It was only marginally adequate for servers, as a proper 2-way server would have more than 4 memory sites. At some later point, there was a AGP to PCI adapter chip to provide adequate IO for the server environment.
The 450NX chipset was specifically designed to the requirements of a database server. It could support two memory cards on one bus(?) Each memory card had 16 memory sites, but these were EDO when 1998 product should have been SDRAM. Each of the two PCI expander bridges could connect to a single PCI 64-bit or two PCI 32-bit busses.
Intel did not take the 2-way server segment seriously. Otherwise, a proper chipset would have been commissioned. One reason for this might have been because the processor was the same Pentium II and III as in the desktop products. Their focus was almost entirely in the 4-way system, which employed the premium Pentium II Xeon processors. The 450NX should have been followed up with a version supporting SRAM. However, at this point in time, their efforts were on the Itanium platform, processor and chipset.
In all, there was a wide chasm between the 2-way and 4-way systems built around Intel chipsets. It did not matter if a medium workload could be handled by two procesors. Back then, we needed the 4-way system for the memory capacity, and sometimes for the IO as well.
There was unusual combination of circumstances around this time. Intel was not certain as to whether they had sufficient fab capacity for the Pentium Pro and follow-on. So they made a cross-license agreement with another fab (Fujitsu?). The shortfall never occurred. However, this enabled a third company, ServerWorks (originally RCC) to design chipsets for Intel processors to be manufactured by the fab with the Intel cross-license.
Systems based on the ServerWorks HE-SL and HE chips are shown below. I believe the ServerWorks product line for Pentium II and III processors was the HE and LE chipset first, and then the HE-SL followed, but this was a long time ago.
The ServerWorks LE chipset resembled the Intel 440BX, updated for the 133MHz front-side bus (FSB) and having a 64-bit 66MHz PCI bus on the north-bridge instead of AGP. The HE was similar in function to the 450NX in having adequate plumbing to support a 4-way system, with SDRAM in-place of EDO memory on the 450NX and 16 memory sites. I am presuming that system vendor feedback prompted Serverworks to then follow with the HE-SL to support a proper 2-way server. The Serverworks HE-SL did much to make the 2-way a viable system for some database requirements.
Some vendors offered high-end servers, with a proprietary chipset, and a crossbar, to connect multiple nodes into a system with 16 or more processors having non-uniform memory access, as shown below.
Yes, the Windows Server operating system and SQL Server installs on the above system. And yes, a SQL Server database will work on this system. But any one who thinks that software blind to the structure of the system architecture will run well on this system is a naive fool. It was possible with herculean effort to produce a not bad TPC-C benchmark result for the 16-way system. The emphasis is on herculean effort.
For the usual marketing reasons, the critical techniques necessary to scale beyond the standard 4-way was passed on to people who bought such systems. Some SQL Server operations actually did run well on the 16-way. Some other operations ran not bad. A few operations tanked, and tanked hard. It was enough to be comparable to a severe service outage. Many times, it would be easy enough to work around the critical problems. Yet there was no well-documented repository of knowledge for each problem, how to diagnose it, and how to work around it.
Taking the above into account, the quantitative capacity planning aspect of server sizing was somewhat irrelavent. The correct answer for all but small requirements was a 4-way. If a 4-way did not enough power, work really hard to reduce your requirements, because venturing into the land of NUMA was fraught with peril, like going into a minefield without the map of where the mines were.
ProFusion 8-way Pentium III
In late 1999 or early 2000, Intel acquired Corollary, which had designed a chipset for an 8-way system, connecting two processor front-side busses, memory and IO components, as shown below.
In case anyone is curious, the ProFusion system actually had three identical busses, two for processors, and one for the PCI bridge chips. The center chips and memory components were designed by Corollary. If my memory has not completely faded, the PCI bridge chipsets were designed by Compaq, who had experience in this. But in the end, Intel acquired Corollary. For Intel, the ProFusion was a point product, launched for the Pentium III processor. There was no update for the upcoming Xeon (Pentium 4). HP/Compaq did their own chipset for an 8-way Xeon however.
Pentium 4 - Xeon
There was a period of several years in which Intel did not produce good chipsets for the Xeon and Xeon MP processors. In this time, many system vendors used the ServerWorks GC-LE and GC-HE chipsets. There were several reasons. One was the fiasco with Rambus memory. A second was the diversion that was Itanium. A third reason was that Intel somehow came to the collective idea that the server chipset needed to meet a ridiculously low price.
Fundamentally, a server chipset is about having fat pipes for big data movement. Fat pipes implies many pins on the package for signals, power and ground. Furthermore, a proper multi-processor server was going to require several components to form the overall system. So several components with many pins each is inherently going to be much more expensive than a desktop chipset. Consider that the server chipset ties together four premium processors that sell for $2500 or even $4000 each. Scrimping on the chipset cost may undermine the entire system.
The AMD Opteron came out in 2003, with very many strong points. The processor core was competitive, and it was matched with an integrated memory controller which reduced latency further boosting performance. Not only was this a 64-bit processor, but Opteron also extended the instruction set architecture from 8 to 16 general purpose registers, which Intel said could not be done. I do not recall what the virtual address was, presumably 40-48-bits, which was good enough to avoid virtual address space exhaustion that was beginning to be common problem in 32-bit operating systems.
Below is a representation of the 4-way Opteron system. A 2-way system would just remove the top two processors. The Opteron also adopted point-to-point signalling, Hyper-Transport in AMD terminology, which had many advantages over a share bus in older processors.
The Opteron architecture with 2 memory channels per processor, matched up very well in forming a single-socket, 2-way and 4-way systems. It was also possible to build an 8-way system. Although, had this been a major objective, then 4 HT links instead of 3 would have been better.
Yet another strenght of the Opteron was that it had a fully pipelined floating-point unit. High-end Intel desktop processors had been traditionally sold into business user applications, which valued integer performance, and less on floating point. FP was mostly the realm of RISC/UNIX workstations. However, gaming was an emerging market that needed strong FP, and Intel was slow to recognize this. Intel did have FP in their SSE instructions, but good non-vector FP was important as well.
Intel Chipsets Return
By the time of Xeon 5000 and 7000 processors around 2005, Intel finally started to make a recovery in server/workstation chipsets. The E8500 4-way system is shown below. The chips in between the MCH and memory are eXternal Memory Bridges. The interface between MCH and XMB is Independent Memory Interface (IMI).
Intel had committed in late 2000 to the new quad-pumped bus for Pentium 4. There was a great deal of infrastructure investment for this, so they did not want to replace it after only a few years. The Intel Itanium 2 processor in 2003 had a double-pumped bus, but the node controller connected to a crossbar switch over a point-to-point protocol. But the Pentium 4 and Core 2 based Xeon processors still had a front-side bus. The nature of a shared bus is that the maximum frequency it can support decreases as the number of loads increases.
The solution adopted in the E8500 chipset was two independent busses. There are three loads on the bus, one for the MCH, and one each for the two processors. This allows each bus to operate at sufficiently high bandwidth to support the bandwidth needs of two dual-core processors. For practical purposes, this is essentially the ProFusion concept, this time, for a total of 4 processors.
In 2006, the two of the Core 2 architecture dual-core CPUs were put into as single package to for a quad-core processor, the Xeon 5300 series for 2-way and the Xeon 7300 series (2007) for 4-way. To support quad-core, it was necessary to further push up the bus frequency so there was now only one processor per bus, as shown in the system diagrams for the 5400 and 7300 chipsets respectively.
The 5400 MCH finally gave an Intel chipset with good memory capacity
at 16 DIMMs for the 2-way system.
It is curious that there was no memory expander device for the 7300 MCH,
leaving it capable of supporting only 16 DIMMs, as in the 5400.
On re-reading the 7300 datasheet, it does support 32 fully buffered FB-DIMMs. It does not have the XMB between the MCH and plain registered DIMMs. Instead, each memory channel connects directly to 8 FB-DIMMs. The FB-DIMM has an Advanced Memory Buffer (AMB) that actually connects to the DRAM chips. Some MCH to FB-DIMM internals diagrams below.
With the advent of quad-core processors, and a very good core at that, it became apparent that the old rule of 4-way for database servers by default was getting ridiculous. Having 4 x 4 cores in system was overkill for many medium and even medium-heavy workloads.
It could have been argued that 2 dual-core processors had enough compute power. But in transaction processing, quick response is also a desirable trait, and this is achieved with high core count. So the 2 quad-core system became a very popular with only the heavy environments needing 4-way systems with 16 or more cores.
Nehalem Xeon 5500 and 7500
With Nehalem, Intel finally integrated the memory controller and replaced the bus with a point-to-point protocol (QPI). There was a quad-core die (2009) for 2-way servers, and an 8-core die (2010) for 4-way servers. (Desktops were on a separate die with integrated PCI-E and no QPI). See Anandtech Nehalem .... The 2-way system with Xeon 5500 series processors is shown below.
The Nehalem core was better than the core of the previous generation, and the integrate memory controller provided an additional boost. Nehalem also brought back an improved Hyper-Threading, effectively doubling database transaction processing performance by itself.
And the 4-way system with Xeon 7500 series processors is shown below. Note that each processor connects directly to all three remote processors. There are no 2-hop nodes.
The 4-way Nehalem system was an even bigger jump over the previous generation 4-way than the 2-way Nehalem over its predecessor. The complete 4-way system now has 16 memory channels over 4 channels in the previous 4-way and 64 memory sites over 16.
It is possible to build glueless 8-way system from Nehalem-EX. Having four QPI links, three of which are used to connect to remote processors allows the 8-way system to be built with fewer 2-hop nodes and no 3-hop nodes.
Westmere was the 32-nm process improvement over Nehalem at 45nm. The 2-way Westmere Xeon 5600 series had 6 cores and the 4-way+ Westmere EX Xeon E7 had 10 cores.
Sandy Bridge EP
In a sense, the new architecture on 32nm Sandy Bridge in 2012 was a refinement of Nehalem at the system level. PCI-E was brought on die, eliminating the need for a QPI link to an IOH. There are 8 cores, 4 memory channels, 2 QPI links and 40 PCI-E lanes. Codename Sandy-Brige EP became the product Xeon E5, is more a replacement for Westmere-EP, Xeon 5600 series, but also overlaps into the EX sphere. In a 2-way system, both QPI links connect to the remote processor as shown below.
However, it is possible to build a 4-way system with the Xeon E5-4600 series. I am guessing that the difference between the Xeon E5 2600-series which supports 2-way and the 4600-series which supports 4-way has to do with how much of the snoop filter is enabled.
The difference between Xeon E5 over the previous generation 4-way system is that Sandy Bridge connects directly to 2 other nodes instead of 3. One remote node is 2-hops away. A lesser difference is that there is no SMB between the memory controller and memory.
This represent a significant strategy shift on Intel's part. Previously, Intel premium processors with $1K-4K price tags were meant for 4-way systems. It was always possible to buy a 4-way capable system with just 2 socket populated. But this did not make sense because the premium processors were usually differentiated by a large cache that only made a difference in the scaling beyond 2 processors.
Now with many cores on one die possible, a premium processor distinguished with high core count made sense even in 2-way systems.
What was not realized was that the same dynamic that occur with quad-core processors back in 2006 had now occurred with Sandy Bridge. Just as the 2-way system with 8-cores total had sufficient compute, memory and IO capability, the single socket system now also had powerful compute, memory and IO.
The recent Intel processor core is probably 20 times more powerful than the Pentium Pro at 200MHz in transaction processing, 40 times with Hyper-Threading enabled. A single 8-core Sandy-Bridge is then 80 times more powerful than the 4-way Pentium Pro.
It is true that a single Nehalem EX with 8-cores is almost as capable as Sandy Bridge with 8-cores, but most system vendors did not support the 4-way capable Xeon 7500 or E7 system in a single socket populated configuration. (Two sockets populated is supported). So the powerful single socket became possible with Sandy-Bridge.
And yet there does not seem to be much acceptance of the single-socket concept. It is possible to buy a system with 2 Xeon E5 sockets with 1 populated. A purpose built single-socket system would have slighty lower cost structure and perhaps more compact volume.
Ivy Bridge, Haswell, Broadwell
There was no Sandy Bridge EX for the Xeon E7 series. Ivy Bridge 22nm had both a Xeon E5 v2 2600-series in late 2013 and Xeon E7 v2 in early 2014. There was also Xeon E5 v2-4600 series for 4-way in 2014. Ivy Bridge E came in three models, LCC, MCC and HCC as shown below, with 6, 10 and probably 15 cores.
Intel put out a functional diagram showing the HCC model with 3 columns by 4 rows for 12 cores. The Xeon E5 v2 series is limited to 12 cores. But the Xeon E7 v2 series have up to 15 cores, so I am inclined to believe that there is just one HCC that has 15 cores, and the 12-core just has 3 cores disabled. Presumably the corresponding E7 and E5 products are the same, with one QPI link disabled in the E5 and the number of sockets supported controlled by the snoop filter setting.
Below are the Haswell LCC, MCC and HCC layouts with 8, 12 and 18 cores respectively. The LCC and MCC die have 4 rows, as do the first three columns of the HCC die. There are 6 rows in the last column of the HCC.
WCCFTech mentions Haswell cluster on-die mode? Intel Xeon E5-2600 V3. Both Haswell and its successor came in both E5 and E7 versions, as with Ivy Bridge.
Below are the Broadwell LCC, MCC and HCC layouts with 10, 15 and 24 cores respectively.
The 4-way E7 system is represented below. It is possible to have 3 DIMMs on each channel instead of the 2 shown.
Broadwell EP/EX introduced a number of cache coherency options, each of which has different optimizations orientations. This is a deeper topic than I can address. See The Intel Xeon E5 v4 Review
Also search the terms Snoop, Directory, Cache, Coherence. See Frank Denneman NUMA Deep Dive Part 3: Cache Coherency
Sky Lake Coming soon?
Earlier this year (2016-09-09), WCCFTech Intel Skylake Xeon V5 said 28 cores. More recently, The Motley Fool mentioned that someone listed the Xeon V5 as LGA3647 with 32 cores on the taoboa, a Chinese ebay?