Parent Historical,  AMD OpteronIntel QPIDell PowerEdgeHP ProLiantIBM x SeriesNUMASandy Bridge,

System Architecture History

The architecture of Intel 4-way (socket) systems dates to around 1996 with the Pentium Pro 4-way system. From 1998-2003, 4-way systems looked more or less as below, with four processors and a memory controller (or north bridge) sharing one system bus.

Generic 4-way system

The original Pentium Pro chipset actually had separate memory controller and one or two PCI bridge chips for a total of 7 devices on the system bus instead of 5 with the integrated memory and I/O hub shown above. At the time, it was a very effective way to build a 4-way system. Performance scaling was very good for transaction processing database workloads. The cost structure was competitive with base model pricing around $15K in 1996, dropping to $6K by 1999. The individual chips in this design could all be built at a relatively economical pin count, about 500-700 pins (for the time).

It could be argued that the four processors sharing one bus would cause a performance bottleneck, but this was not necessarily the case during that era. A properly sized cache was sufficient to keep bus traffic low. Because all processors share one bus, each can see all memory traffic from the other processors, so maintaining cache coherency was relatively simple.

Point-to-Point versus Shared Bus Architecture

In the late 1990s, the technology for point to point signaling became sufficiently well developed. The objective in a multi-processor system is to move as much information from one location (chip) to another as quickly as possible. The main limitation is the number of pins that can be built into a socket at a given cost structure. Point-to-point can route the most data per pin with lower latency, hence has advantages in the key criteria that justifies replacing the shared bus architecture.

The 870 Chipset

The point-to-point protocol can operate at much higher signaling rate than the shared bus architecture. Shared bus with 5 loads (4 processors plus MCH) topped out at 400MT/s, 800MT/s at 3 loads (2 processors plus MCH) and 1600MT/s with 2 loads (processor and MCH). The point-to-point architecture initially operated in the 1.6GT/s range and is now at 6.4GT/s. Point-to-point not only operates at higher transfer rates, but is also a more efficient protocol than shared bus. The shared bus, as implied by its name, must share the bus. Hence the shared bus protocol requires that a device on the bus must first arbitrate for control before sending a request. The point-to-point architecture mandates that there are only two devices, hence one device at the other end, so request can be sent without any preliminary handshaking.

The first Intel system to implement point-to-point signaling was the 870 chipset for the Itanium 2, under the name scalability port (SP). The Itanium 2 processor itself still used a shared bus. The 870 components used SP to link the MCH, IO hub (SIOH) and SPS cross-bar (see the Intel E8870 System Chipset whitepaper, and the Hotchips presentations The Intel 870 Family of Enterprise Chipsets). The AMD Opteron was designed from the beginning for point-to-point, with the name Hyper-Transport (HT).

At the time of these developments, Intel had just made a commitment to the second generation of the Pentium Pro bus architecture for the Pentium 4 (NetBurst) processor. So it was decided to leverage that infrastructure for the next several years, rather than transition to a new, better technology in the early 2000’s. The generation of Intel processors built around the Nehalem architecture has point to point signaling under the name Quick Path Interface (QPI) starting in late 2008 and 2009. (Side note: for the past decade, new processor architectures appeared first in desktop or two-way servers, and then appeared in 4-way servers approximately one year later.)

Pentium Pro to Core 2

The Intel Pentium Pro, Pentium 4 (NetBurst) and Core 2 processors have a 64-bit data bus and a 36-40 bit address bus. The architecture is a split transaction bus with deferred reply, meaning multiple address commands can be sent before the first request is completed. In the original Pentium Pro, the bus clock, address and data rates were all the same. From the NetBurst on, the address bus is double pump and the data bus is quad-pumped. The quoted mega-transfer rate (MT/s) is usually for the data bus. In point-to-point, a full link is typically 16-20 bits wide, and is used to send both address and data, so the bandwidths quoted for bus and point to point are not exactly comparable.

Shared Bus Evolutionary End Point

Microprocessor performance progresses at the pace of Moore’s Law, which means that microprocessors are designed to Moore’s Law of about 40% increase in performance per year. The shared bus architecture was introduced when processor frequency was 133-166MHz on a 66MHz bus. When single core processors reached 3GHz, and with the arrival dual core processors on the horizon, the 5 load system bus had been pushed to 400MT/s transfer rate. But even this was inadequate to handle eight cores on one bus, even with a large cache. The band-aid solution in 2005 was the E8500 chipset with dual independent bus (DIB) design.

E8500 Architecture

The following year, the E8501 increased the FSB transfer rate to 800MHz, but left the Intermediate Memory Interface (IMI) at 5.3GB/sec ( Intel E8501 Chipset North Bridge).

E8501 Architecture

Since the front-side bus with four processors could not be push beyond 400MHz with 5 (socket) loads, the E8500 chipset was designed with two front-side busses. This arrangement was previously used for the 8-way system (ProFusion chipset) for the Pentium III processor, with four processors on each of two 100MHz front side busses. The E8500 MCH required 1432 pins to connect the two processor busses, four memory channels and 28 PCI-E lanes, which by that time, was possible to fit in the existing 4-way system cost structure.

The final evolution of an Intel multi-processor system based on bus architecture is the 7300 chipset. The MCH now has now four front-side busses, with a single processor socket on each bus. The 7300 MCH requires 2013 pins for all the signals, power and grounds necessary to support this architecture. (Intel 7300 Chipset Memory Controller Hub (MCH) datasheet).

(2018-07-27, try datasheet).

7300 architecture

There are approximately 150 signals for the front-side bus with 64-bit data, 40-bit address (excluding lower 4 bits) and a few dozen or so control signals. Each FB-DIMM memory channel has 24 signals. Each PCI-E lane has 4 signals, one differential pair (positive and negative) for transmit and receive, or 128 signals total for 32 lanes. In addition to the signals, a very large number of pins are required for power and ground.

In other Intel chipsets with one processor and one MCH, it was possible to push the FSB to 1600MHz, but the 7300 FSB is set to 1066MHz. The Xeon 5000 series and below used the second generation 771 pin socket, while the Xeon 7000 series retained the original 604 pin socket.

While it may seem that the direct processor to MCH connection is similar diagrammatically to the point-to-point design, it still uses the shared bus protocol, which cannot be changed without a substantial redesign. In contrast, each 20-bit QPI requires 84 signals, of which there are 20 pairs in each direction for differential signaling. The number of pins is reduced by about a factor of 2, and the transfer rate increases from 1.6GT/s for shared bus with 2 loads to 6.4GT/s for point-to-point.

From the E8501 to the 7300, the front-side bus frequency increased from 800MHz to 1.066GHz, and the number of busses doubled. However, the number of memory channels remained unchanged at 4, each 5.3GB/s bandwidth. Memory performance is characterized not just by the raw bandwidth, but also by the number of independent memory channels, which determines the ability to drive memory transactions, and of course latency.

The E8501 MCH was designed to support four dual core Xeon 7000 or 7100 series processors, which are based on the Pentium 4 micro-architecture, for a total of 8 cores (disregarding the Hyper-Threading feature, which shows 16 logical processors). The 7300 MCH was designed for four quad-core Xeon 7300 series and six-core Xeon 7400 series processors based on the Core2 micro-architecture, for a total of 24 cores.

The Core 2 architecture at 2.66-2.93GHz is substantially more powerful than the Pentium 4 architecture at 3.4GHz at the processor core level, so there a strong argument could be made that the 7300 chipset would really benefit from substantially more memory bandwidth and channels. It is very likely that the 7300’s 2000+ pins this is the most that could be put into a single device without a very large increase in complexity and cost structure. (The IBM x3950M2 has a proprietary chipset with 8 memory channels, 4.26GB/s reach and 2.13GB/s write per channel.)

Memory Controller

The E8500/1 chipset uses Independent Memory Interface (IMI) to connect the MCH to an External Memory Bridge (XMB) which in turn connects to two DDR memory channels at 266 or 333MHz or DDR2 at 400MHz. Below is a diagram from an Intel datasheet for the memory controller.

FB DIMM

The IMI bandwidth reading from memory (inbound) is 5.3GB/s, and 2.67GB/s writing to memory (outbound). The 7300 implements Fully Buffered DIMM (FBD) channels to FB-DIMM memory. FB-DIMM in turn has an Advanced Memory Buffer chip along with DDR2 memory at 667MHz.

SDRAM, including DDRx, are usually specified in term of the read data rate. For 1066MHz DDR2 chips formed in a 64-bit (8 byte) wide DIMM, the data read rate is 8.5GB/sec. The data write rate is one-half the read rate at 4.3GB/sec.

Both the E8501 and 7300 support simultaneous 5.3GB/s inbound and 2.67GB/s outbound per memory channel, for a combined memory bandwidth of 21.3GB read and 10.7GB/s write. Both IMI and FBD memory channels signal at 6X the memory chip frequency.

PCI Express

The IO capability is nominally improved from 28 PCI-E lanes plus the 266M/s HI on the E8500 to effectively 32 lanes on the 7300. The 7300 MCH has 28 PCI-E (Gen 1) lanes plus the ESI which is effectively a x4 PCI-E port with additional functionality to support legacy south bridge devices. The normal PCI-E lanes can be arranged in any combination of x16, x8 or x4 ports. ( Intel PCI Express site)

MCH PCI Express

Snoop Filter

As mentioned earlier, the single shared bus was a simple but effective design for the time because each processor could monitor all bus traffic. With multiple processor busses, some mechanism is desired or necessary to help maintain cache coherency. The Intel chipsets with multiple independent busses use a Snoop Filter. (Large systems may use a directory based system.) The diagram below is from the Intel 7300 MCH data sheet. A snoop filter maintains a list address tags and state for all processor caches, but not the contents of the cache itself. The snoop filter reduces the need to snoop other busses for the contents of the processor cache.

Snoop Filter

The 5000P/X chipset for 2-way systems has dual-independent busses (DIB) and has a snoop filter. Some benchmarks showed positive gain with the snoop filter enabled. Other benchmarks showed degradation, so some consideration may be given to testing with the snoop filter enabled and disabled (Dell Power Solutions Configuring Ninth-Generation Dell PowerEdge Servers). Multiple independent busses reduce the number of electrical loads on a bus, allowing the bus to operate at a much higher clock rate, but it is not clear at all that the combined effective bandwidth is a simple sum of the individual bandwidths. There are definitely questions regarding the effectiveness of a snoop filter in the 2-way system. No such concerns have been brought to light concerning 4-way systems. The absence of clear information cannot be construed to indicate as to whether there are technical issues with the snoop filter or not.

In summary, the shared bus was effective at the time of its introduction. It was improved and then band-aided as much as possible. The 7300 MCH still has some liabilities of shared bus, and does not realize the full benefits of point-to-point. Part of the problem is that the MCH must support multiple processor busses, memory channels and PCI-E. And it really does not help that the processor bus does not use the most pin-efficient signaling. In the AMD Opteron architecture, the processor not only uses the more pin-count efficient signaling, it also integrates the memory controller. Hence the memory signals are distributed over multiple processors. In the Intel 7300 chipset, the MCH has 2013 pins while there are only 603 on the processor. This is the end of road for shared bus not only because the replacement architecture has finally arrived, but also because nothing more can be done to keep pace with each new generation of processor performance requirements.

Core 2 2-socket Systems

TBA

5000P

5000P

Itanium 2 System Architecture

Intel 870 Chipset

Intel E870 chipset for 16-way Itanium 2 System