Systems Architecture - Parent
Historical, AMD Opteron, Intel QPI, Dell PowerEdge, HP ProLiant, IBM x Series, NUMA, Sandy Bridge,

Intel Nehalem based systems with QPI

The first Nehalem architecture processor release was the Core i7 for single socket desktop systems in November 2008. The Xeon 5500 series followed in April 2009 for 2-way server systems. Both have 4 physical cores, 3 DDR3 memory channels. The Core i7 has a single QPI channel and the Xeon 5500 has 2 QPI. The 2-way Xeon system can be configured with a single IOH or with two IOH as shown below. With one IOH, both processors connected directly to the IOH. Each IOH has two QPI channels and 36 PCI-E Gen 2 lanes plus the ESI. With two IOH, each processor connects directly to one IOH, and the IOH also directly connected to each other.

QPI 1 IOH

QPI 2 IOH

Notice that there are 1366 pins on the Nehalem processor versus 603 or 771 for the Core 2 based Xeon processor. The number on pins on the IOH, now without memory channels is reduced to a more economical 1295 pins from the 2013 pins in the MCH 7300 for the 4-socket Core2 system.

The QPI has a bandwidth of 12.8GB/s in each direction simultaneously for a combined bi-directional bandwidth of 25.6GB/s. Each PCI-E Gen 2 lane operates at 5Gbit/sec for a net bandwidth of 500MB/s per lane, per direction. A x4 PCI-E Gen 2 channel is rated 2GB/s per direction, and 4GB/s per direction for the x8 channel. So while the 36 PCI-E Gen2 lanes on the 5520 IOH are nominally rated for 18GB/s per direction, the maximum bandwidth per QPI is still 12.8GB/s per direction. Still the dual IOH system would have a nominal IO bandwidth of 25.6GB/s per direction. It would very interesting to see what the actual bandwidth, disk and network, the Xeon 5500 system can sustain is.

5520 IOH

The 5520 is one of the IOH options for the Xeon 5500 series processors. The diagram below shows a 2-way Xeon 5500 system with a single 5520 IOH. With a single 5520 IOH, it is possible configure 4 x8 and 1x4 PCI-E gen 2 slots, plus one x4 PCI-E gen 1 channel off the ICH via the ESI port.

Xeon5500_1IOH

The diagram below shows a 2-way Xeon 5500 system with two 5520 IOH devices. This system could have 8 x8 and 2 x4 PCI-E gen 2 slots plus 1 x4 gen 1.

Xeon5500_2IOH

Below is a representation of 2-way Xeon 5600 (Westmere) 6-core, with 2 IOH.

Xeon 7500 and E7 Series

The 4-way Nehalem architecture (which might be a Dell R910) scheduled for release in late 2009, (actual - 2010 Q2. Nehalem-EX 45nm was followed by Westmere-EX 32nm, 10-cores per socket, in 2011 Q2.

Each Xeon 7500 (Nehalem-EX) processor socket can have up to 8 physical cores (some models with 4 or 6 cores), hyper-threading (HT or 2 logical processors per core), 16M L3 cache, 4 memory channels, and 4 QPI. The architecture of the 4-way system has each processor (socket) directly connected to all three other processor sockets. The remaining QPI connects to an IO hub. Another difference relative to the AMD Opteron system is that the Intel IO hub has two QPI ports, where each connects to a processor.

Westmere-EX adopts the new naming scheme, of E7-xxxx. Below is 4-way Westmere-EX with 10-cores.

Below is 8-way Westmere-EX with 10-cores.

Each QPI full link can be split as two half wide links. The Nehalem EX with four full QPI can support glue-less 8-way system architecture. Glue-less means that no other silicon is necessary to connect the processors together. Each processor connects directly to all seven other processors with a half-wide QPI link, and uses the remaining half wide QPI to connect to the IOH. The Hyper Transport Consortium describes this arrangement for a future Opteron system with 4 full HT links per processor (HT_General_Overview). The actual 8-way system architectures described so to date do employ half-wide links, for both current Opteron systems and forth coming Nehalem EX system.

Intel has a separate datasheet for an IOH for the Xeon 7500 and Itanium 9300 series processors. Both the 5520 and 7500 series IOHs appear to have the same functionality in terms of QPI uplinks, PCI-E IO channels and legacy IO. So any differences might have to do with internal characteristics.

Xeon7500_IOH

The diagram below shows a 4-way Xeon 7500 system with two IOH devices.

Xeon7500_topology

The diagram below shows a 4-way Xeon 6500 system with two IOH devices. Notice there 2 QPI links between the processors. It is unclear whether this configuration has any performance advantages over a system with a single QPI link between processors.

Xeon6500_topology

The diagram below shows an 8-way glueless Xeon 7500 system. Glueless means that there are no silicon components connecting the processors.

8-socket_glueless

EightSocketGlueless

Below is the Xeon 7500 processor block diagram.

Xeon 7500 BlockDiagram

Excerpts from the Intel Xeon 7500 Processor datasheet:

The Intel Xeon processor 7500 series consists of eight cores connected to a shared, 24-MB inclusive, 24-way set-associative Last-Level Cache (LLC) by a high-bandwith interconnect. The cores and shaed LLC are connected via caching agents (Cbox) and the system interface (Sbox) to the Intel QuickPath Interconnect router (Rbox), the on-chip Intel QuickPath Interconnect home agents and memory controllers (Bboxes + Mboxes), and the system configuration agent (Ubox).

Memory Controller (Mbox)
Two integrated memory controllers. Mbox interface logic to Scalable Memory Buffer. SMI, formerly known as Fully Buffered DIMM 2 interface. Each memory controller supports 2 SMI channels, for 4 SMI channels per socket. Each 64 bytes of cache line stored in memory, there are 16 bits available to be used for directory support.

SMI Lock Stepped Channels Requirement

SMI

"Each of the two memory controllers in the Xeon 7500 manages two SMI channels operating in lock-step operation. In lock-step channel mode, each memory access spans SMI channels A and B or C and D. Lock-stepped channel mode requires that both Scalable Memory Buffers (SMB) be populated identically with regard to DIMM size and organization. "

Hemisphere mode
Hemisphere mode of operation the processors Caching Agent 1 will not access it's Home Agent 2, thereby reducing memory latencies.

Hemi