Parent Historical,  AMD Opteron,  Intel QPIDell PowerEdgeHP ProLiantIBM x SeriesNUMASandy Bridge,

AMD Opteron System Architecture

The Opteron microprocessor was designed from beginning for point-to-point signaling, Hyper-Transport in AMD terminology (HT). The Opteron 800 and 8000 series processors have 3 full width HT links. Each full-width 16-bit HT link can be split into two half width 8-bit links. The typical 4-way Opteron system layout is shown below, with the processors on the four corners of a square.

OpteronSystem

Each Opteron processor connects directly to two other Opteron processors. The third processor is two (HT) hops away. Two of the Opteron processors also use one HT link to an IO hub, and there are two IO bridge devices. The third HT ports on two of the Opteron processor are not used.

A significant feature of the Opteron processor is the integrated memory controller. In a single processor system, this substantially lowers the round-trip memory access latency, the time from which the processor issues a memory request to the time the contents arrive, (which is much longer than the DRAM clock rate). A typical round trip time from processor over FSB to MCH to DRAM and back might be 95-120ns, compared with 50-60ns for processor via integrated memory controller to DRAM and back. Consider a code sequence that fetches a memory address and then uses the content to determine the next address. The next address cannot be issued until the first request completes. A 3GHz processor clock cycle is 0.33ns. In a code sequence where the address of memory request is known far in advance, and memory accesses can be deeply pipelined, then latency is not as important.

One benefit of the integrate memory controller in multi-socket systems is that the signals for the memory controller as distributed with the processors, not concentrated in the MCH, as with the Intel 7300 chipset. The 4-way Opteron platform above has 8 memory channels compared with 4 on the Intel 7300.

It would also seem that another advantage is that interconnect bandwidth scales with the number of processor. What was not mentioned is that much of this bandwidth is consumed by cache coherency. In the shared bus architecture, each processor can monitor memory traffic of other processors.

The HP ProLiant DL785G5 8-way Opteron system has 16 memory channels. HP does not provide an architecture diagram for the 8-way Opteron. The Sun Fire X4600 M2 Server Architecture whitepaper for their 8-way Opteron describes an Enhanced Twisted Ladder arrangement.

without HT Assist

The original single Opteron processor was updated to newer manufacturing processes, faster memory interfaces, and first dual-core, then quad-core and finally six-core, all while retaining the basic architecture of 2 memory channels and 3 full-width HT links. (During this period, minor improvements to microprocessor architecture were discussed, but the was no major architectural improvement to the Opteron core.)

The 2008(?) generation Opteron processors, codename Shanghai, was on the 45nm process with dedicated 256K L2 cache per core and 6M L3 shared cache. The original frequencies ran up to 2.7GHz. Later frequencies ran up to 3.1GHz.

A 6-core Opteron codenamed Istanbul was announced on 1 June 2009. Istanbul is described as having a HT assist feature, which uses 1M of the 6M L3 cache. The following excerpt is from the article by Johan De Gelas on AnandTech ( AMD's Six-Core Opteron 2435).

HT assist is a probe or snoop filter AMD implemented. First, let us look at a quad Shanghai system. CPU 3 needs a cache line which CPU 1 has access to. The most recent data is however in CPU’s 2 L2-cache.

without HT Assist

Start at CPU 3 and follow the sequence of operations:

1. CPU 3 requests information from CPU 1 (blue “data request” arrow in diagram)
2. CPU 1 broadcasts to see if another CPU has more recent data (three red “probe request” arrows in diagram)
3. CPU 3 sits idle while these probes are resolved (four red & white “probe response” arrows in diagram)
4. The requested data is sent from CPU 2 to CPU 3 (two blue and white “data response” arrows in diagram)

There are two serious problems with this broadcasting approach. Firstly, it wastes a lot of bandwidth as 10 transactions are needed to perform a relatively simple action. Secondly, those 10 transactions are adding a lot of latency to the instruction on CPU 3 that needs the piece of data (which was requested by CPU 3 to CPU 1).

The solution to is a directory-based system, that AMD calls HT Assist. HT assist reserves 1MB portion of each CPU’s L3 cache to act as a directory. This directory tracks where that CPU’s cache lines are used elsewhere in the system. In other words the L3-caches are only 5 MB large, but a lot of probe or snoop traffic is eliminated. To understand this look at the picture below:

with HT Assist

Let us see what happens. Start again with CPU 3:

1. CPU 3 requests information from CPU 1 (blue line)
2. CPU 1 checks its L3 directory cache to locate the requested data (Fat red line)
3. The read from CPU 1’s L3 directory cache indicates that CPU 2 has the most recent copy and directly probes CPU 2 (Dark red line)
4. The requested data is sent from CPU 2 to CPU 3 (blue and white lines)

Instead of 10 transactions, we have only 4 this time. A considerable reduction in latency and wasted bandwidth is the result. Probe "broadcasting" can be eliminated in 8 of 11 typical CPU-to-CPU transactions. Stream measurements show that 4-Way memory bandwidth improves 60%: 41.5GB/s with HT Assist versus 25.5GB/s without HT Assist.

But it must be clear that HT assist is only useful in a quad-socket system and of the utmost importance in octal CPU systems. In a dual system, broadcast is the same as a unicast as there is only one other CPU. HT assist also lowers the hitrate of L2-caches (5 MB instead of 6) so it should be disabled on 2P systems.

The following generation will have 12 or more cores in one socket. The announced plan is to place two of the six core die in one package, connected with one of the three HT links on each die. The interface out of the package is 4 HT and 4 DDR2/3 memory channels.

The current Opteron is listed as supporting 2GT/s (based on 1 GHz clock) in HT version 1, and 4.4GT/s in HT version 3.0. It appears the most or all current Opteron systems still use HT-PCI-E bridge devices that are HT 1.0 only, and PCI-E gen 1 only. The newly announced HP ProLiant DL585G6 for Istanbul only may be HT 3.0 for 4.4GT/s (or 8.8GB/sec per direction). The previous DL585G5 only supports Shanghai and earlier.

AMD Istanbul

AMD crowed loudly that Opteron with its integrated memory controller and Hyper-Transport scaled memory and inter-processor bandwidth with the number of processors, while the Intel systems up to the Xeon 7400 series were constrained on a shared processor front-side bus and the fixed memory bandwidth of discrete memory controller. See for example excerpts from: SQL Server 2005 and AMD64 –a winning team.

Intel FSB Architecture Opteron DCA Architecture

With the announcement of the six-core Istanbul, we find out that AMD previously did not have mechanism for maintaining cache-coherency comparable to the Snoop Filter in the Intel chipsets. Without this, much of the available bandwidth is consumed by cache coherency traffic, limiting scaling in systems with 4 or more processor sockets. In Istanbul, the HT Assist, or Probe Filter feature uses 1M of the 6M L3 as a directory cache to track cache lines. AMD measured 42GB/s memory bandwidth with HT Assist versus 25.5GB/s without HT Assist.

Below is another pictorial description of HT Assist. I will provide source reference shortly.

with HT Assist

Magny-Cours

Magny-Cours was announced as a 12-core processor comprised of two six-core die in a single package. At the time, I assumed this was simply two Istanbul die. Apparently this was may not be correct, as the Magny-Cours die has 4 full-width HT links, versus, 3 HT for all previous Opteron processors. As shown below, each die within the processor socket/package is connected to the other with one full-width and one half-width HT. In a 2-way system, a die in one processor connects to one die in the other processor with one full-width HT, and to the second die with a half-width HT. The remaining full-width HT is reserved for IO, having no cache-coherency support. The slides below are from the AMD Hotchips 2009 presentation referenced at the end.

Magny Cours MCM

Magny Cours Topology

In a 4-way system, one die connects to the first die in the other three processors with a half-width HT, and the other die connects to the second die in the same. With this strategy, AMD nolonger supports 8-socket systems, but now provides 48-cores in a relatively low cost 4-way system.

Bulldozer

Bulldozer is the next generation AMD processor architecture. System architecture will be discussed as more information becomes available. A description on the processor is here.

Chipsets for AMD Processors

I have not talked much about AMD chipsets. The older 8131 and 8132 were HT to PCI-X tunnels from 2004?

nVidia 2200 and 2050

AMD was slow in providing a PCI-E chipset, and some vendors (HP) used the NVidia 2200/2300.

with HT Assist

There is another nVidia chipset combination, the IO55 and MCP55 Pro that was used in Supermicro and the Sun x4540.

with HT Assist

AMD Chipsets SR5690

The current AMD chipset is the SR5600 series: 5690, 5670 and 5650 with 42, 30 and 22 PCI-E Gen 2 lanes respectively.

Supermicro provides system block diagrams for the 5600 series chipsets.

with HT Assist

References

Hotchips presentation on HT Assist Performance Methodologies for Opteron Servers
revised link: Performance Methodologies for Opteron Servers

The Hyper Transport Consortium HT_General_Overview.