Systems Architecture - Parent

System Architecture and Configuration

Before jumping into system configuration, it is useful first to understand system architecture, including how the current system architecture came to be. A discussion of system architecture inevitably involves a comparison of the Intel Xeon versus Opteron performance characteristics, so this is briefly covered. It is important to distinguish performance characteristics at the processor and system levels where possible.

System Architecture History

The architecture of Intel 4-way (socket) systems dates to around 1996 with the Pentium Pro 4-way system. From 1998-2003, 4-way systems looked more or less as below, with four processors and a memory controller (or north bridge) sharing one system bus.

Generic 4-way system

The original Pentium Pro chipset actually had separate memory controller and one or two PCI bridge chips for a total of 7 devices on the system bus instead of 5 with the integrated memory and I/O hub shown above. At the time, it was a very effective way to build a 4-way system. Performance scaling was very good for transaction processing database workloads. The cost structure was competitive with base model pricing around $15K in 1996, dropping to $6K by 1999. The individual chips in this design could all be built at a relatively economical pin count, about 500-700 pins (for the time).

It could be argued that the four processors sharing one bus would cause a performance bottleneck, but this was not necessarily the case during that era. A properly sized cache was sufficient to keep bus traffic low. Because all processors share one bus, each can see all memory traffic from the other processors, so maintaining cache coherency was relatively simple.

Point-to-Point versus Shared Bus Architecture

In the late 1990s, the technology for point to point signaling became sufficiently well developed. The objective in a multi-processor system is to move as much information from one location (chip) to another as quickly as possible. The main limitation is the number of pins that can be built into a socket at a given cost structure. Point-to-point can route the most data per pin with lower latency, hence has advantages in the key criteria that justifies replacing the shared bus architecture.

The point-to-point protocol can operate at much higher signaling rate than the shared bus architecture. Shared bus with 5 loads (4 processors plus MCH) topped out at 400MT/s, 800MT/s at 3 loads (2 processors plus MCH) and 1600MT/s with 2 loads (processor and MCH). The point-to-point architecture initially operated in the 1.6GT/s range and is now at 6.4GT/s. Point-to-point not only operates at higher transfer rates, but is also a more efficient protocol than shared bus. The shared bus, as implied by its name, must share the bus. Hence the shared bus protocol requires that a device on the bus must first arbitrate for control before sending a request. The point-to-point architecture mandates that there are only two devices, hence one device at the other end, so request can be sent without any preliminary handshaking.

The first Intel system to implement point-to-point signaling was the 870 chipset for the Itanium 2, under the name scalability port (SP). The Itanium 2 processor itself still used a shared bus. The 870 components used SP to link the MCH, IO hub (SIOH) and SPS cross-bar (see the Intel E8870 System Chipset whitepaper, and the Hotchips presentations The Intel 870 Family of Enterprise Chipsets). The AMD Opteron was designed from the beginning for point-to-point, with the name Hyper-Transport (HT).

At the time of these developments, Intel had just made a commitment to the second generation of the Pentium Pro bus architecture for the Pentium 4 (NetBurst) processor. So it was decided to leverage that infrastructure for the next several years, rather than transition to a new, better technology in the early 2000’s. The generation of Intel processors built around the Nehalem architecture has point to point signaling under the name Quick Path Interface (QPI) starting in late 2008 and 2009. (Side note: for the past decade, new processor architectures appeared first in desktop or two-way servers, and then appeared in 4-way servers approximately one year later.)

The Intel Pentium Pro, Pentium 4 (NetBurst) and Core 2 processors have a 64-bit data bus and a 36-40 bit address bus. The architecture is a split transaction bus with deferred reply, meaning multiple address commands can be sent before the first request is completed. In the original Pentium Pro, the bus clock, address and data rates were all the same. From the NetBurst on, the address bus is double pump and the data bus is quad-pumped. The quoted mega-transfer rate (MT/s) is usually for the data bus. In point-to-point, a full link is typically 16-20 bits wide, and is used to send both address and data, so the bandwidths quoted for bus and point to point are not exactly comparable.

Shared Bus Evolutionary End Point

Microprocessor performance progresses at the pace of Moore’s Law, which means that microprocessors are designed to Moore’s Law of about 40% increase in performance per year. The shared bus architecture was introduced when processor frequency was 133-166MHz on a 66MHz bus. When single core processors reached 3GHz, and with the arrival dual core processors on the horizon, the 5 load system bus had been pushed to 400MT/s transfer rate. But even this was inadequate to handle eight cores on one bus, even with a large cache. The band-aid solution in 2005 was the E8500 chipset with dual independent bus (DIB) design.

E8500 Architecture

The following year, the E8501 increased the FSB transfer rate to 800MHz, but left the Intermediate Memory Interface (IMI) at 5.3GB/sec ( Intel E8501 Chipset North Bridge).

E8501 Architecture

Since the front-side bus with four processors could not be push beyond 400MHz with 5 (socket) loads, the E8500 chipset was designed with two front-side busses. This arrangement was previously used for the 8-way system (ProFusion chipset) for the Pentium III processor, with four processors on each of two 100MHz front side busses. The E8500 MCH required 1432 pins to connect the two processor busses, four memory channels and 28 PCI-E lanes, which by that time, was possible to fit in the existing 4-way system cost structure.

The final evolution of an Intel multi-processor system based on bus architecture is the 7300 chipset. The MCH now has now four front-side busses, with a single processor socket on each bus. The 7300 MCH requires 2013 pins for all the signals, power and grounds necessary to support this architecture. ( Intel 7300 Chipset Memory Controller Hub (MCH) datasheet).

7300 architecture

There are approximately 150 signals for the front-side bus with 64-bit data, 40-bit address (excluding lower 4 bits) and a few dozen or so control signals. Each FB-DIMM memory channel has 24 signals. Each PCI-E lane has 4 signals, one differential pair (positive and negative) for transmit and receive, or 128 signals total for 32 lanes. In addition to the signals, a very large number of pins are required for power and ground.

In other Intel chipsets with one processor and one MCH, it was possible to push the FSB to 1600MHz, but the 7300 FSB is set to 1066MHz. The Xeon 5000 series and below used the second generation 771 pin socket, while the Xeon 7000 series retained the original 604 pin socket.

While it may seem that the direct processor to MCH connection is similar diagrammatically to the point-to-point design, it still uses the shared bus protocol, which cannot be changed without a substantial redesign. In contrast, each 20-bit QPI requires 84 signals, of which there are 20 pairs in each direction for differential signaling. The number of pins is reduced by about a factor of 2, and the transfer rate increases from 1.6GT/s for shared bus with 2 loads to 6.4GT/s for point-to-point.

From the E8501 to the 7300, the front-side bus frequency increased from 800MHz to 1.066GHz, and the number of busses doubled. However, the number of memory channels remained unchanged at 4, each 5.3GB/s bandwidth. Memory performance is characterized not just by the raw bandwidth, but also by the number of independent memory channels, which determines the ability to drive memory transactions, and of course latency.

The E8501 MCH was designed to support four dual core Xeon 7000 or 7100 series processors, which are based on the Pentium 4 micro-architecture, for a total of 8 cores (disregarding the Hyper-Threading feature, which shows 16 logical processors). The 7300 MCH was designed for four quad-core Xeon 7300 series and six-core Xeon 7400 series processors based on the Core2 micro-architecture, for a total of 24 cores.

The Core 2 architecture at 2.66-2.93GHz is substantially more powerful than the Pentium 4 architecture at 3.4GHz at the processor core level, so there a strong argument could be made that the 7300 chipset would really benefit from substantially more memory bandwidth and channels. It is very likely that the 7300’s 2000+ pins this is the most that could be put into a single device without a very large increase in complexity and cost structure. (The IBM x3950M2 has a proprietary chipset with 8 memory channels, 4.26GB/s reach and 2.13GB/s write per channel.)

The E8500/1 chipset uses Independent Memory Interface (IMI) to connect the MCH to an External Memory Bridge (XMB) which in turn connects to two DDR memory channels at 266 or 333MHz or DDR2 at 400MHz. The IMI bandwidth reading from memory (inbound) is 5.3GB/s, and 2.67GB/s writing to memory (outbound). The 7300 implements Fully Buffered DIMM (FBD) channels to FB-DIMM memory. FB-DIMM in turn has an Advanced Memory Buffer chip along with DDR2 memory at 667MHz. Both the E8501 and 7300 support simultaneous 5.3GB/s inbound and 2.67GB/s outbound per memory channel, for a combined memory bandwidth of 21.3GB read and 10.7GB/s write. Both IMI and FBD memory channels signal at 6X the memory chip frequency.

The IO capability is nominally improved from 28 PCI-E lanes plus the 266M/s HI on the E8500 to effectively 32 lanes on the 7300. The 7300 MCH has 28 PCI-E (Gen 1) lanes plus the ESI which is effectively a x4 PCI-E port with additional functionality to support legacy south bridge devices. The normal PCI-E lanes can be arranged in any combination of x16, x8 or x4 ports. ( http://www.intel.com/technology/pciexpress/)

MCH PCI Express

As mentioned earlier, the single shared bus was a simple but effective design for the time because each processor could monitor all bus traffic. With multiple processor busses, some mechanism is desired or necessary to help maintain cache coherency. The Intel chipsets with multiple independent busses use a Snoop Filter. (Large systems may use a directory based system.) The diagram below is from the Intel 7300 MCH data sheet. A snoop filter maintains a list address tags and state for all processor caches, but not the contents of the cache itself. The snoop filter reduces the need to snoop other busses for the contents of the processor cache.

Snoop Filter

The 5000P/X chipset for 2-way systems has dual-independent busses (DIB) and has a snoop filter. Some benchmarks showed positive gain with the snoop filter enabled. Other benchmarks showed degradation, so some consideration may be given to testing with the snoop filter enabled and disabled (Dell Power Solutions Configuring Ninth-Generation Dell PowerEdge Servers). Multiple independent busses reduce the number of electrical loads on a bus, allowing the bus to operate at a much higher clock rate, but it is not clear at all that the combined effective bandwidth is a simple sum of the individual bandwidths. There are definitely questions regarding the effectiveness of a snoop filter in the 2-way system. No such concerns have been brought to light concerning 4-way systems. The absence of clear information cannot be construed to indicate as to whether there are technical issues with the snoop filter or not.

In summary, the shared bus was effective at the time of its introduction. It was improved and then band-aided as much as possible. The 7300 MCH still has some liabilities of shared bus, and does not realize the full benefits of point-to-point. Part of the problem is that the MCH must support multiple processor busses, memory channels and PCI-E. And it really does not help that the processor bus does not use the most pin-efficient signaling. In the AMD Opteron architecture, the processor not only uses the more pin-count efficient signaling, it also integrates the memory controller. Hence the memory signals are distributed over multiple processors. In the Intel 7300 chipset, the MCH has 2013 pins while there are only 603 on the processor. This is the end of road for shared bus not only because the replacement architecture has finally arrived, but also because nothing more can be done to keep pace with each new generation of processor performance requirements.

Additional Images (2010-04)

FB DIMM

PCI-E

PCI-E

PCI-E

PCI-E

PCI-E

AMD Opteron System Architecture

The Opteron microprocessor was designed from beginning for point-to-point signaling, Hyper-Transport in AMD terminology (HT). The Opteron 800 and 8000 series processors have 3 full width HT links. Each full-width 16-bit HT link can be split into two half width 8-bit links. The typical 4-way Opteron system layout is shown below, with the processors on the four corners of a square.

OpteronSystem

Each Opteron processor connects directly to two other Opteron processors. The third processor is two (HT) hops away. Two of the Opteron processors also use one HT link to an IO hub, and there are two IO bridge devices. The third HT ports on two of the Opteron processor are not used.

A significant feature of the Opteron processor is the integrated memory controller. In a single processor system, this substantially lowers the round-trip memory access latency, the time from which the processor issues a memory request to the time the contents arrive, (which is much longer than the DRAM clock rate). A typical round trip time from processor over FSB to MCH to DRAM and back might be 95-120ns, compared with 50-60ns for processor via integrated memory controller to DRAM and back. Consider a code sequence that fetches a memory address and then uses the content to determine the next address. The next address cannot be issued until the first request completes. A 3GHz processor clock cycle is 0.33ns. In a code sequence where the address of memory request is known far in advance, and memory accesses can be deeply pipelined, then latency is not as important.

The benefit of the integrate memory controller is multi-socket systems is that the signals for the memory controller as distributed with the processors, not concentrated in the MCH, as with the Intel 7300 chipset. The Opteron platform above has 8 memory channels compared with 4 on the Intel 7300. The HP ProLiant DL785G5 8-way Opteron system has 16 memory channels. HP does not provide an architecture diagram for the 8-way Opteron. The Sun Fire X4600 M2 Server Architecture whitepaper ( www.sun.com/servers/x64/x4600/arch-wp.pdf) for their 8-way Opteron describes an Enhanced Twisted Ladder arrangement.

The current generation Opteron processors, codename Shanghai, are on the 45nm process with dedicated 256K L2 cache per core and 6M L3 shared cache. The original frequencies ran up to 2.7GHz. Newly listed frequencies are up to 3.1GHz.

A 6 core Istanbul was announced on 1 June 2009. Istanbul is described as having a HT assist feature, which uses 1M of the 6M L3 cache. The following is from the online article by Johan De Gelas on AnandTech ( AMD's Six-Core Opteron 2435.

HT assist is a probe or snoop filter AMD implemented. First, let us look at a quad Shanghai system. CPU 3 needs a cache line which CPU 1 has access to. The most recent data is however in CPU’s 2 L2-cache.

without HT Assist

Start at CPU 3 and follow the sequence of operations:

1. CPU 3 requests information from CPU 1 (blue “data request” arrow in diagram)
2. CPU 1 broadcasts to see if another CPU has more recent data (three red “probe request” arrows in diagram)
3. CPU 3 sits idle while these probes are resolved (four red & white “probe response” arrows in diagram)
4. The requested data is sent from CPU 2 to CPU 3 (two blue and white “data response” arrows in diagram)

There are two serious problems with this broadcasting approach. Firstly, it wastes a lot of bandwidth as 10 transactions are needed to perform a relatively simple action. Secondly, those 10 transactions are adding a lot of latency to the instruction on CPU 3 that needs the piece of data (which was requested by CPU 3 to CPU 1).

The solution to is a directory-based system, that AMD calls HT Assist. HT assist reserves 1MB portion of each CPU’s L3 cache to act as a directory. This directory tracks where that CPU’s cache lines are used elsewhere in the system. In other words the L3-caches are only 5 MB large, but a lot of probe or snoop traffic is eliminated. To understand this look at the picture below:

with HT Assist

Let us see what happens. Start again with CPU 3:

1. CPU 3 requests information from CPU 1 (blue line)
2. CPU 1 checks its L3 directory cache to locate the requested data (Fat red line)
3. The read from CPU 1’s L3 directory cache indicates that CPU 2 has the most recent copy and directly probes CPU 2 (Dark red line)
4. The requested data is sent from CPU 2 to CPU 3 (blue and white lines)

Instead of 10 transactions, we have only 4 this time. A considerable reduction in latency and wasted bandwidth is the result. Probe “broadcasting” can be eliminated in 8 of 11 typical CPU-to-CPU transactions. Stream measurements show that 4-Way memory bandwidth improves 60%: 41.5GB/s with HT Assist versus 25.5GB/s without HT Assist.

But it must be clear that HT assist is only useful in a quad-socket system and of the utmost importance in octal CPU systems. In a dual system, broadcast is the same as a unicast as there is only one other CPU. HT assist also lowers the hitrate of L2-caches (5 MB instead of 6) so it should be disabled on 2P systems.

The following generation will have 12 or more cores in one socket. The announced plan is to place two of the six core die in one package, connected with one of the three HT links on each die. The interface out of the package is 4 HT and 4 DDR2/3 memory channels.

The current Opteron is listed as supporting 2GT/s (based on 1 GHz clock) in HT version 1, and 4.4GT/s in HT version 3.0. It appears the most or all current Opteron systems are still use HT-PCI-E bridge devices that are HT 1.0 only, and PCI-E gen 1 only. The newly announced HP ProLiant DL585G6 for Istanbul only may be HT 3.0 for 4.4GT/s (or 8.8GB/sec per direction). The previous DL585G5 only supports Shanghai and earlier.

Intel Nehalem based systems with QPI

The first Nehalem architecture processor release was the Core i7 for single socket desktop systems in November 2008. The Xeon 5500 series followed in April 2009 for 2-way server systems. Both have 4 physical cores, 3 DDR3 memory channels. The Core i7 has a single QPI channel and the Xeon 5500 has 2 QPI. The 2-way Xeon system can be configured with a single IOH or with two IOH as shown below. With one IOH, both processors connected directly to the IOH. Each IOH has two QPI channels and 36 PCI-E Gen 2 lanes plus the ESI. With two IOH, each processor connects directly to one IOH, and the IOH also directly connected to each other.

QPI 1 IOH

QPI 2 IOH

Notice that there are 1366 pins on the Nehalem processor versus 603 or 771 for the Core 2 based Xeon processor, and the number on pins on the IOH (now without memory channels) is reduced to a more economical 1295 pins.

The QPI has a bandwidth of 12.8GB/s in each direction simultaneously for a combined bi-directional bandwidth of 25.6GB/s. Each PCI-E Gen 2 lane operates at 5Gbit/sec for a net bandwidth of 500MB/s per lane, per direction. A x4 PCI-E Gen 2 channel is rated 2GB/s per direction, and 4GB/s per direction for the x8 channel. So while the 36 PCI-E Gen2 lanes on the 5520 IOH are nominally rated for 18GB/s per direction, the maximum bandwidth per QPI is still 12.8GB/s per direction. Still the dual IOH system would have a nominal IO bandwidth of 25.6GB/s per direction. It would very interesting to see what the actual bandwidth, disk and network, the Xeon 5500 system can sustain is.

The 4-way Nehalem architecture (which might be a Dell R910) scheduled for release in late 2009, looks something like below.

Nehalem EX

Each processor socket has 8 physical cores, hyper-threading (HT or 2 logical processors per core), 16M L3 cache, 4 memory channels, and 4 QPI. The architecture of the 4-way system has each processor (socket) directly connected to all three other processor sockets. The remaining QPI connects to an IO hub. Another difference relative to the AMD Opteron system is that the Intel IO hub has two QPI ports, where each connects to a processor.

Each QPI full link can be split as two half wide links. The Nehalem EX with four full QPI can support glue-less 8-way system architecture. Glue-less means that no other silicon is necessary to connect the processors together. Each processor connects directly to all seven other processors with a half-wide QPI link, and uses the remaining half wide QPI to connect to the IOH. The Hyper Transport Consortium describes this arrangement for a future Opteron system with 4 full HT links per processor ( HT_General_Overview.pdf). The actual 8-way system architectures described so to date do employ half-wide links, for both current Opteron systems and forth coming Nehalem EX system.

Performance History and Characteristics

One cannot simply discuss comparative system architectures without bringing up performance. To discussion the relative merits of each system architecture, it is important to distinguish between performance at the processor core level and at the complete system level as much as practical. Performance at the individual core level is still relevant because some important functions are not yet multi-threaded. The two traits of interest at the system level are scalability and overall performance. Scalability is characterized by performance or throughput versus the number of processors (cores and sockets). The important metrics are single core and the overall system performance. A system architecture with excellent scalability wins points for system architecture. If the underlying processor performance is weak, it may not be competitive at the system level against another system with lesser scalability, starting with a more powerful processor core or socket.

Next, performance cannot be characterized with single metric, especially considering that each of the processor architectures have completely different characteristics. On any given manufacturing process, the transistors are (or were down to 90nm) more or less comparable between Intel, AMD, IBM or other foundries. An important point is that the frequency on one processor architecture has no relation to frequency on a different processor architecture. (Processor frequency is like engine RPM. The important metric is power. For an engine, power is RPM x torque. For a processor, performance is frequency x IPC, or instructions per cycle.)

Benchmarks

As processor performance cannot be characterized by a single metric, it is necessary to look at several benchmarks. The most useful for single core performance is SPEC CPU 2006 integer base. SPEC CPU results are base on 12 individual applications. Another important aspect is that the very latest compilers used for SPEC CPU designed to generate the results. The compilers used for common applications like SQL Server do not incorporate the very latest enhancements. The more recent versions of SPEC CPU are now actually multi-threaded (?).

An alternative for single core performance metrics is to simply run certain test queries in SQL Server with all data in memory. Examples tests are: 1) rows per second for a non-clustered index seek with bookmark lookup, or loop join, 2) table scan pages per sec, and 3) network round trips per second.

At the system level, the most widely recognized benchmarks TPC-C, E, and H. TPC-C is a transaction processing benchmark that has long history. Results are available from 1995 to present. TPC-C has the most extensive set of results allowing reasonable ability to compare platforms across multiple generations. TPC-E is a relatively new transaction processing benchmark that is supposed to replace TPC-C. Results are available from 2007 to present. Only SQL Server results have been published. TPC-H is a data warehouse/DSS benchmark with published result from 2004 to present. While TPC-H performance characteristics are important, unfortunately there is not anywhere near enough published results a comprehensive comparison of system and processor architectures. Intel, and the top system vendors most probably have a complete set of TPC-H performance results, electing only to publish when it suits their purpose.

While TPC-C is described as a transaction processing benchmark, it has no relation to any actual transaction processing application. The more important aspect of the TPC-C benchmark is that it is moderately network round-trip intensive and very disk IO intensive. Consider the recent TPC-C for the 4-way Xeon 7460 (24 cores) of 634,825 transactions per minute (tpm-C). This is 10,580 transactions per sec. On average, each TPC-C transaction requires 2.25 calls from the application server to the database server. So this system generates 23,806 network round-trips per second, or 991 per core per sec. This in turn implies that the average cost per call is just over 1 CPU-milli-sec.

The other important metric to draw is the number of disk drives used for the data files. TPC-C has a performance and price-performance metrics. The system should be optimized for full utilization. Then a reasonable assumption is that the disk load is approximately 175-200 IOPS per disk (queue depth 1 operation). Certain TPC-C publications may target best price-performance, in which case the disks will be driven to much higher queue depth, and higher IOPS per disk. The result cited above was configured with 1000 disk drives, meaning the disk load is very likely to be in 175-200K IOPS range.

The reason the network round-trips is a very important is that scaling SQL Server performance has a very different set of issues for driving high network round-trip volume than in executing SQL statements. The TPC-C workload is below but near the boundary where the network call volume becomes more important than SQL execution. SAP has characteristics that are almost purely a matter of driving network round-trips. The other aspect of TPC-C is that disk IO is far higher than most any actual transaction processing application, which might contribute as much as 20-30% of the overall load.

Some other aspects of TPC-C are as follows. TPC-C benefits from Hyper-Threading, a feature of Pentium 4, Nehalem and Itanium 2 processors (starting from the dual core 9100 series). In the Pentium 4 architecture, HT (not AMD Hyper-Transport) improves network call volume throughput, but had no consistent affect in SQL operations. HT may cause erratic SQL performance.

A large processor cache improves the fixed startup cost of SQL operations, but does not change the incremental cost per row. Large cache benefits TPC-C and TPC-E, but does not benefit TPC-H.

The TPC-C only uses 5 stored procedures, meaning there is no cost expended for SQL compiles, which can be a sizeable portion of a less than perfectly designed transaction database.

TPC-E results are reported in transactions per second (tpsE). There are 10 transaction types that make up the main part of TPC-E. Each type may be made up of more than one frame (or stored procedure). The transaction that is scored is the Trade-Result, which makes up 10% of the transactions. There are about 25 frames total in the 10 transaction types. On average, there are 22.3 stored procedure calls for each scored tpsE. Consider the 4-way X7460 result of 721.4 tpsE. This implies that there are approximately 16,087 RPC call per second, or 670 per second for each of the 24 cores. So TPC-E performance characteristics are very similar to that of TPC-C.

With this understanding of the TPC-C benchmark, let us now examined key published results.

DateSystemcoresMHzCacheProcesstpm-C SPEC
04/19964 x Pentiumsingle133--3,641-
12/1996 4 x Pentium Pro single 200 512K 350nm 7,521 -
03/1998 4 x Pentium Pro single 200 1M 350nm 12,106 -
11/1998 4 x Pentium II single 400 1M 250nm 20,434 -
09/2000 4 x Pentium III single 700 2M 180nm 34,265 -
09/2001 4 x Pentium III single 900 2M 180nm 39,158 -
08/2002 4 x Xeon MP (P4) single 1.6G 1M 180nm 61,565 -
11/2002 4 x Xeon MP single 2.0G 2M 130nm 77,905 -
12/2002 4 x Itanium 2 single 1.0G 3M 180nm 87,741 -
04/2003 4 x Itanium 2 single 1.5G 6M 130nm 121,065 -
07/2003 4 x Opteron single 1.8G 1M 130nm 82,226 -
09/2003 4 x Xeon MP single 2.8G 4M 130nm 84,713 -
03/2004 4 x Xeon MP single 3.0G 8M 90nm 102,667 -
04/2004 4 x Opteron single 2.2G 1M 130nm 105,687 -
11/2004 4 x Opteron single 2.4G 1M 130nm 123,027 -
04/2005 4 x Opteron dual 2.2G 2x1M 90nm 187,298 9.9
04/2005 4 x Xeon MP Single 3.66G 1M 90nm 141,504 11.4
09/2005 4 x Opteron Single 2.8G 1M 90nm 138,845 -
10/2005 4 x Xeon 7041 Dual 3.0G 2x2M 90nm 188,761 ~9.3
03/2006 4 x Opteron Dual 2.6G 2x1M 90nm 213,986 11.3
07/2006 4 x Itanium 2 9050 Dual 1.6G 18M 90nm 344,928 13.9
09/2006 4 x Opteron 8220 Dual 2.8G 2x1M 90nm 262,989 13.2
10/2006 4 x Xeon 7140 Dual 3.4G 16M 65nm 318,407 12.0
09/2007 4 x Xeon X7350 Quad 2.93G 2x4M 65nm 407,079* 21.0
07/2008 4 x Opteron 8360 Quad 2.50G 2M 65nm 471,883 14.4
08/2008 4 x Xeon X7460 Six 2.66G 16M 45nm 634,825 21.7
11/2008 4 x Opteron 8384 Quad 2.70G 6M 45nm 579,814 16.9

* IBM x3850M2 result 516,752 on proprietary chipset, DB2

† IBM x3850M2 result 684,508 on proprietary chipset, SQL Server

The main stages of the competition between Intel Xeon and AMD Opteron are: 130nm single core, 90nm single core, 90nm dual-core, Intel 65nm dual core versus AMD 90nm dual core, Intel 65nm quad-core versus Opteron 90nm dual core and Intel 45nm six core versus Opteron 45nm quad-core.

At the time of the Opteron 4-way system introduction in 2003, it was competitive with Xeon in TPC-C. If we account for Xeon benefiting from Hyper-Threading in TPC-C by about 7%, when most environments have HT disabled for stability purposes, the Opteron had a moderate advantage. As Opteron frequencies started to climb up, AMD gained a more substantial advantage. The 90nm 3GHz Xeon MP with 8M L3 was not particularly impressive as it suffered from a poorly designed bus to the on-die L3 cache. It was found that the desktop Pentium 4 core at 3.66GHz with just 1M L2 cache in a Xeon package performed much better than the custom design server core with the large L3 cache.

The dual-core Opteron appeared in 2005 initially at 2.2GHz, against which the dual core Xeon 7041 at 3.0 GHz was competitive. Over time, Opteron dual core processor frequency increased incrementally, but Intel Xeon did not. The dual core Opteron result for 2.8GHz was actually a new improved design with DDR2 memory at 667MHz, compared DDR at up to 333MHz before. The disproportionate performance increase from 214K to 263K (23%) for a 7.7% frequency might suggest that there were additional improvements beyond the increased memory bandwidth.

The Intel dual core Xeon 7140 appeared in October 2006 featuring 2 Pentium 4 architecture cores plus a large 16M L3 cache on one die at 3.4GHz with an excellent TPC-C result of 318K versus 263K for the contemporary dual core Opteron at 2.8GHz. As mentioned before, TPC-C benefits from large cache and Hyper-Threading. So while the Xeon TPC-C was better, the Opteron did very well on many ad-hoc tests. The dual-core Opteron 8220 scored better at TPC-H than the Xeon 7140.

The Intel Core 2 architecture appeared in mid-2006 for two-way systems (Xeon 5100 series) and below. The 65nm Core 2 consists of two cores with a shared 4M L2 cache. This was soon followed by the quad core Xeon 5300 series consisting of two dual core die in a single package. The two-way Xeon 5355 was almost as powerful at the system level as 4-way dual core systems. The Core 2 was significantly faster at the single core level than Opteron or Pentium 4 architecture processors.

The Core 2 architecture for 4-way systems, Xeon 7300 series (quad-core) appeared in late 2007. Only with this did Intel finally regain clear advantage over Opteron in 4-way servers in both high call volume (TPC-C and TPC-E) and large queries (TPC-H) applications. AMD did not have a quad-core until 2008.

At the single core level, the Intel Core 2 architecture is considered to be more powerful than a contemporary generation AMD Opteron. The Xeon X5470 at 3.33GHz SPEC CPU 2006 Integer base result is 26.3, compared to 16.9 for the Opteron 8384 released in late 2008 at 2.5GHz. In the six-core Xeon 7460, top frequency was limited to 2.66GHz on account of the power consumption of the two extra cores and L3 cache, came in at 21.7. A recently announced Opteron 8393 at 3.1GHz is reported at 19.7, which is not far off the Xeon X7460.

AMD was late with quad-core, codename Barcelona, on 65nm process, with a 2.3GHz benchmark result published in March 2008 of 402K, and a 2.5GHz result in July 2008 of 471K. For less than 9% delta in frequency, there was a 17% increase, which is one indication of the bugs reported in early version of Barcelona. The 65nm Opteron 8360 quad core did scored higher than the 65nm Xeon 7350 at 4-way. At 45nm, the Intel Core 2 architecture (X7460) scored higher than the Opteron (8384) on the strength of both the 16M L3 cache and the 2 extra cores per socket, even at a lower frequency than the 65nm (2.66GHz versus 2.93GHz).

TPC-H is a data warehouse benchmark that stresses the ability to run very large queries. Unfortunately there is not a sufficiently large set of the comparable TPC-H results. The following is known. TPC-H does not benefit from large processor cache at all.

While the Core 2 architecture has significantly better performance than Opteron at the single core level, the expectation is that Opteron scales better to a large number of sockets. This means the performance with 2, 4, 8 or more sockets relative to the performance at a single socket.

Processor

A 4-way system like the Dell R900 has several processor options. Currently the Dell R900 processor options range from the Quad-Core Xeon E7310 1.6GHz, 2x2M L2 at $11,423, QC Xeon X7350 2.93GHz 2x4M L2 at $17,523 and the six-core X7460 2.67GHz 16M L3 at $20,423. All comparisons are done with 128GB (32x4GB) memory. The Dell R905 is $9,469 with the 2.4GHz Opteron 8378 and $13,119 with the 2.7GHz Opteron 8384. It does seem that the cost of the processor options in the bare system with processors and memory sockets populated is reasonably proportional to processor frequency (or expected performance). However, in considering the complete cost of the system with storage, and software licensing, especially if per processor licensing applies, (not even including operating costs), the cost difference between the low and high is not worth considering.

Memory

As discussed earlier, the system architecture has multiple memory channels. Depending on the system, while uneven memory configurations can be supported, best performance is achieved with identical memory modules populated in banks of 4 or 8, matched with the number of memory channels. In past years, it might be worthwhile to discuss how to analyze memory capacity requirements.

Today, a 4GB ECC DIMM costs around $115, and an 8GB ECC DIMM costs around $700 (Crucial, 5/20/2009). The typical 4-way system, including the Dell R900, has 32 DIMM sockets. It really is not worth the effort to determine if a system needs less than 128GB. By the default, the system should just be purchased upfront with 32x4GB DIMMs. If there is sound technical analysis to justify 256GB memory, then this would add about $19K to the system cost.

The long term overall memory price trend has been about a 50% reduction every 3 years, with the largest capacity module having a disproportionately higher price. By 2010-11, the 8GB ECC DIMM should drop to below $300, and a new 16GB ECC DIMM will have the disproportionately higher price of $1000-2000.

An interesting point to consider is this. The HP ProLiant DL785G5 is an 8-way system with 64 DIMM sockets. The DL785G5 with 8 x 3.1GHz Quad-Core Opteron 8393 processors and 256GB memory (64x4GB) is priced at $62,628. By comparison, The ProLiant DL585G5 with 4 of the same 3.1GHz QC Opteron processors cost $23,001 with 128GB memory (32x4GB) (HP website, 5/20/2009). Unfortunately, the HP website does not list the 8GB DIMM module option for the DL585G5. The Dell PowerEdge R905 with 4 x 2.7GHz is priced at $13,119 with 128GB (32x4GB) and $31,388. The implication is that one should consider a larger system with more DIMM sockets as an alternative to populating the original system with the largest capacity memory module is there is a large disproportion in price per GB. In particular, the Opteron system architecture ties the number of memory channels and DIMM sockets to the number of processors. So it is not possible to configure the ProLiant DL785G5 with 4 processors and 64 DIMM sockets. There is a hard tie of 8 DIMM sockets per processor socket.