Home, Cost-Based Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

System Architecture and Configuration

System Architecture has been split into multiple sections:
  Historical, AMD Opteron, Intel QPI, Dell PowerEdge, HP ProLiant, IBM x Series,
  NUMA (never finished), and Sandy Bridge 2011-11,
Additional related topics:
  Systems Architecture 2010Q3, Systems Architecture 2009, NEC Express5800/A1080a (2010-06),
  Big Iron Revival III (2010-09). Big Iron Revival II (2009-09), Big Iron Revival (2009-05),
  High Call Volume SQL on NUMA (pre-SQL 2005?),

Update 2011-11-29

Intel Server Strategy Shift with Sandy Bridge EN & EP

The Sandy Bridge EN and EP processors expected in early 2012 will mark the completion of a significant shift in Intel server strategy. For the longest time 1995-2009, the strategy had been to focus on producing a premium processor designed for 4-way systems that might also be used in 8-way systems and higher. The objective for 2-way systems was use the desktop processor (that later had a separate brand and different package & socket) to leverage low cost structure in driving volume. The implication was that 2-way components would be constrained by desktop cost requirements.

The Sandy Bridge collection will be comprised of one group for single processor systems designed for low cost, and one premium processor. The premium processor will support both the EN and EP product lines. The EN is limited to 2-way, and the EP is for both 2-way and 4-way systems. Both EN and EP have more than adequate memory and IO.

The cost structure of both 2-way and 4-way increased from Core 2 to Nehalem due to the significant boost in CPU, memory and IO capability. With quad-core available in 1P, the more price sensitive environments should move down to single processor systems. This allows 2 & 4-way systems to have balanced compute, memory and IO all without being constrained by desktop cost requirements.

In other places, I had commented that the default choice for database server for a long time had been a 4-way system. It should now be a 2-way since the introduction of Nehalem in mid-2009. Default choice means in the absence of detailed technical analysis, basically a rough guess. The Sandy Bridge EP, with 8 cores, 4 memory channels and 80 PCI-E lanes per socket in a 2-way system provides an even stronger support for this strategy.

The glue-less 8-way capability of the Nehalem and Westmere EX line is not continued. One possibility is that 8-way systems do not need to be glue-less. The other is that 8-way systems are being abandoned, but I am inclined to think this is not the case.

The Master Plan

The foundation of the premium processor strategy, even though it may have been forgotten in the mists of time, not to mention personnel turnover, was that a large cache improves scaling at the 4-way multi-processor level for the shared bus SMP system architectures of the Intel Pentium to Xeon MP period. The 4-way server systems typically deployed with important applications that could easily justify a far higher cost structure than that of desktop components, but also required critical capabilities not necessary in personal computers. Often systems in this category were fully configured with top line components whether needed or not.

Hence the Intel large cache strategy was an ideal match between premium processors and high budget systems for important applications. One aspect that people with an overly technical point of view have difficulty fathoming is that the non-technical VP's don't want their mission critical applications running on a cheap box. In fact, more expensive means that is must be better, and the most expensive is the best, right? From the Intel perspective, a large premium is necessary to amortize the substantial effort necessary to produce even a derivative processor in volumes small relative to desktop processors.

The low cost 2-way strategy was to explore demand for multi-processor systems in the desktop market. Servers were expected to be a natural fit for 2-way systems. Demand for 2-way servers exploded to such an extent that it was thought for a brief moment there would be no further interest for single processor servers.

Eventually, the situation sorted itself out, in part with the increasing power of processors. Server unit volume settled to a 30/60/10 split between single, dual and quad processors (this is old data, I am not sure what the split is today). The 8-way and higher unit volume is low, but potentially of importance in having a complete system lineup.

AMD followed a different strategy based on the characteristics of thier Opteron platform. The Hyper-Transport (HT) interconnect and integrated memory controller architecture did not have a hard requirement for large cache to support adequate scaling at 4-way and above. So AMD elected to pursue a premium product strategy based on the number of HT links.

Single processor systems require one HT to connect the IO hub. A 2-way system requires two HT, one connecting to IO, and a second to the second processor. Three HT could support 4-way and higher with various connection arrangements. The pricing structure is based on the number of HT links enabled, on the theory that the processor has higher value in big systems than in small systems.

What Actually Happened

Even with the low cost structure Intel enabled in 2-way, desktop systems remained and actually became defined as single processor. Instead, the 2-way systems at the desk of users became workstations. This might have been because the RISC/UNIX system vendors sold workstations. Intel workstations quickly obliterated RISC workstations, and there have been no RISC workstations for sometime? Only two RISC architectures are present today, having retreated to the very high-end server space (8-way+), where Intel does not venture.

Itanium was supposed to participate in this space, but the surviving RISC vendors optimized at 8-way and higher. Intel would not let go of the 4-way system volume and so Itanium was squeezed by Xeon from below, yet could not match IBM Power in high SMP scaling. To do so would incur a very high price burden on 4-way systems.

One other aspect of Intel server strategy of the time was the narrow minded focus on optimizing for a single platform. Most of the time, this was the 4-way server. There was so much emphasis on 4-way that there actually 2 reference platforms, almost to the exclusion of all else. For a brief period in 1998 or so, there was an incident of group hysteria that 8-way would become the standard high volume server. But this phase wore off eventually.

SPARC was perhaps the weakest of the RISC at the processor level. Yet the Sun strategy to design for a broad range of platforms from 2-way to 30-way, (then with luck, 64-way via acquisition of one of the Cray spin-offs) was successful until their processor fell too far behind.

After the initial implementation of the high volume 2-way strategy, desktop systems became intensely price sensitive. The 2-way workstations and server system were in fact not price sensitive even though it was thought they were. It became clear that desktops could not incur any burden to support 2-way capability. The desktop processor for 2-way systems was put into a different package and socket, and was given the Xeon brand.

Other cost reduction techniques were implemented over the next several generations as practical on timing and having the right level of maturity. The main avenue is integration of components to reduce part count. This freed 2-way system from desktop cost constraints, but as with desktops, it would take several generations to evolve into a properly balanced architecture.

The 4-way capable processors remained on a premium derivative, given the Xeon MP brand in the early Pentium 4 architecture (or NetBurst) period. To provide job security for marketing people (PowerPoint Engineers?), 2-way processors were then became the Xeon 5000 series, and 4-way the Xeon 7000 series in the late NetBurst to 2010 period. In 2011, the new branding scheme is E3 for 1P servers, E5 for 2-way and E7 for 4-way and higher. Presumably each branding adjustment requires changes to thousands of slidedecks.

At first, Intel thought both 2-way and 4-way systems had high demand versus cost elasticity. If cost could be reduced, there would be substantially higher volume. Chipsets (MCH and IOH) had overly aggressive cost objectives that limited in memory and IO capability. In fact, 4-way systems had probably already fallen below the boundary of demand elasticity. The same may have been true for 2-way systems, as people began to realize that single processor systems were just fine for entry server requirements.

Intel had planned to employ a desktop chipset (memory controller and IO hub, previously called north-bridge and south-bridge) with 2-way systems, and this was all that was available for Pentium II and early Pentium III systems. It was soon realized that 2-way server systems had different memory IO requirements than desktops. Desktops desired a single high-bandwidth port for graphics, that became AGP. Servers desired two or more IO ports with great bandwidth than 32-bit 33MHz PCI along with more memory channels and DIMM sockets.

This 2-way server system would have higher cost structure that Intel did not believe was acceptable, and hence did not build the proper chipset. Serverwork, through an arrangement that allowed them access to the Intel processor bus license, did build the right chipset for both 2-way and 4-way systems, and for a while dominated this segment. In 2005-6, Intel was finally able to produce a viable chipset for 2-way systems (E7500? or 5000P) that provided memory and IO capability beyond desktop systems.

It was also thought at the time that there was not a requirement for premium processors in 2-way server systems. The more correct interpretation was that the large (and initially faster) cache of premium processors did not contribute sufficient value for 2-way systems. A large cache does improve performance in 2-way systems, but not to the degree that it does at the 4-way level. So the better strategy by far on performance above the baseline 2-way system with standard desktop processors was to step up to a 4-way system with the low-end premium processors instead of a 2-way system with the bigger cache premium processors.

And as events turned out, the 4-way premium processors lagged desktop processors in transitions to new microarchitectures and manufacturing processes by 1 full year or more. The 2-way server on the newer technology of the latest desktop processors was better than a large cache processor of the previous generation, especially one that carried a large price premium. So the repackaged desktop processor was the better option for 2-way systems

The advent of multi-core enabled premium processors to be a viable concept for 2-way systems. A dual-core processor has much more compute capability than a single core and the same for a quad-core over dual-core in any system, not just 4-way, provided that there not too much difference in frequency. The power versus frequency characteristics of microprocessors clearly favors multiple cores for code that scale with threads, as in any properly architected server application.

However, multi-core at the dual and quad-core level was employed for desktop processors. So the processors for 2-way servers did not have a significant premium in capability relative to desktops. The Intel server strategy remained big cache processors. There was the exception of Tigerton, when two standard desktop dual-core processor die in the Xeon MP socket was employed for the 4-way system, until a large cache variant was readied in the next generation Dunnington processor incorporated a large cache. This also happened for the Paxville and Tulsa.

System Architecture Evolution from Core 2 to Sandy Bridge

The figure below shows 4-way and 2-way server architecture evolution relative to single processor desktops (and servers too) from 45nm Core 2 to Nehalem & Westmere and then to Sandy Bridge. Nehalem systems are not shown for space considerations, but are discussed below.


System architecture from Penryn to Westmere to Sandy Bridge, (Nehalem not shown)

The Core 2 architecture was the last Intel processor to use the shared bus, which allows multiple devices, processors and bridge chips, to share a bus with a protocol to arbitrate for control of the bus. It was called the front-side bus (FSB) because there was once a back-side bus for cache. When cache was brought on-die more than 10 years ago, the BSB was no more. By the Core 2 period, to support higher bus frequency, the number of devices was reduced to 2, but the shared bus protocol was not changed. The FSB was only pushed to 1066MHz for Xeon MP, 1333MHz for 2-way servers, and 1600MHz for 2-way workstations.

Nehalem was the first Intel processor with a true point-to-protocol, Quick Path Interconnect (QPI), at 6.4GHz transfer rate, achieving much higher bandwidth to pin efficiency than possible over shared bus. Intel had previously employed a point-to-point protocol for connecting nodes of an Itanium system back in 2002. (AMD implemented point-to-point with HT for Opteron in 2003? at an initial signaling rate of 1.6GHz?) Shared bus also has bus arbitration overhead in addition to lower frequency of operation. The other limitation of Intel processors up to Core 2, was the concentration of signals on the memory controller hub (also known as North Bridge) for processors, memory and PCI-E. The 7300 MCH for the 4-way Core 2 has 2013-pins, which is at the practical limit, and yet the memory and IO bandwidth is somewhat inadequate.

Nehalem and Westmere implement a massive increase in memory and PCI-E bandwidth (number of channels or ports) for the 2-way and 4-way systems compared to their Core 2 counterparts. Both Nehalem 2-way and 4-way systems have significantly higher cost structure than Core 2. Previously, Intel had been mindlessly obsessed with reducing system to the detriment of balanced memory and IO. This shows Intel recognized that their multi-processor systems were already below the price-demand elasticity point, and it was time to rebalance memory and IO bandwidth, now possible with point to point interconnect and the integrated memory controller.

QPI in Nehalem required an extra chip to bridge the processor to PCI-E. This was not an issue for multi-processor systems, but was undesirable for the hyper sensitive cost structure of desktop systems. The lead quad-core 45nm Nehalem processor with 3 memory channels and 2 QPI ports in a LGA 1366 socket was followed by a quad-core, 2-memory channel derivative (Lynnfield) with 16 PCI-E plus DMI replacing QPI in a LGA 1156 socket. The previously planned dual-core Nehalem on 45nm was cancelled. Nehalem with QPI was employed in the desktop extreme line, while the quad-core without QPI was employed in the high-end of the regular desktop line.

The lead 32nm Westmere was a dual-core with the same LGA 1156 socket (memory and IO) as Lynnfield. Per the desktop and mobile objective, cost structure was reduced with integration, with 1 processor die and potentially a graphics die in the same package, and just 1 other component the PCH.

The follow-on Westmere derivative was a six-core using the same LGA 1366 socket as Nehalem, i.e., 3 memory channels and 2 QPI. This began the separation process of desktop and other single processor systems from multi-processor server and workstation systems. Extreme desktops employ the higher tier components designed for 2-way, but are still single-socket systems. I suppose that a 2-way extreme system is a workstation. Gamers will have settle for the mundane look of a typical workstation chassis.

With the full set of Sandy Bridge derivatives, the server strategy transition will be complete. Multi-processor products, even for 2-way, are completely separated from desktops without the requirement to meet desktop cost structure constraints. With desktops interested only in dual and quad-core, a premium product strategy can be built for 2-way and above around both the number of cores and QPI links.

The Sandy Bridge premium processor has 8 cores, 4 memory channels, 2 QPI, 40 PCI-E lanes and DMI (that can function as x4 PCI-E). The high-end EP line in a LGA 2011 socket will have full memory, QPI and PCI-E capability. The EN line in LGA 1356 socket will have 3 memory channels, 1 QPI and 24 PCI-E lanes plus DMI to supports up to 2-way systems, and will be suitable for lower priced systems. Extreme desktops will use the LGA 2011 socket, but without QPI.

What is interesting is that the 4-way capable Sandy Bridge EP line is targeted at both 2-way and 4-way systems. This is a departure from the old Intel strategy of premium processors for 4-way and up. Since the basis of the old strategy is no longer valid, of course a new strategy should be formulated. But too often, people only remember the rules of the strategy, not the basis. And hence blindly follow the old strategy even when it is no longer valid (does this sound familiar?)

This element of a premium 2-way system actually started with the Xeon 6500 line based on Nehalem-EX. Nehalem-EX was designed for 4-way and higher with eight-cores, 4 memory channels supporting 16 DIMMs per processor and 4 QPI links. A 2-way Nehalem-EX with 8 cores, 16 DIMMs per socket might be viable versus Nehalem at 4 cores, 9 DIMMs per socket, even though the EX top frequency 2.26GHz versus 2.93GHz and higher in Nehalem. The more consequential hindrance was that Nehalem-EX did not enter production until Westmere-EP was also in production, with 6 cores per socket at 3.33GHz. So the Sandy-Bridge EP line will provide a better indicator for premium 2-way systems.

The Future of 8-way and the EX line

There is no EX line with Sandy Bridge. Given the relatively low volume of 8-way systems, it is better not to burden the processor used by 4-way systems with glue-less 8-way capability. Glue-less means that the processors can be directly connected without the need for additional bridge chips. This both lowers cost and standardizes multi-processor system architecture, which is probably one of the cornerstones for the success Intel achieved in MP systems. I am expecting that 8-way systems are not being abandoned, but rather a system architecture with "glue" will be employed.

Since 8-way systems are a specialized very high-end category, this would suggest a glued system architecture is more practical in terms of effort than a subsequent 22nm Ivy Bridge EX. Below are two of my suggestions for 8-way Sandy Bridge or perhaps Ivy Bridges depending on when components could be available. The first has two 4-port QPI switch, or cross-bar or routers connecting four nodes with 2 processors per node.

The second system below has two 8-port QPI switches connecting single processor nodes.

The 2 processor node architecture would be economical, but I am inclined to recommend building the 8-port QPI switch. Should the 2 processor node prove to be workable, then a 16-way system would be possible. Both are purely speculative as Intel does not solicit my advice on server system architecture and strategy, not even back in 1997-99.

In looking at the HP DL980 diagram, I am thinking that the HP node controllers would support Sandy Bridge EP in an 8-way system.

DL980

There are cache coherency implications (Directory based versus Snoop) that are beyond the scope for database server oriented topic. There was an IBM or Sun discussion transactional memory. I would really like to see some innovation on handling locks. This is critical to database performance and scaling. For example, the database engine ensures exclusive access to a row, i.e., memory, before allowing access. Then why does the system architecture need to do a complex cache coherency check when the application has already done so? I had also previously discussed SIMD instructions to improve handling of page and row base storage, SIMD Extensions for the Database Storage Engine.

If that were not enough, I had also called for splitting the memory system. Over the period of Intel multi-processor systems 1995 to 2011, practical system memory has increased from 2GB to 2TB. Most of the new memory capacity is used for data buffers. The exceptionally large capacity of the memory system also means that it cannot be brought very close to the processor, as into to the same package/socket.

So the memory architecture should be split into a small segment that needs super low latency byte addressability. The huge data buffer portion could be changed to block access. If so, then perhaps the database page organization should also be changed to make the metadata access more efficient in terms of modern processor architecture to reduce the impact of off-die memory access by making full use of cache line organization. The NAND people are also arguing for Storage Class Memory, something along the lines of NAND used as memory.

More on QDMPA System Architecture. and Sandy Bridge.

The table below outlines key processor features from Core 2 to Sandy Bridge

  Core2NehalemWestmereSandy BridgeIvy BridgeHaswell
 process45nm45nm32nm32nm22nm22nm
EXcores
Freq
cache
Mem
QPI
PCI-E
6
2.66GHz
16M L3
-
-
-
8
2.26GHz
24M L3
4
4
-
10
2.40GHz
30M L3
4
4
-
none  
 
 
 
 
 
 
 
 
 
 
 
EP
(EN)
cores
Freq*
cache
Mem
QPI
PCI-E
2
3.33GHz
6M L2
-
-
-
4
2.93GHz
8M L3
3
2
-
6
3.33GHz
12M L3
3
2
-
8
3.1GHz
20M?
4 (3)
2 (1)
40 (24)
 
 
 
 
 
 
 
 
 
 
 
 
Stdcores
Freq
cache
Mem
QPI
PCI-E
same 4
2.93GHz
8M
3
-
16
2
3.6GHz?
4M L3
2
-
16
2 & 4
3.6GHz?
4&8M?
2
-
16
 
 
 
 
 
 
 
 
 
 
 
 

*Max frequency at launch, for models with all available cores.
PCI-E + 4 lanes for DMI

More diagrams

I was going to discuss this topic around the diagrams below, but there was too much detail to properly view the over picture. But since the diagrams are done, we can just ramble on about them.

Below is a 2-way Core 2 system with the 5000P chipset. The Core 2 processor show represent quad-core Xeon 5300 or 5400 series. The processor die itself is dual-core, with both cores sharing L2 cache. There are 2 die in one socket, which is possible because the FSB protocol allows multiple processor cores to share a bus. The 5000P MCH had 4 memory channels, 24 PCI-E lanes, and the ESI, which is essentially another PCI-E x4 port with the additional capability to handle legacy functions. The ESI is called DMI on desktop system. In later generations, Intel just calls this the DMI for all platforms.

The 5000P had adequate IO capability for databases, but could run into limitation if the system PCI-E slot widths did not match the desired adapters. There was a follow-on the 5000P, with codename Seaburg, and called the 5400. The 5400 MCH had 40 PCI-E lanes plus ESI for far greater PCI-E capability. Unfortunately the 5400 was not employed by the major server system vendors. Perhaps Nehalem was not far away. The 5400 may be used in the EMC Symmetrix VMAX, which built on Core 2 processors. The 5000P would not have adequate IO to support the VMAX engines?

Below is a 4-way system representing the Xeon 7400 six-core processor (Dunnington) and the 7300MCH. The 7300 MCH has the same 4 memory channels as the 2-way 5000P chipset, and just 4 more PCI-E lanes, which is a limitations. The number of support DIMM slots is high through the use of a memory channel splitter.

Below are the Nehalem/Westmere system architectures. The memory controller in on the processor itself. The processor also has a Quick Path Interface (QPI) to connect processors and the IO hub (IOH). Not shown are 1P variants with PCI-E implemented directly on the processor, instead of QPI.

Below is a 4-way system, with the 10-core Westmere-EX represented.

Below is a 4P Sandy Bridge system architecture.

Below are 1P system diagrams, from left to right, Core 2, Nehalem-Lynnfield, Westmere and Sandy Bridge. The Core 2 and Nehalem systems are shown with discrete graphics attached to the PCI-E gen 2 x16. If these were server systems, graphics would attach to PCI-E x4 gen 1 downstream of the PCH.

In the Core 2 generation, there may have been a version of the MCH with graphics. Intel has been making graphics silicon for some time. But until recently, performance was mediocre and hence only targeted at systems with low-graphics requirements, with the primary objective of lower cost. The Nehalem architecture system shown is Lynnfield, with 2 memory channels, integrated PCI-E, and no QPI. A dual-core version of Nehalem with a discrete graphics die in the same package was cancelled, due to the limited time interval until a Westmere 32nm version would be available.

To support the cancelled 2-core Nehalem derivates, the first Westmere version out was a dual-core with the same interface as Lynnfield, that is 2 memory channels, and PCI-E in the same LGA 1156 socket. Note that the dual-core Westmere does have discete graphics in the processor package, as the LGA 1156 socket included signals for graphics.

In Sandy Bridge, the graphics is now actually integrated into the processor die. In Core 2, the non-memory silicon devices decreases from 4 for the dual-core version and MCH without graphics (1 CPU, 1 MCH, 1 PCH and 1 Graphics) and the same for the number of packages. By Sandy Bridge there are now 2 silicon die and 2 packages. This is most certainly how it will remain. The processor will be manufactured on the most advanced process, while the PCH will be manufactured on an older processor because it easier to drive higher voltage signals from a larger geometry process.

Update 2011-10-05

I know this section needs to be updated. In the mean time, see The Register article on Sandy Bridge servers Real World Technologies Sandy Bridge for Servers.

Update 2010-11-03

The Inquirer has an update on Sandy Bridge. The socket options are LGA1155 for desktop and entry server, with 16 or 20 PCI-E lanes. This line will have 4 cores and 8M last level cache (LLC), frequency up to 3.4GHz with GPU and 3.5GHz with GPU disabled.

The Xeon 5700(?) line will be LGA 1356, 6 and 8 cores, up to 20MB LLC, 3 memory channels, 24 PCI-E lanes, and a single QPI v2 at 8GT/s.

There will be an 8-core model in LGA2011, 150W TDP, 40 PCI-E lanes, 4 memory channels, and dual QPI.

Other sources: AnandTech Real World Technologies

Before jumping into system configuration, it is useful first to understand system architecture, including how the current system architecture came to be. A discussion of system architecture inevitably involves a comparison of the Intel Xeon versus Opteron performance characteristics, so this is briefly covered. It is important to distinguish performance characteristics at the processor and system levels where possible.

Performance History and Characteristics

One cannot simply discuss comparative system architectures without bringing up performance. To discussion the relative merits of each system architecture, it is important to distinguish between performance at the processor core level and at the complete system level as much as practical. Performance at the individual core level is still relevant because some important functions are not yet multi-threaded. The two traits of interest at the system level are scalability and overall performance. Scalability is characterized by performance or throughput versus the number of processors (cores and sockets). The important metrics are single core and the overall system performance. A system architecture with excellent scalability wins points for system architecture. If the underlying processor performance is weak, it may not be competitive at the system level against another system with lesser scalability, starting with a more powerful processor core or socket.

Next, performance cannot be characterized with single metric, especially considering that each of the processor architectures have completely different characteristics. On any given manufacturing process, the transistors are (or were down to 90nm) more or less comparable between Intel, AMD, IBM or other foundries. An important point is that the frequency on one processor architecture has no relation to frequency on a different processor architecture. (Processor frequency is like engine RPM. The important metric is power. For an engine, power is RPM x torque. For a processor, performance is frequency x IPC, or instructions per cycle.)

See the benchmark link for more information: Performance Benchmarks

Processor

A 4-way system like the Dell R900 has several processor options. Currently the Dell R900 processor options range from the Quad-Core Xeon E7310 1.6GHz, 2x2M L2 at $11,423, QC Xeon X7350 2.93GHz 2x4M L2 at $17,523 and the six-core X7460 2.67GHz 16M L3 at $20,423. All comparisons are done with 128GB (32x4GB) memory. The Dell R905 is $9,469 with the 2.4GHz Opteron 8378 and $13,119 with the 2.7GHz Opteron 8384. It does seem that the cost of the processor options in the bare system with processors and memory sockets populated is reasonably proportional to processor frequency (or expected performance). However, in considering the complete cost of the system with storage, and software licensing, especially if per processor licensing applies, (not even including operating costs), the cost difference between the low and high is not worth considering.

Memory

As discussed earlier, the system architecture has multiple memory channels. Depending on the system, while uneven memory configurations can be supported, best performance is achieved with identical memory modules populated in banks of 4 or 8, matched with the number of memory channels. In past years, it might be worthwhile to discuss how to analyze memory capacity requirements.

Today, a 4GB ECC DIMM costs around $115, and an 8GB ECC DIMM costs around $700 (Crucial, 5/20/2009). The typical 4-way system, including the Dell R900, has 32 DIMM sockets. It really is not worth the effort to determine if a system needs less than 128GB. By the default, the system should just be purchased upfront with 32x4GB DIMMs. If there is sound technical analysis to justify 256GB memory, then this would add about $19K to the system cost.

The long term overall memory price trend has been about a 50% reduction every 3 years, with the largest capacity module having a disproportionately higher price. By 2010-11, the 8GB ECC DIMM should drop to below $300, and a new 16GB ECC DIMM will have the disproportionately higher price of $1000-2000.

An interesting point to consider is this. The HP ProLiant DL785G5 is an 8-way system with 64 DIMM sockets. The DL785G5 with 8 x 3.1GHz Quad-Core Opteron 8393 processors and 256GB memory (64x4GB) is priced at $62,628. By comparison, The ProLiant DL585G5 with 4 of the same 3.1GHz QC Opteron processors cost $23,001 with 128GB memory (32x4GB) (HP website, 5/20/2009). Unfortunately, the HP website does not list the 8GB DIMM module option for the DL585G5. The Dell PowerEdge R905 with 4 x 2.7GHz is priced at $13,119 with 128GB (32x4GB) and $31,388. The implication is that one should consider a larger system with more DIMM sockets as an alternative to populating the original system with the largest capacity memory module is there is a large disproportion in price per GB. In particular, the Opteron system architecture ties the number of memory channels and DIMM sockets to the number of processors. So it is not possible to configure the ProLiant DL785G5 with 4 processors and 64 DIMM sockets. There is a hard tie of 8 DIMM sockets per processor socket.

IBM Power

IBM Power system architecture is out of the scope here, but this link IBM POWER Systems Overview from Blaise Barney, Lawrence Livermore National Laboratory is a good source.