Parent

Server Hardware 2009 Q3 Rev A

Joe Chang jchang6@yahoo.com

Historical Best Practices

For a long time, probably as far back as 1995, the standard recommended server system for SQL Server (and other heavy applications too) was a 4-way. Part of the reason for this was that 2-way servers were built with desktop chipsets, or otherwise lacked sufficient memory and IO capability.

On the other side, 8-way servers and larger could be very expensive, typically 2-3 times more expensive per 4-way node relative to standard a 4-way system. In additional, larger systems were typically non-uniform memory architectures (NUMA), which required very rare deep technical skills and very possibly significant re-architecture of an existing database to scale. (Today, even the smaller AMD Opteron systems are technically NUMA, but because they behave closer to SMP systems, they are called soft-NUMA. Larger systems are with higher remote to local memory access time ration are called hard-NUMA.)

On top of the system cost structure, the cost to properly determine the suitable system with a load test, typically required expensive software licensing ($100K) and effort (if the form of consultants) that might contribute another $50-100K.

The cost of determining whether a 2-way server could support the expected load would usually outweigh the cost savings of the 2-way over a 4-way. If the 4-way had difficultly supporting the target load, then the cost of the 8-way or larger system would be sufficient motivation to expend more effort into tuning the database or simply cutting functionality until the “requirements” fit the 4-way system.

Those not daunted by the system cost of the 8-way or larger systems, along with SQL Server Enterprise Edition’s per processor licensing, frequently found out about the unusual NUMA system performance characteristics (after incurring the financial obligation).

Scaling SQL Server 2000 performance on NUMA systems essentially calls for the database to be architected so that the query execution plans avoid operations that the SQL Server engine has severe problem on NUMA. To reiterate, there are two elements to the previous statement: 1) knowing the severely problematic operations of the SQL Server engine on NUMA, and 2) knowing how to write SQL to generate the specific execution plan to avoid specific operation evening changing data distribution statistics.

Very few people were actually explicitly aware of these facts. Far fewer people actually possess the necessary skills to accomplish this. This is why experience is earned at great cost, and good experience has measurable value.

With all of these considerations, third party software vendors frequently recommended a 4-way system as appropriate with meaningful technical data, and customers frequently accepted this recommendation without raising hard questions.

Consideration for Best Practices Today and the Near Future

It is inescapable that long established best practices were established long ago. In the computer world of Moore’s Law, the world today is very difference from the world of long ago. The 2-way server systems of the last few years built around the quad-core Intel Xeon 5300 series processors and 5000P chipset were probably adequate for a substantial portion of line-of-business database applications. The current generation of server systems with the Xeon 5500 series processors and dual 5520 IOH has both large economical memory capacity and powerful IO capability, and are more than sufficient for the large majority of transaction processing applications (that have been properly designed and tuned).

The technology of 4-way systems is approximately one year behind the 2-way server systems. So the technology base of the 2-way Xeon 5500 systems launched in early 2009 will not be available until early 2010, or possibly late 2009. At this point time, 2-way systems are based on Nehalem architecture processors and 4-way systems are based on Core 2 architecture processors.

The next generation of Intel processors for 4-way and larger systems will be based on Nehalem-EX, an eight-core derivative of Nehalem, and will probably be called the Xeon 7500 series. The 4-way Xeon 7500 systems are expected to be 2.5X more capable than the current 4-way Xeon 7400, based on a combination of having 1) a more capable processor core, 2) more cores per socket, 3) much more memory bandwidth (and memory channels) 4) better inter-processor bandwidth and 5) more IO bandwidth. In other words, it will be greatly improved in every area. The Xeon 7500 is also expected to enable much better scaling in NUMA systems.

Significant scaling improvements were made with SQL Server 2005. Many of the scaling problems of SQL Server 2000 with both a large number of processors and on NUMA systems have now been corrected. Additional improvements were made with SQL Server 2008. Today, there is a reasonably good chance that an existing database could achieve good scaling on both SMP systems with multi-core processor and big NUMA systems with just minor adjustments. The might even be achieved with the query plan hints, which would not require changes to the existing code.

Significant scaling improvements were also recently announced for the Windows Server 2008 R2 operating systems in areas pertaining to scaling performance over a large number of processors and in NUMA systems.

The final element is load testing to determine the proper system. A full load test modeling every variation of real behavior is still a substantially expensive undertaking in terms of both personnel and cost. However, a very skilled person can quickly determine the key elements of an application, and conduct a basic load test of good enough quality to estimate the right system all without exceeding the cost benefits of the smaller system if appropriate.

In summary, certain 2-way server systems are now a very good choice for many applications. In the next year, expect very powerful 4-way servers. Pending investigations into the characteristics of SQL Server 2008 R2, Windows Server 2008 R2 on Nehalem-EX based NUMA systems, it is very possible that high-end scaling can be achieved without nearly impossible to find technical skills (and avoiding a virtually complete re-architecture of the existing database plus rewriting application code).

Today 2-way Nehalem versus 4-way Core 2

Let us compare available information for a 2-way server based on the new Nehalem architecture Xeon 5500 series versus a 4-way server based on the older Core architecture Xeon 7400 series.

The Dell PowerEdge T710 supports 2 Xeon 5500 series quad-core processors, has 18 memory sockets, and 6 PCI-E Generation 2 slots. The T710 can support 144GB memory with 8GB DIMMs, but maximum economical memory is 72GB with 18 x4GB DIMMs. One of the PCI-E slots is x16, four are x8 and one x4. There are probably dedicated PCI-E slots for the internal SAS controller and network ports. An architecture diagram from the T710 is not yet available to determine whether these slots are spread over two Intel 5520 IOH devices where each PCI-E slot has dedicated bandwidth, or a single 5520 IOH.

In comparison, the Dell R900 supports 4 Xeon 7400 series six core or 7300 series quad-core processors, has 32 memory sockets, and 7 PCI-E generation 1 slots. Of the four x8 slots, each pair shares a common x8 uplink. So the T710 has better IO capability even with PCI-E gen 1 adapters (if based on two 5520 IOH devices). The R900 can support 256GB with 8GB DIMMs or an economical memory of 128GB with 32 x4GB DIMMs.

The table below show price on September 29, 2009 for the Dell PowerEdge T710 and R900 with six-core and quad-core processor options, excluding common items like additional adapters, shared storage, and the Windows Server operating system. SQL Server licensing is assumed to be per processor.

SystemT710R900R900
Processor2 Xeon X55704 Xeon X74604 Xeon X7350
Cores/Log Proc4/86/64/4
Frequency2.93GHz2.66GHz2.93GHz
Memory72GB (18x4G)128GB (32x4G)128GB (32x4G)
Int Disks2 x 146GB 15K2 x 146GB 15K2 x 146GB 15K
Price$8,644$20,536$17,536
SQL Server EE License2 x$25K4 x$25K4 x$25K

There is about $8K difference between the T710 and the R900 with X7350, and $12K for the R900 with X7460. The 56GB difference in memory contributes approximately $2K of the difference. In either case, the main cost component is SQL Server per processer licensing at $50K for the T710 and $100K for either R900. Even if the R900 has already been purchased, it is less expensive to purchase a new T710 with two processor licenses, than the four processor licenses to deploy the R900. The R900 could be repurposed to other applications where SQL Server per processor license costs does not apply. For a data warehouse, EE and CAL might be preferred. Where suitable, Standard Edition is $6K per processor.

Below are the SPEC CPU 2006 Integer base results for recent Xeon and Opteron processors. SPEC CPU used to be a single core metric, but new compilers generate threaded code for some of the components.

SPEC CPU 2006 Integer Base

DateSystemcoresMHzCacheProcessSPEC
-4 x Xeon X7350 Quad2.93G2x4M65nm21.0
-4 x Opteron 8360 Quad2.50G2M 65nm14.4
-4 x Xeon X7460 Six 2.66G16M 45nm21.7
-4 x Opteron 8384 Quad2.70G6M 45nm16.9
-2 x Xeon X5570 Quad2.93G8M 45nm31.5
-4 x Opteron 8439 Six 2.80G6M 45nm19.7

TPC-C

DateSystemcoresMHzCacheMemoryProcesstpm-C
09/20074 x Xeon X7350 Quad2.93G2x4M128G65nm407,079*
07/20084 x Opteron 8360 Quad2.50G2M 256G65nm471,883
08/20084 x Xeon X7460 Six 2.66G16M 256G45nm634,825
11/20084 x Opteron 8384 Quad2.70G6M 256G45nm579,814
03/20092 x Xeon X5570 Quad2.93G8M 144G45nm631,766
n/a 4 x Opteron 8439 Six 2.80G6M - 45nmno result

* IBM x3850M2 result 516,752 on proprietary chipset, 256G, DB2
† IBM x3850M2 result 684,508 on proprietary chipset, SQL Server
‡ ProLiant DL370G6 on Oracle 11g (Microsoft switched to TPC-E)

TPC-E

DateSystemcoresMHzCacheMemoryProcesstps-E
02/20084 x Xeon X7350 Quad2.93G2x4M128G65nm479.51*
07/20084 x Opteron 8360 Quad2.50G2M - 65nmno result
09/20084 x Xeon X7460 Six 2.66G16M 128G45nm729.65*
02/20094 x Opteron 8384 Quad2.70G6M 64G 45nm635.43
07/20092 x Xeon X5570 Quad2.93G8M 96G 45nm817.15
n/a 4 x Opteron 8439 Six 2.80G6M - 45nmno result

* IBM x3850M2 on proprietary chipset

A note of caution in comparing Xeon 5500 series with Xeon 7300/7400 and Opteron processors. The higher-end Xeon 5500 have Hyper-Threading feature which probably make substantial improvements to TPC-C and TPC-E results. The HT feature does not benefit all SQL operations evenly. Still, the 2-way Xeon 5570 is very impressive compared to all the most recent 4-way systems, even the 24-core Xeon 7460.

Below are recent TPC-H scale factor 100 results. Be aware that the full SF100 TPC-H database is approximately 170GB in SQL Server 2005 with the 8-byte datetime type. So the 4-way system with 128GB memory (and somewhat less SQL Server data buffer cache) cannot hold the entire database in memory.

TPC-H Scale Factor 100GB

DateSystemcoresMHzCacheMemoryProcessQphH
04/20084 x Xeon X7350Quad2.93G2x4M128G65nm34,989.9
n/a 4 x Xeon X7460Six 2.66G16M n/a 45nmno result
06/20092 x Xeon X5570Quad2.93G8M 48G 45nm28,772.9
08/20092 x Xeon X5570Quad2.93G8M 144G45nm51,180.5

* TPC-H QphH results are scaled by size, but results at different scale factors should not be compared directly.

The two Xeon 5570 systems use SSD disk, versus HDD on the 4-way, so IO impact is uncertain. The 2-way X5570 system with 144GB memory on SQL Server 2008 using the 4-byte date type is approximately 140GB, which may almost fit in the buffer cache.

There is some uncertainty in interpreting the results, but it is within reason to expect that the 2-way Xeon 5570 is in the same class as a 4-way Xeon 7350. No 4-way results are available for the Xeon 7460 six-core.

Direct testing between the Dell T710 and R900 in matched like for like configuration is desired for final assessment. But there is enough public data to support consideration for the 2-way Xeon 5570 over the 4-way Xeon 7300 or 7400 series.

Nehalem-EX

From information available, Nehalem-EX will become the Xeon 7500 series superseding the Xeon 7400 series. Previously, Intel did not target 7000 series processors in 2-way systems. Now that AMD has the six-core Istanbul Opteron processor in 2-way system, it is expected that system vendors will have 2-way systems based on Nehalem-EX. This might become a Xeon 6500 series. The Xeon 5500 series processors have 3 memory channels and 2 QPI links in a 1366-pin package. Nehalem-EX has 4 memory channels and 4 QPI links, so Xeon 5500 and 6500 would not be interchangeable.

 

Server Hardware 2009 Q3

Joe Chang jchang6@yahoo.com

It can seem confusing from the wide range of servers offered by the major system vendors as to which is best suited to SQL Server. Dell has no less than six 2-way Xeon 5500 server systems excluding blades, 3 in tower, and 3 in rack. HP might have more. When the full configuration is considered inlight of the total deployment cost, the choice is much simpler. The lower end 2-way systems have no meaningful price advantage over the high-end 2-way system, which has more DIMM sockets and more IO bandwidth. So the choice is really between the top 2-way, the 4-way, and 8-way systems or higher. The 2-way to 8-way have options for either Intel or Opteron processors. The 16-way is Intel only.

2-socket systems

The Dell PowerEdge 2900 is for comparison with the previous generation. Dell came out first in tower chassis with the T610, which did not seem to be the replacement for the 2900, especially considering that there was a R710 in 2U.

VendorDellDellDellHPHP
ModelPowerEdge 2900PowerEdge T610PowerEdge T710ProLiant DL370G6ProLiant DL385G6
CPU SeriesXeon 5400Xeon 5500Xeon 5500Xeon 5500Opteron 2400
ArchitectureCore 2NehalemNehalemNehalemIstanbul
Cores/Socket4 4 446
Hyper-ThreadNoYesYesYesNo
DIMM sockets1212181816
IOH5000P?55202 x 5520?2x5520-
PCI-EGen 1Gen 2Gen 2Gen 2?
x16--12-
x81241+1 (NIC)2
x433+11+164
PCI-X2----
Int. HDD8+288/166+6+26/16
TPC-E tpsE*295.27766.47---
TPC-H QphH*-28,773 SF100-51,180 SF100-
Configuration-----
Price$4,537$5,546$5,417$8,809$6,858
CPU 2x E54402x X55502x X55502x X55502x 2435
Memory12x4 GB12x4 GB12x4 GB12x4GB12x4GB

* TPC-E and TPC-H results are generally posted for the top available frequency, the X5460 and X5570 and sometimes maximum memory.

† Dell result. IBM posted 817.15 tpsE for the x3650M2 with 2xX5570. Both systems configure 96GB memory. Dell did not employ port affinity tuning.

Just recently, Dell released the T410 and T710 filling out the Xeon 5500 series 2-socket tower chassis lineup. There is only a very small price difference between the T610 and T710 at the base model. When configured with dual power supplies, then T710 is actually slightly less expensive than the T610.

T410 1x8, 4x4 8 DIMM

The Dell R710, 2U, has 18 DIMM sockets, 2 x8 and 2 x4 PCI-E Gen2 slots. The R610 is 1U, 12 DIMM, 2 x8 PCI-E. The R410 uses the 5500 IOH, 8 DIMMs, 1 x16 PCIE plus disk

Earlier, I said I liked the ProLiant ML/DL370G6 because it implemented two 5520 IOH devices for 72 PCI-E gen 2 lanes. But I did not like the 2 x16 slots because database servers cannot really use these extra-wide slots. A combination like 7 x8 + 4 x4 slots would have been better. Also, using one of the x8 slots for the 4 included GbE ports is a waste. This should have occupied a x4 slot, or better yet, just implement the pair of GbE ports on the ICH, which attaches off the ESI port instead squandering a valuable PCI-E slot. We can then use the PCI-E slots for our choice of IO.

The T710 appears to have 56 PCI-E lanes configured (1x16+4x8+2x4) but there is no documentation that actually says the T710 implements two of the 5520 IOH devices. Still, 5 wide slots (1x16 + 4x8) are better than 2 on the T610, but 6 x8 would have been better. Let the workstation people have the x16 slots. A pair of available x4 slots would be nice too (one is used by the internal storage controller).

Finally there is the ProLiant DL385G6, which supports the new six–core Istanbul. This makes the DL385G6 a really powerful web server, but I would prefer more IO bandwidth for databases. Also, I do not know if the PCI-E slots are Gen1 or Gen2.

The Dell R805 now also supports the six-core Opteron 2400 series. Price with 2 x 2435 (2.6GHz) and 32GB memory is $3734 (probably another $460 to bring it to the 48GB reference used above, because the 2900 has 12 DIMM slots).

4-socket systems

The 4-way landscape will change when the Nehalem-EX (Intel Xeon 7500?) systems come out. For now, we have Intel Xeon 7400 series and AMD Opteron 8400, both at 6 cores.

VendorDellDellHPHPHP
ModelPowerEdge R900PowerEdge R905ProLiant DL580G5ProLiant DL585G6ProLiant DL785G6
CPU SeriesXeon 7400Opteron 8300Xeon 7400Opteron 8400Opteron 8400
ArchitectureDunningtonIstanbulDunningtonIstanbulIstanbul
Cores/Socket66646
Hyper-ThreadNoNoNoNoNo
DIMM sockets3232323264
IOH7300-7300--
PCI-EGen 1-Gen 1--
x16----3
x84 (2x2)2-33
x435-45
PCI-X2--2-
TPC-E tpsE*671.35----
TPC-H QphH*----91,558 (SF300)
Configuration-----
Price$20,236$16,437$26,268$23,570$57,285
CPU 4x X74604x 8435 2.6GHz4x X74604x 84398x 8439 2.8GHz
Memory32x4GB32x4GB32x4GB32x4GB64x4GB

*TPC-E and TPC-H results are generally posted for the top available frequency

† IBM result 729.65 on x3950M2.

Previously, I criticized Intel for being myopic in obsessive focus on the 4-way platform, i.e., not making the 6-core Dunnington available in 2-way systems. In earlier generations, the 4-way processor had nothing special over the 2-way, and was usually 1 year behind. Anyways, AMD is first to reach 6-core in 2-way. Hopefully, with the powerful 8-core Nehalem-EX, there will be 2-socket systems. For a long time, Microsoft has recommended 4-way as the default choice for database servers. I think this should now be a 2-way, once we get to 6-core or more.

This system should handle most loads. And even if it turns out that a larger system is needed, the 2-way didn’t cost much and can always be used for other purposes. When consolidating small databases, a few 2-way systems is more flexible than one big system.

Big Iron

The big iron systems are shown in comparison the 8-way ProLiant 785. Notice that there is only a slight price premium going from two 4-way Opteron systems to one 8-way system ($47K and $57K respectively). It used to the premium was much larger. The price on Unisys and NEC systems appear to be about $33-37K for each 4 socket node. It used to be that a 4-socket node in the big-iron systems was around $80-90K. Right now we cannot really use the full power of the 16-socket system with the six-core Xeon 7400, because Windows Server 2008 can only support 64 cores. Soon R2 will out, and we can see how SQL Server scales.

However, I really think we need the better QPI interconnect technology of the Nehalem-EX to properly benefit from 128+ cores.

For a long time AMD crowed about how the Opteron HT interconnect and memory bandwidth scaled with the number of sockets, while Intel was constrained by the FSB. Yet the largest Opteron system was the 8-socket from HP and Sun. Even though the ProLiant 785 was very impressive on TPC-H, yet strangely there have been no published TPC-C or TPC-E results. Now with Istanbul, AMD mentions they have added HT-assist, similar in function to the snoop filter on Intel chipsets. Without this, there is excessive traffic to maintain cache coherency. This is not a simple matter and I expect it will take AMD an iteration or two to work out the bugs. Intel had similar difficulties with the snoop filter in their chipsets.

VendorHPUnisysUnisysNEC-
ModelProLiant DL785G6ES7000 7600RES7000 7600RExpress5800 A1160-
System Price $48,997$66,729$135,003$145,596-
Memory Price+8,288+19,136$46,376$50,344-
CPU 8x8439 2.8GHz8xX746016 x X746016 x X7460-
Memory64x4 (256GB)256GB512GB512GB-
TPC-E tpsE*-1,165.61,493.41,568.2-
TPC-H QphH*91,558 (SF300)-80,173 SF10K--

* TPC-H QphH results are scaled by size, but results at different scale factors should not be compared directly.

TPC-C

DateSystemcoresMHzCacheProcesstpm-C SPEC
07/2006 4 x Itanium 2 9050 Dual 1.6G 18M 90nm 344,928 13.9
08/2008 4 x Xeon X7460 Six 2.66G 16M 45nm 634,825 21.7
11/2008 4 x Opteron 8384 Quad 2.70G 6M 45nm 579,814 16.9

† IBM x3850M2 result 684,508 on proprietary chipset, SQL Server

Storage, briefly concerning direct attach storage for DW applications. The HP P800 is $849, the MSA70 (upto 25 SFF drives) is $3,199, each 72GB 15K SFF drive is $349. So the set of 1 P800, 1 MSA70 and 16 HDDs (5584) is $9,632, and 1 P800 + 2MSA70 +32 Drives is $18,415 ($575/disk)

Earlier, JK blogged and ranted about buying bigger hardware.

Let me assure people that talent for highly efficient and scalable (over many processor cores) was drained long ago, and not just from the recent emergence of multi-core processors. For that matter, I am not sure it was ever prevalent because developers built code on small single socket (and single core in the old days) boxes without consideration for the issues that only occur in SMP or NUMA systems. Of course, it is necessary a bad thing that a person of modest skills can build an application, so long as it is rebuilt before moving to the big time.

But we should look at the flip side, from the point of view of both the software vendor (who pays for developers to write code) and the CIO (of the company that pays the sys admin and DBA to maintain the production servers). The high level people responsible for running a business or department are almost never (exception for Bill G and a few others) proficient coders. To rise in business or administration, over the course of time, they learn to assess value on different metrics. The simplest metric is money, and related, head count.

Let’s start with the ISV that sells software that has high (perceived) value. Suppose the probable full deployment cost might be on the order of $10M (£, € or ¥). The software vendor might want to target a price of $1M for the software license, plus 10% per year for support. Now if this software were super efficient it might run on a 2 socket database server costing $10K each, (two servers in cluster + 2 for the DR cluster totaling $40K) plus storage, and a $4K web server (again, 2 for redundancy, plus 2 more for DR).

The CIO will then ask: why am I paying $1M for software that runs on $10K database server and a $4K web server. Now suppose if this software were grossly inefficient that it would require an 8-way database server costing $200K each (a total of 4 for a 2-node cluster + another pair for DR) and 10 web servers (plus another 10 for DR).

Well then, now it seems perfectly reasonable that the software price is $1M. If the application still did not run well, then it certainly justifies having lots of consultants on permanent assignment to make sure it works.

Now let’s look at this from the CIO point of view. Suppose your company has two or more enterprise wide database applications, each administered by a separate DBA. You being a super proficient SQL Server performance expert, enable your database to run on the 2-way quad-core system. The other guy, has no such skills, proposed and deployed an Oracle RAC solution spanning eight 4-socket quad-core servers (The intent here is not to pick on Oracle, Larry is exceptional at making money and has a really nice yacht to boot). Who will know that with proper tuning, it would have also run on a single 2-socket system (this could still be a 2 node RAC for redundancy).

When it comes time for the annual review, which DBA will rate higher, the one that maintains a simple $200K (hardware + database licensing) environment or the one that handles the complex $2M environment? The CIO and HR know that the other DBA has a complex environment because of how long it took to get set up and all the very expensive consultants that had to be hired to do it. You did yours without expensive outside help.

From the CIO and HR point of view, what is an appropriate pay scale for each DBA? Do either of the CIO or HR people have the technical knowledge to put a value on what you did to save the company money? Or are they using other metrics?

If you think all this is highly irrational, take a long look at how things work in your organization and comment.