Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

Additional related material:
  Rethinking System Architecture (2017-Jan),
  Memory Latency, NUMA and HT (2016-Dec),  The Case for Single Socket (2016-04)

 Server Systems 2016 ,  2014 Q2,  2012 Q4,  2011 Q3,  2010 Q3,  2009 Q3 (many of these were never completed)

 NEC Express5800/A1080a (2010-06),  Server Sizing (Interim) (2010-08),
 Big Iron Revival III (2010-09),  Big Iron Revival II (2009-09),  Big Iron Revival (2009-05),
 Intel Xeon 5600 and 7500 series (2010-04, this material has been updated in the new links above)

Server Sizing 2017 (2017-Jul)

The large majority of server systems in enterprise environments, be it organic enterprises or hosting and cloud providers, are 2-socket systems. Given the extreme large scale of these operations, one would think that there are strong technical reasons for this choice. Except that there are not.

The more probably explanation is that it is generally accepted that real server systems, excluding those used by small businesses, are 2-socket. It is true that there was a period when the 2-way (now 2-socket) was a very good choice for the baseline server system. The period in question was probably between 1996 to about 2010-12.

Baseline means the default choice that is the better choice for most scenarios, particularly for cases when the effort to conduct at proper assessment is not warranted.

However, there are strong arguments that the baseline server from 2010-12 on should be the single-socket system. Just to be clear, the single socket system proposed as the default choice is based on the Intel Xeon 5600 with 6-cores and E5(v1) processors, with up to 8-cores.

The arguments supporting a switch in the baseline server from 2-socket to single socket is not simply driven by the increasing number of cores on one processor die. A more compelling motivation is the move to integrated memory controllers, Intel being a holdout for complicated reasons until 2009. This means that all multi-socket systems now have Non-Uniform Memory Access (NUMA) architecture.

It is possible to achieve good and even excellent scaling on NUMA systems, but this does not happen by accident. Fundamentally, this will only happen with correct incorporation of the hardware details into the software architecture. The deciding driver is then that software not architected for NUMA should run on single-socket systems as the default choice, unless this effect is achieved by other means. A secondary facilitator for single-socket is the advent of all-flash storage becoming both viable and popular.

Server System History

Before 1996, multi-processors systems typically required proprietary components to adapter a processor to an interconnect to additional processors. This meant that the 2-way system was significantly more expensive than a single socket system.

In 1995/96, the Intel Pentium Pro processor featured glue-less multi-processing. Up to four processors could be directly connected without additional logic chips. This made possible a 2-way system with only marginally higher base system cost than a single socket system, for which a simplified representation is shown below.

PPro

In part, this was because the 2-way system had many common components with a desktop system, the difference being the extra socket for a second processor. There was a large separation between 2-way and 4-way because the 4-way supported a much larger memory system and extra PCI busses as required by database servers, and were also matched with large cache processors for which the larger benefit was to support scaling at the 4-way level.

The next major development was the AMD Opteron processor, with an integrated memory controller, among many other noteworthy features.

Opteron

A consequence of the integrated memory controller is that all multi-socket systems now have NUMA architecture. At the time, it was known that large NUMA systems had complex characteristics. It was possible for software to scale on big-iron NUMA systems, but software not architected for a NUMA system could experience poor to severely negative scaling.

The 2-way and 4-way Opteron systems did not seem to have such negative characteristics. This may have been because the local memory latency was very low and even the remote memory access time might be comparable to processors without an integrated memory controller.

The integrated memory controller had significant advantages, particularly in single socket systems. But Intel had committed to the quad-pumped front-side bus (FSB) with Pentium 4 in late 2000, and wanted to allow its supply chain to amortize their investment. The FSB was retained subsequent P4 derivates and the next generation Core 2 architecture. Not until Nehalem did Intel finally adopt an integrated memory controller in their mainline processors.

Broadwell

Skylake

   

 

AWS dedicated hosts https://aws.amazon.com/ec2/dedicated-hosts/pricing/
2 socket 20, 36, 24
4 socket 72
E5-2699v3 18c, 2.3GH, price n/a,
12c 7 2600 models,
12c 2690 v3 2.6G $2090
2680 v3 2.5G $1745,
2670 v3 2.3G $1589
2650L v3 1.8G $1329

c3 2 20
c4 2 20 $1.75 $17962
p2 2 36 $15.84 $184780
g3 2 36 $5.016 $58474
m3 2 20
d2 2 24 $6.072 $45726
r4 2 36 $4.682 $46260
r3 2 20 $2.926 $26965
m4 2 24 $2.42 $23913 3yr
i2 2 20
i3 2 36 $5.491 $61043
x1 4 72 $14.674 $107879

Server Sizing 2016 (2016-Dec)

Recommended System 2016

Options and considerations are given further down on this page. But for now, I will jump straight into an example with pricing.

Medium System

ComponentDescriptionPriceComment
ProcessorXeon E5-2630 10-core 2.2 GHz$667 feel free to adjust the specific SKU
MotherboardSupermicro X10SRL-F$272 
ChassisSupermicro CSE-732D4F-903B$270 
Heat SinkSupermicro SNK-P0050AP4$41 
MemoryCrucial 64GB (4×16GB) kit$548adjust as necessary
SATA SSDAny MLC$200? 
PCI-E SSDIntel 750 or P3600 400GB$350adjust as necessary

Heavy System

ComponentDescriptionPriceComment
ProcessorXeon E5-2699 22-core 2.2 GHz$4115 feel free to adjust the specific SKU
MotherboardSupermicro X10SRL-F$272 
ChassisSupermicro CSE-732D4F-903B$270 
Heat SinkSupermicro SNK-P0050AP4$41 
MemoryCrucial 8×32GB$2100adjust as necessary
SATA SSDAny MLC$200? 
PCI-E SSD 1Intel P3600 1.2/1.6/2.0TB<$1/GB?
PCI-E SSD 2Intel P3600 1.6/3.2/4TB>$2/GB?
PCI-E SSD 3Micron 9100 1.6/2.4/3.2TB?

1 x 20-core versus 2 x 10-core
ComponentE5-2698 20-core2×E5-2630 10-coreComment
Processor$3,226$1334
Motherboard$272$500 est. 
Chassis$270$270 

 

System Options 2016

Below are the Intel Xeon processor family and series choices. In the Xeon E5 2600 series, we can opt for 1 or 2 sockets. Presumably the only reason we would consider the E5 4600 is for a 4-socket system. In the E7 series, we could opt for 2, 4 or 8 sockets. There is probably not much benefit for 2-socket, and some liabilities. For the time being, 8-socket is beyond the scope here.

CodenameProcessfamilyseriesCoresDIMMsPCI-ESockets
Sky Lake14nmXeon E3 v5 44161
Broadwell14nmXeon E5 v426002212402
Broadwell14nmXeon E5 v446002212404
Broadwell14nmXeon E7 v448001624404
Broadwell14nmXeon E7 v488002424408

I am excluding Xeon D, which is really for specialty requirements. There is one E5 1600 v4 SKU of interest. The others look to be workstation oriented. In previous generations, there were E5 series with fewer PCI-E lanes.

Xeon E3 v5 and E5 v4 2600-series

Below are the Xeon E3 v5 and Xeon E5 v4 1600 and 2600 series to be compared. Given that transaction processing favors cores and logical processors, I am only including SKUs with Hyper-Threading and generally the less expensive (lower frequency) SKU at each core count.

familyModelCoresFreq$$/core
Xeon E3 v5123043.425062.50
Xeon E3 v5127543.633984.75
      
Xeon E5 v4162043.529473.50
Xeon E5 v4262082.141752.13
Xeon E5 v42630102.266766.70
      
Xeon E5 v42650122.2116697.17
Xeon E5 v42660142.01445103.21
Xeon E5 v42683162.11846115.31
      
Xeon E5 v42695182.12424134.67
Xeon E5 v42698202.23226161.30
Xeon E5 v42699222.24115187.05

I have always included Xeon E3 for consideration because that fact is a quad-core plus Hyper-Threading is very powerful, even it only has 32GB memory. One reservation is that there are no Xeon E3 motherboards with 4 PCI-E x4 slots. There are some with 3 x8 slots with the use of a PCI-E switch. This is really for workstations and adds cost to the motherboard without real value for servers.

Now that I look at the Xeon E5 1620 v4 quad-core 3.5GHz and the E5 2620 v4 8-core 2.2GHz, I would recommend passing on E3, going straight the E5 single-socket. Supermicro has a great E5 v4 UP motherboard, the X10-SRLF with 6 PCI-E gen 3 slots at about $300.

I have split the Xeon E5 1600 and 2600 SKUs into three groups based on price per core. The first group are really low cost per core. The second is the middle group. And the third are the premium models. However, the delineation is not clear-cut. The E5-2695 18-core could be either the high-end of the second or the low-end of the third group.

There is no die size given for the Broadwell 4-core model, but the 2-core GT2 graphics is 82mm2, 6.08 × 13.49mm. The 2-core with Iris graphics, listed at 133mm2. From the original raw image aspect ratio, I calculate the dimensions to be 6.80 × 19.55 mm. Notice the blank space above the cores of the 2c Iris die.

Broadwell 4c   Broadwell 10c  

Above is the 4-core Broadwell. By matching up the cores, I am guessing a die size of 172 mm2 and dimensions of 12.41 × 13.86mm.
The 10-core 246.2mm2, 15.20 × 16.20 mm.

The low price group are astonishing bargins, more so considering that the Broadwell LCC die is 246mm2. The 10-core LCC die is larger than the desktop 4-core with the big graphics engine. But the Core i7-6950X 10-core 3.0GHz, 140W does carry a price of $1723. so that probably where Intel wants to make their money on the Broadwell E LCC die. The Xeon E5-2640 v4 10-core 2.4GHz is $939. Manufacturing yield to frequency probably only plays a minor role, as most die are probably capable of high frequency. There might be some variation in power-efficiency, so the high-die count are cherry picked for the most efficienct.

What is interesting is that the price of the 16-core 2683 v4 on par in price per core to the 14-core 2660 v4. This is because the 16-core is on the HCC die which is much larger than the MCC die for the 14-core.

On the one hand, we could say that the E5-2600 series 20 and 22 core models are priced with the big boys. But compared with the 4600-series, the 2600s are half-priced.

E5 2 × 10-core versus 20-core

The E5-2630 10-core and E5-2698 20-core are both 2.2GHz. This makes an interesting comparison possible between 2 sockets × 10-cores versus 1 socket 20-cores at matching frequency.

A UP Xeon E5 v3/4 motherboard is about $300 versus $500 for a DP motherboard, but this is difference is largely inconsequential.

The 10-core is $667, quantity two is $1,334 versus $3,226 for the 20-core.

Memory also comes into play. Suppose the motherboard has 8 DIMMs per socket. The price of 4×16GB is about $550, versus 4×32GB at $1050. Then there is very little meaningful difference in 256GB total memory between 8×32GB on the UP and 16×16GB on the DP, both costing about $2,100.

The big difference comes into play at 64GB DIMMs, currently about $1,000. This is double the price per GB compared to either the 16 and 32GB DIMMs. If we need 512GB in the UP we would need 8×64GB at $8,000 versus 16×32GB in the DP motherboard at $4,100.

From this point of view, it would seem that DP is the better option at 20-cores total. However, if we were to factor in the effect of Memory Latency and NUMA, then perhaps we would be willing to pay the price of 64GB DIMMs should it be needed in a UP motherboard.

A good motherboard choice for UP is the Supermicro X10SRL-F, with 8 DIMMs, 4 PCI-E ×8 and 2 ×4 in gen 3, plus one gen 2 slot. For extreme IO, the Supermicro X10DRX has 10 gen 3 ×8 slots.

 

Xeon E5 v4 4600 series and E7 v4

Below are Xeon E5 4600 series and the E7 models.

familyModelCoresFreq$$/core
Xeon E5 v44610101.81219121.90
      
Xeon E5 v44640122.12837236.42
      
Xeon E5 v44650142.23838274.14
Xeon E5 v44660162.24727295.44
Xeon E5 v44667182.25729318.28
Xeon E5 v44669222.27007318.50
      
Xeon E7 v4480982.11223152.88
Xeon E7 v44820102.01502150.00
Xeon E7 v44830142.02170155.00
Xeon E7 v44850162.13003187.69
      
Xeon E7 v48860182.24061225.61
Xeon E7 v48870202.14672233.60
Xeon E7 v48880222.25895267.95
Xeon E7 v48887242.27174298.92

I suppose it is even more curious that the E5 4600 series is more expensive than the E7 at equivalent core count. I just realized the E7 have 32 PCI-E lanes verus 40 for the E5. What gives? Did Intel need to borrow wires for the extra QPI link? I am thinking that the implementation of QPI and PCI-E are different, but Intel could just chose not to wire 8 of the PCI-E lanes, using the signals for QPI instead. I believe QPI uses 4 signals for each bit, and QPI is 20-wide?