Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

Work in progress, the original articles to be split into the following,

 Memory Latency (2018-04)

 TPC-E Benchmarks (2018-04)

 The Case for Single Processor(work in progress)

 DRAM (2018-03)

 SRAM as Main Memory (2018-03)

 Historical Systems

Feature articles (to be replaced by above articles)
 Low Latency Memory (2018-03)

 Memory Latency (2018-02) Rethink Server Sizing, the case for single-socket
 SRAM as Main Memory (2018-02)

Posted on Linkedin, (but needs to be updated):
  SRAM as Main Memory Cost Benefit   Rethink Server Sizing 2017

Rethink Server Sizing 2017 - brief

Standardizing on 2 and 4 sockets systems for servers has been an established practice going back to 1996. There were very good reasons for this, but it was so long ago that they have been forgotten. Yet the practice continues unquestioned, almost as a reflex action ingrained in our subconscious. Back then, 2-way was the baseline, and a 4-way was for databases. Now, the large majority of standard systems are 2-way (2S) with 4S for exceptional requirements.

It might seem that the 2-socket system continues to be a good choice, as two processors with an intermediate number of cores is less expensive than one processor with twice as many cores. An example is the Xeon Gold 6132 14-core versus the Xeon Platinum 8180 28-core processors. In addition, the two-socket system has twice the memory capacity and nominally twice as much memory bandwidth.

XeonSP_2x14a

XeonSP_1x28a

So, end of argument, right? Well, no. This is because any multi-socket system, including the two-socket, has non-uniform memory access (NUMA). It so happens that database transaction processing performance is largely governed by round-trip memory access latency. Memory latency is lower on the one-socket system than on a multi-socket system unless the application has been architected to achieve a high degree of memory locality on a NUMA system. Almost zero real world databases have been architected for NUMA system architecture. So, memory access on a 2-way is most probably 50/50 to the local and remote nodes.

Furthermore, even local node memory access in a multi-socket system has higher latency than memory access on a single socket system. This is because a remote node L3 check is necessary for cache coherency.

Xeon_1S_lat2

Xeon_2S_lat2

The implication of memory latency differences is that single thread performance decreases by more than 30% going from 1S to 2S, with throughput scaling from 1S to 2S being around 1.35X. In this regard, two 14-core processors with 28 cores total has comparable throughput performance to a single 18-20 core processor. Given the cost of Enterprise Edition licensing, the difference in processor cost per core does not really matter. The system that can deliver the required performance with fewer cores will win on cost. The single socket system will have fewer unusual (interpret this as bad or very bad) characteristics than a multi-socket system.

Because of the memory latency effect, high processor frequency does not help much in transaction processing because most cycles are spent waiting. There is usually more than one frequency option at a given core count, usually with a difference in price. Finally, the expectation is that Hyper-Threading is highly effective for transaction processing, so this feature should be enabled. And in addition, Intel really needs to increase the degree of HT from 2 to 4 logical processors per core.

LocalRemote

 

 

Older Articles - these are obsolete
 Rethink Server Sizing 2017 (2017-Dec)    SRAM as Main Memory (2017-Dec)
   Memory Bandwidth - Silliness? (unfinished 2018-Jan)

Additional related material:
  Rethinking System Architecture (2017-Jan),
  Memory Latency, NUMA and HT (2016-Dec),  The Case for Single Socket (2016-04)

 

 NEC Express5800/A1080a (2010-06),  Server Sizing (Interim) (2010-08),
 Big Iron Revival III (2010-09),  Big Iron Revival II (2009-09),  Big Iron Revival (2009-05),
 Intel Xeon 5600 and 7500 series (2010-04, this material has been updated in the new links above)

Reference

Onur Mutlu, Professor of Computer Science at ETH Zurich website, lecture-videos