Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

Rethink Server Sizing 2017

An unquestioned article of faith in the IT community is that the server is a multi-socket system with the exception of small business, edge devices and other turnkey applications. There were once valid reasons for this, but the world has changed. For several years now, all multi-socket systems inherently have non-uniform memory access (NUMA) as a result of the memory controller being integrated into the processor. As it should not be difficult to rationalize that a significant group of applications are performance limited by round-trip memory access latency, then there implications in NUMA.

One application example is B-tree index navigation, the foundation of database transaction processing. It then follows that a database not architected to achieve memory locality on a NUMA system will not have great scaling. From the hardware cost perspective, a system with two (medium) 14-core processors is less expensive than one with a single (high/extreme) 28-core processor. Except that cores in the single socket system are significantly more productive than cores in a multi-socket system for this group of applications. The single socket 22-core is probably performance equivalent to a two-socket system with 14-core processors in this case.

For enterprise edition database products, software licensing is on a per core basis and far greater than the server system cost in either case. Until recently, it did not matter that single socket had higher efficiency. Important line-of-business databases needed more compute and memory than that which was available from a single processor. Now with 28 cores in the Xeon SP, even the 22 cores in last years Xeon E5 v3, a single processor is fully capable of handling all but the most extreme transaction processing workloads (or very bad indexing). It is time to stop blindly following obsolete rules, and re-evaluate the server system strategy.

2017 Intel Xeon Scalable Processor (Skylake architecture)

Below is a functional diagram of the Xeon Scalable Processor based on the Skylake core. There are many interesting features, particularly the changes from the previous generation Broadwell based Xeon E5 v4 and E7 v4. See Intel Xeon processor scalable family technical overview for more details.

xeon sp

The Skylake based Xeon SP has 3 die layout options, as did the three previous Xeon E5 and E7 generations. In previous generations, the die layouts were LCC, MCC and HCC. The three Skylake Xeon die layouts are represented below.

skylake lcc LCC   skylake hcc HCC   skylake xcc   XCC

All have one UPI block with two UPI links, shown as the top row, left. The XCC die has a second UPI block with one UPI for a total of 3, shown as second from the right. There are three PCI blocks on the LCC and HCC die. Each PCI block has 16 lanes. The second PCI-E x16 block from the left in each of above die also has the DMI unit that connects to the southbridge. DMI is a PCI-E x4 port with additional protocols. The XCC die has one additional PCI-E x16 block that is used only in the F models to connect to an integrated Omni-Path fabric.

There are different 58 models of the Xeon SP, in Bronze, Silver, Gold, and Platinum groups, with regular (35), M (7), F (7) and T (9) models. (There also 7 Xeon W models.) The number of active cores covers the entire range from 4 to 28 in steps of 2. The chart below shows the price per core for 33 of the regular models.

price

The two regular models excluded are the 8156 and 8158 which have 4 and 12 cores respectively, but use the XCC die which has 3 UPI links. The higher cost 4-12 core options shown above are 3.2GHz+ models. Otherwise, the general trend is that cost per core rises as core count increases. Note that the Xeon 6138 2.0GHz 20-core is only slightly higher priced than the 6140 2.3GHz 18-core ($2,612 versus $2,445), resulting in slightly lower cost per core at 20 versus that at 18 cores.

2-way Xeon Scalable Processor Systems 2017

From the gamut of just the Xeon SP regular models, we can build an equal number of 2-socket systems around the LCC, HCC and HCC die layouts as depicted below.

lcc2 LCC

hcc2 HCC

xcc2 XCC

From the above options, and the fact that lower core count models generally have lower cost per core, it would seem that a 2-socket system strategy, employing a range of core count (and frequency) models would cover the majority of requirements. A 2-socket system with two 14 core HCC die costs $4300 less than a 1-socket system with the 28-core XCC die, assuming there is not much different between a 1S and 2S motherboard. And it could be argued that the 2-socket system has twice the memory and IO of a single socket system.

Except that cores in 1 and 2 socket systems are not equivalent. In transactions processing, good scaling from 1 to 2 sockets requires the database to be architected to achieve a high degree of memory locality on a NUMA system. The TPC-C database was architected for exceptional scaling on NUMA (and clustered) system in the tables having common lead keys in the clustered index (warehouse and district IDs). The client application looks at the key value to determine the connection port number to use, which on the RDBMS, has been affinitized to a specific node. The TPC-E database has a range table to manage block of trade ID values, so that each session gets its own block of consecutive values.

For whatever reason, the critical aspects of database transaction processing scaling are not discussed anywhere (?), except perhaps in the TPC-C and TPC-E full disclosure reports and accompanying supporting files. There, the few important details buried among thousands of pages of useless details, generally without explanation as to impact and the purpose of the detail.

Memory Latency

A topic of great importance in modern server systems is the round-trip latency from processor core to local and remote node memory. Answers to this question were difficult to find, until very recently. Intel had been very cagey with information on memory latency. With their own Memory Latency Checker utility, Intel could publish detailed information for every processor along with the impact of various platform options.

Xeon E3

It is useful to start with Intel desktop processors, i.e., equivalent to the Xeon E3 (soon to be Xeon E?). For several generations, the Xeon E3 (and its equivalent in the desktop Core brand) has L3 latency around 42 cycles, and round-trip memory latency is L3 + 51 ns. A (bi-directional?) ring connects 4 cores, the GPU and the system agent. Note that L3 is cited in CPU-cycles, not time. The Skylake Xeon E3 base frequency can be as high as 3.7GHz and 4.1 GHz in the Kaby lake refresh. So, the 42 cycles L3 latency could be as short as 10ns (source: Wikichip Skylake client). Another source for memory latency is www.7-cpu.com.

skylakeE3ring

Internal Interconnect

In the Xeon E5/E7, the ring(s) in the Broadwell Xeon is more complicated with 1 ring connecting 10 cores in the LCC. There are 2 ring pairs in the MCC and HCC, with 12 cores connected in each of the HCC rings along with PCI, QPI and two memory controllers. The two ring pairs are connected with switches at the top and bottom.

Broadwell_HCC_ring2   Broadwell  

In the Xeon SP, the interconnect is now a mesh. The memory controllers are with the second row of cores, at the left and right.

Skylake_XCCgs   Skylake

Memory Latency versus Bandwidth

Memory latency in the modern processor is a complicated topic. For one, it depends on the operating memory bandwidth, as shown in the slide from Intel's Skylake SP slide-deck.

memory_bandwidth

In the above slide, it would appear that local node memory latency at low bandwidth is around 70ns. Presumably, this is inclusive of L3 latency? As noted earlier, memory latency for the desktop processors is cited as L3 + memory, with L3 cited in cycles and memory cited in ns.

Memory and Cache Latency Comparison

The Intel Press Workshop June 2017 slide below (source: computerbase.de) says L3 latency is 26ns, and DDR4 is 70ns, unclear as to whether the 70ns includes or excludes L3.

MemoryCacheLatency
(Memory and Cache Latency Comparisons)

CPU Cache Latency Comparison - Broadwell and Skylake

The Intel slide below says L3 latency is 19.5ns for the Xeon 8180 (2.5GHz).

CPU_Cache_Latency
 CPU Cache Latency

Wikichip says the L3 latency in Xeon SP is 50-70 cycles, source Skylake server. The 7-cpu site is similar (Skylake X). This could mean L3 is 74 cycles if the clock is at the maximum turbo boost frequency of 3.8GHz or it could mean 49 cycles at the base frequency of 2.5GHz. And then we do know whether this is local L3, or L3 of a different core at some unknown distance.

This link has a related slide-deck: Intel Xeon Scalable Architecture Deep Dive, but not the same slides?

Memory Latency - Xeon SP and EPYC

Prior to this year, Intel put out very little information on memory latency, particularly for the Xeon E5 and E7 product lines. However, this year, memory latency is important in explaining the differences between Xeon and AMD EPYC. And so, Intel now decides that some information can be made available. Nextplatform shows the following Intel slide Intel Stacks Up Xeons Against AMD EPYC Systems, though I have not found a link for the actual pdf file.

memory-latencies

Toms Hardware Intel Presents Its AMD EPYC Server Test Results discusses the same Intel slide-deck.

DDR DRAM

For unbuffered DDR, the latency at the DRAM die interface should be about 42ns. The prominent rating of memory is the data transfer rate, 2666MT/s for example. The DRAM clock rate is one-half of the data rate (hence double data rate), which in this example is 1333MHz for a clock cycle time of 0.75ns. The DDR timing is cited as three numbers for each of major components of a random access, typically 18-18-18 for 2666MT/s. The full random-access time is then the sum of all three elements, 18 x 0.75 = 13.50 ns each for full access time of 40.5ns. see memory performance speed latency.

After including propagation delay from CPU to memory and back, the 51ns cited for memory latency in desktop processors seems reasonable. This is for unbuffered DRAM. The latency of registered DRAM may have been mentioned as being one clock cycle longer. If it is one (DRAM?) clock cycle per component, then we might expect 2ns extra in total?

1-Socket versus 2-Socket, Simple Model

Without clear and precise information, assume that local node memory latency inclusive of L3 is 75ns. Based on a 2.5GHz CPU clock, CPU-cycle time is 0.4ns, making a round trip memory access latency 188 cycles.

XCC_1way_m

In the two-socket system, assuming the database had not been architected for NUMA, 50% of memory access is assumed to go to the remote node. There had not been much information on remote node memory access latency until very recently, now that disclosing this information suited Intel's purpose.

A signal must cross the UPI link to the remote socket, traverse the internal interconnect grid, then the memory access itself, before returning over the same path. The Intel TPC-VMS document (tpc-vms-2013-1.0.pdf) says that on the older Nehalem platform, remote node memory latency is 1.7X higher than local node, which would make it 128ns, but we will use the round number 125ns here.

HCC_2way_m2

The TPC-VMS document also mentioned that 80% of memory accesses on a 2-socket system was to the local node. We can deduce from this that 60% of memory access was to local node be design and 40% was random. As this is a two socket system, the random accesses should be split 50/50, or in this case, 20/20, yielding 80% local and 20% remote. It would also imply that on a 4-socket, local access is 70%, as only 10 of 40 random now goes to the local node.

Note: The Intel slide from Xeon versus EPYC say 89ns for local node and 139ns for remote node, but other slides may have different values. The values 75 and 125ns is used here. As long as the local to remote node ratio is close, the assessment should be substantially the same.

Simple Model - Round-trip Memory Access Effect

Suppose we have a fictitious transaction that comprises 3M cycles of operation if we have a magical single cycle access memory (for simplicity, ignore the fact that L2 latency is 14 cycles and L3 is more). Now suppose 3% of operations must wait for real memory at 75ns, all accesses to the local node. We can work out that a total of 19.785M cycles are needed to complete the transaction for a single thread performance of 126.36 tps.

On a 2-socket system with 50/50 access at 75 and 125ns, then a total of 25.41M cycles are needed for 98.39 tps. This is a performance loss of 22% per thread, and scaling from 1S to 2S is 1.557. Also, in the single socket case, 85% of CPU-cycles are dead cycles waiting for memory and 89.2% in a 2-socket system.

Scaling from 2S to 4S is also only moderate, as the local/remote memory access split becomes 25/75 without NUMA optimizations. In the Ivy Bridge, Haswell and Broadwell generations (Xeon 5/7 v2, 3 and 4) the E7 also has SMB memory expander which added additional latency. In the Xeon SP, only the M models have this, and perhaps some vendors now realize that extra-large memory is not worth the extra latency?

The expectation is then that Hyper-Threading with 2 logical processors on one physical core should have almost linear scaling because a single thread perhaps 90% of cycles are no-ops waiting for memory. In fact, Intel really needs to increase the SMT from 2 to 4 or perhaps even 6 logical per physical. IBM POWER7 was SMT4, and then went to 8 in POWER8.

A third implication of this model is that frequency does not have much impact for transaction processing workloads. Increasing core frequency has only minor impact because of the high percentage of cycles spend waiting for memory. In summary, there are three main implications. High core count is favored over frequency, scaling over multiple sockets has only moderate gains without a properly NUMA architected database and HT/SMT is expected to be good.

1-Socket versus 2-Socket, the old arguments

The historical arguments for bigger was the combined scaling of processors & cores, memory (capacity and bandwidth) and IO. More is better if it can solve poor design and coding with brute force. Until more is not better when the large system has strange NUMA or high core count anomalies.

XCC_1way

HCC_2way

Today, we have a substantial number of cores available in a single socket (22 or 28). Memory bandwidth was never an issue in database transaction processing. When the Opteron came out in 2004, before multi-core, AMD stressed that memory bandwidth scales with the number of sockets. There may have been applications that did benefit from the memory bandwidth, but probably not transaction processing. What was beneficial back then were the extra memory slots, to help reduce disk IO.

Today, the single socket system has 12 memory slots. When filled with 64GB DIMMs, system memory is 768GB which should be enough for any properly index database, more so when storage is on flash/SSD.

The 48 PCI-E lanes in a Xeon SP supports 6 PCI-E x8 devices, each of which can support 6.4GB/s sustained bandwidth. This again is more than what is necessary.

All of this combined supports the argument that the single socket system can support heavy transaction processing loads, considering that expecting scaling from 1S to 2S is about 1.5X and another 1.5X from 2S to 4S. And this assuming that there are not any severe negative NUMA effects, which do happen.

1-Socket versus 2-Socket, Cost

The nature of silicon economics is that cost is highly nonlinear for large die sizes. The Xeon SP LCC die should not be particularly expensive, at approximately the size of an APS-C camera sensor. The HCC die at 485 mm2 is smaller than an APS-H sensor, while the XCC die is in between an APS-H and full frame sensor (36x24mm = 864mm2). The Intel Xeon Platinum 8180 28-cores 2.5GHz has a list price of $10,009.

XeonSP_1x28s

The Xeon Gold 6132 14-core 2.6GHz is $2,111. So, two 14 core processors are nearly $6,000 less than one 28-core processor. There are other equal core count 1S/2S comparison options that show less price disparity. This is the most extreme case.

XeonSP_2x14s

As mentioned earlier, cores are not equal between 1 and 2 sockets in any case, more so on databases not architected at all for NUMA. The more appropriate comparison for two × 14 is a single socket of around 20-22 cores. The only 22-core is the 6152, 2.1GHz at $3,655. This should be compared to the 14-core 5120 2.2GHz at $1,555, so the two-socket system is slightly less expensive.

XeonSP_1x22s

But the hardware price comparison does matter for a database when Enterprise Edition and per core licensing is in effect. In this case, the option that can meet performance requirements with the lowest core-count will win by a large margin.

TPC-E Benchmark, Xeon Platinum 8180 2-socket versus 4-socket

As mentioned earlier, the TPC-E benchmark database happens to be architected for NUMA, achieving perhaps 80% memory locality. In previous processors generations, the Intel Xeon E5 2-way result could not be cleanly compared to the 4-way E7 result because the E5 and E7 had different core count and the E7 had a memory expander, adding latency.

For the recently published Xeon SP in 2S and 4S, both use the same core count, and there is no difference in memory (the Xeon M models have a memory expander?) The Lenovo Think System SR650 result is  6,598.36 tps-E for 2-way Xeon Platinum 8180 2.5GHz 56 cores total, 112 logical processors total.

The Lenovo Think System SR950 TPC-E is  11,357.28 tps-E for 4-way Xeon Platinum 8180 2.5GHz 112 cores total, 224 logical processors total. This works out to 1.72X scaling 2S to 4S on a database designed for NUMA. Scaling would considerably less on a database not designed for NUMA. It is unfortunate that an equivalent 1S result is not available.

In the Broadwell generation, the 2-way Xeon E5-2699v4 produced 4,734.87 tps with 22 cores per socket and the 4-way Xeon E7-8890v4 produced 9,068.00 tps-E with 24 cores per socket for a 2S to 4S scaling of 1.915. On adjusting for the 9% difference in cores per socket, the core adjusted scaling ratio is 1.755, inline with the more recent 2S to 4S scaling. Here, the Xeon E5 system has 512GB memory and the E7 system had 4TB memory, but this was done through the memory expander, which also adds latency. So, it is quite possible, that the 4X difference in memory per socket just offset the extra memory latency of the expander.

Summary - Scale Down to Single Socket

The NUMA effect is real. Most databases were not architected for NUMA systems and this is not something that can be retrofitted easily. In this case, the single socket server will have substantially greater performance per core, requiring fewer core licenses. Furthermore, the single socket system with a high core count processor can handle all but the most extreme workloads and avoids potential NUMA anomalies that could occur in multi-socket systems.

The single socket is the best strategy for most situations, in part because retrofitting in NUMA optimizations to existing database architecture is extremely difficult. With the single socket system, it is not necessary to undertake this task. Having said that, we should evaluate NUMA optimizations in the database architecture anyways. This is because the Intel processors have Cluster-on-die in Broadwell Xeons and sub-NUMA Cluster (SNC) in the Skylake Xeon SPs.

XeonSP_1x22s

Addendum

Above, I had assumed that local node memory latency is the same a single socket memory latency. It would seem that this is not true. In a multi-socket system, it is necessary to check remote node L3 cache before going to memory? So there is a NUMA penalty even for local node access.

Intel D. Levinthal paper

freebsd Ulrich Drepper paper

Summary - Scale Down to Single Socket

In the diagrams below, processor die are scaled to 10 pixels per 1mm.

Estimated die size for Skylake 4 core + small graphics is ~122.4mm2. Assuming the aspect ratio is correct, then the linear dimensions should be 13.4×9.16 mm.

skylakeE3ring

The Skylake LCC has 10 cores (~325.44 mm2), HCC with 18 (~485 mm2) and XCC with 28 cores (~694 mm2) per wikichip.org. Note that the boxes below represent the Skylake core, Caching and Home Agent (CHA), Snoop Filter (SF) and Last Level Cache (LLC). As such, the box is larger than the core as represented for the desktop Skylake above.

skylake lcc LCC   10 cores, ~325.44 mm2, 22.5×14.5 mm
skylake hcc HCC   18 cores, ~485 mm2, 22.5×21.6 mm
skylake xcc   XCC 28 cores (~694 mm2), 32.2×21.6 mm for XCC

Assuming aspect ratio of available die images are correct, then dimensions should be as shown above. The LCC & HCC versus XCC may not be correct because the XCC has a second 512-bit FMA unit?

Broadwell_HCC_ring2   Broadwell   HCC 456 mm2, 25.2mm×18.1mm

The estimated size of Broadwell HCC is 456 mm2 for the 24 core HCC. Assuming the aspect ratio is correct, then the dimensions would be 25.2mm×18.1mm.

Haswell HCC is 622 mm2 for 18 cores. The 22nm to 14nm process should correspond to a 50% area shrink (this may not be true for the buffers to drive signals off chip). It might be assumed that Intel did not attempt a maximum size die on account of power, limited by the thermal envelop?

The AMD EPYC is comprised of four 8-core dies of 213 mm2 each. The theory here is that manufacturing 200mm2 is relatively inexpensive. There is some cost in putting the 4 dies in a multi-chip module (MCM), but much less than building one giant 600mm2+ die.