Home,   Parent,   Query Optimizer, Benchmarks, Processors, Storage, Scripts, ExecStats

Intel 10nm Delayed to 2019? Assessment (2018-07)

For more than 40 years, Intel have repeatedly said that the (semiconductor manufacturing) process was their business model. Every two years, a new process is developed. An existing microprocessor design is shrunk to the new process. Then a new processor architecture is designed on this process for the next year. That will then be ready for a shrink to the next process the following year, continuing the Tick-Tock cycle. By executing to this model, it is possible to achieve industry leading performance as well as economy in cost structure.

Obviously, the cycle has variations and more complexity than the simple high-level summary. To a large extent, Intel has successfully executed to this model for more than 40 years. That is, until recently in the 14 to 10nm process transition. In the past, other manufacturers have achieved design successes with products that were competitive or better at a point in time than the contemporary Intel processors. But in having difficulty adhering to Moore's Law over an extended period, the subsequent generations fell behind.

Considerable difficulty was expected in continuing to push the manufacturing process on a two-year cycle. Intel announced in 2016 a modification of the Tick-Tock model to Process-Optimization-Architecture. The new model has a 3-year process cycle. Then followed further delays, with the most recent suggesting 10nm will happen in 2019. The sequence for 14 to 10nm appears to be: Broadwell (14nm process) in 2014, Skylake (14nm architecture) in 2015, Kaby Lake (14nm+, optimization) in 2016, Coffee Lake (14nm++) in 2017 and then Canon Lake (10nm) in 2019. Given that Intel themselves have declared that the process is their business model, one might then expect that there will be significant repercussions in their technology advantage over competitors?

As far as data center operations are concerned, there is a complication in this picture. The Intel server product line is the Xeon SP, which employs three die layouts to span a huge range of products from 4 to 28 cores. The largest of the three die is the XCC at 694mm2 (Skylake). There are also Xeon D and W product lines that may use the same set of dies in different packaging.

AMD has a single 8-core die (Zen). Four dies reside in a Multi-Chip Module (MCM) to offer up to 32-cores in one processor socket of their EPYC product line. AMD estimates that the cost of manufacturing 4 of their 8-core die is 59% that of a single 32-core die, estimated to be 777mm2.

AMDEpycCost

It is not stated as to whether the cost of the MCM packaging impacts the cost structure versus a single large die in a more conventional package? It may be possible that AMD does not want to tackle the issues in manufacturing very large die. For reference, a full-frame camera sensor is 36x24 = 864mm2, and it is probably the largest die that can be manufactured with a single pass exposure?

Intel reports the memory latency of their Xeon SP in a 2-way (socket) system as 89ns to local node and 139ns to remote node.

2S_XeonSP4

In the AMD 2-socket system, there are 4 path variations to memory. One is local die, the second is (1-hop) remote die in the same socket (MCM), the third is 1-hop to the remote socket and the fourth is 2-hop remote socket. Memory latency for the 2-way AMD EPYC system is comparable for the local die, and for remote die in the same MCM. Latency to the remote MCM is about 200 ns in one-hop situations and 240ns for two-hop.

Epyc_latency

The question is: what processor performance metrics and characteristics are important? As always, it depends. Some applications favor memory bandwidth. If this type of application has been properly architected for Non-Uniform Memory Access (NUMA), then it should run well on modern processors and multi-processor systems. The AMD EPYC MCM has 8 memory channels, two per die. The Intel Xeon SP has 6 memory channels per socket.

Applications with pointer chasing code, prominently database transaction processing, may be almost entirely dependent on memory latency. The effect can be dominant because of the huge disparity between the CPU clock cycle and round-trip memory access time. Even a small percentage of code that requires the memory access to complete before the next instruction is determined, results in most cycles being no-ops waiting on memory. This is why some RISC processors are Simultaneous Multi-Threading (SMT)4 or 8 threads (logical processors) per core. It is unclear why Intel Hyper-Threading remains at 2 logical per physical core in their mainline products.

When memory latency is the key criteria, then the application needs to be architected to achieve a high degree of locality of memory access in NUMA systems. If not, then NUMA systems should be avoided if practical, more so at the higher remote node latencies. As it turns out, very few real-world databases have been architected for NUMA. Both the application and database must be architected to work together with common strategy to achieve memory locality on a NUMA system. This was done for the TPC-C and TPC-E benchmark databases. Apparently, no one put out the word that this was important.

For database transaction processing, the Xeon SP in a single-socket system configuration can have up to 28 cores entirely with local node memory in 6-channels. The multi-socket Xeon SP has remote node memory at about 139ns. In the AMD product lines, it is only possible to have up to 8 cores at local die-only memory latencies. A single socket/MCM can have up to 32-core with the 89ns local and 139ns remote die latency. The remote MCM latency in a multi-socket EPYC system is not a good match for database transaction processing without correct NUMA optimization.

Furthermore, the mainstream databases are licensed on per core basis in which cost far exceeds the processor cost. Licensing cost for SQL Server is $3300 per core (discounted?) in the Enterprise Edition. The high-end Xeon SP 8180 is $10,009, or $358 per core. At a configuration of 28 MB per core (12x64GB DIMMs for 28-cores), and a memory cost of $16 per GB (64GB ECC LRDIMM, $1036, Crucial) for a contribution of $444 memory per core. When the platform is on a mainstream RDBMS with Enterprise Edition licensing, then the cost of the processor and memory is inconsequential. What matters more is the combination of performance per core and scaling. This is achieved with low latency to memory in a large single die processor, no matter what the cost on the hardware side.

It is also under appreciated that on a single socket system, the memory access latency is lower than the local node latency of a multi-socket system, perhaps around 75ns. Presumably the difference is the cache-coherency protocol in the multi-socket system, even though a request to the local node memory controller is issued concurrently with the remote node CC check.

1S_XeonSP4

Years ago, this would have just been an interesting but non-actionable curiosity, because both the total compute and memory capacity of a multi-socket system was desired for high volume database transaction processing. In recent years, a single processor socket can encompass a sufficient number of cores and memory to handle most workloads, more so with storage on All-Flash/SSDs.

Given the importance of latency, it is surprising that the memory technology has not been addressed. Mainstream DDR4 DRAM is designed for low cost at acceptable latency. RL-DRAM achieves much lower latency by sub-dividing the chip into more smaller banks, and by de-multiplexing the row and column addresses. The first increases the area required for logic in relation the bit-array area. The second increases the pin count, which was once a factor, but is now a non-issue. RL-DRAM might have double the cost of DDR4, but the value is even greater, if one can accept that the correct system architecture is single socket.

In the modern datacenter, most deployments are or will be on (privately managed) virtual machines or with cloud providers. Many if not most VMs will have 8 or less cores. In this case, the hypervisor should be capable of detecting the underlying system architecture and allocate local node memory to each VM. In this case it would seem that both platforms are suitable.

My guess is that the AMD four dies in one MCM approach might have a better fit if each 8-core die could be its own system, that is, capable of booting a separate OS image (could all four die be managed by a single hypervisor?). Each die has its own local memory, and remote node cache-coherency protocols are disabled. The fabric connects the four systems/die within one MCM and potentially to others MCMs as well in a 2 or 4S system. All systems/die share a high-bandwidth fabric, preferably 40/100Gb, to external machines. This allows the cost of both the HBA and network switch ports to be amortized over many systems, in the manner of blade servers with a common enclosure.

In summary, the Intel Xeon SP and AMD EPYC are very different products that match up to different solutions with only some overlap. There could be very little consequence in a delay of the 10nm process. What might be more important is to recognize that the world has changed. We can stop (or perhaps relax) the mad rush of adhering to Moore's Law. Take this opportunity of an expanded process window to re-evaluate product strategy with greater specialization in each of the major market segments?

 

Source Images

LocalRemote

Epyc_latency

Small Images for Linkedin

LocalRemote

Epyc_latency

 

 

Interim material, some AMD EPYC etc.

The ServeTheHome articles shows latency from a given node to each of the 8 nodes in a 2-socket EPYC system, with each socket comprised of 4 chips in an MCM.

SthDDR4-2400 SthDDR4-2666

There are 4 nodes in one socket. Local node latency is always the fastest. Remote node but same MCM latencies are comparable to Intel 1-hop remote node. Latency to remote MCM has variation depending whether it is 1-hop remote socket or 2-hop.

Tom's Hardware Hot Chips 2017: AMD Outlines Threadripper And EPYC's MCM Advantage, Claims 41% Cost Reduction say a hypothetical 32-core EPYC die would be 777mm2, would have a die cost 69% higher than the actual implementation with 4 x 213mm2 die in an MCM. (The slide shows the 32c die at relative cost 1.0X and the four 8c die as 0.59X.)

Anandtech "Sizing Up Servers: Intel's Skylake-SP Xeon versus AMD's EPYC 7000 - The Server CPU Battle of the Decade?" by Johan De Gelas & Ian Cutress on July 11, 2017, Memory Subsystem: Latency.

AnandEpycXeonTable

latencyepyc_xeonv5_tinymembench
from above Anandtech article

Another source: AMD Optimizes EPYC Memory With NUMA

AMDEpycCost

AMD Performance Tuning Guidelines for Low Latency Response