Parent: Processor Architectures

Processor Performance Overview 2017 Jul

Processor performance is a complicated topic that cannot be reduced to a single number metric. For the IT professional, there is a gamut of choices with the Intel Xeon E5 and E7 product lines, a few other Intel choices, a potential revival of interest in AMD with Epyc based on the new Zen microarchitecture, the two remaining RISC processors, the possibility of an ARM based server processor and finally GPU based products. Fundamentally, there is a not single choice that is best for all workloads. Modern microprocessors make trade offs between several characteristics, each of which match to particular applications.

There are several characteristics of modern processors, prominently: 1) the cache hierarchy, 2) pipelined execution, 3) superscalar execution, 4) SIMD or vector processing, 5) Hyper-Threading (HT), generic term simultaneous multi-threading (SMT), and 6) multi-core processors. This collection forms set of characteristics that need to be balanced into a particular product.

Cache Hierarchy

The purpose of a cache hierarchy is to achieve the lowest weighted memory access latency. Recent Intel processors have L1 cache with separate 32KB for data and instructions for a total of 64KB L1. The L2 is 256KB. The expectation is that cache hit ration increases with the size of the cache, but that different increase in hit ratio should decrease with each doubling of cache size. A practical aspect of cache size is that larger caches incur longer latency.

Intel Skylake i7-6700 4GHz processor latencies are reported as follows: L1 4 cycles for simple access, L2 12 cycles, L3 42 cycles (Kaby Lake is 38 cycles) and memory at L3 + 51ns. For 4GHz CPU, the clock is 0.25ns, so 51ns is 204 clocks for a combined latency of 246 cycles.

Lets now suppose that for some fictitious workload, memory accesses are 80% to L1, 10% to L2, 5% to L3 and 5% to memory. Then the average memory access latency is 18.8 cycles.

Suppose that L1 latency could be reduced to 3 cycles at some smaller size. Then a 71% L1 hit rate, assuming that the combined L1/L2 hit rate is 90% based on fixed L2 size, corresponds to the break-even point. A higher L1 hit rate would have a beneficial impact from lower L1 latency. A lower L1 hit rate would result in higher average memory latency.

Assume the combined L1/L2/L3 cache hit ratio is 95% based on some fixed L3 size and L1 remains at 80%. Suppose that a larger L2 cache has 17 cycle latency. Then 12% at L2 is the break even point. A higher hit rate at L2, lower at L3 will achieve lower average memory access, while a lower hit rate will result in higher average memory latency.

In principle, increasing the size of the last level cache might increase performance. The trade off is that larger cache has both higher cost, and reduces the total number of cores that can fit on a multi-core die.

Of course, every software has it own cache hit rate pattern often dependent on workload circumstances, so the choice is a complex set of trade offs which is hopefully is not bad for any application but possibly not best either.

Pipelined Execution

Superscalar Execution

SIMD - vector execution

Hyper-Threading, AKA SMT

Multi-core

 

 

It used to be that the largest possible CPU die size was about 440mm2, limited by the capability of the photo-lithograpy system? Sometime around 2005 or so, it became possible to manufacture large die. Intel Madison 9M L2 was 430mm2 in 2004. Montecito was 5802 in 2006 and Tukwila was 698-7152 in 2010.

For digital photography enthusiasts, this to correlate with the period when full frame camera senors became practical. Prior to this, the full frame sensor had to be build in two separate processes? 36.24, APS-H This seems to correspond camera sensor size. 1Ds 35.8x23.8 2002. 1D 28.7x19.1 2001. 1D Mark II 28.7x19.1 2004. APS-H 1D - 1D Mk IV 2011. 5D 35.8x23.9 2005. $3299 x 28.7x19.1 2004. Canon Museum x. x. x. x.

Tick-Tock Model. (the "tick").

(see Wikipedia List_of_Intel_microprocessors)