Turbo Boost Alternative

The Intel main line processors are designed to operate at very high frequency. It is not practical, however, to accommodate the power, current and thermal limits for many cores running at the design frequency, depending on the workload. An apparent solution is dynamic frequency switching, Turbo-Boost Technology in Intel terminology. The processor has a de-rated base frequency that can be supported under all conditions. Under favorable conditions, one or more cores can run at higher frequency than the base.

Core1

Turbo-Boost is a good strategy to better realize the design capability of the processor and its cores. That is, if we have a frequency centric view of performance. But important workloads, characterized by point-chasing code, have only weak frequency sensitivity above the 2GHz level. Is there a better strategy for latency sensitive software?

There might be. For this, we should compare the Intel Core processors to the now neglected Itanium processors, which were not designed for exceptionally high frequency.

Core1

Intel Core processors now have three levels of cache (with caveats). The L1 and L2 are part of the core. The L3 is outside in the uncore. The L1 and L2 access latency is set in CPU-cycles regardless of operating frequency. L1 access is 4-5 cycles. Second level cache access is 12 cycles for the 256K L2 in client processors and 14 cycles for the 1M L2 in Xeon SP processors.

Given that Intel 14nm processors can operate at over 4GHz, we might then interpret the L1 as being capable of less than 1ns access time and the 256K L2 of 3ns access time. But should also keep in mind that Intel processors can be significantly overclocked, meaning that the cache might be capable of much better latency.

The L3 access appears to be determined in time at around 10-11ns in the quad-core die. On the big die processors, L3 is 18ns for the 24-core Broadwell with dual ring interconnect and 19.5ns for 28-core Skylake with mesh interconnect. A quad-core processor operating at 4GHz might have 42 cycle L3 latency corresponding to 10.5ns. A big die processor operating at 2.5GHz might have 45 cycle L3 corresponding to 18ns latency.

Compare the Itanium 9500 processor (Poulson) on 32nm, 2.53GHz with the main line processor at 32nm, Sandy Bridge, capable of 3.9GHz in Turbo Boost. Poulson is listed as having Turbo-Boost and 2.53GHz as the base frequency, but maximum frequency is not listed?

Itanium 9500 L1 (FLI) cited as 1 cycle and the 512K L2 (MLI) as 9-cycle. (Note, later generation Itanium processors have separate L2 instruction and data caches, whereas other Intel processors have a unified L2)

Given that an important class of applications do not benefit much from higher operating frequency, an alternative to Turbo-Boost might be to cap the processor frequency at some reasonable value, perhaps 2.0-2.5GHz, and instead lower the L1 and L2 cache latency?

This could be done for L2 without much difficulty? The L1 cache, however, is hardwired into the processor-core pipeline and may be much more difficult to change. So, degree of difficulty versus benefit is a factor. Of course, if CPUs were simple, then anyone can build them.

For example, the 1M L2 in Xeon SP has 14-cycle latency to accommodate operation at 4.2GHz or perhaps higher, implying access in less than 3.33ns. Then for a core limited to 2.5GHz, 9-cycle latency should be possible. If the 32nm Poulson can have 1-cycle L1 (16KB) at 2.53GHz, is 2-cycle possible to a 32KB L1 on 14nm?

Note:
Wikichip cites Sandy Bridge L3 as 26-31 cycles.

Modern processors are very complicated, and many old concepts are only valid with qualifications. The L1I on Sandy Bridge is not the true L1 in that there is a L0 µOP cache of 1536 µOP 8-way, in 32 sets of 6.

Sandy Bridge has a 14-stage pipeline from the µOP cache (1-cycle access?). The instruction fetch from the 32KB L1I is 4 cycles or pipeline stage incorporating pre-decode. Decode itself may be 2 stages. Hence the pipeline is 19 stages from L1 and 14 from L0?

 

Poulson  Poulson 32nm, 8c, 32M L3, 544mm2, 29.9 × 18.2mm
(same layout as Nehalem and Westmere EX)

Westmere  Westmere-EX, 32nm, 513mm2,10x3M L3, 24.90 × 20.61mm ?

 

Below is estimated frequency scaling for various average memory latencies.

frequency1

An average memory latency of 114ns is representative of a 2S system with 50/50 local-remote node memory access. In a 1S system, 76ns is the expected memory latency.

Conventional DDR4 SDRAM has tRC value of 46ns. In the 1S, L3 is 18ns for Broadwell and 19.5 for Skylake. Latency from memory controller to DRAM and back is 58ns for standard RDIMM timing, implying about 12ns for tranmission from memory controller to DRAM?
The total latency is L3 + memory = 76ns.

In a 2S Broadwell system, L3 is 18ns, and local node memory is 75ns for a total (L3 + memory) = 93ns. Presumably the difference between 1S and 2S local node is related to cache coherency?

In a 2S Skylake, L3 is 19.5. Total latency to local node memory is cited as 89ns, meaning the memory with 2S cache coherency is 69.5ns? Remote node latency in 2S Skylake is cited as 139ns. Average latency with a 50/50 local-remote mix is 114ns.

There are no low latency products for a server system with ECC memory, either unbuffered or register. But it should be possible to produce such a product. It is possible tRC of 36ns could be achieved simply with binning (selecting the better parts) and a heat sink on the DIMM, in the manner of memory for gaming systems. A more substantial alteration, example smaller bank size, should be able bring tRC down to 26ns? resulting in 56ns L3 + memory?

 

Parent,