Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

System Architecture  HistoricalAMD OpteronIntel QPIDell PowerEdgeHP ProLiantIBM x Series,
  NUMA (never finished),  and Sandy Bridge

Update 2017-01

Knights Landing

Can the Intel Xeon Phi x200, aka Knights Landing, run SQL Server? It does run Windows Server 2016, so is there anything in SQL Server 2016 that would stop it from installing? Xeon Phi is designed for HPC, so it would not have been tested with SQL Server, but that does not confirm whether it will or will not work. If so, then this could be used to prove out some important theories.

The main line Intel processors used in the Core i7/5/3 and Xeon product lines, with recent generation codenames Haswell, Broadwell, Sky Lake and soon Kaby Lake, are heavily overbuilt at the core level. The Broadwell processor core is designed to operate at 4GHz if not higher.

The Haswell 22nm processor was rated for up to 3.5GHz base, 3.9GHz turbo at launch in Q2’13. In Q2’14, a new model was rated 4/4.4GHz base/turbo at 88W TDP. Both are C-0 step, so the higher frequency was achieved by maturity of the manufacturing process or cherry picking? The Broadwell 14nm processor had a top frequency of 3.3/3.8GHz base/turbo 65W for desktop, but perhaps this is because it was more focused at mobile than desktop? (Curiously there is also a Xeon E3 v4 at 3.5/3.8GHz 95W and Iris Pro graphics). The top Sky Lake 14nm processor was 4.0/4.2 GHz base/turbo at 91W.

With a single core under load, processor is probably running at the turbo boost frequency. When all four cores are under load, it should be able to maintain the rated base frequency while staying within design thermal specifications, and it might be able to run at a boosted frequency depending on which execution units are active.

The latest Intel Xeons (regular, not phi) are the E5 and E7 v4, based on the Broadwell core. There are 3 die versions, LCC, MCC, and HCC with 8 10?, 15, and 24 cores respectively. All of these should be able to operate at the same frequency as the desktop Broadwell or better, considering that the Xeon E5/7 v4 Broadwells came out one year after the desktop processors. But Xeons need to be more conservative in its ratings so a lower frequency is understandable.

The top Xeon 4-core model, E5-1630 v4, using the LCC die is 3.7/4GHz at 140W TDP. The top 8-core is 3.4/4.0GHz, E5-1680 v4, also at 140W TDP.

The top 14-core (MCC die) is 2.2/2.8GHz 105W. The top 24-core (HCC die) is 2.2/3.4GHz 140W. So the Xeon E5 and E7 v4 processors are built using cores designed to operate electrically at over 4GHz, but are constrained by heat dissipation when all cores are active to a much lower value, as low as one-half the design frequency in the high core count parts.

The transistor density half of Moore’s law is that doubling the number of transistors on the same manufacturing process should enable a 40% increase in general purpose performance. The implication here is that if a particular Intel processor (full-size) core is designed with transistor budget to operator at 4GHz, then in theory, a processor with one-quarter of that transistor budget should be comparable to the full-size core operated at one-half the design frequency, whatever the actual operating frequency of the quarter size core is. (Doubling the quarter size core to half-size yields 1.4X gain. Double again to full-size yields another 1.4X for approximately 2X performance going from quarter to full size).

So the theory is that it might be possible to have 100 cores of one-quarter the complexity of the Broadwell core on a die of comparable size as Broadwell EX (456mm2), with adjustments for L2/L3 variations, and differences in the memory and PCI elements.

This just what Xeon Phi, aka Knights Landing, appears to be. There are 72 cores in 36 tiles, operating at 1.3-1.5 GHz base, 1.7GHz turbo. The Xeon Phi x200 is based on the Silvermont Atom, but at 14nm. A tile is composed 2 Atom cores, with 4-way simultaneous multi-threading (SMT) and 1M L2 cache shared between 2 cores. (There is no shared L3? how is cache coherency handled?) The Xeon Phi has 16 MCDRAM and 6 memory channels capable of 115GB/s and 384GB max capacity (6x64GB). The MCDRAM can be used in one of three modes: Cache, Flat, or Hybrid.

There is no mention of the MCDRAM latency, only the phenomenal combined bandwidth of 400-500GB/s. My expectation is that it should be possible for the processor to off-die memory roundtrip latency to be lower when the memory is in the same package as the processor compared to the common arrangement when memory is outside the processor package. This is because it should be possible to use really narrow wires to connect the processor to memory in a common package, so there should be less buffering circuits to amplify the signal current? (Can some circuit designer speak to this please?)

This higher core count, higher threads on SMT is more or less comparable to IBM POWER, SPARC and even AMD Zen. Transactional queries are essentially pointer chasing code: fetch a memory location, use its value to determine the next location to fetch. This should run fine on a simpler core than 6/8-port superscalar Broadwell. And have many dead cycles during the round-trip memory access latency, implying SMT will work well (beyond the two threads per core in the main line Intel cores).

However, this may not be the best general purpose computing solution, in that there are important single threaded tasks, or tasks that are not massively parallelizable for which the existing powerful Intel core is the best. My thinking is that a mix of few powerful cores and many smaller cores is right solution. And that there should be a few smaller cores dedicated to special OS functions (interrupt handling and polling), in a blended asymmetric-symmetric arrangement.





Knights Landing 14nm, Xeon Phi 200

Sky Lake  

Above is based on die size of 683mm2, and if image aspect ratio is correct, then dimensions are 32.2 × 21.2 mm.

Anandtech Hybrid Memory Cube currently 4 to 8) dies HotChips 23 conference in 2011, the first generation of HMC demonstration cubes with four 50 nm DRAM memory dies and one 90 nm logic die with total capacity of 512 MB and size 27x27 mm had power consumption of 11 W and was powered with 1.2 V The second version of the HMC specification was published on 18 November 2014 by HMCC.[15] HMC2 offers a variety of SerDes rates ranging from 12.5 Gbit/s to 30 Gbit/s, yielding an aggregate link bandwidth of 480 GB/s (240 GB/s each direction), though promising only a total DRAM bandwidth of 320 GB/sec.[16] A package may have either 2 or 4 links (down from the 4 or 8 in HMC1), and a quarter-width option is added using 4 lanes. micron 2GB, 160GB/s

Intel Knights Landing for Developers