Rethinking System Architecture (2017-01, Home)
Server system memory capacities have grown to ridiculously large levels far beyond what is necessary now that solid-state storage is practical. Why is this a problem? Because the requirement that memory capacity trumps other criteria has driven system architecture to be focused exclusively on low cost DRAM. DDR DRAM, currently in its fourth version, has low cost, acceptable bandwidth, but poor latency. Round-trip memory access latency is now far more important for database transaction processing. The inverse of latency is practically a direct measure of performance.
In the hard disk era, the importance of memory in keeping IO volume manageable did trump all other criteria. In theory, any level of IO performance could be achieved by aggregating enough HDDs. But large arrays have proportionately more frequent component failures. Although RAID provides fault tolerance, the failed drive rebuild process causes operational problems. The practical upper bound for HDD storage is on the order of 1000 disks, or 200K IOPS. If it was necessary to purchase 1TB of memory to accomplish this, then it was money well spent because the storage was even more expensive.
Since then, storage with performance requirements have or should be transitioning to all-flash, and not some tiered mixed of flash and HDD. Modern solid-state arrays can handle up to 1M IOPS, far more and easily when on a full NVMe stack. Now is the time to re-evaluate just how much memory is really needed when storage is solid-state.
There are existing memory technologies, RLDRAM and SRAM for example, with different degrees of lower latency, higher cost and lower density at both the chip and system level. There is potential to reduce memory latency by a factor of two or more. The performance impact for database transaction processing is expected to be equal. A single-socket system with RLDRAM or SRAM memory could replace a 4-socket system with DDR4. Practically any IO volume, even several million IOPS, can be handled by the storage system. The key is that the CPU expended by the database engine on IO be kept to a reasonable level.
An argument is made for a radical overhaul of current system architecture, replacing DDR4 DRAM as main memory with a different technology that has much lower latency, possibly sacrificing an order of magnitude in capacity. As the memory controller is integrated into the processor, this also calls for a processor architecture change. A proper assessment for such a proposal must examine not only its own merits, but also other technologies with the potential for order of magnitude impact.
The discussion here mostly pertains to Intel processors and Microsoft SQL Server. However, the concepts are valid on any processor and database engines built around page organization, row-store with b-tree indexes. The argument for low latency memory is valid even in the presence of Hekaton memory-optimized tables, or other MVCC implementations because it is a drop-in hardware solution instead of a database re-architecture project. Of course, the heavy lifting is on Intel or other to re-architect the system via processor and memory.
For more than forty years, DRAM has been the standard for main memory in almost every computer system architecture. A very long time ago, having sufficient memory to avoid paging was a major accomplishment. And so, the driving force in memory was the semiconductor technology with low cost. This lead to DRAM. And DRAM was optimized to 1 transistor plus 1 capacitor (1T1C). DRAM addressing was multiplexed because even the package and signal count impact on cost mattered.
The page file has since become a relic, that for some reason has yet to be removed. Over the years, the mainstream form of DRAM, from SDRAM to DDR4, did evolve to keep pace with progress on bandwidth. The criteria that has changed little is latency, and it is round-trip latency that has become critical in applications characterized by pointer chasing code resulting serialized memory accesses.
Even as memory capacity reached enormous proportions, the database community routinely continued to configure systems with the memory slots filled, often with the maximum capacity DIMMs, despite the premium over the next lower capacity. There was a valid reason for this. Extravagant memory could mask (serious) deficiencies in storage system performance.
Take the above in conjunction with SAN vendors providing helpful advice such as log volumes do not need dedicated physical disks, and the SAN has 32GB cache that will solve all IO performance problems. Then there is the doctrine to implement their vision of storage as a service. But now all-flash is the better technology for IO intensive storage. Solid-state storage on an NVMe stack is even better (nvmexpress overview). Massive memory for the sole purpose of driving IO down to noise levels is no longer necessary.
Now free of the need to cover-up for weakness in storage, it is time to rethink system architecture, particularly the memory strategy. Not all application desire low latency memory at lower capacity and higher cost. Some work just fine with the characteristics of DDR4. Others prefer a different direction entirely. The HPC community is going in the direction of extreme memory bandwidth, implemented with MCDRAM in the latest Xeon Phi.
In the past, when Moore's Law was in full effect, Intel operated on: "the Process is the Business Model". It was more important to push the manufacturing process, which meant having a new architecture every two years with twice the complexity of the previous generation and then be ready to shrink the new architecture to a new process the next year. Specialty products divert resources from the main priority and were difficult to justify. But now process technology has slowed to a three-year cycle and perhaps four years in the next cycle. It is time to address the major profitable product outside of general purpose computing.
The principle topic here is the impact of memory latency on database transaction performance, and the directions forward for significant advancement. As the matter has complex entanglements, a number of different aspects come into play. One is scaling on non-uniform memory access (NUMA) system architecture, in which the impact of memory access can be examined. Memory technologies with lower latency are mentioned, but these topics are left to experts in their respective fields. Hyper-Threading is a work-around to the issue of memory latency, so it is mentioned.
The general strategy for server system performance in recent years is multi-core. But it is also important to consider which is the right core as the foundation. Another question to answer is whether a hardware and/or software solution should be employed. On software, Multi-version Concurrency Control (MVCC) is the technology employed by memory-optimized databases (the term in-memory, has misleading connotations) that can promise large performance gains, and even more extreme gain when coupled with natively compiled procedures. Architecting the database for scaling on NUMA architecture is also a good idea.
A mixed item is SIMD, the SSE/AVX registers and instructions introduced over the last 18 years. From the database transaction processing point of view, this is not something that would have been pursued. But it has come to occupy a significant chunk of real estate on the processor core. So, find a way to use it! Or give it back.
The list is as follows:
- Scale up on NUMA versus single socket
- RLDRAM/SRAM versus DDR4+3D XPoint
- Hyper-Threading to 4-way
- Core versus Atom and all comers?
- Hekaton, memory-optimized, MVCC
- Database architected for NUMA
- SSE/AVX Vector Instructions
The above topics will be address, perhaps only partially, often out of order, and sometimes mixed as appropriate. Factors to consider are potential impact, difficulty of implementation and overall business justification.
L3 and Memory Latency
The 7-cpu benchmark website lists L3 and memory latency for some recent processors, shown below.
|Processor||Base Freq||L3||Local Memory||Remote Memory|
|Westmere EP (Xeon X5650)||2.67GHz||40-42 cycles||L3+67ns||L3+105ns|
|Ivy Bridge (i7-3770)||3.4GHz||30 cycles||L3+53ns|
|Haswell (i7-4770)||3.4GHz||36 cycles||L3+57ns|
|Haswell (E5-2603 v3)||1.6GHz||42-43 cycles|
|Skylake (i7-6700)||4.0GHz||42 cycles||L3+51ns|
There are a number of questions. Is L3 latency determined by absolute time or cycles? In the Intel Xeon processors from Ivy Bridge EP/EX on, this is particularly complicated because there are 3 die layout models; LCC, MCC, and HCC, each with different structure and/or composition. The Intel material says that L3 latency is impacted by cache coherency snoop modes, ring structure and composition. See Intel Dev Conference 2015 Processor Architecture Update, and similar Intel HPC Software Workshop 2016 Barcelona For simplicity, L3 is assumed to be 15ns here without consideration for other details.
DRAM, SQL Server 8KB Page
The Crucial web site and Micron datasheet cites DDR4-2133 latency at 15-15-15, working out to 14ns each for CL, RCD and RP, so random row latency is 42ns the DRAM interface. The Intel web page for their Memory Latency Checker utility shows an example having local node latency as 67.5 or 68.5 ns and remote node as 125.2 or 126.5 ns. L3 and memory latency on modern processors is a complicated matter. The Intel values above includes the L3 latency. So, by implication, the 10ns difference is the transmission time from memory controller to DRAM and back?
One of the modes of DRAM operation is to open a row, then access different columns, all on the same row, with only the CL time between columns. I do not recall this being in the memory access API? Is it set in the memory controller? Systems designed for database servers might force close a row immediately?
In accessing a database page, first, the header is read. Next, the row offsets at the end of the 8KB page. Then the contents of a row, all or specific columns. This should constitute a sequence of successive reads all to the same DRAM row?
Memory Architecture - Xeon E5 and E7
The most recent Core i3/5/7 processors run in the mid-4GHz range. There are Xeon EP and EX processors with base frequency in the 3GHz+ range. But for the high core count models, 2.2GHz is the common base frequency. Below is a representation of the Xeon E5 memory subsystem.
In the Xeon E7, the memory controller connects to a scalable memory buffer (SMB) or Memory Extension Buffer (MXB), depending on which document, and the interface between MC and SMB is SMI. The SMB doubles the number of DIMMs that can connect to each memory channel.
There is no mention of the extra latency for the SMB. It cannot be free, "All magic comes with a price". The value of 15ns is used here for illustrative purposes.
Anyone with access to both 2-way Xeon E5 and 4-way E7 systems of the same generation and core count model (LCC, MCC or HCC) is requested to run the 7-cpu benchmark or Intel Memory Latency Checker utility, and make the results known. In principle, if Intel were feeling helpful, they would do this.
Below is a single socket Xeon E5 v4. The high core count (HCC) model has 24 cores. But the Xeon E5 v4 series does not offer a 24-core model. Two cores are shown as disabled as indicated, though it could any two? There is only one memory node, and all memory accesses are local.
Below is a 2-socket system, representing the Xeon E5 either HCC or MCC models in having two double rings, and the interconnect between rings. The details shown are to illustrate the nature of NUMA architecture.
NUMA is complicated topic, with heavy discussion on cache coherency, the details of which impacts L3 and memory latency. This discussion here will only consider a very simple model only looking a physical distance impact on memory latency. See NUMA Deep Dive Series by Frank Denneman.
Each socket is its own memory node (not sure what happens in Cluster-on-Die mode). To a core in socket 0, memory in node 0 is local, memory in node 1 is remote, and vice versa to a core in the other socket (more diagrams at System Architecture Review 2016).
Below is a representation of a 4-socket system based the Xeon E7 v4 memory. To a core in socket 0, memory in node 0 is local, memory in nodes 1, 2, and 3 are remote. And repeat for other nodes.
It is assumed that memory accesses are 15ns longer than on the E5 due the extra hop through the SMB outbound and inbound. This applies to both local and remote node memory accesses.
On system and database startup, threads should allocate memory on the local node. After the buffer-cache has been warmed up, there is no guarantee that threads running in socket 0 will primarily access pages located in memory node 0, or threads on socket 1 to memory node 1. That is, unless the database has been purposefully architected with a plan in mind to achieve higher than random memory locality.
Database Architecture for NUMA (TBD)
The application also needs to be built to the same memory locality plan. A connection to SQL Server will use a specific TCP/IP port number based on the key value range. SQL Server will have TCP/IP ports mapped to specified NUMA nodes. Assuming the application does the initial buffer-cache warm up, and not some other source that is not aware of the NUMA tuning, there should be alignment of threads to pages on the local node. (It might be better if an explicit statement specifies the preferred NUMA node by key value range?)
An example of how this is done in the TPC-C and E benchmark full disclosures and supplemental files. Better yet is to get the actual kit from the database vendor. For SQL Server, the POC is JR Curiously, among the TPC-E full disclosure and supplemental files' thousands of pages of inane details, there is nothing that says that the NUMA tuning is of pivotal importance. It is assumed in the examples here that threads access random pages, and memory accesses are evenly distributed over the memory nodes.
Simple Model for Memory Latency
In Memory Latency, NUMA and HT, a highly simplified model for the role of memory latency in database transaction processing is used to demonstrate the differences between a single-socket system having uniform memory and multi-socket systems having NUMA architecture.
Using the example of a hypothetical transaction that could execute in 10M cycles with "single-cycle" memory, the "actual" time is modeled for a 2.2GHz core in which 5% of instructions involve a non-open page memory access. The model assumes no IO, but IO could be factored in if desired. The Xeon E5 memory model is used for 1 and 2 sockets, and the E7 for the 4-socket system.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
If there were such a thing as single-cycle memory, the performance would be 220 transactions per second based 2,200M CPU-cycles per second and 10M cycles per transaction.
Based on 67ns round-trip memory access, accounting for a full CL+tRCD+tRP, the transmission time between processor and DRAM, and L3 latency, incurred in 0.5M of the 10M "instructions", the transaction now completes in 83.2M cycles.
The balance of 73.2M cycles are spent waiting for memory accesses to complete. This circumstance arises primarily in point-chasing code, where the contents of one memory access determines the next action. Until the access completes, there is nothing else for the thread to do. The general advice is to avoid this type of coding, except that this is what happens in searching a b-tree index.
TBD (to be part of DB Architecture for NUMA)
If the impact of NUMA on database transaction processing performance were understood and clearly communicated, databases could have been architected from the beginning to work with the SQL Server NUMA and TCP/IP port mapping features. Then threads running on a given node primarily access pages local to that node.
If this forethought had been neglected, then one option is to re-architect both the database and application, which will probably involve changing the primary key of the core tables. Otherwise, accept that scaling on multi-socket systems is not going to be what might have been expected.
Furthermore, the Xeon E7 processor, commonly used in 4-socket systems, has the SMB feature for doubling memory capacity. As mentioned earlier, this must incur some penalty in memory latency. In the model above, scaling is:
1P -> 2P = 1.45X, 2P -> 4P = 1.56X and
1P -> 4P = 2.26X
The estimate here is that the SMB has an 11% performance penalty. If the doubling of memory capacity (or other functionality) was not needed, then it might have been better to leave off the SMB. There is 4-waay Xeon E5 4600-series, but one processor is 2-hops away, which introduces its own issues.
There is a paucity of comparable benchmark results to support meaningful quantitative analysis. In fact, it would seem that the few benchmarks available employ configuration variations with the intent to prevent insight. Below are TPC-E results from Lenovo on Xeon E5 and 7 v4, at 2 and 4-sockets respectively.
|E5-2699 v4||2||44||88||512GB (16x32)||3x17 R5||4,938.14||-|
|E7-8890 v4||4||96||192||4TB (64x64)||5x16 R5||9,068.00||-|
It would seem that scaling from 2P to 4P is outstanding at 1.915X. But there is a 9% increase in cores per socket from 22 to 24. Factoring this in, the scaling is 1.756X, although scaling versus core count should be moderately less than linear. Then there is the difference in memory, from 256GB per socket to 1TB per socket. Impressive, but how much did it actually contribute? Or did it just make up for the SMB extra latency? Note that TPC-E does have an intermediate level of memory locality, to a lesser extent than TPC-C.
The details of the current Intel Hyper-Threading implementation are discussed elsewhere. The purpose of Hyper-Threading is to make use of the dead cycles that occur during memory access or other long latency operations. Hyper-threading is an alternative solution to the memory latency problem. Work arounds are good, but there are times when it is necessary to attack the problem directly. Single thread performance is important, perhaps second after overall system throughput.
Given that the CPU clock is more than 150 times that of memory round-trip access time, it surprising that Intel only implements 2-way HT. The generic term is simultaneous multi-threading (SMT). IBM POWER8 is at 8-way, up from 4-way in POWER7. SPARC has been 8-way for a few generations?
There is a paper on one of the two RISC processors stating that there were technical challenges in SMT at 8-way. So, a 4-way HT is reasonable. This can nearly double transaction performance. The effort to increase HT from 2-way to 4-way should not be particularly difficult. Given the already impossibly complex complexion of the processor, "difficult should be a walk in the park".
It might help if there were API directives in the operating system to processor on code that runs well with HT and code that does not.
One other implication of the memory latency effect is that scaling versus frequency is poor. The originator of the memory latency investigation was an incident (Amdahl Revisited) in which a system UEFI/BIOS update reset processors to power-save mode, changing base frequency from 2.7GHz to 135MHz. There was a 3X increase in worker (CPU) time on key SQL statements. A 20X change in frequency for 3X performance.
In other words, do not worry about the lower frequency of the high core count processors. They work just fine. But check the processor SKUs carefully, certain models do not have Hyper-Threading which is important. It also appears that turbo-boost might be more of a problem than benefit. It might be better to lock processors to the base frequency.
Hekaton - MVCC
Contrary to popular sentiment, putting the entire database in to memory on a traditional engine having page organization and row-store does not make much of a direct contribution in performance over a far more modest buffer cache size. This is why database vendors have separate engines for memory-optimized operation, Hekaton in the case of Microsoft SQL Server. To achieve order of magnitude performance gain, it was necessary to completely rethink the architecture of the database engine.
A database entirely in memory can experience substantial performance improvement when the storage system is seriously inadequate to meet the IOPS needed. When people talk about in-memory being "ten times" faster, what meant was that if the database engine have been designed around all data residing in memory, it would be built in a very different way than the page and row structured implemented by INGRES in the 1970's.
Now that the memory-optimized tables feature, aka Hekaton for Microsoft SQL Server, is available, are there still performance requirements that have not been met? Memory-optimized tables, and its accompanying natively compiled procedures are capable of unbelievable performance levels. All we have to do is re-architect the database to use memory-optimized tables. And then rewrite the stored procedures for native compilation. This can be done! And it should be done, when practical.
In many organizations, the original architects have long retired, departed, or gone the way of Dilbert's Wally (1999 Y2K episode). The current staff developers know that if they touch something, it breaks, they own it. So, if there were a way to achieve significant performance gain, with no code changes, just by throwing money at the problem, then there would be interest, and money.
There is not a TPC-E result for Hekaton. This will not happen without a rule change. The TPC-E requirement is that database size scales with the performance reported. The 4-way Xeon E7 v4 result of 9,068 tpsE corresponds to a database size of 37,362GB. The minimum expectation from Hekaton is 3X, pointing to a 100TB database. A rule change for this should be proposed. Allow a "memory-optimized" option to run with a database smaller than memory capacity of the system.
About as difficult as MVCC, less upside? Potentially the benefits of NUMA scaling and Hekaton could be combined if Microsoft exposed a mechanism for how key values map to a NUMA node. It would be necessary for a collection of tables with compound primary key to have the same lead column and that the hash use the lead column in determining NUMA node?
The major DRAM companies are producing DDR4 at the 4Gb die level. Samsung has an 8Gb bit die. Micron has a catalog entry for 2×4Gb die in one package as an 8Gb product. There can be up to 36 packages on a double-sided DIMM, 18 on each side. The multiple chips/packages form a 64-bit word plus 8-bits for ECC, capable of 1-bit error correction and 2-bit error detection. A memory controller might aggregate multiple-channels into a larger word and combine the ECC bits to allow for more sophisticated error correction and detection scheme.
A 16GB non-ECC DDR4 DIMM sells for $100 or $6.25 per GB. The DIMM is comprised of 32×4Gb die, be it 32 single-die packages or 16 two-die packages. The 16GB ECC U or RDIMM consisting of 36, ×4Gb die is $128, for $8/GB net (data+ECC) or $7.11/GB raw. There is a slight premium for ECC parts, but much less than it was in the past, especially with fully buffered DIMMs that had an XMB chip on the module. The 4Gb die + package sells for less than $3.
Jim Handy, the memory guy, estimates that 4Gbit DRAM needs to be 70mm2 to support a price in this range. The mainstream DRAM is a ruthlessly competitive environment. The 8Gbit DDR4 package with 2×4Gb die allows 32GB ECC DDR4 to sell for $250, no premium over the 16GB part.
The 64GB ECC RDIMM (72GB raw) is priced around $1000. This might indicate that it is difficult to put 4×4Gb die in one package, or that the 8Gb die sells for $12 compared to $3 for the 4Gb die. Regardless, it possible to charge a substantial premium in the big capacity realm.
One consequence of the price competitiveness in the mainstream DRAM market is that cost cutting is an imperative. Multiplexed row and column address lines originated in 1973, allowing for lower cost package and module product. Twenty years ago, there was discussion on going back to a full width address, but no one was willing the pull the trigger on this.
The only concession for performance in mainstream DRAM was increasing bandwidth by employing multiple sequential word accesses, starting with DDR to the present DDR4.
Reduced latency DRAM appeared in 1999 for applications that needed lower latency than mainstream DRAM, but at lower cost than SRAM. One application is in high-speed network switches. The address lines on RLDRAM are not multiplexed. The entire address is sent in one group. RLDRAM allows low latency access to a particular bank. That bank cannot be accessed again for the normal DRAM period? But the other banks can, so the strategy is to access banks in a round-robin fashion if possible.
A 2012 paper, lead author Nilandrish Chatterjee (micro12 and micro 45) has a discussion on RLDRAM. The Chatterjee paper mentions: RLDRAM employs many small arrays that sacrifices density for latency. Bank-turnaround time (tRC) is 10-15ns compared to 50ns for DDR3. The first version of RLDRAM had 8 banks, while the contemporary DDR (just DDR then) had 2 banks. Both RLDRAM3 and DDR4 are currently 16 banks, but the banks are organized differently?
Micron currently has a 1.125Gb RLDRAM 3 product in x18 and x36. Presumably the extra bits are for ECC, 4 x18 or 2 x36 forming a 72-bit path to support 64-bit data plus 8-bit ECC. The mainstream DDR4 8Gbit 2-die package from Micron comes in a 78-ball package for x4 and x8 organization, and 96-ball for x16. The RLDRAM comes in a 168-ball package for both x18 and x36. By comparison, GDDR5 8Gb at 32-wide comes in a 170-ball BGA, yet has multiplexed address? The package pin count factors into cost, and also in the die size because each signal needs to be boosted before it can go off chip?
Digi-Key lists a Micron 576M RLRAM3 part at $34.62, or $554/GB w/ECC, compared with DDR4 at $8 or 14/GB also with ECC, depending the module capacity. At this level, RLDRAM is 40-70 times more expensive than DDR4 by capacity. A large part for is probably because the RLDRAM is quoted as a specialty low volume product at high margins, while DDR4 is a quoted on razor thin margins. The top RLDRAM at 1.125Gb capacity might reflect the size needed for high-speed network switches or it might have comparable die area to a 4Gb DDR?
There are different types of SRAM. High-performance SRAM has 6 transistors, 6T. Intel may use 8T Intel Labs at ISSCC 2012 or even 10T for low power? (see real world tech NTV). It would seem that SRAM should be six or eight times less dense than DRAM, depending on the number of transistors in SRAM, and the size of the capacitor in DRAM.
There is a Micron slide in Micro 48 Keynote III that says SRAM does not scale on manufacturing process as well as DRAM. Instead of 6:1, or 0.67Gbit SRAM at the same die size as 4Gbit DRAM, it might be 40:1, implying 100Mbit in equal area? Another source says 100:1 might be appropriate.
Eye-balling the Intel Broadwell 10-core (LCC) die, the L3 cache is 50mm2, listed as 25MB. It includes tags and ECC on both data and tags? There could be 240Mb or more in the 25MB L3? Then 1G could fit in a 250mm2 die, plus area for the signals going off-die.
Digi-Key lists Cypress QDR IV 144M (8M×18, 361 pins) in the $235-276 range. This $15K per GB w/ECC. It is reasonable to assume that prices for both RLDRAM and QDR SRAM are much lower when purchased in volume?
The lowest price for an Intel processor on the Broadwell LCC die of 246mm2 is $213 in a 2011-pin package. This would suggest SRAM at south of $1800 per GB. While the ultra-high margins in high-end processors is desirable, it is just as important to fill the fab to capacity. So, SRAM at 50% margin is justified. We could also estimate SRAM at 40X that of DRAM, per the Micron assertion of relative density, pointing to $160-320 per GB.
Graphics and High-Bandwidth Memory
Many years ago, graphics processors diverged from mainstream DRAM. Their requirement was for very high bandwidth at a smaller capacity than main memory, plus other features to support the memory access patterns in graphics. GDR is currently on version 5, at density up to 8Gbit, with a x32 wide path (170-ball package) versus x4, x8 and x16 for mainstream DDR4. More recently, High Bandwidth Memory (HBM) is promoted by AMD, Hybrid Cube Memory by Micron.
High bandwidth memory is not pertinent to databases,
but it does provide scope on when there is need to
go a separate road from mainstream memory.
Databases on the page-row type engine does not come close to testing the limits
of DDR4 bandwidth.
This is true for both transaction processing and DW large table scans.
For that matter, neither does column-store, probably because of the CPU-cycles
I may have to take this back. DDR4-2133 bandwidth is 17GB/s per channel, and 68GB/s over 4 channels. (GB is always decimal by default for rates, but normally binary for size.) A table scan with simple aggregation from memory is what now? It was 200MB/s per core in Core 2 days, 350MB/s in Westmere. Is it 500 or 800MB/s per core now? It is probably more likely to be 500, but let's assume 800MB/s here.
Then 24 cores (Xeon E7 v4 only, not E5) consume 19.2GB/s (HT does not contribute in table scans). This is still well inside the Xeon E5/7 memory bandwidth. But what if this were read from storage? A table scan from disk is a write to memory, followed by a read. DDR writes to memory at the clock rate, i.e., one-half the MT/s rate. So the realized table scan rate effectively consumes 3X of the MT/s value, which is 57.6.
To pursue the path of low latency memory, it is necessary to justify the cost and capacity structure of alternative technologies It is also necessary that the opportunity be worthwhile to justify building one more specialty processor with a different memory controller. And it may be necessary to work with operating system and database engine vendors to all be aligned in doing what is necessary.
The Chatterjee et al micro12 paper for Micro45 (microarch.org) shows LRDRAM3 improving throughput by 30% averaged across 27 of the 29 components of the SPEC CPU 2006 suite, integer and fp. MCF shows greatest gain at 2.2X. Navigating the B-tree index should show very high gain as well.
The cost can be justified as follows. An all-in 4-way server has the following processor and memory cost.
|Processor||4×E7-8890 v4||$7,174 ea.||$28,700|
|Memory||4TB, 64×64GB||$1,000 ea.||$64,000|
If the above seems excessive, recall that there was a time when some organizations where not afraid to spend $1M on the processor and memory complex, or sometimes just the for 60+ processors (sales of 1000 systems per year?). That was in the hope of having amazing performance. Except that vendors neglected to stress the importance NUMA implications. SAN vendors continued to sell multi-million-dollar storage without stressing the importance of dedicated disks for logs.
If a more expensive low latency memory were to be implemented, the transmission time between the memory controller and DIMM, estimated to be 10ns earlier, should be revisited. A RLDRAM system might still have the DIMM slot arrangement currently in use, but other option should be considered.
An SRAM main memory should probably be in an MCM module, or some other In-Package Interconnect (TSMC Hot Chips 28). This is if enough SRAM can be stuffed into a module or package. It would also require that the processor and memory be sold as a single unit and instead of memory being configured later. In the case of SRAM, the nature of the processor L3 probably needs to be re-examined.
Before SSDs, the high-end 15K HDDs were popular with storage performance experts who understood that IOPS was more important than capacity. In a "short-path to bare metal" disk array, the 15K HDD could support 200 IOPS at queue depth 1 per HDD with low latency (5ms). It should be possible to assemble a very large array of 1,000 disks, capable of 200,000 IOPS.
It is necessary to consider the mean-time-between-failure (MTBF), typically cited as over 1M-hours. There are 8,760 hours in a 365-day year. At 1M-hr MTBF, the individual disk failure rate is 0.876%. An array of 1000 disks is expected to see 9 failures per year. Hard disks in RAID groups will continue to operate with a single or sometimes multiple disk failures. However, rebuilding a RAID group from the failed drive could take several hours, and performance is degraded in this period. It is not operationally practical to run on a very large disk array. The recommendation was to fill the memory slots with big DIMMs, and damn the cost.
The common convention used to be a NAND controller for SATA on the upstream side would have 8-channels on the NAND side. The PCI-E controller would 16 or 32 NAND channels for x4 and x8 respectively. On the downstream side, a NAND channel could have 1 or 2 packages. There could be up to 8 chips in a package. A NAND chip may be divided in to 2 planes, and each plane is functionally an independent entity. An SSD with 8 packages could have 64 NAND chips comprised of 128 planes. The random IOPS performance at the plane level is better than a 15K HDD, so even a modest collection of 24 SSDs could have a very large array (3,072) of base units.
At the component level, having sufficient units for 1M IOPS is not difficult. Achieving 1M IOPS at the system level is more involved. NVMe builds a new stack, software and hardware, for driving extraordinarily high IOPS possible with a large array of SSDs, while making more efficient use of CPU than the SAS. PCI-E NVMe SSDs have been around since 2014, so it is possible to build a direct-attach SSD array with the full NVMe stack. NVMe over fabric was recently finalized, so SAN products might be not too far in the near future.
From the host operating system, it is possible to drive 1M IOPS on the NVMe stack without consuming too much CPU. At the SQL Server level, there are additional steps, such as determining which page to evict from the buffer cache. Microsoft has reworked IO code to support the bandwidth made practical with SSD for DW usage. But given the enormous memory configuration of typical transaction processing systems, there may not have been much call for the ability to do random IOPS with a full buffer cache. But if the need arose, it could probably be done.
All Flash Array
When SSDs were still very expensive as components, storage system vendors promoted the idea of a SSD cache and/or a tiering structure of SSD, 10K and 7.2K HDDs. In the last few years, new upstarts are promoting all flash. HDD storage should not go away, but its role is backup and anything not random IO intensive.
Rethinking System Architecture
The justification for rethinking system architecture to low latency memory at far higher cost is shown below. The scaling achieved in 4-socket system is less than exceptional except for the very few NUMA architected databases, which is probably just the TPC-C and TPC-E database. It might be 2.2X better than single socket.
At lower latency, 40ns L3+memory, the single socket system could match the performance of a 2-socket system with DDR4 DRAM. If 25ns were possible, then it could even match up with the 4-socket system. The mission of massive memory made possible in the 4-way to reduce IO is no longer a mandatory requirement. The fact that a single-socket system with RLDRAM or SRAM could match a 4-socket with massive memory allows very wide latitude in cost.
RLDRAM may reside inside or outside of the processor. If outside, thought should be given on how to reduce the transmission delay. SRAM should most probably be placed inside the processor package, so the challenge is how much could be done. Should there still be an L3? Any latency from processor core to memory must be minimized as much as possible as warranted by the cost of SRAM.
Below are the memory latency simple model calculations for a single socket with L3+memory latency of 43 and 25ns. There are the values necessary for the single-socket system to match 2 and 4-socket systems respectively.
|GHz||L3+mem||remote||skt||avg. mem||mem cycles||fraction||tot cycles||tps/core||tot tx/sec|
In the examples above, Hyper-Threading should still have good scaling to 4-way at 43ns, and some scaling to 4-way at 25ns memory latency.
The new memory architecture does not mean that DDR4 DRAM becomes obsolete. It is an established and moderately inexpensive technology. There could still be DRAM memory channels. Whether this is a two-class memory system or perhaps DDR memory is accessed like a memory-mapped file can be debated elsewhere. Xeon Phi has off-package DDR4 as memory node 0 and on-package MCDRAM as memory node 1, all to the same processor.
It is acknowledged that the proposed system architecture is not a new idea. The Cray-1 used SRAM as memory, and DRAM as storage? For those on a budget, the Cray-1M has MOS memory. Circumstances of the intervening years favored processor with SRAM cache and DRAM is main memory. But the time has come to revisit this thinking.
While working on this, I came across the slide below in J Pawlowski, Micron, Memory as We Approach a New Horizon. The outline of the Pawlowski paper includes high bandwidth, and persistent memory. Deeper in, RL3 Row Cycle Time (tRC) is 6.67 to 8ns, versus 45-50ns for DDR4.
I am guessing that the large number of double-ended arrows between processor and near memory means high bandwidth. And even bandwidth to DIMMs is substantial.
On the devices to the right seems to be storage. Does ASIC mean logic? Instead of just accessing blocks, it would be useful to say: read the pointer at address A, then fetch the memory that A points to.
Below is roughly Intel's vision of next-generation system architecture, featuring 3D XPoint. The new Intel and Micron joint non-volatile technology is promoted as having performance characteristics almost as good as DRAM, higher density than DRAM, and cost somewhere in between DRAM and NAND.
The full potential of 3D XPoint cannot be realized as PCI-E attached storage. The idea is then to have 3D XPoint DIMMs devices on the memory interface along with DRAM. The argument is that memory configurations in recent years have become ridiculously enormous. That much of it is used to cache tepid or even cool data. In this case, DRAM is overkill. The use of 3D XPoint is almost as good, it costs less, consumes less power, is persistent, and will allow even larger capacity.
In essence, the Intel vision acknowledges the fact that much of main memory is being used for less than hot data. The function of storing not so hot data can be accomplished with 3D XPoint at lower cost. But this also implies that the most critical functions of memory require far less capacity than that of recent generation systems.
In the system architecture with a small SRAM or RLDRAM main memory, there will be more IO. To a degree, IO at 100µs to NAND is not bad, but the potential for 10µs or less IO to 3D XPoint further validates the concept and is too good to pass up.
Below is a my less fancy representation of the Micron System Concept.
The Right Core
The latest Intel mainline Core-i processor has incredibly powerful cores, with 8-wide superscalar execution. A desktop version of Kaby Lake, 7th generation, has 4.2GHz base frequency and 4.5GHz turbo. This means the individual core can run at 4.5GHz if not significantly higher, but must throttle down to 4.2GHz so that four cores plus graphics and the system agent stays under 91W. A reasonable guess might be that the power consumption is 20w per core at 4.2GHz? Sky Lake top frequency was 4.0GHz base and 4.2GHz turbo. Broadwell is probably 3.5GHz base and 3.8GHz turbo, but Intel did not deploy this product as widely as normal.
In transaction processing, this blazing frequency is squandered on memory latency. The server strategy is in high core count. At 24 cores, frequency is throttled down to 2.2GHz to stay under 165W. The Broadwell HCC products do allow turbo mode in which a few cores can run at up to 3.5 or 3.6GHz.
Every student of processor architecture knows the foundations of Moore's Law. One of the elements is that on doubling the silicon real estate at a fixed process, our expectation is to achieve 1.4X increase in performance. (The other element is a new process with 0.71X linear shrink yields 50% frequency, also providing 1.4X performance.) The Intel mainline core is two times more powerful than the performance that can be utilized in a high core count processor.
In theory, it should be possibly to design a processor core with performance equivalent to the mainline core at 2.2GHz (see SQL on Xeon Phi). In theory, this new light core should be one-quarter the size of the mainline core. (Double the light core complexity for 1.4X performance. Double again for another 1.4X, for a cumulative 2X over baseline.)
The new light core would be running at maximum design frequency to match the 2.2GHz mainline, whatever frequency that might be. We can pretend it is 2.2GHz if that helps. This new core would have no turbo capability. What is the power consumption of this core? Perhaps one-quarter of the mainline, because it is one-quarter the size? Or more because it is running at a slightly higher voltage? (this is covered elsewhere).
It might be possible to fit four times as many of the light cores on the same die size, assuming cache sizes are reduced. But maybe only 3 times as many cores can be supported to stay within power limits? This a much more powerful multi-core processor for multi-threaded server workloads. How valuable is the turbo capability?
The turbo boost has value because not everything can be made heavily multi-threaded. Single or low-threaded code might not be pointer chasing code, and would then be fully capable of benefitting from the full power of the Intel mainline core. A major reason that Intel is in such a strong position is that they have had the most powerful core for several years running, and much of the time prior to the interlude period. (AMD does have a new core coming out and some people think highly of it.)
Intel has two versions of each manufacturing process, one for high performance, and another for low power. Could the mainline core be built on the lower power process? In principle this should reduce power to a greater degree than scaling voltage down. Would it also make the core more compact?
We could speculate on theory, applying the general principles of Moore's Law. But there is a real product along these lines, just targeted towards a different function. The Xeon Phi 200, aka Knights Landing has 72 Atom cores, albeit at 245W (260 with fabric). The current Phi is based on the Airmont Atom core. (the latest Atom is actually Goldmont).
The recent Atom cores have a 14-stage pipeline versus 14-19 for Core? Airmont is 3-wide superscalar, with out-of-order, but does not have a µOP cache? The true top frequency for Airmont is unclear, some products based on are Airmont are listed as 2.6GHz in turbo.
On Xeon Phi, the frequency is 1.5GHz base, 1.7GHz turbo. It might be that the low core count processors will always be able to operate at a higher voltage for maximum frequency while high core count products set a lower voltage resulting lower frequency, regardless of intent.
Below is a diagram of Knights Landing, or Xeon Phi 200 from Intel's Hot Chips 27 (2015). The processor and 8 MCDRAM devices are on a single multi-chip-module (package) SVLC LGA 3647.
The MCDRAM is a version of Hybrid Memory Cube? (Intel must have their own private acronyms.) Each device is a stack of die, 2GB, for a total of 16GB with over 400GB/s bandwidth. There are also 3 memory controllers driving a total of 6 DDR4 memory channels for another 90GB/s bandwidth (at 2133 MT/s). Only 1 DIMM per channel is supported, presumably the applications are fine with 384GB but wants extreme bandwidth.
The Xeon Phi is designed for HPC. As is, it might be able deliver impressive transaction processing performance. But perhaps not without tuning at many levels. The question is, how does Knights Landing perform on transaction processing had the memory been designed for latency instead of bandwidth?
I suppose this could be tested simply by comparing an Airmont Atom against a Broadwell or Skylake? The theory is that the memory round-trip latency dominates, so the 8-wide superscalar of Haswell and later has little benefit. Even if there is some code that can used wide superscalar, the benefit is drowned out by code that wait for memory accesses.
Even in transaction processing databases, not everything is transactions, i.e., amenable to wide parallelism. Some code, with important functions, do benefit from the full capability of mainline Intel core. Perhaps the long-term solution is asymmetric multi-core, two or four high-end cores, and very many mini-cores.
SSE/AVX Vector Unit (SIMD)
The vector (SSE/AVX) unit is a large portion of the core area. This are not used in transaction processing but are used in the more recent column-store engine. Microsoft once evaluated the use of the Intel SSE registers, but did not find a compelling case. It might have been on the assumption of the existing SSE instructions?
Perhaps what is needed is to redesign the page structure so that the vector registers can be used effectively. The SQL Server 8KB page, 8,192 bytes, has a row header of 96 bytes, leaving 8096 bytes. Row offsets (slot array of 2 byte values) are filled in from the end of the page. See Paul Randall, SQL Skills Anatomy of a Record.
Within a page, each row has a header (16-bytes?) with several values. The goal of redesigning the page architecture is so that the slot and header arrays can be loaded into the vector registers in an efficient manner. This might mean moving the slot array and other headers up front. SQL Server would continue to recognize the old page structure. On index rebuild, the new page structure employed.
The necessary instructions to do row-column byte offset calculations directly from the vector register would have to be devised. This needs to be worked out between Intel and various database vendors. Perhaps the load into the vector registers bypasses L1 and/or L2? It would be in L3 for cache coherency?
The Chatterjee paper mentioned putting the most critical word in RLDRAM and the rest in DRAM. The concept could apply to the slot and header, but the database engine people would have a cow on the suggestion to split the page contents.
The current Xeon E5/7 processors, with the latest on the Broadwell core, have 16 vector registers of 256-bits (32 bytes) totaling 512 bytes. The Skylake has 32 registers of 512-bits, 2KB of registers. This is too much to waste. If they cannot be used, then the processor with special memory controller should discard the vector unit.
The main purpose of this article was to argue for a new system architecture, having low latency memory, implying a processor architecture change, with a major focus on the feasibility for transaction processing databases, and mostly as pertinent to Intel processors. However, all avenues for significant transaction performance improvement are considered.
- Hyper-Threading to 4-way
- Memory-optimized tables with natively compiled procedures
- Database architected for NUMA
- Multi-core - the right size core?
- 3D XPoint
- SSE/AVX Vector Instructions
Increasing HT from 2-way to 4-way has the potential to nearly doubly transaction processing performance. Other options have greater upside, but this is a drop-in option.
Memory-optimized tables and natively compiled procedures combined has the greatest upside potential.
People do not want to hear that the database should be re-architected for NUMA scaling. If it runs fine on single-socket or Hekaton, then fine. But Intel mentions that scaling to the Phi core count levels requires NUMA architecture even on one socket.
Higher core count using a smaller core will have best transaction throughput, but an asymmetric model might be more practical.
The capabilities of the modern processor core are being squandered in long latency for capacity that is not needed. Figure out what is the right low latency memory.
There is definitely potential for 3D Point. But it goes beyond displacing some DDR DRAM and NAND. The true potential is to enable a smaller lower latency true main memory, then have DDR DRAM and 3D XPoint as something in-between memory and IO.
SSE/AVX: use it or lose it.
The growing gap between processor clock cycle time to memory latency is not a new topic. There have been many other papers on the advantage of various memory technologies with lower latency. Most of these originate either from universities or semiconductor companies. Everyone acknowledged that cost relative to mainstream DRAM was a serious obstacle. A number of strategies were conceived to narrow the gap.
Here, the topic is approached from the point of view of database transaction processing. TP is one of several database applications. Database is one of many computer applications. However, transaction processing is a significant portion of the market for high-end processors, and systems with maximum memory configuration. And the extremely large memory capacity requirement is shown to now be a red herring.
The database community regularly spends $100K on a processor-memory complex for performance levels that could be achieved with a single socket, if it were matched with the right memory. There is valid justification from the database side to pursue the memory strategy. There is justification to the processor-memory vendor that this one market has the dollar volume to make this effort worthwhile.
In all, there are several worthwhile actions for the next generation of server system architecture. Probably none are mutually exclusive. There are trade-offs between impact, cost and who does the heavy lifting. No single factor wins in cases, so a multi-prong attack is the more correct approach.
Processor and system architecture has been an exercise in trading off speed versus size for a very long time now. It was first with cache, then multi-level cache. Then the database engine largely uses memory to cache data stored permanently on disk.
The diagram below might be helpful in discussion. The elapsed time for a transaction is the weight sum of operations that incur a wait at each level. Single thread performance is the inverse of elapsed time. For throughput, memory access latency can be partially hidden with hyper-threading. IO latency is hidden by asynchronous IO, but there is an overhead to that too.
For this discussion, assumed that L1-3 cache is fixed, though I did suggest rethinking L3 if memory is SRAM. Why give up the very large memory possible with DDR4. Technically this is 768GB per core based on 12 DIMMs at 64GB each. The E7 is double that at 24 DIMMs per core. For simplicity, assume 8 DIMMs per socket in Xeon E5, 16 in E7 and DDR4 DIMM at 64GB.
Consider the scenarios where the entire database fits in memory of 512GB for single-socket, 1TB for two-socket and 4TB at 4-way. The 1 and 2-way are E5, while the 4-way is E7. The maximum memory for SRAM might be as low as 8GB (8 packages, 8×1Gb die stack) and perhaps 64GB for RLDRAM on 8 DIMMs?
In the direct DDR4 (E5, no SMB), local node memory access is L3 + 52ns, and there is no IO, except for writes. In the smaller low latency memory, access time is much shorter, but now there are buffer cache misses. Now it all depends on where the miss goes to? DDR4 or 3D XPoint on the memory bus, to PCI-E and 3D XPoint or NAND. The calculations now depend on what fraction goes to memory in each case, and what fraction goes to IO in the second case and what t
Suppose D is the latency for DDR memory, and S it the latency of some low latency memory, both inclusive of transmission time and possibly L3. There is no read IO in the DDR system. Suppose x is the fraction of memory accesses that are outside of the fast memory, and the I is the latency. The term I might represent accesses to DDR or 3D XPoint on the memory interface via a memory access protocol so it is really not IO. Or it could be to 3D XPoint or NAND attached to PCI-E via an IO protocol.
The criteria for the small faster memory being an advantage in elapse time delta, exclusive of operations that occur inside L3, is as follows.
(1-x)×S + x×I < D
x < (D-S)/(I-S)
The objective is to achieve a large gain in performance via elapsed time. Only so much can be gained on the numerator D-S, so much depends on the latency of I. If the secondary device were DDR or 3X XPoint on the memory interface, then a very high value of accesses (x) could be allowed while still achieving good performance gain. If it were on PCI-E, then 3D XPoint might have a strong advantage over NAND.
In the discussion on Knights Landing, I suggested that the Atom core might not be bad for transaction processing. The cheapest Xeon Phi is the 7210 at $2438. About $4700 in a system. What is the difference between Atom C2750 and 2758? Both are Silvermont 8-cores, no HT. Use ECC SODIMM.
Atom has changed since its original inception, not using out-of-order execution for simplicity and power-efficiency. Silvermont added OOO. Not sure about Goldmont. Is Atom to be a slimmed down Core? with 3-wide superscalar and manufactured on the SoC version of the process?
The quality of analysis here is only meant to be good enough to raise questions. It would take far more resources than I have to do a proper analysis. Such resources exists in a few places.
I will try to sort out the material and redistribute
over several articles as appropriate.
Knights Landing 2016-08, Memory-IO Performance (), Memory Latency, NUMA and HT 2016-12, System Architecture Review 2016, The Case for Single Socket (2016-04) Amdahl Revisited and also Cost-Based Optimizer