2013-10 More explanations on Hekaton at PASS 2013. The primary foci in Hekaton are 1) changing an alternative concurrency mechanism, doing away with (most?) locks and latches, and 2) implementing true compiled binary for stored procedures to leverage the gains. Hekaton will retain row organization of data. So column organization to improve the memory access sequence is a completely separate entity.
2013-08 Technet blogs architectural-overview-of-sql-server-2014-s-in-memory-oltp-technology 2013-07-22 and getting-started-with-sql-server-2014-in-memory-oltp 2013-06-26
The term in-memory database can be subject to misinterpretation. An in-memory database was originally used to describe a storage engine designed for the memory access characteristics of modern microprocessors, not simply a database stored in memory. Today it is common for a database to reside almost entirely in the buffer cache, i.e., memory of a traditional relational DBMS, but this is very different from an in-memory database just defined. As Microsoft recently announced that the next version of SQL Server will incorporate in-memory database technology under the Hekaton codename, it is worthwhile now to revisit in more detail the difference between the original disk storage oriented and in-memory databases, along with the differences in computer system architecture between then and now that drove the change in the database storage engine.
Relational databases originated from the papers of Edgar Codd published from 1970 on. Oracle may have had the first commercial product. A group at UC Berkeley (Stonebraker and Wong) built the INGRES, from which Sybase and later SQL Server descended. Ingres was developed on a DEC PDP-11, which was a popular mini-computer system at the time (16-bit integer/register). The Design and Implementation of INGRES paper by Stonebraker, Wong and Held, ACM 1976 mentions support for UNIX on the PDP-11/40 45 and 70 models. The 11/40 could support a process address space of 64K and 128K on the 11/45 and 11/70 models.
The main element was for a database engine to make best use of limited memory to complete a query with minimal disk IO.
The DEC PDP-11 came out in 1970 at a relatively low price-point such that it was a very popular system in university environments. The Spring Joint Computer Conference 1970 paper A new architecture for mini-computers - The DEC PDP-11 cites a $5000-10K price target. This is may have been why one happened to be available for the original Ingres development project. The PDP 11 Handbook lists the PDP-11/10 as having 1,024 words of 16-bit read-only memory and 128 word read-write memory. The PDP-11/20 model has 4,096 words of 16-bit read-write (Magnetic) core memory. The max data transfer rate on the Unibus was one word every 750ns. Core memory had a 1.2 µs cycle time and 500 ns access time.
Wikipedia lists the history of the PDP-11 Unibus models as:
Microsoft Research now seems to be the repository of DEC material under the Gordon Bell section, including DEC Museum and 16-bit timeline. The timeline information between Wikipedia and Microsoft Research do not appear to be entirely conistent. Either it is difficult to interpret surviving documents or people's recollections of this era are fading.
There is more information on the next generation DEC VAX 11/780, the first 32-bit mini-computer. This system came out in 1977. See VAX-11/780 Hardware Users's Guide and VAX Product Sales Guide for details. Also search for the VAX-11/780 Architecture Handbook from Carnegie Mellon ECE. The CPU was built with TTL, had a 200ns clock and 8KB cache. No transistor count is cited? The VAX-11/780 pricing was between $210K and 350K?
The system was described as 1MIPS, but that was because the performance was roughly comparable to an IBM system (370/158-3?) that was accepted as 1MIPS. It turned out the VAX 11/780 executed 0.5M native instructions per sec to deliver equivalent peformance to the IBM 1MIPS. John McMallum jcmit cites the IBM 370/158-3 as 0.73MIPS and the VAX-11/780 as 1MIPS.
The CPUs of this time were very limited in the number transistors, and should have only basic instructions. It would have not been feasible for compiled binaries to be built on basic instructions. The native VAX (or PDP-11) instruction set were comprised of complex instructions, which are translated by a set of microprogrammed instructions (microcode) into the basic instructions? The presumption based on 0.5 VAX MIPS and the 5MHz clock cycle is then that the average VAX instruction decomposes into 10 basic instructions or rather clock cycles, accounting for memory access time?
The memory system contains one or two memory controllers. Each controller can handle 1 to 16 arrays. The memory array has a cycle time of 600ns. A memory controller buffers one command while processing another. The memory controllers can be interleaved.
Cache access time is 200ns, basically 1-cycle access. Memory cycle time is 600ns. Read access time at the processor is 1800ns. Effective average operand access time is 290ns.
The first systems used 4Kbit DRAM supporting a maximum system memory of 2MB, in increments of 128K.
Later systems used 16Kbit DRAM, supporting up to 8MB memory, in 256K increments. Minimum memory was cited as 128K and 256K in the 1977 and 1979 handbooks, but later documents cited minimum memory of 1M?
If we do that math, we can work out that excluding overhead for ECC, 4096 chips are required for 8MB at 16Kbit per DRAM. The VAX 11/780 has a 72-bit memory path comprised of a 64-bit word with 8-bits for ECC.
By comparison, a modern server system supports 1TB memory over 64 DIMM sockets with 16GB DIMMs. There are 36 chips on each 16GB DIMM (32 for data, 4 for ECC) at 4Gbit per chip. The DRAM package could be single or double die package (DDP). So the system could have upto 2048 chips plus 256 for ECC.
Over the course of time, computer systems transitioned to single chip microprocessors. The low-end systems transitioned first to realize the cost benefits of lower part count. Eventually high-end systems transitioned to microprocessors as well, due to the chip to chip signal delays not scaling with improving transistor performance within a chip.
The next step in microprocessor architecture was pipelined execution. A complete single instruction is comprised of a sequence of many operations. By dividing the sequence into many steps, the clock rate for completing a single step can be higher than for the whole instruction. By allowing the a sequence of instructions to overlap, one instruction could be completed each clock cycle with pipelining. Wikibooks Microprocessor Design/Pipelined Processors has excellent illustrations of pipelining. The time to execution a single full instruction is several clock cycles.
The Intel 80486 (1989) has a 5-stage pipeline: fetch, decode1, decode2 (effective address), execute, and write-back.
The Pentium (1993) pipeline stages are: Prefetch, Fetch (MMX only?), D1 Instruction Decode, D2 Address Generate, Execute, Writeback. So that makes 5 stage for the original Pentium and 6 for the MMX?
Intel is curiously vague on the exact number of pipeline stages for the Pentium Pro to Pentium III, collectively known as the P6 line. It might be because the actual number of stages varies with the instruction? The Pentium III has been cited as 10 stages, and the Pentium Pro (P6) could be the same. The later Pentium III processors may have added a (prefetch) stage purely to account for the time to access L1 cache as core frequency increased with process generation and maturity.
The Pentium M (130nm Banias) could be an improved P6, but is also called a new design for the Pentium 4 4X bus. This was followed by Dothan on 90nm, with dual core, branded as Core Duo and Solo?
The first Pentium 4 processors (Willametter and Northwood) are 20 stage, the second generation Prescot is 31 stages.
The diagram below is from "The Microarchitecture of the Pentium 4 Processor", Intel Technology Journal Q1 2001 showing 10 stages for a basic Pentium III, and 20 stages for the 180nm and 130nm Pentium 4s, Willamette and Northwood.
In other documents, I have P6 as:
IFU1, IFU2, IFU3, Dec1, Dec2, RAT, ROB, Dis, EX, Ret1, Ret2.
The Core 2 (Conroe 65nm, Penryn 45nm, there were other codenames for server and mobile) 14 stages. The Core 2 brand name was later changed to Core, even though pre-Core 2 processors had already been sold with Core as brand name.
The next significant innovation was super-scalar execution, where a microprocessor could complete several instructions in parallel each clock cycle. The Intel Pentium has limited 2-wide super-scalar. The Pentium Pro had a more broadly usable 3-wide. The super-scalar execution units typically have special uses, so it is not always possible to complete an instruction on all units in each cycle.
The Intel Pentium 4 is shown with 4 ports, 2 for execution, 1 Load and 1 Store port. I recall the Pentium 4 as 3-wide, which might be the maximum throughput of the ports.
The Intel Core microarchitecture (Conroe/Penryn) is described as 4 instructions per clock cycle versus 3 in previous architectures. The diagram shows 5 units, 3 for different aspects of ALU, FP and vector, 1 Load and 1 Store. Also mentions 14-stage pipeline.
The Intel Nehalem is shown in IDF 2008 with 6 execution units, 3 for integer, FP and vector, 1 Load, 1 Store Address and 1 Store Data. The Intel Sandy-Bridge is shown with 6 execution units, 3 for integer, FP and vector, 2 for Load/Store Address and 1 Store Data.
The Intel IDF Fall 2012 presentation on Haswell shows 8 units: 4 integer of which 3 can do vector, and 2 can do FP, , 1 Load/Store Addres, 1 Store Data, 1 Store Address.
Technically, a instruction on one system architecture has no inherent correlation to an instruction on a different system architecture. So there should be no correlation between MIPS on one system to another. But people need or want to compare systems, and MIPS had already become popular, so the MIPS was standardized, first as Whetstone (contains floating-point), and then later Dhrystone (no floating-point). One DMIPS is the performance of the VAX-11/780, rather than 1 million specific actual IPS.
There are minor inconsistencies between MIPS from various sources. The table below is mostly from Wikipedia Instruction per second . Last two items are multi-core processoes and the MIPS rating is for all cores, but the D IPS/clock is per core. Another broad compilation is jcmit.
|IBM 370 158-3||1 MIPS at 8.69MHz||0.1||1972|
|Intel 8080||2||0.33 (not Dhrystone)||0.165||1974|
|Intel 80486 DX2||66||54||0.8||1992|
|Intel Pentium Pro||200||541||2.7||1996|
|Intel Pentium III||600||2054||3.4||1999|
|Intel Pentium 4EE||3200||9726||3.0||2003|
|Intel Core 2 (2c)||2930||27079||4.6||2006|
|Intel Core i7 920 (4c)||2660||82300||7.7||2008|
Notice the sharp rise in IPS per clock between the Intel 80386 (non-pipelined) and the 80486DX2 (pipelined) to nearly 1 per clock. Presumably the main contributor is the 8K (unified) on die for the 80486 and 8K data + 8 K instruction cache for the Pentium. The high-end 486 and Pentium systems of this period also had off-die L2 cache as well. I do not recall if off-die cache was common for 386 systems.
Thereafter, IPS/clock is greater than 1 with the advent of super-scalar execution. Both the Pentium Pro and Pentium III are 3-wide, so the increase in IPC might be due to the SIMD capability of the Pentium III. The Pentium 4 gave up a degree of IPC on the very deep pipeline to achieve extraordinarily high clock rates. The Core 2 was 5-wide? The Core i7 is 5-wide but also has hyperthreading. The latest Sandy-Bridge is 6 wide?
Intel provides MIPS rating of their earlier processors up to Pentium in List of Intel microprocessors
|Intel Pentium (P5)||60|
|Intel Pentium (P54)||90|
|Intel Pentium (P54CS)||133|
A complete DRAM history is more difficult to trace, along with the primary manufacturers chaning over time. Wikipedia is generally a good starting point. Dynamic random access memory, cites DRAM Design Overview from Stanford University by Junji Ogawa.
DRAM timing is particularly difficult to understand, more so with the radical change from asynchronous (FPM and EDO) DRAM to synchronous SDRAM, and DDR timings. Wikipedia SDRAM latency provides the diagram below.
Other references are Anantech everything-you-always-wanted-to-know-about-sdram-memory-but-were-afraid-to-ask. and Ulrich Drepper's What every programmer should know about memory.
The aspect of focus here is memory access latency. This element was generally quoted for asynchronous DRAM products. After the change to synchronous DRAM, the industry emphasis changed to bandwidth timings.
The last of the FPM and EDO DRAM products were available with 50ns access times, but 60ns products were more common. Perhaps the 50ns access time required cherry picking from a normal production run?
Today, the best DDR3 may have an access time of 25-30ns at the DRAM chip. Local memory access time at the processors (with integrated memory controller) is on the order of 50ns? The difference due to signal transmission from processor to memory and back. On server systems using registered memory, there may be buffer chip between processor and DRAM? On multi-processor (socket) systems, access to remote node memory may be over 95ns? to an adjacent node and 150ns+ for 2-hop distances?
DDR transfers data on both edges of the clock, i.e., at double the clock rate. Internally, DRAM is now organizied into multiple banks in order to sustain data transfers at a very high-rate.
The entire discussion above pertains to mainstream DRAM, which emphasis cost relative capacity first, followed by bandwidth, with the expectation that computer system will be comprised of many DRAM chips. For example, a recent generation personal computer will have 2 memory channels, each 64-bits wide. The DRAM components are organized as x8, providing an 8-bit data path, so there are 8 chips to form a 64-bit channel, and the minimum system memory has 16-chips.
There specialty DRAM products designed around different requirements. Graphics DRAM is designed for high bandwidth on the assumption that the memory system will be comprised of few chips. Consider a graphics subsystem that needs only 1GB comprised of 1Gbit chips. The desired bandwidth might require a 256-bit path. So GDDR DRAM are often organized wider, x32 being popular.
Another specialty is reduced latency DRAM for network systems. These systems do not require monstrous system memory capacity, but do need super low latency to support fast turn-around time for high-speed networking, in the 10-40Gbit/s range. A Micron RLDRAM document mentions tRC of 8ns versus 46-52ns for DDR3?
It has been my opinion that server system memory has long since become out of balance with the original concept of system main memory. The latency has become to long for memory. Today most memory is used for caching of one form or another, including the database buffer cache. The time is right to split computer system memory. There should be a smaller memory subsystem emphasizing very low latency, not just with specialtly DRAM, but also with physical proximity, perhaps in the same package as the processor. A separate larger subsystem can continue to implement bulk DRAM, tolerating longer latency.
It has long been known that memory access latency cannot keep up with the microprocessor. Of course, the Intel server microprocessor clocks rates have settled into the 3GHz range, with further progress emphasizing the number superscalar execution ports, and the number of cores on a single die (or socket). For a 3GHz processor and 50ns local node access, memory latency is now 150 CPU clock cycles away, and 300+ for remote node memory access.
Micron and other memory companies have formed the Hybrid Memory Cube consortium, proposing a radical re-architecture of the memory system. See Hot Chips HC23 Hybrid Memory Cube (HMC). by J. Thomas Pawlowski, Micron and High-Performance Memories for Packet Processing.
On the VAX-11/780, the CPU clock was 5MHz or 200ns cycle time, but a complete instruction averaged 10 cycles. DRAM access time was 250ns, 600ns to the memory controller and 1800ns to the processor. This was before the advent of SIMM and DIMM technology. The processor, memory controller and memory were all on separate boards, with long signal delays. So essentially, memory access time was comparable to the time complete one VAX (complex?) instruction.
The a single Intel Sandy-Bridge core can complete 6 instructions per clock cycle if there are no memory stalls. The key to modern microprocessor performance is an effective cache strategy to hide memory latency. This can be successful is there is locality or if memory can be prefeched, ideally 150+ cycles before it is needed. An alternative strategy is sequential memory access to make use of the high memory bandwidth of modern systems.
|CPU clock||Effective ns/Inst||Memory Access|
|2012||upto 6||0.33ns||0.055||50ns (local)|
100ns (1 hop)
Summarizing, the CPU clock was faster than memory access even back in 1978. However, the CPU was also a very simple device that required 10 full cycles to complete a (microprogammed) instruction. So the net result was instruction time was comparable to memory access. Today, a single core is capable of completing 6 instructions per clock. This is on top of the 150-1 ratio between local memory access latency to CPU clock. The decisions made thirty years ago for good reasons nolonger hold today. The current nature of computer system architecture points to a completely different strategy given the long latency for memory access.
The modern microprocessor is designed to operate with pipelining and superscalar execution. There should be multiple independent instructions that can be executed on each clock. Furthermore, instructions executed in one clock should not have intractable dependencies on instructions in the immediately preceding clock cycles.
The most difficult code for modern microprocessors is pointer chasing. This is where a memory access retrieves the next memory location to be accessed. If the memory address is not in cache, then the access time to DRAM is over 150 cpu-cycles, during which the processor core has nothing to do. Once the memory is accessed, this provides the address of the next memory fetch. Unfortunately, this code sequence just happens to describe a b-tree search.
Having covered the computer system architecture transistions from 1970 to 2012, including the processor core and the memory system, it is appropriate to return to the orignal relational database implementation. The following diagram is from The Design and Implementation of INGRES by Stonebraker, Wong and Held, ACM 1976.
The page and row storage organization from Ingres in the 1970s is still in use today. The diagrams below are from Microsoft MSDN Understanding Pages and Extents
and SQLSkills Inside the Storage Engine: Anatomy of a page
Now examine the sequence of operations to access rows and columns with page-row storage, with consideration for whether memory access operations are in cache, or can be prefetched.
Assume that we have already found the sequence of rows required by a query from an index. The information we have for each row is the file_id, page_id, and row
1) the cost of the page in cache check could be as high as 1000 cpu-cycles? This is based on a series of table scan tests I did for varying number of rows per page. with the lock pages in memory permission on. The OS check could be equally expensive. One of Thomas Kejser's slides from SQLBits mentions that lock pages in memory performance impact could be significant.
Note to storage vendors: this is why the claim that caching solves IO performance problems is totally stupid.
2) It is necessary to place some kind of lock or latch on the page even if nolock or tablock is applied on the query. This is so the SQL Server storage engine knows that the page cannot be evicted from the buffer cache while being read.
4) The reason that the row offset and the actual row data is filled in from opposite directions of the page is the improve storage efficiency. With nullable or variable length data, it is not known how many rows will fit in any given page. This requires non-sequential memory access patterns.
5) One might think in a SELECT COUNT(*) query that we could just read the m_slotCnt value in the page header, or read the number of 2-byte row offset values at the end of page, but apparently SQL Server actually reads the row header for each row.
6) Fixed length non-nullable columns are the least effort because the column offset is known ahead of time and the same for each row. One of the recent SQL Server versions improved the handling of nullable columns by having a bitmask for all columns in each row, which simplifies the process of determining the offset?
Variable length columns are then difficult? I think we have to go to the first variable length column, read the length to get the next length value and so on until we find the desired column. It would be nice to see the source code for this. Perhaps someone would be helpful in examining the code of one of the open source databases?
There are also cycles expended to handle conversion from raw bytes to the correct data type and special SQL Server rules.
A couple of years ago, I proposed extension to to the Intel vector instructions in SIMD Extensions for the Database Storage Engine in order to facilitate database page-row-column address calculation. This would require working out the details on the new instructions, and getting Intel to implement this in the next processor still early in the design stage. I suspect that it would also be necessary to change the way metadata is stored to facilitate loading into the vector registers. It would take 2-3 years for the new processor to enter production. There would be another 2-3 years before the new technology is broadly deployed. Of course all of this should be been started ten years ago when CPU frequency went over 1GHz.
I ran a test on a series of tables with a range of rows per page from 1 to 476. The queries consisted of a table scans, first getting just a count of rows and then aggregating successive columns. The first three systems are 2-socket, running SQL Server 2008R2 on Windows Server 2008R2. The last system is single socket running SQL Server 2012 on Windows Server 2012.
The table below shows in CPU-nanoseconds for the cost per page (including 1 row) of the count query, the cost for each additional rows, and then the additionak cost for the first and each additional column aggregated.
Given that all processors cores were around 3GHz, the CPU-cycles for each page is the range of 2000 cycles, each row and column contributing another 150 or so cycles. When the first row is accessed, ie, reading the last 2 bytes of an 8KB page, the entire 64-byte cache line comprising 32 row offset values would be read into cache. The approx 150ns cost for per row corresponds to a single memory access for the row header, with the row offset most likely already in cache.
The tests compared column accesses in sequence. The single column aggregate is on the second column. The two column aggregate is on the second and third columns, which should be stored in adjacent bytes. There is some indication the pairs of columns are marginally more than a single column but the cost off 100+ cycles per successive column seems to be high. Is the type conversion? or due the interpreted code?
My standard SQL Server configuration is with the lock pages in memory right assigned, as this is required for Trace flag 834: use large-page allocations for the buffer pool. I was not aware of Thomas Kejser's report that the lock pages in memory by itself would have significant performance impact. If possible, I will re-run the above tests with and without lock pages in memory.
Another major top in database performance is scaling to a very high number of many cores. This is both scaling over the cores in a single processor socket and scaling over all the cores of a multi-socket system. Apparently the locking mechanism is a serious obstacle to scaling.
A few years ago, I did a study of a non-transactional b-tree search engine, ie, without locking. Not only did it scale perfectly over the physical cores, it also scaled perfectly over the Hyper-Threading logical cores. This was possible because the b-tree search is a series of pointer chasing memory accesses, resulting in many no-ops cycles within a single thread. With no lock contention, the scaling was perfect.
I also looked at compression and parallelism. At DOP 1, queries to compressed tables consumed about 40% more CPU than to an uncompressed tables, depending on the operation. The uncompressed tables would scale with increasing the degree of parallelism up to a point, before scaling falls off and the performance is saturated. The compressed tables scaled perfectly until the performance was equal to the uncompressed tables. The interpretation was that contention for locks was limiting scaling with parallelism. The compression added enough CPU on each thread to relieve the contention. At high degree of parallelism, 16-32 in some examples, the compression essentially become free.
Transactional memory is currently a topic of discussion. See the Intel Developer Forum 2012 session ARCS0004, Intel Transaction Synchronization Extensions. The objective is a lower overhead alternative to locking that can used on most cases? The Microsoft paper also discusses lockless or lock-free strategies?
As soon as it was evident that CPU-cycles and memory access latency were on diverging paths, perhaps around 1990, it was realized that the page row storage system with pointer chasing code to retrieve scattered metadata would not realize the full capability of modern processors. Hence the term in-memory database for describing a storage engine optimized for processor - memory access characteristics.
Another option is columnar data storage. The sequential data access could then take advantage of the memory bandwidth of systems, which was improving at an adequate pace. Furthermore, the data type within each would be known, except for the (hopefully) rarely used variant.
By the time of the later Intel 486 or early Pentium processors, the cpu cycle time to memory access latency ratio had exceed ten. So there was talk of 10X or greater performance with in-memory and columnar database technology.
At that time, system memory had not become ridiculously huge as it is today, so in-memory databases were not really practical to achieve broad adoption.
Today server memory capacity is both huge and cheap, with 16GB DIMM pricing below $400. Of course the mainstream database systems have progressed far beyond their original base with a deep infrastructure of tools and features that migrating to a different DBMS would involve huge effort and risk.
The natural solution is incorporate in-memory database technology into an existing DBMS. Microsoft SQL Server has already incorporated columnar storage in version 2012.
Breakthrough performance with in-memory technologies (Nov 8, 2012) on the Microsoft Technet SQL Server Blog by Dave Campbell, and The coming in-memory database tipping point (Apr 9, 2012) describes the rational behind in Hekaton. The first Nov 8 post cites Per-Ake Larson et al High-Performance Concurrency Control Mechanisms for Main-Memory Databases which describes method to reduce locking and other concurrency overhead.
The diagram below is from Oracle TimesTen In-Memory Database Architectural Overview