Home, Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage,

SIMD Extensions for the Database Storage Engine

For the last 15 years, Intel and AMD have been progressively adding special purpose extensions to their processor architectures. The extensions mostly pertain to vector operations with Single Instruction, Multiple Data (SIMD) concept. The reasoning was that achieving significant performance improvement over each successive generation for the general purpose elements had become extraordinarily difficult. On the other hand, SIMD performance could be significantly improved with special purpose registers and instructions. The very large performance gains achieved enabled new applications (mainly in the areas of multimedia, graphics and video processing) to become viable over the span of a single processor generation, instead of the typical several generations.

All of this is interesting, but here we are interested in the database engine. What can SIMD and special instruction set architecture extensions do for database performance, in particular the storage engine? The fact that there have been no database related extensions would seem to indicate that SIMD cannot be used in this area. The core of database storage is the page organization of data. A cursory examination of accessing rows in a page would show a series of scalar instructions. There does not seem to be a fit for SIMD processing. If there were special instructions to enable SIMD style coding along with use of the vector registers for database functions, the benefits might not be obvious without a detailed examination of where CPU clock cycles are expended.

Modern Microprocessors
Microprocessors since the Intel Pentium Pro era in the 1990s are essentially factory assembly lines. The throughput, in terms of instructions completed, is phenomenal. The time span from an instruction entering the pipeline to its completion, on the other hand is much longer. Access to memory is even longer, hundreds of CPU cycles away. The key to performance to on modern microprocessors involves keeping the pipeline running, and issuing requests for memory long in advance. A very poor coding practice is a sequence of instructions where each depends on the previous instruction to complete before the next can start. Even worse is a sequence that involves a fetch memory to determine the next memory access.

Intel started with Multi-Media Extensions (MMX) in the Pentium II and later Pentium processors for vector integer processing using 8 MMX registers aliased to the existing floating point registers. The next extension was Streaming SIMD Extensions (SSE) in Pentium III with 8 dedicated 128-bit registers supporting 4 packed single precision floating point instructions. Pentium 4 accompanied SSE2 that extended the special instructions to double precision floating point and 1-8 bytes integers with the 128-bit SSE registers. Additional instructions were added for specific applications in SSE3 and supplemental SSE3. Later SSE4 and AESNI added encryption, and string and text processing. The latest Intel microprocessor, Sandy Bridge added 256-bit AVX registers.

SIMD extensions can improve performance on two aspects. One is a reduction in the number of instructions. Second is the addition of special purpose registers. Consider adding two vectors. Not only are multiple additions consolidated into a single instruction, the loads preceding the add instruction and the following store operations are now a single vector instruction instead of multiple scalar instructions. A single load instruction fills the vector register with multiple elements. By using special registers for vector operands, contention for the very limited number of general purpose registers(in the X86 architecture, there are 8 for 32-bit and 16 for 64-bit modes) is reduced. The expectation is that there are fewer temporary memory copies to free up a register. The memory access pattern should be more predictable as well.

SQL Server Table Scan
Consider the TPC-H Query 1 that aggregates several columns with a group by clause. (The group by clause, in light of the existing indexes, necessitates a Hash Aggregate operation that is more expensive than the simpler Stream Aggregate). On the Xeon 5680 processor, the performance is approximately 350MB/s per core, corresponding to 44,800 pages per second. The Line Item table size is 138 bytes per row (with the 3 byte date data type), so the row rate is 2,659,431 per second. This corresponds to 1,240 CPU-cycles per row (apparently 1400 CPU-cyc per row is a better calc). The cost is much lower for an aggregate without the group by. Even then, the cost per row aggregated is still several hundred CPU-cycles.

In a 2008 blog, I had reported a cost of approximately 1 CPU micro-second for the page handling overhead based on a Core 2 processor at 2.66GHz (or 2670 CPU-cycles) using a table with 1 row per pages and the entire table in memory. The cost for a Select Count was approximately 0.05 micro-seconds for each additional row per page, or about 133-cycles. Aggregating columns incurs additional cost (I will provide this later).

Hypothetical Page Access Instruction Sequence
Consider what the code sequence might look like after the page is in memory, and appropriate locks and latches have been taken, and the page header has been read. (See Paul Randal and the SQL Server Storage blog for information on SQL Server page structure). It is assumed that the column offsets, sizes and types
have been read from the system table. It is also assumed that we are only
accessing not null fixed length columns.

1) Load (2 byte) offset for first row at end of page
2) Load column offset
3) Compute column address (page address + row offset + column offset)
4) Load column from memory location
5) Consume column (plus data validity tests), write results

Repeat 2, 3, 4, & 5 for each column access.
Repeat entire sequence for each row. (Assuming row locks are not required)

This is of course a gross simplification. In response to questions raised on one my earlier blogs, Mario of sqlinternals.com provided a description of the actual instructions using the debugger. So if anyone would be so kind to do so again, I would appreciate it.

In addition, a SQL query is not compiled to binary code as with a C/C++ program, but is interpreted. It is presumed that SQL Server and other RDBMS have efficient interpreters. (It might help if Microsoft would allow us to look at source code, including the assembly output for this).

It is expected that some memory accesses will be found in L1 or L2 cache, especially for accesses to adjacent column values. The L2 cache latency on recent Intel processors is around 10 CPU clocks, relatively close. Any accesses to L3 incur 36 cycle latency (Westmere). Local memory access should be around 50ns, and remote node memory access could be 100ns+. Some documents indicate that even though local memory access is quick, a remote node snoop is required for cache coherency.

In any case, there is a wide gap between a reasonable estimate of the CPU instructions necessary to process each row and column in a page and the actual total CPU clock cycles observed. It is suspected that many of the clock cycles are lost on memory latency. Due to the conventional nature of instruction sequence, the compiler and processor cannot determine which memory addresses need to be pre-fetched far in advance.

SIMD Page Operation
The question is: what special purpose extensions could significantly improve database page access performance? Consider the previous scenario of a table scan with this special purpose register strategy. The row offsets are loaded into one special purpose register. A 128-bit 16-byte register can hold 8 row offsets, 16 into a 256-bit register. The column offsets are loaded into another register. These are only the columns to be accessed, and excludes columns not accessed. A third register hold the size of each column corresponding to the column offsets in the previous register. If more than 8 columns are accessed, additional register pairs (for the offset and size) could be employed.

XMM1:             Row offsets:        1, 2, 3, 4, 5, 6, 7, 8
XMM2:             Column offsets:   1, 2, 3, 4, 5, 6, 7, 8
XMM3:             Column sizes:     1, 2, 3, 4, 5, 6, 7, 8

The special instructions for database operations could fetch row x column y calculated using the page base address + row offset + column offset. It is not necessary to access multiple columns or multiple rows in a single instruction because this might entail more memory accesses than the processor has load/store units for. Simply achieving a through put of one column per clock (or even several clock cycles) is already a very large performance gain.

Note this only works for fixed length not null columns. A more complicated method could be contrived to handle columns that allow null values or variable length columns. (Or one could hire a competent database architect). Potentially, a data structure could have been allocated for the columns accessed and the instruction copies the each column in succession into the data structure. I am assuming that the processor is able to distinguish when more than one column is a single cache line and not issue unnecessary memory accesses.

This SIMD strategy is not just about reduces the number of instructions necessary to access the required rows and columns in a page. The more important aspect is to minimize memory latency stalls. Using vector registers for multiple offset values eliminates the waste of using a large 8-byte (64-bit) general purpose register for a single 2-byte offset. The memory access pattern should now be sufficiently predictable to substantially reduce memory latency delays.

There is no pretense that the rambling (does JK have a trademark on this term?) here is sufficient to make the case for SIMD extensions in the CPU architecture and for retro-fitting the SQL Server storage engine. But given the importance of the database in the server world, should there not be a thorough investigation (with source code access). It is well understood that transactional database engines incur significant overhead because of the critical requirements. An alternative question could be: would database SIMD extensions be more valuable than flirtations with columnar database storage.

There are some additional questions with regard to micro-architecture
optimizations. Should the row start positions be 4, 8, 16, 32 or 64-byte aligned as appropriate for the table based on average row size? Should the lead fixed length columns be the 4 and 8 byte values? Alternatively pairs of 2-bytes values or quads or 1-byte values. Does the SQL Server 3-byte Date data type cause alignment performance penalty?

below is a Microsoft Research / Columbia University paper on database SIMD, but our strategy appears to be different?

per John in the comments,
from what I gather on John's blog, VectorWise is its own company, but is partnering with Ingres for a really fast data analytics. My discussion focused on retaining the row organization in a general purpose and transactional DB. In analytic world, there is no problem with implementing column storage, and of course, using SIMD instructions effectively is the mark of a serious programmer. (John: did you think if I was the dept head of statistics? The two of us had been fighting for top search ranking until that movie star from Taiwan blew us both away)