Home, Query Optimizer, Benchmarks, Server Systems, Processors, Storage, TPC-H Studies

Column Store, Parallelism and Decimal

SQL Server performance has the interesting nature in that no matter how sound and logical an idea is on how it might behave in one scenario compared to another, what actually happens could be very different. Presumably the reason is that what happens inside the SQL Server engine is very complex with very many elements to support even a basic query, including steps that we are not even aware of. On top of this, the modern microprocessor is also very complicated, with radically different characteristics depending on whether an address is in (the processor L2) cache, or a memory round-trip necessary, and then whether it is a local or remote node memory access, not to mention the implications of cache-coherency checks.

With this in mind, some aspects of the Decimal/Numeric data type are of interest. There have been previous discussions on the fact that the Decimal data type is more expensive than integer or float, with impact that could be on the order of 2-3X, enough to become a serious consideration in queries that aggregate very many rows and columns with the Decimal data type. It is easy enough to understand the integer and floating point data types can be executed directly by the microprocessor while decimal must be handled in software, which should mean that the overhead is much higher than 2-4X. The explanation is that the even simple matter of accessing a column in the page-row storage organization of traditional database engines involves a series of address offset calculations for which the modern microprocessor cannot pre-fetch from memory sufficiently far in advance to keep its execution pipeline filled.

If this is indeed the case, then one would expect that the difference between integer and float compared to decimal would have far larger impact in column store indexes introduced in SQL Server 2012 for nonclustered and clustered in the upcoming 2014. The whole point of column store is to access memory sequentially to fully utilize the capability of modern microprocessors emphasizing bandwidth oriented over serialized round-trip memory accesses. In any performance investigation, it is always very helpful first to build baseline with non-parallel execution plans. This is because parallel execution introduces a whole new set of variability's that can complicate the assessment procedure. Of course, with such sound and logical reasoning, the outcome is inevitably the unexpected, hence the opening paragraph.

It would seem that the SQL Server engine follows completely different code path on operations to column store indexes depending on whether the execution plan is non-parallel or parallel. But it turns out that is occurs in SQL Server 2014 CTP2, and not SQL Server 2012 SP1, so it is possible the unexpected behavior will not occur in the 2014 release version?

Test System

The test system is a Dell PowerEdge T110II with 1 Xeon E3-1240 (Sandy Bridge) 4-core, 3.3GHz nominal (3.7GHz in Turbo) processor with Hyper-Threading enabled, 32GB memory, and storage on 8 SATA SSDs in RAID 0 attached to a LSI 9260 RAID controller. The operating system is Windows Server 2012 R2, and SQL Server version 2014 CTP 2.

The database was populated using the TPC-H data generator (dbgen) to produce a SF10 data set. This puts 59.986M rows in the Lineitem table which would have been 10GB using the 8-byte datetime data type but is 8GB with the 4-byte date data type. The index keys are different from the TPC-H kit, but the test here are not represented as conforming to TPC rules for official results.

Four Lineitem tables were created, all at SF 10. Two use the conventional page/row storage, and the other two use Clustered Columnstore indexes. The conventional tables were not compressed, while Columnstore indexes are compressed without option. For each type of storage, one table has 4 columns of type float (8 byte), and the other has 4 columns declared as decimal(18,6) at 9 bytes, NOT NULL in both cases.

The conventional Lineitem table average 139 bytes per row or 59 row per page with 8 byte float and 143 bytes per row, 57.25 rows per page for the 9 byte decimal. The table with clustered column store index averaged 44.1 and 45.67 bytes per row for float and decimal respectively. The columnstore indexes where about 2.5GB versus 8GB for the conventional tables.

Previous test have shown that there is no difference between int/bigint, money and float, as all are natively supported on the processor hardware. From the table definition, the four float/decimal columns are adjacent, and should be within 64 bytes of the row header? Meaning all values are in the same cache line?

One additional note is that this report is a quick response to a question concerning decimal overhead. I did not have time to setup rigorous measurements averaged over 100 executions and follow-up with an examination of anomalies. All measurements here are based on a few runs.

Page/Row and ColumnStore Performance with Parallelism

The basic test case a simple aggregate of 1-4 of the float or numeric columns along with a count, in reference to a count only query. For the page/row table, a clustered index (table) scan is forced. There is no point to forcing a scan on columnstore index, due to the nature of column storage. Performance data is collected from sys.dm_exec_query_stats. An attempt was made to ensure data in memory prior to each measurement, but some columnstore accesses generated a small amount of disk IO.

Below is the CPU in nanoseconds per row for the four cases at DOP 1. The DMV reports worker time in micro-seconds, so that value was multiplied by 1000 to get nanoseconds.

The CPU nominal frequency is 3.3GHz but for single thread operations, could very be running at the turbo frequency of 3.7GHz, somewhat more than 3 cycles per ns. The cost of the Count only operation, forcing a scan on the entire 8GB table (but does not touch either the float or decimal columns) is about the same for both conventional tables at 58.8 and 60.15 ns per row respectively, probably reflecting the slight difference in table size (3%).

The true cost structure of a SQL Server table (clustered index) scan consists of a cost for the page access, each row within a page, and each column, typically with the first column access having high cost than the subsequent columns, and perhaps higher cost if a subsequent columns is not on a previously accessed cache line, and perhaps higher for non-fixed length columns that involve a more complicated address calculation.

In previous reports, I have cited the page access cost as in the 650-750 CPU-ns range for Sandy Bridge generation processors. So about 10ns of the average row cost cited above is amortizing the page access cost (for just under 60 rows per page).

Below are the same test data, but showing incremental cost of each additional column accessed and aggregated. The Count value is the same as above because it is the baseline operation.

Notice that the incremental cost for the first column aggregated (1SUM) is higher than the subsequent columns. It is strongly evident that decimal aggregation is much more expensive than the float type (and other tests show float to be the same as int and money).

The reason that we cannot put a specific value on the difference is because of the cost structure of complete operation has page, row and columns components of which the int/float/decimal difference only involves the last component. In addition, the number of columns of each type also impacts the differential.

Below is the cost per row in CPU-ns of the count query, with the two conventional tables on the left and the two columnstore indexes on the right at DOP from 1 to 8. The system under test has 4 physical cores with HT enabled. SQL Server correctly places threads on separate physical cores when DOP allows, but the DOP 8 test forces both logical processors on each core to be used. It is also clear in the Columnstore tests that there is something very peculiar. CPU put unit work is not supposed to decrease from DOP 1 to 2. There are certain cases when this does happen, example being a hash join in where the parallel plan has a bitmap filter, which is not employed (per rule) in non-parallel plans. This not the case here, and a test on SQL Server 2012 shows the expected performance advantage for columnstore at all DOP levels.

Below is the rows per second for the Count query. This is the inverse of elapsed time and is better for demonstrating scaling with DOP. The vertical axis is log scale in base 2 to better distinguish 2X. The scaling with DOP is not particularly good in the DOP 1-4 range, with each thread on separate physical cores. This is believed to be the case with as CPU/row only rises moderately with parallelism to DOP 4. This query does almost no work other than page access, so it is possible there is contention somewhere in the buffer pool management?

Perfect scaling would be doubling performance for each doubling of DOP (each thread on separate physical cores), an example being from 16 to 32 rows/µs on the vertical scale from DOP 1 to 2. An indicator of quality of the measurement the ratio of worker time to elapsed time. In a perfect situation, this would be equal to the DOP. At DOP 4, the ratio is unexpectedly low at 2.8. Very good scaling is normally expected when parallelism is over separate physical cores. Here the scaling in that case is poor, but appears to be great when both logical processors on each core are active at DOP 8. The sharp rise in CPU per row from DOP 4 to 8 is indicative of this aspect of HT. Had the DOP 4 measurement indicated the correct worker/elapsed ratio closer to 4, there would have been only a moderate increase in performance from DOP 4 to 8.

Below is the single column SUM query cost per row (in ns) for the two conventional tables on the left and the two columnstore tables on the right. The cost difference between float and decimal in the conventional tables is now evident though not large. It is much more significant in the columnstore tables, as expected.

Below is the rows per second for the single column SUM query.

Below is the two column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right. There is a larger difference between float and decimal in the conventional tables compared to the single column query. This is expected as there is more work is in column operations relative to the page and row access. The difference on the columnstore table is than in the single column and this was not expected.

Below is the rows per second for the two column SUM query.

Below is the 3 column SUM query cost per row for the two conventional tables on the left and the two columnstore tables on the right.

Below is the rows per second for the 3 SUM test.

The graphs below show the query cost per row for Columnstore access with 1, 2 and 3 columns, the same as before, except with the DOP 1 value truncated.


Aspects of the Decimal data type cost relative to float have been examined, noting the float has elsewhere been observed to be the same as int and money. The overhead can be as large as 2-3X in conventional page/row tables, depending on the ratio of work between page and row access, relative to column aggregation. In queries that access small to even moderately large number of decimal values (rows*columns) the higher cost of decimal should not be significant. In large data warehouse queries that access ten million values, it will be noticeable and probably start to get painful in the hundred million scale.

It was expected that the Decimal data type cost would be much high in columnstore indexes, as the row access costs are greatly reduced, magnifying the column contribution. First, the problem in SQL Server 2014 CTP2 with columnstore clustered index for non-parallel execution plans was observed, and it is hoped that this problem will be resolved in RTM. The expected large impact was observed on queries aggregating a single decimal type columns, but the trend decreased unexpectedly for 2 and 3 columns aggregated. SQL Server performance characteristics are not always in-line with expectations, no matter how sound and reasonable the logic is.