Benchmark Summary 2014-07-05

TPC-H and Columnstore

Earlier I had commented on the TPC-H results published in April of this year for SQL Server 2014 using clustered column store storage, noting that two of the 22 TPC-H queries did not perform well in column store. I had speculated on the reason without investigation (I should have learned by now not to do this), that perhaps the cause was that the row store result benefited from date correlation optimization. Thomas suggested otherwise (see below) pointing out that column store has an alternative mechanism of greater general usefulness in the keeping min/max on each columns, along with citing the join to Customers as a more likely explanation, evident in the query plan (which is why one should always provide the plan).

Thomas Kejser Comments
I am not sure your theory is correct in the case of Q10. It is noteworthy that the column store requires that the join with CUSTOMER is performance before the sort on revenue. The row store on the other hand can do a loop join (so can the column store, but that is not the plan you get it seem).

This must mean that the sort buffer is significantly larger for the column store (as reflected in the plan estimates) - which in turn can cause a rather significant memory consumption. It is also noteworthy that the column store does not seem to push the return flag predicate into the storage engine.

With column storage segments storing the min/max of all values contain in each column, it is unclear if the date correlation provides any benefit that isn't already gain from the segment header.

Another odd thing about the column store plan of Q10 is that the join of LINEITEM/ORDER is hashed, while the probe happens on CUSTOMER. Unless the predicate on RETURNFLAG is very selective (I don't recall) this is the wrong way around and may cause further spilling

This is easy enough to test. First strategy is to remove the CUSTOMER and NATION tables from the query, making it a pure Below is the test version of Query 10.

SELECT TOP 20 O_CUSTKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM ORDERS, LINEITEM
WHERE L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE>= '1993-10-01' AND
O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND
L_RETURNFLAG = 'R'
GROUP BY O_CUSTKEY
ORDER BY REVENUE DESC

The execution plan for the test query is below with row-store. (Top operation to the left not shown for compactness).

tpcE

The execution plan for the test query is below with column-store. (Top operation also not shown).

tpcE

The two plans are essentially the same, with the difference being that the column-store plans applies the ReturnFlag as a filter operation instead of as a predicate in the LINEITEM access. I suppose this is because each column is stored separately, or perhaps this is just the way the column-store plan is shown. About 25% of rows in LINEITEM meet the ReturnFlag = R condition.

On my 1 socket 4 core, HT enabled test system, at Scale Factor 10 (SF10), the SQL Server query execution statistics for the original version of Q10 are:
Row Store CPU time =  5876 ms, elapsed time =  814ms
Col Store  CPU time = 10826 ms, elapsed time = 1758ms

This is somewhat in-line with the 3 official TPC-H reports at SF 1000, 3000 and 10000 (1, 3 and 10TB) compared against different systems and SQL Server 2012 or earlier.

For just the core ORDER - LINEITEM query
Row Store CPU time = 5248 ms, elapsed time = 769ms
Col Store   CPU time = 4030 ms, elapsed time = 565ms

So it is clear that TK had the correct explanation for the poor column store performance relative to row store in the case of Q10. The counter test for my original suggestion, is to explicitly apply the date range on LINEITEM discovered by the date correlation optimization in row-store, L_SHIPDATE between 1993-09-20 and 1994-06-17. As pointed out earlier, the actual ship date range 1993-10-02 and 1994-05-02. This further improved the Columnstore result to
Col Store   CPU time = 2686 ms, elapsed time = 403ms

This is a small improvement over the existing Columnstore min/max feature. My thinking is that the row store date correlation feature is not particularly useful in real world databases with highly irregular date correlation, and that if such date correlation did exist, the analyst should spell it out rather than depend on a database engine feature. I am tempted to speculate that it might be better to partition on join columns instead of date range, but perhaps I should not do so without investigation? unless of course, this prompts someone else to do the investigation.

Now that we know were the problem occurred in Q10, we can attempt to rewrite the query to avoid the error, as shown below.

SELECT C_CUSTKEY, C_NAME, REVENUE, C_ACCTBAL, N_NAME, C_ADDRESS, C_PHONE, C_COMMENT
FROM (
 SELECT TOP 20 O_CUSTKEY, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
 FROM CUSTOMER, ORDERS, LINEITEM, NATION
 WHERE L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE>= '1993-10-01' AND
 O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND
 L_RETURNFLAG = 'R'
 GROUP BY O_CUSTKEY
 ORDER BY REVENUE DESC
) x
JOIN CUSTOMER ON C_CUSTKEY = O_CUSTKEY
JOIN NATION ON C_NATIONKEY = N_NATIONKEY
ORDER BY REVENUE DESC

The alternate query improved both the row and column-store query plans in pushing out the join to Customers and Nation to after the Top clause. The row-store plans is:

tpcE tpcE

The column-store plan is below.

tpcE tpcE

The impact is minimal in the row-store plan because the reduced number of index seeks for Customers from 115 to 20 is small in the overall query. For the column store plan, the performance retains most of the gains achieved in the Order-Lineitem only test.
Col Store   CPU time = 4218 ms, elapsed time = 598ms

In my test system, I have Q4 as 3 times faster with column-store over row-store, so I do not know why the published reports have it as comparable or slower.

Benchmark Summary 2014-05-05

TPC-H

Three TPC-H benchmark results were published in April of this year at SQL Server 2014 launch, where the new updateable columnstore feature was used. SQL Server 2012 had non-updateable columnstore that required the base table to exist in rowstore form. This was not used in the one published TPC-H benchmark result on SQL Server 2012, which includes two refresh stored procedures, one inserting rows, the second deleting rows. It is possible that the TPC-H rules do not allow a view to union two tables? and perhaps a delete via the partitioning feature? (meaning the delete range must match the partition boundaries). Another possibility is that SQL Server 2012 columnstore was considered to be a multi-column index which is also prohibited to reflect the principle of being ad-hoc queries.

 

SQL Server Columnstore

First a few quick words on SQL Server columnstore. Columnstore is not actually an index of the b-tree index form. The MSDN Columnstore Indexes Described states that columnstore index is "a technology for storing, retrieving and managing data by using a columnar data format, called a columnstore." In SQL Server 2012, it is called a nonclustered columnstore index not because it is nonclustered or an index, but because the base table must exist in traditional rowstore form. In SQL Server 2014, there is a clustered columnstore index not because data is stored in order of the index key, as there is no key, but rather that there is no rowstore table, just the columnstore.

 

TPC-H details Date Columns and Query SARGs

The full details of the TPC-H Decision Support benchmark are described on the TPC.org website. There are a few details of relevance to the use columnstore. The largest table in TPC-H is LINEITEM, which has 3 date columns, ShipDate, ReceiptDate and CommitDate, and is clustered on ShipDate. The second largest table is ORDERS, clustered on the one date column OrderDate. These two tables are joined on OrderKey. There is correlation between values in the date columns in these two tables, some natural, and others based on reasonable business conditions. ShipDate must be greater than OrderDate obviously, and is also no more than 121 days greater than OrderDate per benchmark specification built into the data generator. CommitDate is between -89 and 91 days of ShipDate. ReceiptDate is between 1 to 30 days after ShipDate. The date values ranges from Jan 1992 to Dec 1998.

There are 22 Select queries in the TPC-H benchmark, along with the 2 refresh stored procedures. Many of the Select queries specify a date range on one of the date columns or a lower bound one date column and an upper bound on a different column. Ideally, for queries that target rows in a limited date range, we would like to have upper and lower bounds on for the cluster keys on both the ORDERS and LINEITEM tables, OrderDate and ShipDate. However the TPC-H rules do not permit re-writing the query SARGs based on inferable knowledge.

That said, apparently the rules do not preclude the query optimizer from discovering such knowledge. One of the other RDBMSs was probably first to do this, and Microsoft followed suit in order to be competitive in the TPC-H benchmark with the Date Correlation Optimization feature in 2005. Personally, I am not aware of any production server using this feature. Realistically, any organization that was having query performance issues related to date range bounds would probably have directed the analyst to apply appropriate date bounds on the cluster key. This is most probably a benchmark specific optimization feature.

The date correlation optimization statistics do not exist when using clustered columnstore, because there is no underlying rowstore table with foreign key relations? The date correlation statistics do exist when using rowstore tables with foreign keys and are used by nonclustered columnstore indexes?

 

TPC-H on SQL Server 2014 with Columnstore

That said, let us now look at the 3 new SQL Server 2014 TPC-H results published making use of the new clustered columnstore indexes. One is from IBM at Scale Factor 1000 (1TB) and two from HP at 3TB and 10TB respectively. The new results are compared to prior results with traditional rowstore on previous versions of SQL Server and previous generation server systems.

Because Columnstore is not really an index in the b-tree sense, given that queries frequently involve date ranges, it is presumed to be important to use partitioning with Columnstore. The three new TPC-H reports on SQL Server 2014 partition both ORDERS and LINEITEM by OrderDate and ShipDate respectively (the cluster key in previous versions) with a partition interval of 1 week (7 years x 52 weeks per year = 364 partitions). Perhaps of interest, the scripts show that a rowstore partitioned index is first build before building the partitioned clustered columnstore index.

 

TPC-H at SF 1000 (1TB)

The new TPC-H 1TB result on SQL Server 2014 using columnstore is compared with 3 previous results on SQL Server 2008 R2. There is a difference in memory configurations between the four systems below. For SQL Server 2014 with clustered columnstore indexes, the TPC-H SF1000 total database size is just under 430GB, so with 1TB memory, the benchmark is running entirely in memory after the initial data load, with the exception of hash operation spills to tempdb.

SFVendor
System
Processor
GHz
SocketsCores
Threads
MemSQLQphHPowerThrough
put
Date
1TBIBM
x3850 X6
E7-4890 v2
2.8
460/12015362014519.976695,445388,7794/16/14
1TBHP
DL980 G7
E7-4870
2.4
880/16020482008R2219,888233,119207,4078/30/11
1TBIBM
x3850 X5
E7-8870
2.4
880/8020482008R2173,961200,889150,6355/20/11
1TBCisco
UCS C460
E7-4870
2.4
440/?10242008R2134,117156,157115,18812/7/11

For rowstore, the TPC-H SF1000 total database size is 1420GB, so the two systems with 2TB memory are mostly running with data in memory, again except for the initial load and spills to tempdb. There is definitely disk IO for data in the Cisco system at 1TB physical memory. The performance impact is noticeable, but probably not as severe as one might think based on how people talk of memory. The reason is that all of these systems make correct use of massively parallel IO channels to storage capable of 10GB/s plus table scans, and many of these also use SSD storage capable of more random IOPS than SQL Server can consume even at very high parallelism.

The new SQL Server 2014 result is 2.36X higher on composite score (QphH) and 3X higher in the Power test than previous versions with conventional rowstore.

The 22 individual query run times from the Power test at 1TB are shown below.

tpcE

Query 1, a single table aggregation, is more than 10 times faster on SQL Server 2014 using columnstore on 60 cores (Ivy-Bridge, 2.8GHz) than 2008R2 using rowstore, on 80 cores(Westmere-EX). Per TPC-H benchmark procedure, the test is run immediately after data load and index creation? The second largest speed-up is Query 16, joining 3 tables, at 6.4X.

Query 10 is 40% slower with columnstore, and Query 4 about the same between columnstore and conventional. This query is listed near the end of this section on TPC-H.

Notice that in the 3 SQL Server 2008R2 results, Query 2 becomes slower as the degree of parallelism increases from 40 cores (threads unspecified) to 80 cores/80 threads and then to 80 cores/160 threads. Elsewhere I had commented that SQL Server really needs a graduated approach to parallelism instead of the all or nothing approach.

TPC-H at SF 3000 (3TB)

The new TPC-H 3TB result on SQL Server 2014 is compared with a previous result on SQL Server 2008R2. Here, the difference in memory is a significant contributor. The SQL Server 2014 system has more memory than the columnstore database, while 2008R2 systems has much less memory than the 3TB rowstore database (4.5TB).

SFVendor
System
Processor
GHz
SocketsCores
Threads
MemSQLQphHPowerThroughputDate
3TBHP
DL580 G8
E7-4890 v2
2.8
460
120
30722014461,837631,309337,8594/15/14
3TBHP
DL980 G7
X7560
2.27
864
128
5122008R2162,601185,297142,6856/21/10

I am supposing that the reason the HP 2010 report only configured 512GB (128 x 4GB priced $29,440 in 2010) was that there would not be a significant performance improvement for the TPC-H 3TB result at either 1TB or even 2TB memory in relation to the higher price (128 x 16GB priced $115K in 2011).

Several of the TPC-H queries involve nearly full table scans of the large tables. If there is not sufficient memory for the entire database, then the next objective is to have sufficient memory for reducing the spill to disk in hash operations? HP may have elected for the better price-performance? Or perhaps someone just wanted to make a point. The point being that it is important for the SQL Server engine to function correctly when heavy IO is required.

In the SQL Server 2014 result, the system has 3TB memory (96 x 32GB priced $96K in 2014) which is sufficient to hold the entire data set for TPC-H 3TB in columnstore.

The overall composite score is 2.8X higher with columnstore and 3.4X higher on the Power test.

The 22 individual query run times from the Power test at 3TB are shown below.

tpcE

The largest gain with column-store is Query 19 at 19.7X. Query 4 and 10 show degradation, similar to the case at SF1000.

TPC-H at SF 10000 (10TB)

The new TPC-H 10TB result on SQL Server 2014 is compared with a previous result on SQL Server 2012. Strangely, supporting documentation for the HP 2013 report on SQL Server 2012 is missing so there is no indication as to whether nonclustered columnstore is used? I am guessing that columnstore was not used because the results are in line with expectations on rowstore.

SFVendor
System
Processor
GHz
SocketsCores
Threads
MemSQLQphHPowerThroughputDate
10TBHP
DL580 G8
E7-4890 v2
2.8
460
120
3072 2014404,005631,309337,8594/15/14
10TBHP
DL980 G7
E7-4870
2.4
880
160
4096 2012158,108185,297142,6856/21/11

The full data size in columnstore at SF 10000 should be 5TB, and 14TB in rowstore, so there should have been heavy disk IO in both results.

The overall composite score is 2.55X higher with column-store and 3.1X higher on the Power test.

The 22 individual query run times from the Power test at 10TB are shown below.

tpcE

The largest gain with column-store is Query 6 at 23.2X. Query 4 and 10 show degradation as in the two previous cases.

 

TPC-H Query 10

Below is Query 10. This query is consistently slower in columnstore relative to rowstore.

 
/* TPC_H Query 10 - Returned Item Reporting */

SELECT TOP 20 C_CUSTKEY, C_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE, C_ACCTBAL,
N_NAME, C_ADDRESS, C_PHONE, C_COMMENT
FROM CUSTOMER, ORDERS, LINEITEM, NATION
WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND O_ORDERDATE>= '1993-10-01' AND
O_ORDERDATE < dateadd(mm, 3, cast('1993-10-01' as date)) AND
L_RETURNFLAG = 'R' AND C_NATIONKEY = N_NATIONKEY
GROUP BY C_CUSTKEY, C_NAME, C_ACCTBAL, C_PHONE, N_NAME, C_ADDRESS, C_COMMENT
ORDER BY REVENUE DESC

The execution plan for rowstore at SF 10 is shown below.

tpcE
tpcE

The execution plan for columnstore at SF 10 is shown below.

tpcE
tpcE

Below are the details on ORDERS and LINEITEM from the rowstore plan.

tpcE tpcE

Notice that there are seek predicates on LINEITEM for 1993-09-20 to 1994-06-17. The actual range should be 1993-10-02 to 1994-05-02, for 1 day after the OrderDate lower bound and 121 days after the OrderDate upper bound.

Below are the details on ORDERS and LINEITEM from the columnstore plan.

tpcE tpcE

In columnstore, every operation is a scan. There is a predicate for the ORDERS table but not on the LINEITEM table. Presumably storage engine must scan entire set of LINEITEM partitions while only scanning the ORDERS partitions encompassing the SARG date range

I am thinking the reason is that with date correlation in conventional row-storage, the SQL Server query optimizer knows that the data range in LINEITEM ShipDate is also restricted by the lower OrderDate and the upper OrderDate plus 121 days, corresponding to 1 day after the lower bound on OrderDate to 121 days after the upper bound on OrderDate.

TPC-H Query 4

TPC-H Query 4 below is slower than row storage in the 3 and 10TB results. I am thinking that the reason is the same?

 

/* TPC_H Query 4 - Order Priority Checking */

SELECT O_ORDERPRIORITY, COUNT(*) AS ORDER_COUNT FROM ORDERS
WHERE O_ORDERDATE >= '1993-07-01' AND O_ORDERDATE < dateadd(mm,3, cast('1993-07-01' as date))
AND EXISTS (SELECT * FROM LINEITEM WHERE L_ORDERKEY = O_ORDERKEY AND L_COMMITDATE < L_RECEIPTDATE)
GROUP BY O_ORDERPRIORITY

 

See TPCH Query Plans for the TPC-H reference queries and execution plans at SF1 on SQL Server 2005. The parent page TPCH Interim has links for the SF1000 query plans with and without parallelism.

TPC-H Columnstore Summary

As with every other new feature, Columnstore is a really interesting new technology. But think hard about what is really happening, experiment, and remember to get good execution statistics and plans prior to making changes, then get the new execution statistics and plans after the change.

 

TPC-E

The charts below show the progression of performance over time for the selected TPC-E results spanning Core 2, Nehalem, Sandy Bridge and Ivy processors at 2, 4 and 8 sockets.

tpcE

For the 2-socket systems, West-1 is from the first set of TPC-E results reported for Westmere X5680 with HDD storage and West-2 is the later X5690 report with SSD storage. Both are 6-core Westmere-EP processors. The West-3 is the E7-2870 10-core (Westmere-EX) on SSD storage.

For the 4-socket systems, West-1 is on HDD storage, and West-2 on SSD, both 2K8R2 and 1TB memory. The West-3 is on Win/SQL 2012, 2TB memory and SSD storage.

The same data is shown below with reverse organization showing scaling with sockets for each of the processor architectures.

tpcE

Notes: Nehalem
2-socket 4-core, 2.93GHz (11.72 core x GHz), 4 & 8-socket 8-core 2.26GHz (18.08 core x GHz)
Westmere
2-socket 6-core, 3.46GHz (20.76 core x GHz/socket), 4 & 8-socket is 10-core 2.4GHz (24 core x GHz)
Sandy Bridge
2-socket 8-core, 2.9GHz (23.2 core x GHz/socket), 4-socket is 8-core 2.7GHz (21.6 core x GHz)
Ivy Bridge
2-socket 12-core, 2.7GHz (32.4 core x GHz/socket), 4 & 8-socket is 15-core 2.8GHz (42 core x GHz)