Home

Large Query Performance from SQL Server 2000 to 2008, 32 & 64-bit - Updated 2009-12

SQL Blog reference: Large Query Performance from SQL Server 2000 to 2008, 32 & 64-bit

Joe Chang jchang6@yahoo.com

This is a update of my original discussion on this topic in 2008.

I had been meaning to do a somewhat comprehensive review of SQL Server performance from versions 2000 to 2008 for both 32 and 64-bit on Data Warehouse type queries, with in depth examination of scaling in parallel execution plans. For now, I can provide a short summary.

Test Platform

The test platform is a Dell PowerEdge 2900 with 2 quad-core Xeon E5330 2.66GHz processors, and 24GB memory. The operating system is Windows Server 2008 64-bit for both 32 and 64-bit SQL Server versions. Technically SQL Server 2000 is not supported, but this is just a performance comparison, not a production environment. The database is generated using the TPC-H dbgen kit for scale factor 10, meaning the Lineitem table is approximately 10GB, and the entire database is approximately 17GB, which fits entirely in memory. There was some tempdb activity, which is spread across 10 15K drives.

All (original) tests are run twice to load data into memory and pre-compile the execution plan for the second run. All results shown are for the second run. For SQL Server 2008, the tables use the new Date data type in place of Date Time, and queries are modified to avoid conversion anomalies as noted below. The ALTER DATABASE tpch SET DATE_CORRELATION_OPTIMIZATION ON optimization was not employed here. This will discussed separately. The newer results are based on ten test runs, ignoring the first sequence, and averaging the results of the last nine, to load data into memory and eliminate the query compile time.

Overall Results

Below is the total (sum) CPU time in milli-seconds to execute the 22 queries in sequence for max degree of parallelism: 1, 2, 4, and 8. Note official TPC-H scores are a geometric mean, scaled to the size of the database.

CPU by Build, DOP DOP 1DOP 2DOP 4DOP 8
2000 RTM 534,912663,848656,232697,794
2000 bld 2187514,881589,245657,543770,272
2005 RTM 32463,526444,479456,567498,623
2005 SP2 32464,478403,668413,685452,134
2005 RTM 64379,363377,570394,962474,200
2005 SP2 64370,206327,149345,155436,491
2008 RTM 375,136324,264343,250410,220

Duration in milli-seconds to run 22 queries by max DOP.

Dur by Build, DOP DOP 1DOP 2DOP 4DOP 8
2000 RTM 553,900293,411191,552149,568
2000 bld 2187566,333276,085188,497164,677
2005 RTM 32480,839237,933134,64484,721
2005 SP2 32483,842214,804119,52572,515
2005 RTM 64379,563194,199107,40965,094
2005 SP2 64370,374166,57994,84459,388
2008 RTM 375,135171,39094,02856,795

SQL Server 2000

On SQL Server 2000 build 2187, notice that CPU increases from 514.8 to 589.2 seconds going from degree of parallelism (DOP) 1 to 2 and so on to Max DOP 8. This is expected because there is overhead to employing a parallel execution plan, and the overhead increases with the number of threads involved. Between SQL Server 2000 RTM and build 2187, there was a sharp jump in the CPU required at DOP 8. I will disregard this as there were significant changes and code fixes between the two builds concerning correctness of parallel execution plan results. Still, there is an overall performance gain from DOP 4 to 8.

Several years ago, I mentioned that SQL Server 2000 performance is very problematic beyond DOP 4. That was before multi-core processors, and there were at most 4 cores per NUMA node. So the more correct interpretation is that SQL Server 2000 is very problematic on NUMA systems. An earlier look at SQL Server 2005 RTM showed no such problems on NUMA.

SQL Server 2005 32-bit

In SQL Server 2005 (and SQL Server 2008, there is actually a decrease in CPU going from DOP 1 to 2. This is mostly attributed to the bitmap filter in hash operations. Some queries show a significant drop in CPU from DOP 1 to 2, others no change, and some an increase. From DOP 2 to 4 there is a slight increase in CPU and a more significant increase in going from DOP 4 to 8. This might indicate that DOP 2 and 4 are very good for overall efficiency, benefitting from bitmap filters in hash join operations, yet without incurring excessive parallelism overhead. (This is unrelated to the recommendation of Max DOP 4 on Itanium systems based on cores per NUMA node). Unrestricted parallelism on the 8 core system yields the best single stream completion times, although this should really be tested on 16 or more cores before setting any rules.

In the transition from SQL Server 2000 to 2005 RTM, both 32-bit, there is a modest 15% reduction in the duration to run the 22 TPC-H queries using non-parallel execution plans. The improvement is similar at DOP 2, but then improves to 29% at DOP 4. At DOP 8 using all 8 processor cores, the reduction is a very substantial 49%, almost twice as fast.

Service Pack 2 does not change results at DOP1, but yield 10-14% improvement in parallel plans at DOP 2, 4 and 8.

SQL Server 2005 64-bit

From SQL Server 2005 32-bit to 64-bit, both RTM builds, the performance gain in terms of reduced duration was a solid 20% across all DOP from 1 to 8. The CPU efficiency improvement was a little less, so the tempdb configuration affects the results. Even though the entire data and indexes fit in memory, a query with large intermediate results is less likely to spool to tempdb at 64-bit than 32-bit. From SQL Server 2005 64-bit RTM to Service Pack 2, an additional 10% was realized at DOP 2 and higher.

SQL Server 2008 64-bit

SQL Server 2008 RTM is marginally better than SQL 2005 SP2. There is significant variation from query to query, so improvements should be expected over time hopefully to correct the query plans that are slower while maintaining the performance advantage of plans that are better. One of the big disasters in the SQL Server 2008 parallel execution plans occurs on Query 5, Local Supplier Volume. The query is:

/* TPC_H Query 5 - Local Supplier Volume */
SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM CUSTOMER, ORDERS, LINEITEM, SUPPLIER, NATION, REGION
WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND L_SUPPKEY = S_SUPPKEY AND C_NATIONKEY = S_NATIONKEY
AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY AND R_NAME = 'ASIA'
AND O_ORDERDATE >='1994-01-01' AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY,1, '1994-01-01'))
GROUP BY N_NAME
ORDER BY REVENUE DESC

Compare the CPU-ms and duration between SQL Server 2005 sp2 and 2008 RTM, both 64-bit, by DOP.

SQL ServerCPU-msDuration-ms
 
2005 SP2
2008 RTM
DOP 1DOP 2DOP 4DOP 8
19,70314,13415,66521,949
21,65330,39031,43438,267
DOP 1DOP 2DOP 4DOP 8
19,7167,3764,6543,159
21,64816,1508,3715,120

The non-parallel plan is shown below (plan cost 1412.87).

TPC-H Query 5 Plan

The non-parallel plan starts with the Nation and Region tables to identify which customers are of interest, then joining to Orders and Lineitem, and finally joining as the inner source to Supplier.

The parallel plan is below (plan cost 1142.17).

TPC-H Query 5 Plan

The parallel plan starts with the date range on the Orders table, joins Lineitem, then joining to successively as the inner source to Customers, Nation, Region and Supplier. next joins to before to all and Region tables to identify which customers are of interest.

In both cases, the estimate number of row involved is 1.378M, but because in the parallel plan the join to Region occurs late, the estimated 9M rows in Lineitem that meet the Orders date range are carried for three hash joins before being eliminated.

The non-parallel and parallel plan summary details are shown below. The parallel plan does indeed have a lower plan cost.

TPC-H Query 5 Plan TPC-H Query 5 Plan

The main elements of the non-parallel plan

TPC-H Query 5 Plan TPC-H Query 5 Plan

The main elements of the parallel plan

TPC-H Query 5 Plan TPC-H Query 5 Plan TPC-H Query 5 Plan

The IO costs for explicit tables and index are the same for non-parallel and parallel plans. The CPU cost in a parallel plan is reduced by a factor of two in a DOP 2 plan. The DOP 4 plan has the same CPU cost as DOP 2. Each successive doubling of DOP reduces CPU cost by a factor of 2 at least up to 16. Anyone who can provide a system with a much larger number of cores can observe CPU cost pattern.

The MaxDOP 1 plan is essentially:

SELECT N_NAME, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS REVENUE
FROM SUPPLIER
INNER JOIN (
 SELECT N_NATIONKEY, N_NAME, L_EXTENDEDPRICE, L_DISCOUNT, L_SUPPKEY
 FROM NATION
 INNER JOIN REGION ON N_REGIONKEY = R_REGIONKEY
 INNER JOIN CUSTOMER ON C_NATIONKEY = N_NATIONKEY
 INNER JOIN ORDERS ON C_CUSTKEY = O_CUSTKEY
 INNER JOIN LINEITEM ON L_ORDERKEY = O_ORDERKEY
 WHERE R_NAME = 'ASIA'
 AND O_ORDERDATE >= '1994-01-01'  AND O_ORDERDATE < CONVERT(DATE, DATEADD(YY,1, '1994-01-01'))
) x ON L_SUPPKEY = S_SUPPKEY AND S_NATIONKEY = N_NATIONKEY
GROUP BY N_NAME
ORDER BY REVENUE DESC
OPTION (FORCE ORDER)

At MaxDOP 1, the actual CPU is 21,965 ms for the original query, the MaxDOP 2 CPU is 28,721ms for the original. The CPU for the forced query is 13,323.

So this one query added 15.4 CPU-sec to the total 22 query 324.3 CPU-sec, close to 5%, and about 8.0sec duration.

Query 8 was also bad news on the parallel plans, with about 5 CPU-sec lost on the MaxDOP 2 parallel plan compared with a forced parallel plan modeled on the non-parallel plan. One might think that MS should have caught these anomalies. I think the reason they do not is that MS does not look at SF1-30 TPC-H results. The minimum for publication is 100GB, and that will probably increase to 300GB soon, because 30GB is not a real data warehouse. I do think MS should look very carefully at SF1-30. The queries are at the onset of eligibility for parallelism. The really big queries in SF100 and higher are less likely to encounter plan problems. While not strictly a data warehouse, most transactional databases I have seen do not remotely resemble TPC-C or E. I would say most have TPC-H SF1-10 sized queries mixed in with smaller transactions. So a bad execution plan can be really bad news.

I am sufficiently satisfied that SQL Server 2008 has a very powerful engine, and a decent optimizer. However, I have complained in the past about the rigid assumptions that all query costs factor in IO time, the use of a fixed random to sequential IO performance model (320 IOPS to 10.5MB/sec) and an out of balance IO-CPU ratios. If a proper calibration of the true cost formulas were to be done, there would probably be fewer silly mistakes resulting in goofy execution plans. Given that many people do not know how to diagnose this type of problem, a simple test of 2000 or 2005 and 2008 can encounter this matter, leading to a decision to stay with 2000/2005, when a few simple adjustments would have corrected the 2008 results.

SQL Server Settings

Generally I follow the HP TPC-H publications on optimization settings, particularly -E and -T834. Neither changed results by more than 1% either way. I had also looked at -T2301 in the past finding no apparent differences. I really would like MS to provide more details on T2301. Are there set points below which it has no effect?

SQL Server 2008 new Date data type changes

The 3 datatime columns in the LineItem table from 2005 become Date columns, for an apparent savings of 12 bytes. The 2005 tpch SF10 database was 13.77GB (rather million KB) data and 3.68G indexes for a total of 17.46G. In 2008, using the Date data type in place of datetime, the size is 12.77 data and 2.96G index for a total of 15.74G. The average bytes per row of LineItem drops from 169 to 153, because one of the DateTime/Date columns was the cluster key.

Nornally a simple reduction in size on column width, not row count, does not improve performance unless it impacts fit in memory. I always try to exclude this factor because one can generate any difference in performance by adjusting the amount of disk IO.

The original TPC-H queries may have SARG of the form

AND O_ORDERDATE >= '1994-01-01'

AND O_ORDERDATE < DATEADD(YY, 1, '1994-01-01'))

Even before SQL 2008, the date functions would return a datetime or smalldatetime result as appropriate. In SQL 2008, the nature extension is to return a date type when the comparison is a date column. I made this request in connect and was told to bugger off. So SQL 2008 will convert the column to date time to equate with the function, losing the benefit of a proper SARG. Anyone upgrading to SQL 2008 with the date type and not changing code as below may get a nasty suprise.

AND O_ORDERDATE >= '1994-01-01'

AND O_ORDERDATE < CONVERT(DATE,DATEADD(YY,1, '1994-01-01' ))

 

[Update 2009-11-27]

The TPC-H reports from 2009 on use the following:

AND O_ORDERDATE < DATEADD(YY,1, CAST('1994-01-01' AS DATE))

which also has the desired effect.

Little things like this can cause people to refuse to budge from SQL 2000, which really needs to be retired.

Duration for SQL 2008 64-bit

SQL Server 2005 SP2 64-bitSQL Server 2008 RTM 64-bit
Query DOP 1 DOP 2 DOP 4 DOP 8
Q1 64,76132,55316,4288,344
Q2 504 295 158 106
Q3 14,733 4,782 3,0031,989
Q4 17,506 5,338 3,7472,519
Q5 19,716 7,376 4,6543,159
Q6 1,609 893 471 309
Q7 15,855 5,472 3,3062,403
Q8 5,225 2,391 1,3332,147
Q9 44,61123,29112,2137,222
Q1013,989 6,384 3,9342,754
Q11 4,093 1,192 669 495
Q12 8,166 4,497 4,0221,714
Q1325,83013,566 7,5214,260
Q14 2,060 1,020 526 352
Q15 1,358 1,931 1,139 235
Q16 6,476 3,476 2,4291,215
Q17 1,012 524 291 199
Q1846,95426,15613,8969,209
Q19 2,133 1,172 623 450
Q20 830 446 253 172
Q2164,23120,85012,5369,087
Q22 8,722 2,972 1,6921,049
Total370,374166,57994,84459,388
QueryDOP 1 DOP 2 DOP 4 DOP 8
Q1 50,01326,31712,5917,159
Q2 504 268 150 107
Q3 16,296 5,186 3,1581,902
Q4 19,232 5,288 3,4522,340
Q5 21,64816,150 8,3715,120
Q6 1,845 929 496 312
Q7 17,397 4,369 2,3881,376
Q8 5,734 6,765 3,6281,849
Q9 48,36122,03411,3356,372
Q1015,281 5,822 3,5952,425
Q11 4,423 1,238 657 600
Q12 9,363 4,828 4,3562,365
Q1321,69911,310 5,7512,967
Q14 2,146 1,033 547 334
Q15 1,368 970 521 249
Q16 6,599 3,615 2,0181,848
Q17 1,243 521 294 213
Q1850,90927,94515,4399,365
Q19 2,096 1,093 607 378
Q20 841 430 255 165
Q2169,19122,06412,8268,337
Q22 8,946 3,213 1,5921,010
Total375,135171,39094,02856,795

Both are RTM

SQL Server 2005 RTM 32-bitSQL Server 2005 RTM 64-bit
Query DOP 1 DOP 2 DOP 4 DOP 8
Q177,01438,24919,33010,270
Q2606372191123
Q319,2317,8914,3342,923
Q426,27111,2386,7414,191
Q529,65811,6926,7194,023
Q62,4961,328689428
Q718,3987,8514,4893,912
Q86,4863,2221,7672,857
Q952,47630,08416,6609,193
Q1020,16810,1835,7043,530
Q114,9641,814841500
Q1210,5435,9733,8362,716
Q1329,27415,8937,9494,762
Q142,2701,240634367
Q151,6452,2431,160332
Q166,3054,1932,5081,466
Q171,206641341227
Q1867,64141,47027,45618,106
Q192,5621,358726498
Q201,015534313205
Q2191,42236,76320,24912,584
Q229,1883,7012,0071,508
Total480,839237,933134,64484,721
QueryDOP 1 DOP 2 DOP 4 DOP 8
Q181,46440,77720,41510,351
Q2478284154107
Q314,0596,6204,1292,641
Q416,7827,9654,6323,120
Q519,0549,8385,6763,481
Q61,597888484304
Q715,2566,5083,9732,496
Q85,1842,5661,4132,177
Q943,17924,95713,0707,384
Q1013,5687,1614,0142,771
Q113,9531,296704512
Q128,2594,4364,1281,720
Q1327,08813,8316,7893,720
Q142,0361,143589344
Q151,4751,9511,051259
Q166,2923,6812,5091,260
Q17975519290198
Q1845,73527,19714,8749,570
Q192,1091,141613456
Q20808427247180
Q2161,62527,60415,79410,901
Q228,5873,4091,8601,141
Total379,563194,199107,40965,094

SQL Server 2000 sp4 + hf 2187 32-bit

QueryDOP 1 DOP 2 DOP 4 DOP 8
Q191,35347,84624,82014,086
Q25937509961,543
Q318,64310,9167,3936,210
Q426,38012,5139,2666,940
Q526,53314,33610,0769,500
Q62,4001,6061,060860
Q720,39011,9006,2406,926
Q819,2364,2433,7304,353
Q964,88034,19621,62320,623
Q1023,15012,2269,0007,176
Q114,7104,7264,7403,683
Q1222,20013,2439,3767,380
Q1353,08614,32014,3208,580
Q142,1209961,0131,076
Q154,6801,0769961,530
Q167,8805,2404,9135,320
Q171,1107931,0002,513
Q1879,29637,41020,92020,733
Q192,7301,530856563
Q201,0001,1531,9335,383
Q2186,27040,68330,76326,876
Q227,6934,3833,4632,823
Total566,333276,085188,497164,677

my apologies, Linchi post SQL 2005 64-bit results, so my duration results for SQL 2005 64-bit, SP2 (no cu) below