Several TPC benchmark results for SQL Server published recently (and some not so recently) are of interest in the high-end performance sphere. Unisys published new TPC-E and TPC-H results for their 16-way ES7000 with the Intel Xeon X7460 six-core processor (96 cores total). HP published TPC-H results for their 8-way ProLiant DL785G6 with the AMD Opteron six-core (Istanbul) processor at scale factor 300GB back in July, and now at 1TB. (There are still no TPC-E results for the Opteron Istanbul processors).
The most recent Unisys benchmark results employ Windows Server 2008 R2 together with SQL Server 2008 R2, which supports greater than 64 processors, as the 16-socket Unisys with the six-core Intel Xeon has 96 cores total. Windows Server 2008 R2 also features significant core level changes, most prominently the removal of the dispatch scheduler lock, which was described as enabling much better scaling at the higher number of processor cores.
The main feature in the latest Unisys TPC-E benchmark performance result is Windows Server 2008 R2, which now allows the operating system to utilze all 96 processor cores in the fully populated ES7000. Compare the range of TPC-E performance results for the Intel Xeon X7460 processor from 4 to 16 sockets. The IBM x3850M2 4-way Xeon 7460 with 24 cores and 128GB memory achieved 729.65 tpsE. The 8-way Unisys ES7000, Xeon 7460 with 48 cores and 256GB memory achieved 1165.65.
Last year, Unisys published a result for their ES7000 7600R with 16 Xeon 7460 2.66GHz six-core processors on Windows Server 2008 Datacenter Edition and SQL Server 2008 EE, both 64-bit, which was limited to using 64 of the 96 processor cores in the 16-way ES7000. This provided only a 28% performance increase over the ES7000 with 8 processors and 48 cores. While the performance gain per core is in line with expectations, the performance relative to the 8-socket 48-core is less than desired.
|Processor||GHz||Memory||Disks||OS||SQL Server||tpsE||Pub Date|
|16S/96C/96T||2.66||1024GB||882||W2K8 R2||2008 R2||2,012.77||11-2009|
The new Unisys ES7000 result of Nov 2009 is on Windows Server 2008 R2 and SQL Server 2008 R2, both 64-bit Datacenter Edition, which now allows full use of all 96 cores. There is a 34.8% increase in the TPC-E score from 64 to 96 cores. If this were the scaling from increasing the processor cores only, then it would be excellent. However some non-neglibile gain is expected from the operating system enhancements only, considering the high number of cores.
The scaling for each doubling of the number of processor sockets and cores from 4-sockets/24-cores to 8-sockets/48-cores is 1.60. The scaling from 8/48 to 16/96 is 1.73, including the software gains from Windows Server 2008 RTM to R2. The average scaling for each doubling from 4 to 16 sockets (48 to 96 cores) is 1.66. This is approximately in line with the Microsoft slidedecks citing a 1.7X gain for each doubling of processor cores.
Unisys TPC-H Results for the ES7000 16-way
Unisys recently published a TPC-H scale factor 3TB result for their 16-way ES7000 with Windows Server and SQL Server 2008 R2 (2009-11), enabling the use of all 96 cores. Unisys had previously published a SF 10TB result for 16-way on Windows Server 2008 RTM, restricted to using 64 of the 96 cores (2009-02). A TPC-H scale factor 1TB means the Lineitem table data without compression is approximately 1TB (or was 1TB with the 8 byte date time data type). The full size of the SF 1TB database uncompressed can be from 1.4 to 1.7TB. Technically, we are not supposed to compare TPC-H results at different scale factors. But there are very few published results, so sometimes this cannot be avoided in comparative analysis. Also for comparison, there is a SF 3TB result for a 32-socket 64-core IBM system with the POWER6 processor, the AIX operating system and Sybase database engine.
The Unisys SF 10TB report mentions that older model 15K 73GB disks were used. The newer current generation disks have better track to track seek times. This is not expected to influence scan oriented queries. The newer disks may have moderately better pseudo random IO, particularly queries that have a key lookup or nested loops join.
The chart below shows for the two Unisys results, the ratio of individual query runtime for the 10TB to 3TB on the power runtime (stream 0). One reason we should be careful in comparing results from different scale factors is that the execution plan can be different, so there is no guarantee that the query cost is directly proportional to database size. For the moment, setting aside the possibility of fundamentally different execution plans. The SF 10TB database is 3.3 times larger than the SF 3TB database. The 10TB result is on only 64 cores, while the 3TB is on 96 cores.
TPC-H SF 10TB 64-core to SF 3TB 96-core query run time ratio
Without the difference in cores, the 10TB query times should be about 3.3X longer than the 3TB baseline. There should be an additional 20-30% for the difference in number of cores. The expectation then is that queries on the 10TB with 64 cores should take approximately 4 times longer than on the 3TB with 96 cores, not accounting for differences between RTM and R2, and differences in the execution plans.
Query 9, below, runs slower than expected on the 10TB 64-core system, compared with the 3TB 96-core result.
/* TPC_H Query 9 - Product Type Profit Measure */
SELECT NATION, O_YEAR,
FROM (SELECT N_NAME AS NATION, datepart(yy, O_ORDERDATE) AS O_YEAR,
L_EXTENDEDPRICE*(1-L_DISCOUNT)-PS_SUPPLYCOST*L_QUANTITY AS AMOUNT
FROM PART, SUPPLIER, LINEITEM, PARTSUPP, ORDERS, NATION
WHERE S_SUPPKEY = L_SUPPKEY AND PS_SUPPKEY= L_SUPPKEY AND PS_PARTKEY = L_PARTKEY AND
P_PARTKEY= L_PARTKEY AND O_ORDERKEY = L_ORDERKEY AND S_NATIONKEY = N_NATIONKEY AND
P_NAME LIKE '%%green%%') AS PROFIT
GROUP BY NATION, O_YEAR
ORDER BY NATION, O_YEAR DESC
While the execution plan, below, is complicated, there is nothing that seems to indicate that a disproportional gain should be expected from either an increase in memory to data ration or a simple increase in the number of cores. It is possible the higher memory to data ratio reduces the need to spool intermediate hash join results to tempdb, but this should not be a drastic impact.
Notice that this execution plan has a merge join between PART and PARTSUPP. Both Unisys TPC-H reports mention the use of -T8743, which turns off merge joins, but that was only used during the load phase. I have not seen the impact of -T2301, (Trace flag to enable more accurate query run-time behavior modeling in the SQL Server query optimizer typically only needed for large data set decision support processing) but perhaps my systems are note sufficiently large to activate whatever this flag actually does.
Queries 12 and 22 run slower on the 3TB 96-core system relative to the 10TB 64-core result. As a matter of fact, the query times between SF 3TB and 10TB are curiously close to each other. The execution plan at 10GB for query 12 shows a key lookup, and for query 22 a nested loops join. If this occurs in the execution plan at 3 or 10TB, then the query time might be constrained by the disk random IO performance.
/* TPC_H Query 12 - Shipping Modes and Order Priority */
SUM(CASE WHEN O_ORDERPRIORITY = '1-URGENT' OR O_ORDERPRIORITY = '2-HIGH' THEN 1 ELSE 0 END) AS HIGH_LINE_COUNT,
SUM(CASE WHEN O_ORDERPRIORITY <> '1-URGENT' AND O_ORDERPRIORITY <> '2-HIGH' THEN 1 ELSE 0 END ) AS LOW_LINE_COUNT
FROM ORDERS, LINEITEM
WHERE O_ORDERKEY = L_ORDERKEY AND L_SHIPMODE IN ('MAIL','SHIP')
AND L_COMMITDATE < L_RECEIPTDATE AND L_SHIPDATE < L_COMMITDATE AND L_RECEIPTDATE >= '1994-01-01'
AND L_RECEIPTDATE < dateadd(mm, 1, cast('1995-09-01' as date))
GROUP BY L_SHIPMODE
ORDER BY L_SHIPMODE
The execution plans below are on a SF 100GB TPC-H database (forced non-parallel)
Both Unisys 3TB and 10TB results are on systems with about 900 disks. The expectation is that number of rows from the LINEITEM table that meet the SARG conditions is proportional to the SF of the database. So the 3TB result should be proportionally faster for the same disk system.
Lets see if the numbers are reasonable. The actual row count from LINEITEM that meet the SARG at SF 1GB is 54,077 (37,715 estimated). Then the projected actual row count is 540M for SF 10TB and 162M for SF 3TB. The run time for this was 257.8 sec on the SF10TB and 191.0 sec on the SF 3TB. or about 2M Loop join (pair) rows per sec. The 900 disks at 500 IOPS per disk (this is not a random IO, but rather a skip-seek), translates to 450K/s. The ORDERS table is 197MB at SF1, meaning 1,970GB at SF 10TB.
/* TPC_H Query 22 - Global Sales Opportunity */
FROM (SELECT SUBSTRING(C_PHONE,1,2) AS CNTRYCODE, C_ACCTBAL
FROM CUSTOMER WHERE SUBSTRING(C_PHONE,1,2) IN ('13', '31', '23', '29', '30', '18', '17') AND
C_ACCTBAL > (SELECT AVG(C_ACCTBAL) FROM CUSTOMER WHERE C_ACCTBAL > 0.00 AND
SUBSTRING(C_PHONE,1,2) IN ('13', '31', '23', '29', '30', '18', '17')) AND
NOT EXISTS ( SELECT * FROM ORDERS WHERE O_CUSTKEY = C_CUSTKEY)) AS CUSTSALE
GROUP BY CNTRYCODE
ORDER BY CNTRYCODE
Query 22 SF 1GB TPC-H database (forced non-parallel)
The chart below compares the Unisys with 96 cores (Core 2 architecture) running SQL Server relative to the IBM with 64 cores (POWER6 architecture) running Sybase. The IBM POWER6 system has 50% better overall TPC-H score in QphH than the Unisys with the Xeon 7460 processor. The advantage is less than 20% on the TPC-H power component, which is a single stream testing the ability to use all availabe compute resources for each query.
TPC-H SF 3TB, 96-core Xeon to 64-core IBM Power 6 query run time ratio
The IBM result is on a completely difference processor architecture as well as a completely different database engine, so large query-to-query differences are not unexpected, as evident from the chart. The Unisys system with Xeon processors and SQL Server is faster some queries. The IBM system with POWER6 processors and Sybase is faster on others. The IBM system has only 282 disks compared with over 900 disks on Unisys. Both systems should have sufficient sequential disk IO performance, but the Unisys will have much better random IO capability. The IBM system is almost twice as fast on the TPC-H throughput metric, which is not suprising because the POWER6 processor is designed for massive bandwidth, while the Intel processors upto the Core 2 architecture are designed for much more price constrained systems. (Even though the Xeon 7400 series are targeted at expandable systems, it has the same bus architecture as the Core 2 processors for desktop and mobile systems.)
Also important, these SF 3TB results with 1TB system memory should not be compared with the TPC-H SF100-300 results where the data essentially resides in memory versus, without consideration for that disk IO involves significant CPU consumption. The in-memory results have zero random IO and even the SSD results also have nearly no limits on random IO. I am still inclined to speculate that Nehalem EP/EX has much better TPC-H than Dunnington per socket, possibly because of much better memory bandwidth.
The results below are on the 8-way HP ProLiant DL785. The first result is on the quad-core Opteron 8384 is on Windows Server 2008 RTM, SQL Server 2008 RTM. The second result with the six-core Opteron 8439 is on Windows Server 2008 SP1 and SQL Server 2008 SP1.
The third result with the six-core Opteron 8439 is on Windows Server 2008 R2 and SQL Server 2008 SP1. It is possible that the (locking) performance improvements in Windows Server 2008 R2 was desired, but because the system has less than 64 cores, SQL Server 2008 R2 was not required. This allows using CAL instead of per processor licensing.
Between the first two systems, the number of cores is increased by 50%, from 4 per socket to 6 per socket. The core frequency increased from 2.7GHz to 2.8GHz. The AMD Opteron Istanbul six-core also introduced HT-assist, which help maintain cache-coherency. HT-Assist is supposed to help transaction processing application more than DW, but no TPC-C or TPC-E benchmark has been published for Istanbul as of 17 November 2009. It is possible HT-Assist is beneficial, but not all the bugs have been worked yet. The difference between SQL Server 2008 RTM and SP1 is unknown.
|8 Opteron 8384 quad-core||32||2.7||256GB||300||57,684.7||75,161.2||44,271.9|
|8 Opteron 8439 six-core||48||2.8||256GB||300||91,558.2||109,067.1||76,860.0|
|8 Opteron 8439 six-core||48||2.8||512GB||1000||81,514.8||95,789.1||69,367.6|
The chart below compares the query run time for the SF 300GB for the quad-core relative to the six-core. The expectation is that the six-core should be 20-30% faster than the quad-core, as scaling the performance of a single query with a parallel execution plan from DOP 32 to 48 is not a simple matter. Most queries appear to be 40% faster, so this might be attributed to the HT-Assist.
8-way Opteron, quad-core to six-core query run time ratio, TPC-H SF 300
Queries 2, 6 and 17 show very little gain from 32 to 48 cores. This might be because these are relatively small queries which are difficult to scale by increasing the degree of parallelism.
The chart below shows the individual query times for the DL785G6 8-way Opteron six core scale factor 1TB relative to 300GB. The difference in the size of the database is 3.3X. There should be some improvement for Windows Server 2008 R2 over RTM.
8-way Opteron six-core, TPC-H SF 1TB to 300GB query run time ratio
However, the SF 300GB with compression fits mostly in memory, and the 1TB does not. So any query plans that generate random IO might be disproportionately longer than the difference in size. Query plan that require a table scan to disk may encounter only minor penalty as the storage system has sufficient sequential IO bandwidth.
/* TPC_H Query 17 - Small-Quantity-Order Revenue */
FROM LINEITEM, PART
WHERE P_PARTKEY = L_PARTKEY AND P_BRAND = 'Brand#23' AND P_CONTAINER = 'MED BOX'
AND L_QUANTITY < (SELECT 0.2*AVG(L_QUANTITY) FROM LINEITEM WHERE L_PARTKEY = P_PARTKEY)
Query 17 SF 100GB TPC-H database (forced non-parallel)
In Query 17, the estimate is 200 parts meet the Brand and Container SARG at SF 1GB. There are 30 Lineitems for each part, for 6000 rows total at SF 1. This extrapolates to 1.8M rows at SF 300, and 6M rows at SF 1000. The query run time for SF 1000 was 39.5 sec (stream 0), so the performance is approximately 152K rows/sec. The total database size is over 1.5TB(recalculate for compression?), the system memory is 512GB, so expect about 1/3 of data to be in memory. Then 100K rows/sec might require disk IO. There were 240 disks for data in the HP DL785G6 system, so the IO would be 400 IOPS per disk, not unreasonable because the 800GB Lineitem table occupies a small fraction of the 16TB total disk space and it is expected to be running at high queue depth.
At SF 300, query 17 ran in 2.6sec, or 620K rows/sec, or 13K rows per core per sec, again, not an unreasonable rate if most data is in memory