Benchmarks 2016

Earlier this year, HPE and Microsoft sponsored the article The Importance of Benchmarking in SQL Server Pro to revive interest in the TPC and SAP benchmarks, citing the need to verify that new hardware and software outperform the previous versions. Reasons cited for the decline in interest were that people believe benchmarks are not valid in reflecting real-world or more importantly, their specific environment.

This topic is more complex than can be encompassed in a short article (of 400 words) and will be expanded upon here. Benchmarks were more important in the past for many reasons. Many system vendors had their own RISC processor architecture, which though relatively new, everyone thought was the futrure, and for which processor frequency was not a basis for comparison. The course of the future turned out to be different. The decisive competition in the server processor arena turned out to be between Intel Xeon and AMD Opteron, having a common x86, non RISC, instruction set architecture, with Intel ultimately gaining the upper hand from 2010 on. For the last several years, the Intel Xeon has been on a stable evolutionary path in both the processor and system architecture levels.

In the matter of whether benchmarks are relavent in real-world applications should also be addressed. the.

Over ten full cycles of Moore's Law, compute capability has increased one thousand. Long ago, it was necessary to keep the main line-of-business system to just the core transaction processor, which had the effect that benchmarks had some resembalance to real-world. As processors became more powerful, more functionality was piled into the main LOB system, and databases were used for many uncharted applications, such that benchmarks had less relavence.

Twenty Years Ago

Benchmarks were once more important than it has become in recent years. Twenty years ago, the world was a very different place. Almost every vendor had their own RISC processor (IBM had two, RS64 and POWER, not counting PowerPC), each built around some variation of the core RISC concepts (SPARC implemented sliding register windows from the Berkeley version). There were large symmetric multiprocessor (SMP) systems built around unproven architectural concepts, and before the issues of scaling on SMP were understood, much less worked out.

Relational database management systems (RDBMS) were new in terms of pervasive deployment. Software was in the process of changing from the old mainframe generation terminal to the new client-server concept. In short, almost the entire stack, in both hardware and software, was unproven and yet people were full of hope for the brave new world of RISC, SMP, RDBMS, and client-server. Technically, only mainframes were proven technology and it was believed that they will become a thing of the past. If that were not enough change to digest in a short period, the second half of the 1990’s was also the period of transitioning from 32-bit to 64-bit. Of course, getting approval for IT projects in this period was easy. All that was needed was to say, in a solemn tone of voice, the magic acronym: Y2K, and money flowed effusively.

Then, benchmarks were important then because processor frequency had no meaning from one architecture to another. Vendors talked about linear scaling on SMP with the conviction and sincerity of a used car salesmen with the distinction that this Edsel could have a million-dollar price tag. There were some single point benchmark results for the maximum configuration published, but without results for a base and intermediate configurations to substantiate the SMP scaling claim, much less linearity. While these statements are critical of vendors of the high-end SMP systems of the late 1990’s, without building these systems, we would not have discovered what the problem area are. And it would not have been possible to solve said problems in the succeeding generations. This took multiple iterations.

RISC and Itanium

At no point in time were the RISC processors order of magnitude (2X ore more) better than contemporary non-RISC processors when adjusted for design generation, manufacturing process and transistor count complexity. RISC did have an advantage in being a recent architecture with a many registers and made more effective use of the available transistors. There was a widespread belief in the manifest destiny of RISC displacing CISC over time as transistor density doubled every two years with Moore’s Law. The rational was that RISC was the best platform leverage pipelining and superscalar execution, something that would be extraordinarily difficult if not impossible on CISC.

Even Intel bought into this thinking. But instead of accepting a role as a latecomer to RISC, they wanted an even better idea. The original RISC concept emphasized reduced in order to get the processor on to a single chip to eliminate the long latency signaling between chips was feasible at a transistor budget of about 50,000. Several cycle of Moore’s Law have since elapsed with since then, and it was more appropriate to amend the R in RISC from reduced to regular. The Pentium Pro processor had 5.5M transistors on the 0.60µm with die size 306mm2. At the time, a 400mm2 die would have been very expensive but maybe possible, implying that 15M might have been possible on 0.35. Cache has higher density than logic. The Pentium Pro 256K cache had 15.5M transistors at 202mm2 on the same process, the 512K cache 242mm2 on 0.35. Merced was designed with tools for 0.35, but manufactured with a double shrink to 0.18.

The future was about being able to effectively utilize tens of millions of transistors in the near term and billions in the long term. The joint Intel-HP new post-RISC paradigm was explicitly parallel instruction computing (EPIC). Basically, it would be more efficient for the compiler to encode parallelism rather than expend transistors to do this on the fly.  

The Path to Today

But a funny thing happened on the way to the future. A few system vendors without their own RISC processor architecture built SMP systems using Intel processors starting with the 80386, followed by 486 and later Pentium. The performance was respectable and there was sufficient market interest that Intel put out their own 4-way system based on the Pentium processor (1995). The processor connects to a cache controller and cache. The cache controller connects to a memory bus that also interconnects other processors? I seem to recall there being a 2-socket Pentium system in which both processors shared a bus to the cache controller and cache, but I cannot find any documentation on this, other than mentioning that the 430NX supports 2-way.

It is valuable to have a long record of benchmark results going back deep into history. TPC curiously started to cull then outright remove the TPC-C v3 results. I downloaded the results spreadsheet in early Jul 2016, but these have since been removed. In selecting important sequence of results, it gradually became clear that this not the original list. TPC-C version 3 result are nolonger even on the TPC website?

Compaq published a series of TPC-C results for their ProLiant 4500 with 4 Pentium processors, the first at 100MHz, then others at 133MHz. The 1,512 tpm-C result surprised some as this was comparable to contemporary RISC processors in 4-way systems. But this was quickly improved upon, first in SQL Server on Windows NT, then other RDBMS and OS combinations as well. There were also 3 results for 4-way Pentium 133 with Oracle 7.3.3 on UnixWare, Solaris and Windows, in the range from 3066 to 3516 tpm-C.

DBOSProcessorMHz# ProctpmCDateMemNotes
Sybase 10.0.2UnixWare 2.0Pentium10041,516.81995-05-18  
SQL Server 6.0NT 3.51Pentium13342,455.01995-10-09  
Sybase 11.0NT 3.51Pentium13343,112.41996-03-28  
SQL Server 6.5NT 4.0Pentium13343,641.21995-04-05  

Below are some RISC results from 1995.

DBOSProcessorMHz# ProctpmCDateMemNotes
Informix 7.1HP-UX 10.0PA-720010042,616.21995-03-27  
Informix 7.1AIX 4.1.2PowerPC 6017541,562.91995-05-09  
Informix 5.0.1NEC UXMIPS R440020041,245.21995-07-28  
Oracle 7.3HP-UX 10.01PA-720012043,809.51995-09-26  

An older TPC-C results document listed a 4-way Pentium 166MHz result of 3,849.2 tpm-C on 1996-01-31. The last list from TPC-C before version 3 results were removed says it was Pentium Pro I believe the 1/31/96 result is for Pentium 166 because the system model is ProLiant 4500.

The Pentium Pro was glue-less 4-way, meaning that an intermediary chip is not necessary to adapt the processor to the main bus. The bus was a split-transaction bus. The first 4-way Pentium Pro (166MHz) result was 5,676 tpm-C on a Compaq ProLiant 5000, SQL Server 6.5 and Windows NT 4.0 in May 1996. Neither the executive summary nor the full disclosure reports are now available on the TPC-C website. I am inclined to think that this was before the 3GB switch, so process virtual address space would have been limited to 2GB and system memory may have been 2GB as well. HP pushe the 4-way PPro 166 to 5,949 in Aug 96. Compaq published a result of 6,750 for the PPro 200MHz in Sep 96. By Apr 97, this was just over 8,000 tpm-C on SQL Server 6.5 SP3, and NT 4.0

DBOSProcessorMHz# ProctpmCDateMemNotes
SQL Server 6.5NT 4.0Pentium Pro16645,676.91996-05-28  
SQL Server 6.5NT 4.0Pentium Pro20046,750.51996-09-062G?512K L2?
SQL Server EE 6.5NT EE 4.0Pentium Pro20049,116.01997-06-023-4G? 
SQL Server EE 6.5NT EE 4.0Pentium Pro200410,537.51997-10-13 1M L2?

In Jun 97, 9,116 tpm-C was list for PPro 200 on SQL Server 6.5 Enterprise Edition and Windows NT Enterprise Edition 4.0, so presumably this was when system memory was increased to either 3 or 4GB, with SQL Server being able to use 3GB. The last Pentium Pro 4-way result in the more recent list was a Dell at 10,537 tpm-C in Oct 1997. But the older list have Pentium Pro results reaching 12,000. The higher results are for the 1M L2 cache model (late 97 or early 98).

Below are some RISC results from 1996-97, plus early 98.

DBOSProcessorMHz# ProctpmCDateMemNotes
Oracle 7.3HP-UX 10.01PA-720012044,939.11996-02-05  
Rdb 7.0VMS 7.1Alpha 2116440047,985.21996-08-27  
Sybase 11.0HP-UX 10.10PA-8000180412,321.91996-09-17  
Sybase 11.0HP-UX 10.30PA-8000180414,739.01997-02-27  
Sybase 11.0DEC UX 4.0Alpha 21164466410,350.21997-05-13  
Sybase 11.5Solaris 2.6Ultra-SPARC300411,559.71997-08-18  
Sybase 11.5DEC UX 4.0Alpha 21164533412,350.21997-10-27  
Sybase 11.5DEC UX 4.2Alpha 21164600415,100.71998-01-13  

It is evident that there was substantial progress being made in software, at the OS and RDBMS level in this period, in addition to the performance gains at the hardware level. This is true for both Intel and RISC processors. The Intel Pentium Pro was comparable to the better RISC processors in the server environment. It is also evident that frequency has no value in comparing one processor architecture to another.

From 1998 to 2002, Pentium II and III Xeon, Alpha and PA-RISC traded the 4-way lead. The first Itanium (Merced) was a disaster and quickly forgotten. But the second, Itanium 2, took over the 4-way lead with 78,455 in Jul 2002 (available in Dec 02). About this same time, the Xeon MP (Foster or Gallatin, Pentium 4 architecture) overtook Alpha and PA-RISC.

DBOSProcessorMHz# of ProctpmCDateMemNotes
SQL Server EE 7NT EE 4.0Pentium II Xeon400418,893.41998-09-14 1M L2
SQL Server EE 7NT EE 4.0Pentium II Xeon400420,433.91998-11-178?1M L2
SQL Server EE 7NT EE 4.0Pentium III Xeon500425,065.31998-06-08 2M L2
SQL Server 2000 EEWin 2000 ASPentium III Xeon700430,231.42000-10-238G2M ondie
SQL Server 2000 EEWin 2000 ASPentium III Xeon900439,158.12001-09-278G2M ondie

Intel Xeon MP (Pentium 4 architecture)

DBOSProcessorMHzSockets
Cores
Threads
tpmCDateMemNotes
SQL S2K EE sp2Win 2000 AS sp2Xeon MP16004/4/8?48,906.22002-03-128G1M L3
SQL S2K EEWin .NET EntXeon MP16004/4/861,564.52002-08-2332G512K
SQL S2K EEWin .NET EntXeon MP20004/4/868,739.22002-11-0432G2M
SQL S2K EE sp3Win ES 2003Xeon MP20004/4/474,206.32002-11-0432G2M

IBM published a 4 socket Xeon (DP model) 2.4GHz,

Enter Opteron

DBOSProcessorMHzSockets
Cores
Threads
tpmCDateMemNotes
SQL S2K EE sp2Win 2000 AS sp2Xeon MP16004/4/8?48,906.22002-03-128G1M

In July 2003, AMD Opteron with a 64-bit version of x86, entered the picture, beginning a period of intense competition with Xeon, overtaking in Jul 2004, widening the lead in 2005. In Jul 2004, the IBM POWER5 reported 194,391 tpm-C for a 4 processor configuration, better than the 136,111 reported for a 4-way Itanium the previous year. This is more astonishing because the POWER5 was dual core chip, and IBM considered the 2-socket, four cores total as a 4-way system. Over the next three years, it would be competitive with 4-socket dual-core processors with 8 cores total.

In late 2006, the Intel Xeon 7140 (Tulsa 2-core 16M L2) finally recovered the 4-way lead from Opteron on SQL Server (there was an early 2006 lead with DB2). This was the last of the P4/NetBurst/Prescott derivatives with hyper-threading, which benefits transaction processing type workload. In other aspects common in the real world, the Opteron was better, leading to bitter feelings on the unfairness of benchmarks.

Xeon Triumphant

In Sep 2007, the Xeon X7350 quad-core (Tigerton) overtook Itanium and POWER5 with 407,079. IBM was able to reclaim with POWER6 in Mar 2008 at 629,159. Tigerton has two Core 2 die in one package. The Core 2 itself was a dual-core die, making the Tigerton quad-core. The Core 2 processor core was broadly more powerful than its contemporary Opteron processor core, with the Opteron having an advantage in memory latency at the single core level, and also retaining an advantage in memory bandwidth at the 4-socket level. The 4-way TPC-C lead between Xeon and Opteron changed hands in this period, 2006-2009. (Core 2 should have had a strong advantage in 2-way?)

But in Aug 2010, Intel reported the Xeon X7560, an eight core processor based on Nehalem at 1,807,347 tpm-C. Now Intel had a much more powerful core than Opteron, and an integrated memory controller with comparable memory latency and bandwidth characteristics as Opteron. From here, Opteron faded out of server systems.

The Schism

From 2007, everyone was supposed switch over to the newly developed TPC-E benchmark, because TPC-C was getting old at 13 years of age. IBM, Microsoft, Oracle and all the hardware vendors participated in the development of TPC-E and signed off on it. But only Microsoft published SQL Server results on TPC-E.

IBM continued to publish TPC-C until 2010 achieving 1.2M on 2 POWER7 quad-core, 4 threads per core and 10M on 24 POWER7 8-core, 768 total logical processors. Oracle continued to publish TPC-C for SPARC until 2013, achieving 8.5M on SPARC T5, 16-core, 8 threads per core, 1024 logical total. There is an 8-way Xeon E7 10-core of 5M tpm-C on Oracle 11g R2.

Microsoft permitted hardware vendors to continue publishing TPC-C results but only for SQL Server 2005, not 2008, with a high score of 1.8M tpm-C for 8-way Xeon X7560 8-core.

There is a peculiarity in the TPC-C new order transaction Update statement to the stock table, in that it touches ten static char-24 columns, returning one in the output clause. This has the effect of greatly increasing the size of the log write, had this artificial construct not been there. Database mirroring was a highly valued feature in SQL Server 2005. But for this to work over a wide-area network connection, it was necessary to compress the log write, as was done in 2008, with no option to disable compression.

SQL Server prior to 2016 has only thread to handle log writes, which greatly simplifies the process of serialization. The single thread to handle log write did not pose any limitation, until the advent of compression in 2008. It would appear that no real-world SQL Server environment had a log write inflator and the single threaded log write compression was not serious limitation to any Microsoft customer of major importance for a while.

Over the years, Intel kept pushing up the number cores in a processor socket. In 2010 there could be 32 cores total in a 4-socket system with the Xeon 7500 8-core processor. In 2011, this was 40 cores total with the Xeon E7 10-core processor. In 2014, 60 cores total with the E7 v2 15-core processor. In 2015, 72 cores total for the E7 v3 18-core processor. And now 96 cores total in a 4-way system with the E7 v4 24-core processor. At some point in this sequence, someone brought up the issue that the single thread doing compression on the log writes was a serious bottleneck, this time coming from a customer of importance to Microsoft, instead of one of those inconsequential customers. The important of benchmarks is in paying close attention to the details, and investigating the curiosities.

One might ask why TPC-E benchmarks were not published for Oracle, either the RDBMS or their SPARC processor. Both Oracle the RDBMS and the Oracle SPARC processors with 8 threads per core should run TPC-E very well. There is strong evidence that Oracle the RDBMS on Unix or Linux OS scales at high SMP better than SQL Server on Windows OS.

It is possible that the reason has to do with Oracle Real-Application Clusters (RAC). Oracle RAC scales reasonably well on the TPC-C benchmark with all nodes active. The TPC-C database is organized around warehouses and districts, with all the tables having a clustered index leading with the warehouse and district keys. By directing traffic from the application server to the database node based on warehouse key value, a high-degree of locality can be achieved. This might be the key element in node scaling. If so, then the absence of a TPC-E benchmark result for RAC might be suspicious. Oracle did admit that their first attempt at all active clustering, Oracle Parallel Server, has serious issues. And it is necessary to try even if the result is failure, in order to be able to get it right eventually. If this is the reason, then it is important for people to know this, to know whether their environment is a good candidate for RAC or not. This is why benchmarks are important.

Benchmarks Today - Processors

Of the key reasons that benchmarks were important 20 years ago, let’s review the situation now. The processor for all but highly unusual circumstances for the last several years is the Intel Xeon, currently in E5 and E7 v4 variants. All but two of the RISC architectures have faded from the server scene as has the post-RISC Itanium (MIPS continues in embedded applications). Both of the remaining RISC processors, Oracle SPARC M7 at 32-core 256-thread, and IBM POWER8 at 12-core (6 per die) 8 threads per core, have been pushed into a niche market segment.

Fujitsu and Lenovo continue to publish TPC-E. Cisco, HPE and Lenovo publish TPC-H. The lack of results from other vendors might indicate that personnel with the specialized skills and experience in benchmarks have been laid off. Oracle last published SPARC T5 TPC-C and H results in 2013 (Fujitsu SPARC in 2014). IBM POWER7 in 2011. Results have not been published for their latest generation, SPARC M7 and POWER8.

SMP

In the past scaling relative to the number of processor core and sockets were the same thing. Today we can be at 22-24 cores on a single socket. The 4-way systems are still common today, with a few vendors offering 8-way and larger systems. In any case, many of the issues in scaling been worked out. Scaling versus cores on a single socket is relatively problem free. There are still some issues in scaling with NUMA, and all systems today with more than 1 socket are NUMA.

Given the power of modern processor cores, and that it is possible to have a 1-socket system with 22 cores, reasonable memory backed by SSDs with more IOPS than we need, my recommendation is to give serious consideration to single socket. But people feel compelled to abided by past practices of opting for the 4-way, although many are willing to step down to 2-way. In any case, there does appear to be an interest in benchmarks for larger than 8-socket systems, even though specialized skills are necessary to properly utilize even 4-way systems.

RDBMS

Today, most organizations have deployed many applications on one RDBMS or another, and possibly more than one. As such, they have personnel with years of experience in that RDBMS.  It is also highly probable that important applications utilize the full suite of functionality in modern RDBMS environments, which go far beyond standardized SQL.

It would require extreme circumstances for someone to consider move an application from on RDMBS to another, such as no liking the licensing terms. Still, the decision would not be made based on benchmarks on a very limited scope of functionality, and much more so on whether it is possible to migrated the complete ecosystem.

Benchmarks to Verify New Hardware and Software Versions

The SQL Server Pro article cited the importance of benchmarks to verify that new versions are better than previous versions. Yes, this is important. But in general, results are not published for old and new hardware on the software version, and for old and new version software on the same hardware. Mostly it is new hardware running new software relative to old software on old hardware, diminishing the value of this verification.

There were a number of significant events over the last twenty years, in which it was important to verify the situations. Transitions involving combined drastic changes in both processor and system architecture are: Intel Pentium to Pentium Pro in 1996, Pentium III Xeon to Xeon MP (Pentium 4 architecture) in 2002, and the Nehalem based Xeon 7500 in 2010. The Pentium 4 to Core 2 transition in 2006/7 was significant in processor architecture but had common system architecture.

The advent of Itanium was profound, involving a brand new processor with a completely new instruction set architecture, that also required a change from 32-bit to 64-bit all at once. The second, and first successful, Itanium processor had a peculiar a system architecture that was tradition bus based SMP at the 4-way node level, and new in point-to-point bi-directional signaling in the crossbar connecting multiple nodes.

Opteron in 2003 was a processor from a different vendor initially running the common x86 software, with innovations in the integrated memory controller and point-to-point protocol at the processor level. Opteron not implemented a 64-bit extension to x86, but was extraordinarily innovative in expanding the set of general purpose registers from 8 to 16. This could not be used until 2005, thank you Microsoft?

Since 2010, the hardware side has seen significant evolutionary progress with stable processor and system architecture. While it was important for Intel and Microsoft to validate their work, clients have had confidence.

What is Needed Today

With such immense compute power, more memory than we can use, and now more IO than we need, and RDBMS software with such a rich set of functionality, a naïve person might hope that performance problems are thing of the past, and hence benchmarks are not needed? But every knows this is not true. Because the world today is very complicated.

 

 

Twenty years is 10 full cycles of Moore’s Law, so as a rough estimate, we have one thousand times more compute power in 2016 than in 1996. Coincidentally, memory capacity is about 1000 times greater as well, from several GB to several TB. Back then it was critically important to restrict code functionality in the main line-of-business system to just the essential transactions. In one example, users were locked out on the first day of each month so accounting could run reports.

For many years now, application architects have become much freer in allowing complex functions to reside on the transaction processing system. It might seem logical given the 1000X increase in compute capability that performance issues are a thing of past, and hence benchmarks are not important anymore.