Benchmarks 2016

Earlier this year, HPE and Microsoft sponsored the article The Importance of Benchmarking in SQL Server Pro to revive interest in the TPC and SAP benchmarks, citing the need to verify performance improvements in new hardware and software over previous versions. It suggested that the decline in interest were that people believe benchmarks are not valid in reflecting real-world or more importantly, their specific environment.

There is more to this topic than can be addressed in a short article of 400 words. The IT world has changed. Everyone now expects that hardware and software to have extraordinary benchmarks results compared to 5, 10 and 20 years ago. And there continue to be major IT project failures, sometimes on performance, albeit less frequent than before. There is deep expertise that goes into conducting benchmarks, yet very little of that is made available to the public in a digestable fashion. The challenge is to make expert level performance analysis more broadly available and to expand its coverage to more than the narrow range of elements in benchmarks.

Background

In past, benchmarks were regarded as more important than is the case today. There were many reasons for this. One is that many system vendors had their own RISC processors along with a customized Unix operating system. Frequency on one processor architecture has zero meaning relative to frequency on a different processor architecture.

In the early days, symmetric multi-processors (SMP) systems were new. Very large systems had non-uniform memory access (NUMA). Vendors spoke as though scaling was linear without being clear on exactly what is or should be meant by that term, and avoided the matter of good data to substantiate such claims. For both the RISC processor and the SMP system architecture, there were enough TPC-C benchmark results to warrant caution in any assumptions of the characteristics of large systems.

It should come as no surprise to IT professionals that software does not just magically run well on multi-processor systems when the developers are working on single processor workstations and testing against small data sets. In fact, it takes many cycles of close cooperation between the processor, system architecture, operating system and database engine teams to make the combined environment work to reasonably good effect (let’s not kid ourselves that the current environment is anywhere near being even remotely close to optimal).

Over the course of years, the situation has changed dramatically. Many of the difficulties in scaling performance on large systems have been resolved, including issues specific to NUMA, which is good because all multi-socket systems today are NUMA. In the 1990’s, it was manifest destiny that the future belonged to RISC processors. There were very logical and sound technical reasoning to support this belief. Intel believed it as well. They came up with an even more advanced concept than RISC, called EPIC, which took form in the Itanium brand.

But in fact it was the RISC processors that could not sustain the rapid pace of architecture and manufacturing development to keep up with the x86 (generic term, Intel once used the term IA-32, then later I64 to reflect 64-bit but distinguish it from IA-64) processors, of which the Intel server version is Xeon. Even more surprising was that the second strongest competitor in the server arena was AMD Opteron, ahead of the RISC players and Itanium.

Both the Intel Xeon processors and system architecture have been on a stable evolutionary path with the precedence set by the Nehalem(-EX) processor and QPI in 2009-10. For this, people have confidence in the hardware whether or not there are benchmarks on the latest model.

To a reasonable degree, the Microsoft Windows Server operating system and SQL Server database engine employ sufficient consideration for the recent generations of Intel multi-core processors and system architectures, given that the overriding objective is to avoid having more complex tuning options than the typical (accidental) DBA can handle.

(One thing missing in the Windows OS is a mechanism to prevent processes that are not NUMA aware from allocating a large block of memory from one node, unbalancing the entire system. HP-UX had Locality Optimized Resource Alignment LORA. I do not know if HPE brought this into Linux?)

Between the stability and maturity of the processor, system, OS and database engine, and the passage of many cycles of Moore’s Law, the world today is very different from the past. In twenty years, performance has doubled ten times. We have about one thousand time more compute capability. System memory capacity has also increased by a factor of one thousand in the same period, from several GB to several TB.

A naive person might think that given such circumstances, performance issues might be a thing of the past. But most of us should know that this is not true. Major software/IT projects still fail in spectacular fashion. Occasionally such failures are reported in the mainstream news channels.

The reason is that the compute capability we have today is not in the form of a single infinitely fast processor, but through very many processor cores with complicated interconnection. An analogy for a modern server system is several factories, each in different corners of a metropolitan area. A factory (the processor socket) is comprised of tens of assembly lines. An assembly line (the core) has parallelism of its own and can output several product units each cycle. It takes 14-19 cycles for a unit to traverse the assembly line from beginning to end.

All of this requires a carefully choreographed supply system to feed the factories using multi-lane highways where an in-balance in the usage of routes could cause traffic jams choking the factories. It is possible today to build an information system to meet almost any conceivable enterprise requirement. But careful planning and testing is required, as always, and this does not happen by accident.

Benchmarks

It is evident there has not been much interest in benchmarks for years now. Some note that benchmark system configurations are excessive. Others that the benchmarks do not resemble real applications, and more importantly, their specific application. Between the fact that hardware is powerful, and the full stack from processor, server system, operating system and RDBMS have all achieved maturity together, the benchmark result does not have much value by itself.

There might an interest in the specific components used in the benchmark system, as that would be a good indicator that those components have been tested to withstand high stress load. But this potential area of value is lost because benchmark configurations almost always use direct-attach storage for low cost while most people are buying much more expensive SAN storage, with completely different components. Which also means that the benchmark price-performance metric is not relevant.

What might have value is knowing what problems were encountered in achieving the result. Under what specific circumstances an application might encounter the same problem? And the steps taken to address each issue. The problem is that vendors persist in stressing the metric that does not matter.

The TPC benchmarks rules requires full disclosure, but important knowledge is buried in hundreds of pages of chaff. There is little or no explanation provided to the reader why certain things were done, although all the benchmark teams knows the tuning techniques. Some items of note are discussed below.

In database benchmarks, the collation is usually Latin1_General_Bin. This makes sense. A string search should be more efficient than a case insensitive, accent sensitive (and implied KI, WI) in the common collation. In the real world, we probably do not want to use binary sort order on all strings. Under what circumstances is it worth the effort to change specific columns in a pre-existing database with table already populated? This might be something people want to know.

The typical benchmark report shows the database created with multiple files in each file group. This is because the system has direct-attach storage, multiple storage controllers, and volumes on each controller. Hence it is necessary to distribute file over the volumes. Doing so also spreads the IO activity over multiple controllers.

Some of the RAID controllers were designed in the days of hard disks to support a certain level of IOPS, perhaps 200K, as it is unlikely the maximum number of HDDs that can be attached to one controller will generate more than that number of IOPS. Then SSDs came about, and it was then possible that performance could be limited at the RAID controllers.

One controller vendor offered a special high-performance firmware option for use with SSDs at additional cost. I understand that this added cost firmware disables caching on the controller, because there is overhead to do the cache function. And besides the many SSDs attached to the controller can drive more IOPS than the controller can cache in the first place.

In TPC-H, the desired total IO bandwidth is far higher than any single controller can support, higher than the bandwidth of a PCI-E x8 slot as well. Spreading volumes over multiple controllers is necessary to meet the bandwidth requirement.

Some (many? most?) SAN vendors like to tell people to create one giant volume, because it is so much easier to manage, or because it might be required to use the volume snapshot feature. Of course there is no performance difference, because their SAN is magically all-powerful. Oh yes, and multi-path IO automatic load balancing works wonderfully. Unfortunately, no one publishing benchmark performance data to shed light on this.

There are number Trace flags set in TPC-E, examples being:
  T652 - Disable page pre-fetching scan
  T661 - Disable the ghost record removal process
  T834 - Use Large Pages
  T3502 - Display checkpoint information in errorlog
  T3505 - for correct checkpoints
  T8744 - Disable pre-fetch for ranges

The presence of T834 in recent reports is curious. In more recent versions of SQL Server (2008R2 on?), with the lock pages in memory permission granted to the SQL Server service account, using large pages is automatic under appropriate conditions? So T834 should be obsolete? Unless this is just because someone forgot to remove it?

The SQL Server service is started at the command prompt with -x, which disables CPU time and other counters. This would not be done on a production server, but what is the impact?

Apparently there is a separate setting made in the registry under Image File Execution Options for the sqlservr.exe and associated DLLs.

There are affinity settings on the TCP/IP connection ports, along with setting defining nodes by CPU mask. There is documentation on how to make these setting but not a proper explanation.

Changing the Windows Power Setting to High Performance from default Balanced is important. The default Balanced setting is ridiculous because of the inefficiencies incurred from constantly switching between low and high frequency. Microsoft really needs to recalibrate under what circumstance frequency should be changed so that the default setting is not a stupid setting.

There are a number of services stopped. It used to be that leaving the Windows Print Spooler running would incur a 1% penalty, but this is less now with multi-core processors. It could help to know the overhead of each service. It would probably be help if it were easier to assess which services in the actual production server have significant negative impact, or whether the cumulative collection of running services becomes an excessive burden.

The TPC-C benchmark had a great run, from 1992 on. All during that time, no one was able to break the benchmark, as in generating an absurdly high result. By 2005 or so, people knew it was getting dated. All parties involved got together to put out a new better transaction processing benchmark, TPC-E, in 2007. After contributing to its development, Oracle and DB2 decided not to publish results for TPC-E. My understanding is that Oracle can run TPC-E just fine, better than SQL Server on large systems. However, TPC-E does not have inherent partitioning boundaries with high degree of locality, and as such, is not a good vehicle for showing RAC all-nodes active clustering. TPC-E results have only been published for SQL Server.

Since TPC-E is now not a benchmark for competition among database vendors, perhaps Microsoft could be persuaded to adopt a stock-car version, in no-tuning techniques that are not generally applicable are applied?

Real Applications

The TPC-C benchmark consists of 9 tables and 5-6 stored procedures. The TPC-E benchmark consists of 37 or so tables and 23 stored procedures but no functions. Real-world databases are typically far more complicated in both the number of tables and stored procedures. So it could be argued that TPC-C is too simple to be meaningful and that TPC-E is only marginally more complicated. In fact, it does not matter to the RDBMS engine whether there are 10 or 1000 tables, 10 or 10,000 stored procedures. (It might matter if there are millions of objects in the plan cache). What does matter is the characteristics of usage, such as the different execution plan operators involved, but more generally the different code paths that are exercised.

The modern RDBMS has a very rich set of features and even more code paths. A feature, such as a user defined function might go through certain code paths. There are also one or more code paths for each of the many different operators in an execution plan.

(There is a configuration setting lightweight pooling that is minimally helpful in TPC-C but more beneficial for extremely high volume network roundtrips, like SAP? This uses a completely different set of code paths, for which not all features of SQL Server are possible. It is probably only used in certain benchmarks. Some MS people wish this setting could go away, and cringe whenever someone uses it in production.)

Possibly no single application uses all features, but a complex database probably uses more than that used in benchmarks. What we need to know is whether there are any circumstances that might cause a specific code path to break or result in severe degradation to the database engine.

We might think that a major RDBMS vendor has the resources to thoroughly test all features? If that were true, then there would not be bugs, found and fixed, found not fixed, and unknown. Each feature is definitely unit tested, but this is very different from load tested.

It might seem that a good solution is to contrive a benchmark that exercises as many different code paths as possible. This would not work if certain operations are broken under only under heavy load. If the benchmark comprises a mix of very many different code paths, testing the system under heavy load does not mean that there is heavy volume for each specific path. What this means is the single number result type of benchmark has little value. A meaningful test needs to be comprised of many numbers. But major decisions are often made by people with a poor understanding of numbers with complicate meaning, and so insist on a single number.

Some examples of salient characteristics of an application are: network round-trip, compile and recompiles, Insert-Update-Delete write volume, insert into tables clustered on identity columns, temp table or table variables, functions, execution plans with a sort or spool operation, blocking and deadlocks, lock level, lock escalations, activation of asynchronous IO and so on.

We can rest assured that any code path touched by the TPC-C or TPC-E benchmarks works just fine, if not by default, then with one or more the tuning steps in the full disclosure report. Too bad we do not know which ones are for which specific issue.

My assertion is that many problems can be detected early with simply a full load unit tests for each of the component operations important to any given real-world application. The more common belief is that only a full load test simulating the real environment, touching all the functionality in mixed fashion, including the random timing of users. A stupid conceited person wants to believe that a rare unforeseeable combination of circumstances is the cause of a projects failure, when he has not tested the quality of the concrete foundations.

That said, there are some interesting combination events, though not unforeseeable, that can cause serious problems. One example is the stored procedure compile parameter values in relation to relevant data distribution statistics and the common or possible actual execution values. Another is updates that cause a key statistic to be updated, possibly with default sample when the lead key column is not unique. One interesting event is when SQL Server does a mass dump from the procedure cache. Does it a take lock on new inserts to the procedure cache blocking compiles while doing the dump?

One more combination will be mentioned here. Software behaves differently on NUMA systems. Before Opteron (2003) and Nehalem (2010), NUMA systems were more than 4 processors sockets. Technically the Intel ProFusion 8-way in 1999 was not NUMA, but the 16-socket systems were definitely NUMA. All AMD Opteron systems and Intel systems after Nehalem with more than one socket are NUMA. SQL Server 2000 had very many eccentricities on NUMA. Perhaps later versions of SQL Server have progressively fewer.

Xeon Processors and Systems

In the early days, there were only 3 Xeon processor options, all the same frequency, with small, medium and large cache. Most people chose the large cache for databases on recommendation, without relevant performance versus cache size data being made available. But no matter, cache sold itself without effort.

What is confusing today is that there are 11 choices for Xeon E7 v4 processors with core count options: 4, 8, 10, 14, 16, 18, 20, 22, and 24, of which two are special SKUs for HPC (4 and 10 cores, 60M L3). Of these 7 are 8800 series that can used in systems up to 8-sockets. Four are 4800 series that can used in systems up to 4-sockets.

The Xeon E5 v4 4600 series has 9 options including 8, 10, 12 ,14, 16, 18, and 22 cores for 4-socket systems. The Xeon E5 v4 2600 series has 22 options including 4, 6, 8, 10, 12, 14, 16, 18, 20 and 22 cores for 2-socket systems. This group has a wide range of base frequencies, from 4-core 3.5GHz to 22-core 2.2GHz. The two lowest models are HT disabled. There are also 5 Xeon E5 v4 1600 series, but these are probably not interesting.

(I am of the opinion that there should just be 2 configurable SKUs for each of the LCC, MCC and HCC dies in E5, and the same for E7. One die SKU is for parts with all cores functional. The second SKU is for die with some non-functional cores. The desired actual core count – frequency combination can be set in the UEFI. There might be a third SKU in the MCC/HCC cherry picked for low power if needed. Just my opinion. If in fact the E5 and E7 are the same die, then only the high SKU is needed for E7.)

In all, we have the following choices for 1, 2 and 4 sockets. The 4-socket system can be E5 4600 series or E7. My recommendation for 4 socket systems is the E7 because there is a direct QPI connection from each processor to any other processor, whereas each E5 4600 directly connects to two other processors, requiring an extra hop to the third processors.

Two sockets populated can be either a four socket E7 system with 2-sockets populated or the E5 2600 series. One socket system can be either the E5 1600, which is not interesting, or the E5-2600 series. One other option is the Xeon E3 which is not discussed here.

What is the difference between core count versus frequency? I have seen some people use total core-frequency as core count times frequency. Here it is important to know that certain code patterns are highly insensitive to frequency, typically serial memory access. Fetch memory, used contents to determine the next memory address to request. Other code patterns, ones that reside inside L2 cache, march through memory sequentially or otherwise in pattern predictable far in advance, benefit

Suppose the core count were determined to be some intermediate value, say 16 cores. Two practical options are one 16-core processor or two 8-core processors. The 4-socket E7 would be an extraordinarily extravagant option, not only because of the 4-socket system level costs, but also because the 4-core E7-8893 v4 3.2GHz is a special SKU using the HCC die with the full 60M L3 enabled carrying a $6841 price tag per processor. The 8-core E5- 2667 3.2GHz is $2057 and the 16-core Xeon E5-2697A v4 2.6GHz is $2,891. There is also the 8-core E5-2620 v4 2.1GHz at a very economical $417. (These Intel suggested prices, Dell prices are slightly lower).

For the moment, let’s set aside that the E7-8893 is special SKU, or suppose we were using the 8-core E7-4809 v4 2.1GHz at $1,223 each, disabling 4 cores per socket in UEFI (formerly BIOS) and pretending that this were not a violation of Microsoft per core licensing terms. Then what is the meaningful differences between these options, aside from processor frequency, and assuming that the difference in the number of DIMM sockets is immaterial.

Too many people might first point to memory bandwidth because there are four memory channels per processor socket. Then there are 4 channels in the single socket, 8 in the dual and 16 in the quad. With DDR4 at 2133MT/s, that’s 17GB/s per channel. Except that memory bandwidth is not important in some database functions. Memory latency is all important and single socket means all memory is local node. Dual socket might mean that memory access is 50/50 local versus remote, and 4 socket might be 25/75.

If SQL Server had knobs to align network connection, thread affinity and memory node, then a high degree of memory locality might be achieved. But Microsoft does not like to engage in this type of tuning. There is the ability to map TCP/IP ports to NUMA nodes, but other controls are needed?

My recollection is that SQL Server parallel execution plans have great scaling when all threads are within one processor socket, and poor scaling when threads are over multiple sockets. I have also noticed that transactions (single or few row index seeks and key lookups) have the best CPU efficiency on single socket, then less as the number of sockets increases to two and then four.

Microsoft claims excellent scaling for (formerly) PDW (now APS) scaling versus cluster nodes, which is a more difficult feat than scaling a single system versus memory nodes (processor sockets). This is accomplished by distributing fact tables across nodes, and replicating dimension tables. Certain types of queries can then be processed by each node with a minimum of internode communication. If SQL Server (non-APS) implemented the same mechanism for distribution fact tables, only by memory node when data is in the buffer cache (storage is still common), and allowed for threads processing that part of the fact table to be running on cores in the same node, then scaling should be even better than APS/PDW? Dimension table would not be replicated, but the OS should have a striped memory pool for this?

This type of information is important, pertaining to real decision options with significant impact, and yet good information on this topic is nearly non-existent.

Benchmark versus Real-world Storage System

One reason people might be suspicious of benchmarks is that the storage system seems unreal. The TPC-C and TPC-E benchmarks have a linear performance versus database size requirement. As higher performance is achieved, the database under test must be scaled proportionately in size. For TPC-C it is about 6MB per tpm-C and for TPC-E about 4.2GB per tps-E. The exact parameters seem to have been calibrated so the data size is about ten times larger than memory, or this could have just been a coincidence.

The required database size for TPC-C and TPC-E is probably larger than most contemporary real-world environments, but not excessively so that it would seem unreal. The key difference is that the entire data is active in the benchmark workloads. Of the real world databases that I have seen, much of the data is past history as people neglect to archive old data out of the transaction processing system. So the significant difference is that while a real world database might be several times larger than memory, the active data could mostly fit in memory with proper tuning resulting in fairly low IO to data. I have seen large enterprise systems run at a few thousand data file reads/sec, with checkpoints driving write bursts to several tens of thousands IOPS. For a benchmark system, if the storage system has 1000 15K HDDs for data files, then the steady state IO load is probably 200K IOPS, because it was sized for 200 IOPS per 15K HDD.

As the TPC benchmarks have a require to list the total system price for the price-performance metric, system vendors use low-cost direct-attach storage. For whatever, in the real world, storage system vendors have convinced IT departments to buy incredibly expensive SAN storage systems. SAN vendor reps have no understanding of RDBMS architecture data-log, frequently not understand that FC is specified in gigabits per sec while IO requirements are specified in byte/sec, thinking that 8Gbps FC is more than enough to meet a 2GB/s data transfer objective. Most astonishing, SAN people have an unquestioned mystical belief in the power of their large cache. At the time when large SAN systems had 32GB cache (64GB physical memory mirrored), large database systems had 256GB memory of which 80% was for the buffer cache. Presented with this information, they still believed that the storage system cache provided immense value.

Fortunately, SSD storage became viable even for TPC price performance metric. This occurred in the 2010-11 time frame, when there was a mixed of published benchmarks, some still on HDD and others on SSD. A correctly designed SSD storage system can support far more IOPS and bandwidth than any benchmark might require, so hopefully there will no longer be a disconnect in understanding the configuration strategy of the storage system. Hopefully people do not make the stupid mistake of using 8Gbps FC or 10 GbE iSCSI as fabric.