Home, Optimizer, Benchmarks, Server Systems, Processors, Storage,

SQL Server Scaling Project Overview

Two years ago, I joined a SQL Server scaling project. A reinsurance company planned to consolidate two separate entities each with their own data processing application into a single system in 18 months. The smaller of the two had a newly built system on SQL Server that went live in early 2013, having a rich functionality set, but also serious performance issues. The second entity ran on an older system with less functionality, processing three times the volume of the first entity. The combined system would process four times the volume of the smaller entity and is expected to be four times larger in size.

An overview of the scaling project is presented here. The emphasis is on the reasoning supporting the major decisions, and a description of the segments of the project. Most of the technical details are standard SQL Server performance repertoire and documented elsewhere. The plan was to first assess whether the newer system could be fixed in place or if a more complex re-architecture would be necessary and then to do the work. A separate team would handle data migration from the other data system.

It was evaluated that the immediate objectives could be achieved with performance tuning involving a combination of single thread efficiency and multi-thread concurrency improvements. There were deeper issues in the architecture that could limit future scaling. However, there was risk that a re-architecture effort might not reach a stable schema by the time at which the data migration team needed table definitions to be frozen. The schedule was very generous for a performance tuning effort, so it was proposed that re-architecture would be investigated, but its implementation would be deferred to after the consolidation.

Over the course of this project, two important related matters became apparent. First, scaling on the Hyper-Threading (HT) enabled logical processors was almost linear for a transaction work load. Second, scaling over multiple processor sockets was less than highly effective without the database and application being architected together to work with the SQL Server NUMA tuning features, again for a transaction processing workload. The converse of this is that a single socket system may be almost as good as a two-socket system in a database not architected for NUMA.

A final item for anyone not planning on reading the entire article. There were indications that this application, with very extensive logic, would have benefited more from natively compiled procedures than any other performance work. Unfortunately, the SQL Server 2014 implementation of Hekaton memory-optimized tables was not viable, and version 2016 was past the timeline of this project.


About 2011, the smaller entity contracted with a consulting firm to build a new system to a replace a legacy mainframe application. The firm represented their consultants as SQL Server experts, whereas the client's in-house staff had experience in mainframe. Every table in the database was to have a Row GUID column as the primary key clustered. This included the dimension-type tables with 3 to 100,000 rows. Even though this is a transaction processing application, the DW terms dimension and fact tables are still appropriate descriptors.

The large tables quickly grew to hundreds of millions of rows. There could be several to several tens of foreign key columns of the uniqueidentifier type. The natural key of some dimension table were char-3 values. The in-house developers expressed serious concerns on the pervasive use of guids, but the consultants declared this to be best practice. I do not recall a SQL Server author making this representation. But some of the Microsoft developer tools were GUID happy or punch-drunk if you prefer. It is unfortunate that many books on programming use poor examples of database architecture with all tables having an identity column.

The nature of the application is batch mode transaction processing. There were large queries and application driven code with light transactions, but the focus here is on the batch driven transaction elements. A batch processing system can have transactions very much like a regular transaction processing system. The difference is that transactions are known ahead of time, are driven from a batch job, and live users are not waiting for each individual transaction to complete.

At the end of each month, there are 5-6 million accounts of 5 types to be processed. The reinsurance industry term is Retrocession, but account is a more generic description. These are divided into batches, ten batches for each of the two common types, and between 1-4 batches for each of the three infrequent types. The batches are called from SSIS, which has a parallel execution capability. There are five top-level procedures, one for each type.

The top-level procedures loop around the individual accounts in a specified batch. Inside the loop, there is a nested two-deep BEGIN/COMMIT TRAN block. The interior transaction wraps around an individual account. This is the true transaction, either the entire sequence for the account completes or the transaction is rolled back. Within the true transaction, there calls for up to 7 different sub-procedures, 2 table-valued functions, and a dozen or so scalar functions at different nesting levels.

One of the sub-procedures has a loop which can run for up to 240 iterations. Each iteration representing one month, having an upper bound of 20 years. Accounts most frequently loop once, generating 4 or so updates, 5 or more inserts to permanent tables, and a few rows inserted to temp tables. Accounts that loop for more than a few months are infrequent, but tend to be concentrated into certain batches by a combination of necessity and circumstances.

The outer transaction block wraps around 500 accounts. This is an artificial transaction, not serving a business requirement. We might ask why there is an outer transaction as only the inner transaction is necessary for transactional integrity. One of the supposed SQL Server best practices is to group insert, update and delete statements into a transaction even if they are not part of a true transaction. The reason is that the entire transaction then generates one log write. Otherwise each IUD statement is an implicit transaction with its own log write.

Reducing the number of log writes is beneficial, but there is a point of diminishing returns. The negative is that locks are held for the duration of the transaction. The interior transaction already has ten or more write statements. The benefit of the exterior transaction around more than one account is already weak, and setting the size at 500 accounts is courting serious trouble, but we are getting ahead of ourselves.

Employing a GUID as the primary key for a dimension table with few distinct values that would be the foreign key in very large fact-type tables is monumentally foolish. If anyone knows of a valid justification, please say so. The GUID serves a valid purpose in meeting a specific requirement that arises in certain situations. One is when the key value is generated elsewhere, and is later brought into a central database as a unique key.

For this to work, only certain tables need to have a GUID key. Dimension-type tables should have centrally coordinated keys, not larger than necessary, be it tiny, small, regular or big integer, or other data types. Tables that are unique on the parent table foreign key plus another column may not need an identity or row GUID column.

This application went online in early 2013 on a 4-way system with 6-core Opteron processors, 64GB memory and a hard disk storage system. The database was on SQL Server 2008 R2 and operating system was Windows Server 2008 R2. Performance was 3-5 transactions per sec per thread. Furthermore, deadlocks were encountered with a high number of concurrently running threads for the batch processing procedures.

When the number of batch processing threads running in parallel was reduced to 6, deadlocks were rare. The combined throughput with 6 concurrent threads was 30 account transactions per sec or less. Doing the math, 6M transactions divided by 30 tps is 200,000 seconds or 55 hours. This is if the long running transactions are evenly distributed between batches. But of course, they are not, so some threads take much longer to complete.

There is also preparatory work before the transaction processing, and post processing after the transactions are completed. Each of the pre-processing and post-processing stages consumed up to one full day at the beginning of the project. After transactions are processed, analysts examine the results to determine whether the intended business logic was correctly implemented. If not, then the results are discarded, code fixes are made, and the transactions are processed again.

Ideally, processing is started on the evening of the last day of the month, and completes before the next morning. If the results are accepted, then they can be made available promptly. A second system (UAT) runs this process on a copy of the production database before the official end-of-month run. This allows with new code to be tested two or three times before the official run.

Processing that takes multiple days was an unsatisfactory situation. Among other reasons, the original consultants departed. In late 2013, the production server was replaced with a new 4-way system with Xeon E7-8837 8-core processors, 32 cores total, 256GB memory. The storage was replaced with a Violin Memory All-Flash Array (AFA). The specific E7 SKU did not have hyper-threading.

The new server and storage system combination increased processing to 30 transactions per second per thread for a combined throughput of 180 tps. Projected run time was down to 9 hours plus extra to account for uneven distribution of the long running transactions. This was now tolerable if less than desirable. Some of the performance gain can be attributed to the Xeon E7 (Westmere) having 2-3 times better compute than Opteron on a per core basis, and having more memory. But the larger contribution under the circumstances was from lower IO latency times on the flash storage compared to hard disks.

The Scaling Project

When the need to consolidate data processing with a second entity arose, two new consulting firms were asked to evaluate the situation and make a pitch. The contract was awarded to a very large firm with a well-known three letter acronym, and a nickname for size and color, whose identity shall not be disclosed.

They immediately started working on a data model for the current system. The plan was to create scripts to expand the existing database to four times the size to run the data processing as the baseline for the 4X scaling assessment. The data model was significant undertaking. For whatever reason, the purpose of this was not clearly communicated. The client became concerned with billable hours being expended without tangible progress towards objectives. And so, a new consultant was brought in.

When I arrived on this project in June 2014, the 4X data growth scripts had been built, but the test environment for the scaling project (POC) was still being set up. Information including execution stats could be gathered from both the production and UAT systems.

Database Size
The database size at the beginning of the project was 3.4TB reserved, 1.6TB data and 1.8TB nonclustered indexes. There were too many nonclustered indexes on the main tables, many with similar keys, some infrequently or not used. All objects were in a single filegroup having a single data file.

Some of the contents of dm_exec_query_stats could be traced to the stored procedures belonging to the transaction processing code that was the objective of this project.

There were other SQL high on the list of top resource consumers but not related to transaction processing. These originated from simple one or two statement stored procedures. Efficiency fixes were quickly put in solely for the purpose of getting rid of clutter from the top entries in dm_exec_query_stats. This did improve user experience, but that was not the objective of this project. It was also clear that dm_exec_query_stats only retained fragments of the transaction stored procedures as information was lost from recompiles, or plan cache eviction.

Test environment for the scaling project (POC)
The test environment for the scaling project was setup in a managed data center. The POC system was also managed by the data center instead of in-house staff. This was part of another objective to out-source infrastructure management. The test server was a 2-socket Xeon E5 v2 blade with 8-core processors (2650v2?), hyper-threading enabled, 16 total physical cores, 32 logical processors, 192GB memory initially, later 384GB, and a 3PAR HDD-based storage system. A pre-baseline month-end transaction processing run was conducted from a backup of the then current production database.

The scripts developed by the first consulting team for the scaling project were then applied to artificially grow the database. At this point the scripts were found to have defects in generating workable data for transaction processing. It was felt that there was no time in the schedule to rework the scripts.

Database Size and Scaling
I argued that in a correctly configured server system and correctly designed database, size has only a weak or moderate impact on performance for transaction processing workloads. Furthermore, a difference in size could be accounted for simply by testing at the expected memory to data size ratio. If the database is expected to be 4 times larger than it is currently, simply test at one-quarter of the expected memory configuration. It was unclear if management believed this argument, or that we proceeded without the scaled-up database because there was no other choice. As much as I would like to believe that my arguments were persuasive, it is hard to beat the second item. Later developments made this a non-issue.

The Plan
The plan from here was to spend about three months conducting a comprehensive survey of the existing environment. Then make an assessment as to whether the 4X scaling could be achieved with fixes in-place to the current database, or whether more intrusive architecture work was necessary.

The goal was to support consolidated data processing in 18 months. The schema would need to be stable several beforehand for the data migration team to do the final phase of their work. In the remaining time, performance tuning without table changes could continue. Re-architecture, if necessary, would have to be done early in the project.

The expectation was that the assessment would consist of making a list of items to be fixed, along with an estimate for the effort level of each item. The standard project management methodology is to then sum the list for the total effort. The total divided by the working days from start to completion date could be interpreted as the number of people required. Or just as likely, the number of items requiring work would be reduced until it met the budget. By tracking items accomplished to date against the schedule, this could be interpreted as to whether the project was on schedule.

Data Migration Servers
New servers were purchased at the start of this project for the data migration team. At the completion of the project, the data migration servers would replace the production system. At that point, there had not been time to acquire a deep insight into the nature of the database and its performance characteristics. I suggested the Xeon E7-8857 v2 12-core processors, which was higher frequency than the other 12-core options, and was significantly lower priced than the 15-core options.

I overlooked the fact that the E7-8857 v2 did not have hyper-threading, nor did I know at the time that HT was so immensely valuable in this workload. The new systems were configured with 1.5TB memory via 48 x 32GB DIMMs. A configuration of 96 x 16GB DIMMs would have been less expensive, but the client preferred having the option to upgrade to 3TB if necessary. There was a moderate premium in 2014 for one 32GB DIMM over two 16GB DIMMs.

Assessment Process
Broadly, avenues for achieving performance gain fall into the following categories.

1. Single-threaded efficiency, i.e., index and query tuning
2. Multi-thread concurrency improvements
3. Architecture
4. Software, either general improvements or new features
5. Hardware

Architecture is not just a matter of replacing GUID keys with integer, but also determining the data model that best supports the business logic. Hardware could be augmenting or replacing the production system, and could involve systems yet to be released.

Tackling the above list out of order, the hardware had already been set by the new servers for the data migration project. In principle, we would have liked to do much performance work as possible beforehand, then determine the most appropriate hardware configuration. But circumstances dictated otherwise. The number of processor sockets and cores would impact software licensing.

Between SQL Server versions 2008 to 2014, most of the improvements have been in the new features with each version, rather than a significant improvement in existing components. This means we need to re-write code to use the new features. Given that this a transaction processing type of application, the feature of interest is Hekaton memory-optimized tables and natively compiled procedures. A quick look showed that there were too many limitations in the 2014 Hekaton for serious consideration. As it turned out, the greatly expanded Hekaton support in version 2016 probably would have made this possible, and would have made for an interesting follow-on project.

The database was running SQL Server 2008 R2 at the start of the project, with per socket licensing in effect. Eventually, SQL Server would to be upgraded to a later version, in which per core licensing would be in effect. When this occurred, having the correct hardware configuration to meet performance requirements with some headroom would optimize licensing costs, but this was not a major concern. (Furthermore, with the SQL Server 2016 SP1 features repositioned to Standard Edition, we can stop complaining about per core licensing.)

We should consider that estimating performance gain on an individual item is just a guess. Estimating the effort to accomplish a particular task is also just a guess. An estimate for the cumulative effect of many individual changes is no more than a wild guess, and probably much less than that.

Still, management needs to make a decision, with some but not too many options. Decisions are based on cost-benefit calculations, even if the basis numbers lack firmness. This is why it is worthwhile to push through some immediate actions, to narrow the gap between current state and the objective, reducing the uncertainty. That said, should some of the estimates prove too optimistic, there is no point making excuses. Just find a way to get the job done.

Physical Architecture and Indexes
While the assessment was still ongoing, two immediate actions were taken, not waiting for the completion of the assessment. One was to restructure the physical database architecture from a single filegroup with one data file to two filegroups, each with 8 data files, and for the storage system to have 9 volumes, 8 for data and 1 for logs. The second was to restructure the indexes, changing the cluster key in several tables, and greatly reducing the number of nonclustered indexes.

The physical architecture and indexes changes were deployed together because it is more effective to redistribute data from the original single file to multiple files by rebuilding indexes instead of via DBCC SHRINKFILE. If indexes are rebuilt to redistribute data, then we may as well take this opportunity to put in good indexes, making clustered key changes if necessary, and get rid of excess indexes.

The purpose of the second filegroup was so that large tables could be moved to the second file group on index rebuild, emptying out the original large data file. The original data file could then be shrunk without much effort. One valid reason for having two filegroups is to place static tables in one FG, and growing tables in the second FG. This would facilitate filegroup backups, but filegroup backups were not used in this environment. There was no purpose for having two permanent filegroups here other than to facilitate routine maintenance. Tempdb was also repositioned to have one data file on each of the 8 data volumes. All logs were on the volume assigned for logs.

The purpose of the database physical architecture was to enable high IO bandwidth in database backup and recovery. Backups also specified multiple files, one to each volume. The Violin Memory representatives said that bandwidth did not scale beyond 4 fiber channel links. This is true for a single data file, when the MPIO driver must dynamically balance load. They did not know that with multiple files, the path should be preplanned? With no need to make inflight decisions for every 2KB fiber channel frame, scaling should be nearly linear to a very high number of channels. It would have been nice if the network had multiple 10GbE channels, which was my plan, but I never got around to it.

Backup and Recovery
In a previous project, the objectives appeared to be possible. But the necessary number of test cycles could not be accomplished within the scheduled timeframe in large part due to the interval required to reset the test environment. Much of the turnaround time was on the restore from backup and other IO bandwidth intensive operations.

New servers were acquired with local storage correctly configured for high IO bandwidth. It was then possible to turn 2-3 test cycles per day instead of just one. The client liked the new test system so much that they moved it to production, even though there were non-enterprise parts to save money on the assumption that it was just for the test environment.

For tables having clustered index changes, the new key was usually the foreign key to a parent table followed by the row GUID. In one case, the new cluster key was just two foreign keys, and the row GUID column was discarded. In some cases, a nonclustered index on the row GUID was not needed, as all accesses came through the parent table key, now the lead column of the clustered index.

The clustered indexes were left in the original state without compression. Some of the nonclustered indexes had page level compression applied. Row level compression was not evaluated. The data size was reduced from 1.8 to 1.5TB, as a result of reclaiming unused space on the rebuild of the clustered indexes. The nonclustered indexes were reduced to 0.4TB.

With the rebuilt and reduced indexes, performance improved by 33% from 30 to 40 transactions per sec per thread in the production environment. Later testing showed that rebuilding fragmented indexes improved performance by 10%. The performance gain that attributed to the new and reduced indexes is then 21% (1.1 x 1.21 = 1.33). Still, the reduced index set made periodic index rebuilds quicker. Performance in the POC environment with hard disk storage was about 20 tps per thread.

SQL Tuning First Steps
In following the transaction code from the 5 top-level procedures down, it was found that the use of SET options was inconsistent. SET NOCOUNT ON was not uniformly in effect. Many of the procedures and TVFs were several hundred lines long, with 20 or more SQL statements, even more variable assignment statements, and many IF blocks. So, the impact of NOCOUNT may be non-trivial. Changes for this and other simple items were applied at this stage.

The Assessment

This was rather easy. The transactions were being processed with 6 concurrently running threads on the current production system with 32 cores. The production at the completion of the project would have 48 cores. Whatever caused the problems that led the original developers to step down to 6 threads would be found and fixed.

The cause was not identified in the assessment phase, but solving deadlock issues is a standard skill for database experts. There is no doubt that it can be done. The question is what performance gain could be expected from increasing the number of concurrently running threads from 6 to 32 and then further to 48. The expectation is that performance is less than linear. In a well-designed database and system, scaling is expected to be only slightly less than linear. If scaling were much less than linear, then that might be an architectural issue.

Single Thread Efficiency
Some efficiency gains from index tuning were already achieved before the completion of the assessment phase. On the SQL side, there were 6 statements with high CPU, another dozen or so with medium CPU, over 40 statements comparable to a single row index seek on the primary key, and over 100 SET statements doing variable assignment. Assuming that significant improvement could be made on the large statements, and moderate improvement could be made on the medium statements, what would the overall impact be?

The performance prior to re-indexing was about 30 tps, so the average elapsed time for one transaction is 33ms. Half of the transaction elapsed time could be accounted for in dm_exec_query_stats, and only half of this was in the large statements. Until the missing time could be accounted for, it did not seem wise to be making promises for large overall gains in SQL efficiency.

We might consider that the overhead for logging execution stats, including for all the variable assignments, might be non-trivial. By the way, TPC benchmarks are run with -x, disabling the collection of much performance related information. Would you like to guess why? That said, as general practice, we can set a goal of 3-4X improvement considering that the time scale of this project was 18 months, giving us plenty of time to dig into the root causes.

Assessment Summary
What this means was that the objectives could be met without re-architecture, without waiting for an enhanced Hekaton in SQL Server 2016, and without waiting for a super-high core count processor like the Xeon v4, which had been announced at the time as having 24 cores, expected to be available in 2016. In fact, there was no need to have the maximum of 15 cores per socket in the 4-way Xeon v2 being purchased for the data migration system.

GUID versus Integer Keys
It would still have been desirable to remove the GUID keys from at least the dimension tables. And possibly some of the child fact tables, leaving GUIDs in just certain parent tables. The client's staff developers were of the same opinion. However, this would have been a complex undertaking. The main tables were referenced in several thousand stored procedures, several hundred functions, and a few hundred views. The objects scripted out totaled 32MB and 881,000 lines. Furthermore, there were SQL in the application software outside of stored procedures, itself a really bad idea.

Database Architecture
If we were to go to the effort of replacing the GUID keys with more compact data types, we should also take the opportunity to put in the correct keys to best support the business logic. It was apparent in the transaction procedures that the code was doing work that should have been expressed in the key structure. Given the timeframe, the performance tuning work could be done by one person. A second person could work in parallel on either a full re-architecture or just convert some GUID columns to a more compact data type. This matter was settled when the second consultant was not available for this project.

SQL Server Version
One other question in this project was whether or when SQL Server and the Windows Server operating system should be upgraded from version 2008 R2, and to which version: 2012, 2014 or later. Seeing as how performance could be improved to meet objectives on the current version, and that the more interesting later version was SQL Server 2016, which would be second generation Hekaton, it was recommended that version upgrade would occur after the consolidation project.


Some preliminary changes, database physical architecture, indexes and SET options, had already been deployed to production. The practical approach in performance tuning is to make a series of incremental improvements. Find the top queries, either by CPU, elapsed time or perhaps physical IO, optimize as practical, test, debug and deploy. Then repeat the process. Some may prefer to deploy, then debug. To each, his or her own.

It is important to conduct a reassessment after significant changes, as the system level performance characteristics may change. What might have seemed important earlier may now be not so important, and vice versa. Performance tuning may ultimately be limited by architectural flaws, in which case it is important to make this assessment as soon as possible. Some last points, good execution statistics are very helpful. Comprehensive performance tuning will involve heavy SQL re-coding, in which case, good debugging capability is important.

Individual Statements
In consideration for the complexity of the transactions, the approach was adjusted to first address localized changes, that is, tuning individual SQL statements. A couple of the top resource queries and a number of medium CPU queries were improved. These fixes contributed some performance benefit, but more importantly, familiarity with the code was gained.

Statistics No Recompute and Explicit Statistics Update
In the course of processing 6M transactions, the number of rows written (inserts and updates) for even the large tables was sufficient to trigger the automatic statistics re-compute set points. This in turn, is followed by a procedure or statement recompile. Periodical recompiles pose no operational problems. However, being new to the environment, it is helpful if dm_exec_query_stats could show as complete a picture as possible.

In many tables, the index statistics do not change in a manner that would impact the execution plans. Here, the index statistics were updated with FULLSCAN and NORECOMPUTE prior to the start of transaction processing. For some indexes, statistics were explicitly recomputed after a specific step in which rows were modified in a manner that would affect the execution plan.

The transaction procedures had code blocks for logging the execution time of certain steps. These were controlled by a variable, with the value set to bypass logging. Presumably the value was set to enable logging for performance investigations. In general, using a row insert to log execution times for steps in a transaction processing system is probably incurring too much overhead. After the index statistics strategy was implemented, dm_exec_query_stats captured the entire run. The logging code structures were then completely removed.

Page Compression
Out of curiosity, or euphemistically, in being thorough, page level compression was applied to the data, that is, the clustered indexes. Data size was reduced to 0.5TB. The total database size was then down to 0.9TB. There had been 25% growth in the number of rows since the beginning of the project. The combined effect of index rebuilds (reclaiming unused space), compression in the clustered and some nonclustered indexes, and reduced set of nonclustered indexes achieved a 4X reduction in size.

There was no apparent performance impact from compression in the clustered indexes as measured in elapsed time. In previous tests, I had reported that the CPU overhead for page compression in single threaded execution plans was 40% when all data was in memory. The interpretation of the outcome observed here is that the higher CPU overhead for compression just happened to be about equally offset by the reduction in IO time.

Had the active data been entirely in-memory in both cases, then we expect to see the full overhead for compression. When data is not entirely in memory, and compression lowers memory footprint, reducing IO time, then there could be a benefit from compression in the elapsed time. In this case, storage was on flash for the production system, with queue-depth 1 IO latency on the order of a few tenths of a milli-second. Had the storage been on hard disk with 5-10ms IO latency, then compression might have had a large positive impact.

After data compression was applied, no one particularly cared that we did not have a 4X scaled up database for proof-of-concept testing. That and fact that we were already half way to the objective.

Concurrency 10 threads
As the main avenue of gain was expected to be from increasing the degree of concurrency, steps were now taken in this area, while continuing to work on efficiency of other SQL statements. The first step was to bypass the outer BEGIN/COMMIT TRAN around every 500 individual transactions. This was done by reducing the outer transaction block to 1 row, effectively negating it while leaving the code in-place.

The inner transaction block already encompassed several insert and update statements, achieving most of the benefit of log write consolidation, while also serving its primary purpose of transactional integrity. The outer transaction block then served no positive purpose, with the negative impact of holding locks much longer than necessary.

By this time, single thread performance was 50 tps on flash storage, for an average transaction elapsed time of 20ms. The time to complete 500 average transactions in one thread would then be 10 seconds, over which locks are held for 4500 rows inserted or modified. Because the long transactions tend to be grouped together in certain batches, the worst-case scenario is that locks are held by the outer block for several minutes on several tens of thousands of rows. This is not good.

As expected, there was no impact from negating the outer transaction block. In the next round, the outer BEGIN/COMMIT code was removed outright. Presumably the original consultants had read a best practice item on explicit transactions without bothering to understand the parameters for employing this practice.

Next was a baby step of increasing the number of concurrent batches from 6 to 10. This also ran without problems, and without loss of efficiency at the individual thread level. The system now had a combined throughput of over 500 tps, except towards the end when some threads finished, but stragglers continued running.

Complex SQL
Efficiency work progressively moved to more complex scenarios. Blocks of SQL statements were examined, first within a procedure. Then, statements were restructured, moving between nested procedures if beneficial. This may necessitate changing procedure parameter lists, which meant that dependencies needed to be checked carefully.

All the high CPU statements were examined in detailed. Opportunities for efficiency in the intermediate and even in the simple statements were not ignored. Some blocks of multiple SET variable assessments were consolidated to a single statement. The execution statistics for these are near zero, but does that include the effort to log the execution statistics?

A commonly cited performance tuning methodology is to make only one change at a time. This is like riding a bicycle with training wheels. Given the sophisticated and fine grain metrics in the DMVs today, multiple simultaneous complex change might be workable. This does require skilled interpretation of metrics, and good debugging tools.

The client's in-house expert on the business logic provided SQL to examine which rows should match up exactly. Still it was a difficult process to trace the source of a discrepancy (a different way of saying bug).

All the tables in this database had six common columns for administrative housekeeping. Two were create and modify date. The other four were GUID columns for create, modify, (code) module and status. It was not necessary to have administrative columns in all of the transaction tables, perhaps just one or two. The GUID values were not particularly useful in this matter.

What was needed was a mechanism to show the code path followed by each transaction. This could have been done with a string column in one or two of the core tables, where encoded values are appended to indicate where the row originated and each place it was modified. However, there were too many places outside the transaction code that accessed these tables to attempt this change.

There were some temp tables created and drop with each transaction. These were converted to permanent tables, truncated at the beginning of month-end processing, but retained at its completion. The permanent working tables were constructed with additional columns to record administrative information so that the code path followed by each transaction could be deduced. The complications in modifying official permanent tables was then avoided.

The data distribution statistics at the completion of month-end processing are extracted via DBCC SHOW_STATISTICS WITH STATS_STREAM. At the beginning of the next month-end, the temp tables are truncated, statistics auto-update is disabled to prevent recompiles, and the statistics from the previous month are applied. Again, this is not necessary for normal operation. It is only used to preserve full execution statistics in dm_exec_query_stats for my benefit.

It was discovered that there were minor variations in the amount field for a small number of transactions from run to run even with identical code and initial data. The magnitude of variations was small, and would only impact whether the fourth significant figure was assessed in one period or the next, with the total being the same.

The client felt that this was not a serious problem on this business side, because the end total was correct. But run-to-run variations made performance work difficult because we had to assess whether a difference between the baseline run and a run with new code was due to the existing variation, or a new error. Later, the variation was traced to a scalar function in which one of the code paths that computed the partial month fraction did so at a lower precision than what should have implemented. After fixing this, results from one run matched another run exactly.

Concurrency 10 to 20
The combined efficiency tuning efforts pushed single thread performance to 75 tps at 10 concurrent threads. The number of concurrent threads was increased from 10 to 20. Again, there were no blocking or deadlock problems. There was a slight decline in single thread efficiency to 60 tps at 20 threads.

Table Valued Functions
The two table valued functions showed high CPU of about 1,900ms and 800ms respectively, with elapsed time about equal. When the execution plan is not parallel, having CPU and elapsed time about equal usually indicates that all data is in memory. It should be possible to confirm this with the physical IO stats in dm_exec_query_stats. Yet the statements within these two TVFs did not show anything that could consume so much CPU, not singly or cumulatively.

There was some physical IO. Not huge, but enough that there should have been separation between CPU and elapsed time. If this had been a stored procedure with SQL statements that incurred physical IO, then there would be a gap between CPU and elapsed time. The IO should be issued asynchronously, allowing the SQL Server engine to switch to another thread. The thread can resume when the IO completes. The implication is that within a TVF, IOs are synchronous? After converting the TVFs to stored procedures, most of the CPU disappeared but the elapsed time was about the same as before.

More efficiency
SQL efficiency work continues. A few logical inconsistencies were found, in which the query logic expected only one row. But without some combination of unique indexes or foreign keys, the query optimizer believed otherwise. Because there was no official keeper of the data model, each developer interpreted the usage for tables differently. The result is unclear data definitions. Nor was there an official document of definitions and rules for each table.

Data Model
In this regard, the first consulting firm hired for the scaling project had justification in wanting a data model. However, deducing the correct data model from the existing tables, unique indexes and foreign keys is not a simple matter, more so if the keys are not correctly designed. It is necessary to also examine the core business logic.

This is probably why the script they built to artificially increase the data size did not work. If a more coherent unique key and foreign key set had been in place, it should have been possible to write the data multiplier script.

Consulting Practices
The take away here is that an outside consultant should plan the sequence of work not just from the technical correctness point of view, but also so that the client perceives steady and visible progress towards objectives. When there is a degree of confidence in the consultant and the project is deemed to be at low risk of failure, there is greater freedom of action.

Partitioned Tables and Indexes
Partitioning is a powerful feature that can be used either to concentrated active rows into a single partition or to distribute row as the situation calls for. Both purposes were employed here. Concentrating active row has the benefit that indexes can be rebuilt for a partition once after the data is set, and not need rebuilding thereafter. Distributing rows allows for indexes to be rebuilt one partition at a time, although online partition rebuilds were not possible until version 2016?

It may be possible, depending on the usage, to have in effect, a free index when the leading index key is not the partitioning key. One example is a table partitioned on a date having not too many distinct values, but the lead key is perhaps an identity or other highly selective value. A query for a date only scans a specific partition. A query on the lead key seeks for that value in each partition.

Concurrency 20 to 30
Concurrency was further increased from 20 to 30 threads. To support this, the number of batches was also increased. It was kept in mind that the ultimate production system, currently in use by the data migration team, would have 48 cores.

As single thread performance and concurrency increased, the effect of uneven runtime for batches having many multi-period transactions began to have a greater proportional impact. A larger number of batches were created and a mechanism was implemented to even out the distribution of work between threads until the stragglers lagged by less than a few minutes.

The concurrency increases from 6 to 10 then to 20 threads did not result in deadlocks or other observed blocking behavior. On the initial increase to 30 threads, I was distracted with other code changes and neglected to check the error logs.

Only after several test sequences was it noticed that there were now entries in the error log for deadlock victims. This was presumably the problem that led the original consultant to back down to just 6 concurrent threads. There could be several reasons why deadlocks were not seen at 10 or 20 threads, appearing only after the increase to 30.

Back then, there were poor indexes and too many indexes, so writes must modify many indexes. The original storage was on hard disks, and had since been upgraded to all flash, meaning transactions now completed more quickly. Perhaps, the larger factor was the exterior BEGIN/COMMIT TRAN around 500 individual transactions, that resulted in holding too many locks for much longer than locks should have been held.

The original code did have error handlers, which did detect and log the error messages, along with procedure name and line number. There were two procedures each with one statement showing occurrences of deadlocks, both statements being updates to the same table. From here, it was not difficult to conclude that one of these statements specified the primary key as the SARG, and hence was most likely the victim of a deadlock. The other statement had a multi-part SARG, some of which were equality expressions, but others were inequality expressions.

The equality arguments reduced the number of rows to an average of 10. The remaining rows must then be tested against the inequality arguments. However, the upper bound on the equality columns was several thousand rows. Over 6M transactions touching an average of 10 rows each time, only a total of 10-20 rows actually meet the full of criteria for the update.

The first attempt at this problem was to use an IF EXISTS test on the equivalent SELECT statement before attempting the UPDATE. There were still deadlocks. Next, a temp (later converted to a permanent) table was created with the necessary key values. An INSERT/SELECT statement populated the table. If the row count was greater than zero, the key values from the rows inserted were used in the update statement. This time, the full unique key columns are specified.

The problem might be that update statements took exclusive locks on all the rows meeting the seek predicates? Even if the row did not meet the full criteria. In the Insert/Select statement, only a read committed lock was in effect. The update statement is then called only when rows are found and now the unique key is specified. There were no more deadlocks, problem solved with a good coding.

Soon after deploying this code change, one of the in-house business logic experts reviewed the code sequence. After discussion with other staff, it was assessed that this code corrected infrequent data errors, and should not be done at each individual transaction. Instead a single query for all accounts at the completion of processing was the right place.

In all probability, if this code had not been there, then original code probably would not have had deadlock problems with 20 or so concurrent threads. After purchasing the Xeon E7 server and flash storage array, performance at the projected 4X level would not have been great but might have been deemed tolerable.

Performance Summary
The assessment given to the client early in the project was that 4X volume and perhaps 10X throughput performance could be achieved without re-architecture. There were reasons not to expect too much from single threaded efficiency. The bulk of the gain was expected to come from increased concurrency in transaction processing, from 6 threads to how ever many there were physical cores or logical processors as appropriate.

It was not necessary to first identify the cause of the thread contention, and then assess whether the issue could be fixed, before making the assessment. Whatever that problem was, we were going to fix it, so that the server would have all processors working. In fact, the scenario for maximum throughput would be to have two or more threads on each logical processor, but this was not pursued. The nature of modern systems is that compute capability is distributed over very many processor cores. Competent software architects must know how employ the full power of the system or there is no point pretending to be an expert.

That said, it was also apparent there was contention between threads. The highest measured single thread performance was 75 tps at 10 threads. There were additional efficiency measures deployed after threads increased to 20, and then 30. At 30+ threads, single thread performance fell to just over 50 tps. The combined throughput was 1500 tps on the POC test system and 1600 on production. An effective 10X plus had been achieved because transactions were more uniformly distributed among threads.

On the new system with 48 core totals, the 50+ tps per thread was also maintained for a throughput of over 2600 tps. The loss of efficiency at high concurrency is presumably due to contention between threads. The source of contention at high parallelism was not identified. One possibility is that this is a generic high concurrency effect. Another possibility is that this more due to the effect of contention between threads on different processor sockets.

From the technical perspective, there was no doubt that the performance objectives could be achieved under the allowed schedule. The client had concerns at the beginning of the project, given that the then current performance level had only been achieved at great cost, including the all-flash storage. After significant progress was achieved quickly, concerns on achieving the remainder disappeared.

Re-Architecture 1

GUID Replacement
At the beginning of this project, it was thought that it would be necessary or at least beneficial to replace the GUID keys with integer keys. What gain would be achieved? The database would be smaller. Logic would be more efficient because integers can use native CPU instructions.

As it was, applying page compression to the large tables as is achieved a 4X reduction in size. The use of 1, 2, 4 and 8-byte integer keys would achieve 3X reduction without incurring the overhead of compression. The overhead for compression was estimated to be 40% in previous tests. There were also many date-time columns representing date values, so it is possible 4X reduction could be achieved with a combination of integer keys and date data types.

Sequential integer keys could also concentrate active rows in a limited range of pages at the end instead of throughout each table. If the size of the tables and indexes involved in transactions was 2TB with GUID keys, then it would probably take 2TB of buffer memory to reduce reads to nearly zero. With appropriate integer keys, then the core tables might be 0.7TB. Of this, the active pages might reside in only 200-300GB (just a guess).

Several of the tables used NEWSEQUENTIALID, which does not cause as much fragmentation as plain NEWID. There is factor concerning the use of a sequentially increasing keys that will discussed shortly.

Unique Indexes and Foreign Keys
Database architecture is sometimes thought to mean mostly normalization. In fact, the structure of unique indexes (or constraints) used in foreign keys ensures that foreign keys point not simply to a valid row, but also to the correct row in the parent tables. Correct design of the key plays an important role in supporting the business logic. The SQL logic should not have to do work because there are deficiencies in the data model.

So, if we were going to the extraordinary effort to replace the GUID column keys with integer data type, then we should also put in the correct key structures. The fact is, not every table should have an identity or its GUID equivalent as a primary key. In some cases, the key might be a parent table key followed a local sequence number or perhaps a combination of foreign keys. In the primary key a table were a compound of several columns, and it had child tables of its own, then we might not want to foreign key to the child with several columns, in which case it might then have an identity type key in addition to the unique combination of other columns.

In particular, a table should not foreign key directly to its grand-parent table, the parent of its parent. In the example table A is the parent of table B, and B is the parent of C. Table B has a key leading with the key of A, followed by a local column. Table C foreign keys back to B on the compound key of B, which has a key to A. The point here is that it would have taken time to learn what the right keys.

Memory Latency

Before concluding the database architecture topic, a brief digression on the matter of memory latency and Non-Uniform Memory Access (NUMA) system architecture. It should be intuitively obvious to anyone looking at a system architecture diagram along with pertinent specifications that round-trip memory access latency is the most important aspect of modern server performance in transaction processing applications.

Round-trip memory access latency comes into play when a memory address is not in cache. An access off the CPU-die to DRAM is required, and the contents of the memory being accessed is needed for the next operation. As this is so very important, most vendors avoid the topic and mentioning the important specification values, while happily talking about unimportant specifications. As memory latency is important, this also means that the NUMA architecture of a multi-socket system also factors into memory latency.

There were three occurrences in the course of the work on this project that shed light on the effect of memory latency. First, it was apparent that the POC test system, with HT enabled, continued to scale as the number of threads increased beyond the number cores to both logical processors per core. It was not practical to do a test with hyper-threading disabled though.

In brief, Hyper-Threading involves a processor core appearing as two or more, though Intel processor only support 2-way SMT, the generic term for HT. When a task on one of the logical processors accesses memory, the processor switches to a task on another logical processor. The only way HT will show nearly linear scaling is if more than 50% of cycles are idle for any given logical processor.

The second occurrence, on two or three occasions, the data center managing the POC system applied firmware updates, that required rebooting the system. For whatever reason, the system was set to a lower power mode of 135MHz instead of the 2.7GHz rated frequency. The managed data center IT team did not notice this, nor did they notice other items of significant importance. This setting resulted in transactions running noticeably slower. The dm_exec_query_stats showed worker time (CPU) about 3 times higher than normal. Elapsed time was less than 3 times higher because latencies from the occasional IO did not change appreciably. A 20X reduction in CPU frequency only contributed to a 3X reduction in CPU efficiency on this transaction processing workload.

NUMA, 2-way versus 4-way
The third item, is that the 2-way POC system showed 20-30% lower worker time than the 4-way production system. True, the production system was Xeon E7, a Westmere architecture, while the POC system was Ivy-Bridge. There was a two-year gap separating the processors, a tock and a tick in Intel terminology representing a new architecture followed by a process shrink. This could be in line with expected improvement for the new architecture, but perhaps not.

What can explain all three is memory latency. The round-trip time from a processor to access local memory is about 55ns. This is in addition to the L3 access time, which might be 45 cycles or approximately 12 ns. Round-trip time to memory attached to an adjacent processor socket is perhaps 125 ns. A rough estimate of 5% for the fraction of "instructions" requiring a round-trip access to memory before proceeding to the next step would explain all of the above.

The key assumption here is that memory is randomly located in one of the NUMA nodes. This would mean that logical processors enabled by Hyper-Threading would scale linearly, a 20X reduction in CPU frequency would only reduce CPU efficiency by 3X, and that in going from a 2-socket Xeon E5 to a 4-socket E7, there is a 20-30% loss of CPU efficiency. A further implication is that the CPU efficiency penalty going from 1-socket to 2 is 40%.

TPC Benchmarks
One might question that if the above were true, then would not the TPC-C and TPC-E transaction processing benchmarks show a scaling issue? And does not SQL Server have NUMA aware features? There are no performance-oriented TPC benchmark results at single socket. But published results do show excellent scaling from 2-socket Xeon E5 to 4-socket Xeon E7 which are similar but not exactly the same, (1.915 raw, 1.755 adjusted for core count). Scaling from 4 to 8 sockets is reasonable, 1.59 for the E7 v3.

An examination of the full disclosure and supplemental files shows that the TPC-C and E databases implement measures to achieve data locality. Both the database and application are architected in a manner to work together with the SQL Server NUMA tuning features to achieve higher than random probability for data pages to be loaded into the local memory node. If both the database and application were not architected to work with the SQL Server NUMA tuning features, including TCP port mapping, then there is no expectation of achieving better than random distribution of memory access.

Server Sizing
This has deep implications on server sizing strategies. If the database and application were not architected for NUMA, then the expectation is that scaling versus NUMA nodes will be mediocre, perhaps 1.4X. A very practical consideration is then that we should seriously examined whether application performance requirements can be met on a single socket system.

Before laughing, consider that a server with a single Xeon E5 v4 can have 22 cores, and 44 logical processors with Hyper-Threading enabled. Transaction processing applications scale very well on HT. Each core today is about 20 times more powerful than the Pentium III processors from 1998. The single socket Xeon E5 can have 12 DIMM slots. The 64GB DIMM today has a moderate premium over the 32GB DIMM. So 768GB is possible, though 384GB is more economical in a single socket Xeon E5 system.

Insert Performance for Index on Sequentially Increasing (Identity) Column
Indexes on an identity column has been popular in SQL Server. Long ago, someone probably showed that there was no negative performance impact on a 4-way server having multiple threads inserting into a table with a clustered index on the identity column. This would have occurred when the system architecture was 4 processors on one bus and when processor meant single core.

Much has changed in almost 20 years. Today, there is contention when threads running on many cores are inserting into the last page of the same table. This is significant on a single socket system. Going from 4 cores to 10 cores, there is a reduction in insert performance from 40K rows per sec to 30K/s. It has been reported that on the a 4-way system, the impact can be more than a 10X reduction when many threads running on different cores on different sockets try to insert into a table with an index on the identity column (Thomas Grohser presentation).

Database Architecture 2

It is now time to complete the database architecture topic. The traditional topics of database architecture include: normalization to ensure data correctness, and unique keys and foreign keys working together to support the business logic. Both of these are hardware independent, pertaining to database theory.

Some software people adhere to a purist approach, refusing to consider any hardware details. In the modern world, we cannot ignore the nature of hardware. For one, systems can now have a very high number of cores, and two, multi-socket systems have non-uniform memory architecture.

To support high insert volume into a single table, the identity column should not be used. There needs to be a strategy for writes to be distributed over many pages. The identity column is acceptable for tables without excessively high insert volume or on a single socket system with not too many cores. But insert performance becomes progressively worse as core count increases, and then falls dramatically in multi-socket systems.

Memory Node Locality
The general idea is to enable memory node locality and avoid contention between threads. To achieve memory node locality starting from an empty buffer cache, we would like as much as practical for a group of data pages to be accessed from threads running on the cores of only one socket. This relies on SQL Server to load the pages accessed to the local memory node. One mechanism for doing so is if a group of tables have a common lead key. The application calls certain stored procedures using a connection with a port number determined by the lead key value. SQL Server in-turn uses TCP port number mappings to the Soft-NUMA nodes. This requires coordination between database architecture, application architecture and the SQL Server NUMA tuning features.

Thread Contention
To avoid contention between threads, the same approach could work, only with finer grain isolation to the thread level. Instead of grouping by processor socket (same as memory node on recent generation systems), grouping is now by thread or core. We can rely on the SQL Server SPID to logical processor mapping. This used to be fixed, SPID 51 was on logical processor 0, etc. It is not fixed anymore, but the default is still round-robin? Because logical processor enumeration is by core, then socket, both thread isolation and memory node locality is achieved.

The first priority is to avoid contention in concurrent threads from different sockets writing to the same page. The second priority is probably to avoid contention between threads writing to the same page even if the threads are on the same socket. Third is probably memory node locality for reads.


The main decision point in this project was whether a re-architecture was necessary. It was quickly assessed that it was not for the data consolidation. While a re-architecture to clean up inherent flaws would have long term benefits, it would pose a schedule risk for the immediate objective and could be deferred to a follow-on project.

A subordinate item was that testing at scaled size was not necessary. Simply testing at the expected data to memory ratio will account for the IO characteristics. With the very large memory configurations typical in recent years, a properly tuned database should be able to run well on a good hard disk storage system. But going forward, the cost of all-flash storage could be less than HDD storage system with a high number of drives.

In the course of this project, it became apparent that there were deeper flaws. Some were in traditional database architecture, and others due to the nature of modern multi-processor servers being NUMA. There were also reasons to believe that the combination of memory-optimized tables and natively compiled procedures would have had the larger impact than anything else.

Two hardware items are of high significance. Scaling with Hyper-threading is excellent for transaction processing. Avoid the Intel Xeon SKUs without HT. Scaling to multiple processor sockets is poor if the database architecture was not designed to enable memory locality with the SQL Server NUMA tuning options and the application was also architected with this in mind. Serious consideration should be given to a single-socket system, more so if the database was not architected for NUMA.

Testing and debugging methodology were important elements of this project. In addition to supporting the performance effort, some existing bugs were corrected along the way.

Another factor in this project was the client's confidence in the consultants and progress towards objectives. Establish both items quickly. If it is necessary to work on a long-term effort, then also do something else in parallel that has visible near term benefit.