Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

Load Test Manifesto

Load testing used to be a standard part of the software development, but not anymore. Now people express a preference for assessing performance on the production system. There is a lack of confidence that a load test reflects what will actually happen in production. In essence, it has become accepted that the value of load testing is not worth the cost and time, and perhaps whether there is any value at all.

The main problem is the load test plan criteria being excessively focused on the perceived importance of “real behavior”, instead of the factors that actually matter. Real behavior is frequently believed to be the random intervals between user actions and sometimes raw database size. The execution plan produced by the query optimizer is what is important. There are three factors in this:
  1) the formulas used for the cost of component SQL operations,
  2) the data distribution statistics, and
  3) the procedure/SQL compile parameters.

A second deficiency is treating the load test as a formality to validate that the application designed to “such and such” principles will meet arbitrary performance expectations instead of using the load test harness to investigate performance.

A third problem is neglecting to conduct a performance unit test, i.e., single queries without other load.

With an understanding of the database storage engine including the query optimizer, a load test can correctly reflect actual production system performance characteristics, and more importantly identify potential serious issues.

The software development disciple has established a number of principles (significantly under the umbrella term Agile) that are very important for many reasons, almost none of which have more than the nebulous connection to database performance.

Finally, while the actual business requirement might be for a specific load capability, the purpose of load testing is also to investigate. A load test may indicate that performance does not meet the objective. To assess the nature of the problem, it is necessary to determine whether the problem is at the component level, or in the ability to achieve throughput due to concurrency issues.

Query Optimizer - Cost Based Optimization

Almost all (?) modern database engines use cost based optimization, as opposed to rule based optimization. The details of the SQL Server query optimizer formulas are discussed in QDPMA Query Optimizer and by others elsewhere (Paul White on SQLBlog). The most significant elements in the SQL Server query optimizer formulas is relation between the key lookup (following an index seek) and the table (or index) scan cost.

The cost of a key lookup is dominated by the IO component of 0.003125 (=1/320). The incremental cost of a scan operation is also mostly in the IO component at 0.00074 (=1/1350) per page. The SQL Server query optimizer uses a model of the IO system based on a capability of 320 IOPS for non-sequential page access and 1,350 pages/sec (=10,800KB/sec) for scan operations. Of course, the absolute numbers are not important, only the ratio between IOPS and bandwidth. The query optimizer always assumes there is an IO element in the plan cost regardless of the amount of memory or whether the object is already in memory. There is no consideration for the actual random and sequential IO capability of the storage system.

The decision between using a nonclustered index seek followed by key lookup versus a scan operation is determined by the ratio of number of pages in the scan to the number of rows in the key lookup. For a non-parallel execution plan, when the CPU components of the plan are included, the effective pages to rows ratio may be about 3.5:1. In parallel plan at high DOP, the CPU components are reduced and the cost is entirely determined by the IO components, so the pages to rows ratio approaches 4.22 (1350/320).

Others have speculated that have speculated that the conditions for using an index could be based on the percentage of rows selected or perhaps the comparing the logical IO count between the key lookup and scan. Even a cursory examination of the SQL Server query optimizer shows that neither is true. It is only a coincidence that for a b-tree depth of 3 or 4, the true cross-over from index + key lookup to scan occurs near the logical IO cross-over point. A simple test on a table with b-tree depth 5 or higher shows that logical is not a consideration. It can also be shown that the key lookup to scan cross-over occurs at the same point for both clustered index and heap tables. The key lookup to a heap is only 1 logical IO per row.

It also so happens that the loop join operation is essentially the same as a key lookup such that the SQL Server query optimizer uses the same cost formula and may use the same (loop join) symbol depending on the version. At some point, the execution plan shifts from a loop join with index seek on the inner source to a hash join with a scan. This is governed by the same set of formulas as for key lookup versus scan, with the difference being the hash join operation. (There is a nuance in the loop join inner source formula that impacts database architecture.)

The main implication of this is that cardinality is important along with the average row size of key tables in the test database. Both meanings of cardinality apply; the uniqueness of a column within a table and the number of rows in the relation between tables.

The second element is the average row size. This is to ensure that the table has a reasonably correct number of pages relative to the number of rows.

The absolute size of the database or an individual table is not particularly important in a transaction processing system where the expectation is that the critical queries are properly indexed. The important criteria are the cardinality and the page to row ratio.

Data Distribution Statistics

The previous reference to cardinality did not elaborate on distribution. Obviously we do not expect uniform cardinality in a database. An example being every customer has exactly 10 orders, and each order having exactly 10 line items. Not only does the actual data distribution matter, but also the data structured used by SQL Server to represent data distribution statistics. Other important aspects are the technical details of random page sampling, when statistics are resampled.

The data structure used to represent data distribution is shown below. There are three parts, a header with scalar values, the density vector for overall distribution with successively increasing number of keys, and finally the histogram at the end. There can be up to 200 steps for keys for which there is an equal rows value. The distribution between keys is handled with a distinct and total range row values from which the average range rows can be calculated.

The purpose of the histogram is an attempt to handle skewed data distribution. Presumably one would try to use the key values for the skewed distributions that are not successive. Obviously there are limitations to how much skew this can handle, but a set data structure with size limits can only do so much.

If the impact of data distribution were understood before the database became populated, then there are strategies that could be adopted to ensure good statistics are possible and help in ensuring consistently good execution plans. Otherwise, we can only rely on the inherent capabilities of SQL Server.

Next, the other aspects: sampling and percentage. From sampling theory, we know that based on a true random sample, the accuracy is expect to be the square root of the number samples at a given value, with relative accuracy as the inverse of the square root. There are some indications that SQL Server follows this principle in determining the sampling percentage.

The difference is that SQL Server samples all rows from randomly selected pages (including first and last?). The reason for this is obvious, it is much less expensive. A large chunk of the work is getting the page, so there only a little more work to sample the column value for all the rows in the page. To reduce the impact of correlation between page and value, it tries to use a nonclustered index for which the column is not the lead key, see Elisabeth Redei.

When an index is built or rebuilt, the Rows sampled is always 100%. On an update statistics call without specifying FullScan, whether automatic or manual, the sampling can be seriously off on the high side. Below it the fullscan statistics for the TPC-H Lineitem table, nonclustered index on Partkey.

Below is the distribution after UPDATE STATISTICS at default (re-)sampling.

The full scan has correct values for EQ_Rows, while the partial sample is too high. Interestingly, in both cases, the Avg Range Rows is correct. This occurs in both SQL Server 2008R2 and 2012. Earlier version did not? Below is the execution plan detail with both estimated and actual rows counts based on fullscan statistics and default partial sample for one of the equal rows.

A significantly incorrect row estimate may not have negative impact on the execution plan operation at the source. This is because it would take a much larger error to change itself to change the plan from an index seek to a scan. However, it is more likely to have negative consequences further on in the execution plan of a more complex query with row estimate propagation errors.

The final aspect of statistics is resampling. From a brand new database (with no statistics), statistics are automatically created when a column is referenced in the search or join condition. Indexes have statistics, and there can also be statistics on columns. Thereafter, statistics are marked for recompute when certain thresholds are exceeded. The recompute occurs on the next query referencing impacted columns. The statistics recompute thresholds used to on the first 6 and 500 rows modified, and every 20% thereafter. After a recompute, any affected execution plans are also marked for recompile. (I am not sure if these thresholds have since changed or on the exact recompute and recompile sequence, so refer to other documents on this matter.)

As a side note, permanent tables and temp tables are subject to statistics create and recompute. Table variables do not have statistics, and the execution plan assumes 1 row and 1 page for the table variable. In SQL Server 7 and 2000, the execution plan recompile following a statistics recompute could be more expensive than the actual execute for complex queries involving few rows.

This may have been the reason table variables were introduced. But the explanation given for table variables was convoluted and probably resulted in incorrect strategies being adopted over whether to use temp tables or table variables. From SQL Server 2008 on, it does seem that the execution plan generation (query optimization) is far more efficient that this is less of a problem. I am of the opinion that the 6 rows statistics recompute and plan recompile never helps, but the 500 row and subsequent threshold points are useful.

The statistics recompute points introduce interesting effects on SQL Server performance. Consider the following scenario. A particular column has few distinct values (200 or less, the step limit of the statistics data structure), perhaps all 0. Statistics are currently accurate. A query modifies a number of rows to a new value that did not previously exist. The next query specifies this new value on the column as a search argument with joins to other tables.

If a large number of rows are modified in the first query, then statistics are recomputed and a good execution plan is produced for the second query. If a smaller number of rows are modified (below the recompute threshold), then the second query generates an execution plan based on statistics saying with certainty that zero rows meet the search criteria. The execution plan will then start with a table access applies the search criteria. The subsequent operations are then based on the assumption that zero rows come from the initial operation.

The execution plan always shows 1 row when statistics says 0 rows. Say that 1 row from the first table joins to many rows in the next table. At this point the execution plan will still show an estimate of 1 row. So when the estimate is 0, 1 is always shown, but the propagation estimates are based on 0. This may or may not lead to a poor plan, but it could lead to a disastrous plan.

The intuitive reasoning may have been that more rows modified in the first query requires more work in the second, but the internal working of SQL Server renders such intuitive reasoning completely irrelevant. There is nothing fundamentally wrong with the fact that SQL Server has statistics recompute thresholds, or what the threshold values are. The point is that we must consider this in our database.

There are many more interesting affects (if there is not a problem, it would not be interesting) dues to the nature of statistics. This is only an overview on points most related to load testing. The full details of statistics in SQL Server are important for many reasons. Search for the Technical Article “Statistics Used by the Query Optimizer in Microsoft SQL Server 2008” by Eric Hanson and Yavor Angelov, with Lubor Kollar as contributor. Other SQL Server statistics material is found from SQL CAT and Benjamin Naverez.

Compile Parameters and Variables

The final link in the chain from the query optimizer and data distribution statistics are the execution plan compile parameters. In a stored procedure, the actual parameter values are used when the stored procedure is compiled. Towards the end of the sqlplan file, there is a node for ParameterList. A stored procedure may also have internally declared variables. These values are assumed to be unknown. The estimate for unknown is based on the density value in the density vector section. The estimate for known parameter values is taken from the histogram, either the equal value or range rows as appropriate.

This has deep implications. Just because the test database was populated with a good choice of cardinality, page to row ratio and even good data distribution skew, this does not mean a load test will generate self-consistent performance throughput results or results consistent with what will happen in production, which may also be inconsistent.

Obviously we expect that there could be differences in the execution plans for procedures between compile parameters with known low estimated row count compared to compile parameters with known high estimated row count. As stated earlier the impact of consequence may not be in the immediate operation, but rather further on in the execution plan as result of differences between estimated and actual row counts.

There are some measures for handling this in SQL Server. One is the WITH RECOMPILE, which directs a recompile on execute. This means the current parameters are used for optimization, which leads to as good a plan as the query optimizer can produce given its capability, at the expense of requiring compile for each execute. Another measure is to direct the optimization based on a specified parameter value instead of the actual parameter value. A variation on this is to direct optimization for unknown, which should be the same as declaring a variable, and using the variable in the query, set equal to the parameter.

The Load Test Plan

With knowledge of the significant factors that affect database performance, it becomes clear how the load test plan should be constructed. First cardinality on columns within a table and between tables should be defined. Next a reasonable estimate is needed for the average row size in bytes. In a well design database, there should be few nullable columns in the critical tables. So the main variable is the average number of characters in the variable length string (or binary) columns. With this, the populated test tables should have the correct page-row ratio that would influence the execution plan between key lookup and scan or loop versus hash join.

In the old days, there might have been valid reasons why it would be impractical to populate the test database to the full anticipated production size. Today we could put two 1TB SSDs in a large laptop if we wanted to. If the developer complains on space, then he/she probably has too much non-project content (hint).

Still, if it were desired to anticipate future problems (most of which will be due to an execution plan change) without a full size database, this could be accomplished by scaling cardinality in relation to size . For example, if it is anticipated on the production server at some point a particular index would select 100,000 rows from an index, and the table size would be 2.8GB (350K pages), the execution plan could be modeled by scaling the selectivity to 10K rows and 280MB (35K pages). For good measure, we could scale the buffer cache size with the database size as well. There might still be discrepancies due to differences in the number of logical processors between test and production unless MAXDOP is restricted.

After defining cardinality and the table page-row ratio, the next step is data distribution. This point is interesting because of the data structure that SQL Server maintains to model data distribution, described earlier with header, vector and histogram portions. Just because two sets of data have the same distribution profile does not mean that SQL Server will generate the same distribution statistics for both.

The critical aspect is whether skewed data distribution is accounted in the same manner between test and production. If the skewed key values are isolated, then the SQL Server statistics could try to account for up to 200 values with the equal steps. If the skewed values are adjacent, SQL Server could try to lump these into the range rows. In any case, the SQL Server statistics data structure can only do so much to accurately model arbitrarily populated heavily skewed data distributions.

We could go to a great deal of effort so that both the test and production databases have not just identical data distribution, but also identical data distribution statistics as in the output of DBCC SHOW_STATISTICS. Alternatively, with our understanding of the SQL Server data distribution statistics data structure, we could come up with a strategy so that the SQL Server statistics will be mostly correct in both systems.

One is to directly manage the assignment of key values. For example, assign large customers an id from a low range, medium customers to an id from a middle range and small customers to an id from the high range, then the statistics data structure can capture this reasonably well. Other strategies are possible, such as using filtered indexes, along with writing the SQL so that the optimizer knows that it can use the filtered index, which happens to provide more accurate statistics. It is even possible to falsify statistics, but that is another discussion.

Finally, the load test plan must account for differences in execution plans due to compile parameters, along with variations in data skew on the actual parameters. The first question is how many different execution plans are possible depending on the compile parameters. If there is more than one possible plan, then how many execution plans are necessary? If only one plan is necessary, then we can just force whatever compile parameter is necessary to generate the one good plan.

Let us assume that with proper database design, thoughtfully constructed SQL and indexes, that the query optimizer produces a good execution plan when the compile parameter has low estimated row count when there are in fact few rows and also produces a good plan when the compile parameter has high estimated count when the actual is in fact high. If either the plan for low estimate rows is not suitable for high actual rows or the plan for high estimate rows is not suitable for low actual rows, then more than one plan is necessary. If the cost of compile is low relative to the cost of execution, then an acceptable solution is to apply the WITH RECOMPILE hint. Otherwise, we need devise a strategy so there are the necessary distinct execution plan that can be reused and such that the correct plan is used for the appropriate situation.

A final point, no data is provided here that demonstrate that the typical delays in user actions along with random variation do not affect system performance. In several tests, it has been shown that the impact of reasonable user delays between calls is that there are more concurrent open connections. This may impact application server requirements such as memory, but the impact on the database server is believed to be minor. So there has been little interest in investigating this matter further.

Unit Testing for Performance

Not many software development projects specify a unit test for performance. The requirement is only for throughput and response time under load. Of course, this is a rather trivial exercise so one could argue that it is not necessary list such minor items. If the presumption is that the application coming from development as is will meet performance objectives without rework, then the load test is a mere formality and a unit test for performance would have no purpose.

If on the other hand the load test shows that there are issues, then what is the cause? Are all the individual elements good and the problem only a matter of scaling throughput with multiple concurrent streams of traffic? A comparison of the unit test with the load test would provide this information transparently. The performance unit test can be accomplished simply by running the complete sequence of calls using the load test harness set to a single stream.


The point of interest in this discussion is not simply an explanation of the key factors that cause apparently inexplicable widely differing results in load tests as well as in the production environment. Instead of just attempting to reproduce what might happen in production, we see that it is actually possible to avert problems in production before it happens with the right load test strategy.

This is unlikely to be accomplished by attempting to build the test plan on general intuitive principles that would appear to reflect the real world. The SQL Server query optimizer is not built on an intuitive perception of the real world, but rather on cost based optimization with data distribution statistics. By targeting the critical aspects of the query optimizer, we can replicate any possible execution plan, which in turn implies almost any possible problem can be replicated as well.

But it is also possible to work strategies for averting problem execution plans into the SQL code as well. This might be controlling the stored procedure compile parameters and identifying critical (index) statistics that need non-default (probably fullscan) update maintenance. More sophisticated strategies might involve explicitly directing primary key assignment or possibly even using false statistics (can a lie correct a wrong?).


Some additional items.
This applies to transactional databases in which much of the active data is not in memory, and particularly more important when residing on hard disk storage systems. For an estimated row count of the key lookup or loop join inner source access is less than 25 rows, SQL Server issues synchronous disk IO. For over 25 rows (I am not sure if the cutoff is equal or over 25 rows) the disk IO is asynchronous. Asynchronous IO at higher queue depth will yield better performance on hard disk storage. Flash/SSD storage should have sufficiently low access latency such that this effect is not as critical

It used to be that updates using nonclustered indexes generated synchronous IO instead of asynchronous regardless of the estimated row count. I am not sure if this has been fixed in more recent versions?

In tables with varchar(max) fields that are stored outside of normal pages, a pointer to the actual page is stored instead (could someone provide a reference for this). Disk IO for lob pages is necessarily synchronous, because the in-row page and row must be accessed to determine the pointer? (Its too bad SQL Server does not have a feature for an included column on just the pointer and not the full column?)