Home, Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage,

ZCodeArtifact and SQL Server Statistics

The system that SQL Server employs for automatically creating and updating data distribution statistics is meant to "work transparently" for a "common transaction processing" application. The problem is that SQL Server is used in very many and very different types of applications that bear no resemblance to the common transaction processing such that the details of the statistics is implemented is absolutely critical in avoiding serious performance problems.

An exampled of this can be seen in the queries involving the ZCodeArtifact tables in Relativity version 8.1. In previous versions of Relativity (7.x), there was a CodeArtifact table with three columns: CodeTypeID, CodeArtifactID, AssociatedArtifactID. In version 8.x(?) this functionality is now implemented with many ZCodeArtifact_xxxxxxx tables, each with two columns: CodeArtifactID, AssociatedArtifactID. (Presumably there is one table for each of the former CodeTypeID tables?)

In the query below, the usage of the ZCodeArtifact tables is that each CodeArtifactID value has many AssociatedArtifactID values, and the search query specifies either inclusive or exclusive sets of rows for various ZCodeArtifact table and CodeArtifactID value combinations.

 SELECT DISTINCT [Document].[RI_1003671]
 LEFT JOIN [SEARCHTABLE_1] ON [SEARCHTABLE_1].[Key] = [Document].[ArtifactID] 
 AND (EXISTS(SELECT CodeArtifactID FROM EDDSDBO.ZCodeArtifact_1000170 (NOLOCK)
   WHERE AssociatedArtifactID = [Document].[ArtifactID] AND CodeArtifactID IN (1588779))
   WHERE AssociatedArtifactID = [Document].[ArtifactID] AND CodeArtifactID IN (1181439))
  AND  EXISTS(SELECT CodeArtifactID FROM EDDSDBO.ZCodeArtifact_1000057 (NOLOCK)
   WHERE AssociatedArtifactID = [Document].[ArtifactID] AND CodeArtifactID IN (1061045)))
AND [Document].[AccessControlListID_D] IN (1) 
 SELECT DISTINCT [Document].[RI_1003671]
 FROM   EDDSDBO.[Document] (NOLOCK) 
 LEFT JOIN [SEARCHTABLE_1] ON [SEARCHTABLE_1].[Key] = [Document].[ArtifactID]
 LEFT JOIN [GROUPIDENTIFIER_1] ON [GROUPIDENTIFIER_1].[RI_1003671] = [Document].[RI_1003671]
  WHERE AssociatedArtifactID = [Document].[ArtifactID] AND CodeArtifactID IN (1588779))
  WHERE AssociatedArtifactID = [Document].[ArtifactID] AND CodeArtifactID IN (1181439))
 AND [Document].[AccessControlListID_D] IN (1) 
SELECT TOP 1000 [Document].[ArtifactID]  
LEFT JOIN [GROUPIDENTIFIER_2] ON [GROUPIDENTIFIER_2].[RI_1003671] = [Document].[RI_1003671]
WHERE [Document].[AccessControlListID_D] IN (1)
ORDER BY [Document].[ControlNumber] , [Document].[ArtifactID]  

Below is the up to date data distribution statistics for the ZCodeArtifact_1000170 table. There are two important points in the data distribution of the CodeArtifactID values with regard to the technical details of how SQL Server manages and uses statistics. One is that there are a limited number of distinct values, and when the number of distinct values is less than 200, the SQL Server statistics may know exactly what values exists, so it also knows that other values do not exist in the table. The second, while not apparent, is that new CodeArtifactID values are incrementing. The mechanics of SQL Server data distribution statistics is that it tries to discover the lower and upper bounds. This is easy when the statistics is on an index. The implications of this are that when a new CodeArtifactID is inserted, but statistics are not updated, SQL Server believes that the new value does not exist.

The problem occurs in the following sequence of events. Statistics were updated prior to the rows for the last CodeArtifactID value (1588779) were inserted, either via SQL Server automatic criteria or the standard Relativity nightly index fragmentation and statistics maintenance job. During the day (or whatever normal user working hours are) 45,489 rows with the new CodeArtifactID 1588779 value are inserted. This does not trigger that SQL Server automatic statistics update criteria (the first 6 and 500 rows, and every 20% thereafter, newer versions has a 1% criteria for large tables?). And then a query is made referencing the new value.

In the more "typical" SQL Server usage, queries are inside stored procedures and either the stored procedure was compiled with parameter values that map to statistics values that existed at the time, or the DBA implemented some alternative controls, either via copying parameters to variables or use of the OPTIMIZE FOR hint.

In Relativity, search queries are issues as language events (SQL:Batch). This is the correct approach because Relativity search queries are expected to be complicated and data distribution is expected to be skewed. So it is desired to generate a new execution plan for the same logic but different search (parameter) values. The problem is that the SQL is generated in the application code and the DBA has no means of providing tuning hints.

Estimated Execution Plans

Now lets examine the execution plan that the SQL Server query optimizer generates for the query in question first with existing statistics in which the ZCodeArtifact_1000170 table CodeArtifactID column value 1588779 does not exist and then with correct statistics. The estimated execution plan for the first case (full plan) is shown below in two portions. The upper and lower portions feed into a join to the left in a loop join operation (not shown).

Note that both portions start with an index seek on ZCodeArtifact_1000170. This is because the general theme of the most efficient execution plan is to start with the table with the fewest rows or otherwise most likely to achieve the greatest reduction in the number of rows that need to be evaluated as quickly as possible when proper indexes are present. The SQL Server query optimizer believes that there are zero rows in ZCodeArtifact_1000170, (but the execution plan shows the estimated number of rows as 1 if statistics indicates zero). The join type is Nested Loops, or loop join, which has low fixed/set up cost (in processing the first row).

Below is the estimated execution plan for query involving ZCodeArtifact with correct statistics (full plan). The upper and lower portions also join in a nested loops operation as in above plan.

Below are the Index Seek details for ZCodeArtifact_1000170 in the first operation in upper portion upper right. The detail on the left is for incorrect statistics, showing Estimated Number of Rows at 1, when the statistics actually thinks the value does not exist (zero rows). The detail on the right has the correct value of 45,489.

It might seem that the significant difference between the two estimated execution plans is the fat arrows in the upper portions representing a large number of rows, which in turn lead to Hash Match for the first two joins. The hash join has higher fixed (set up) cost (for the first row) than a loop join, but lower cost per incremental row processed. But of course it is more complicated than this.

Even with correct knowledge that there are many rows at the source tables in the upper portion of execution plan, estimating rows through joins is not simple. In this particular case, the query optimizer believes that after the second Hash Match (Left Anti Semi Join), the number of rows remaining becomes small. This is because there are 270,422 rows from ZCodeArtifact_1000171 and the search condition is IN Set A (45K rows from 170) and NOT IN Set B (270K rows from 171). Hence the reasonable assumption is that few rows meet both of these two condition, so the both the join to the lower portion and the joins in the lower portion are loop joins.

The important difference between these two plans is that the starting point for the lower portion is no longer ZCodeArtifact_1000170, but another table (Document). This is because it is now known that ZCodeArtifact_1000170 has 45K matching rows.

Actual Execution Plans

The distinct between estimated and actual execution plans can be somewhat misleading because the estimated plan for a single statement is the actual plan, with the difference being that the actual plan has additional information including the actual number of rows at each step. (In SQL batches with multiple statements, write statements may trigger a recompile altering the plan for subsequent statements.)

Below is the actual execution for the plan with correct table statistics.

In fact, Set A and Set B are completely orthogonal, so all 45K rows from ZCodeArtifact_1000170 in the upper portion pass the Left Anti Semi Join. This means that the lower portion (of the left most Nested Loops) is executed once per row from the outer source.

Now if the query optimizer had known that there would a large number of rows from the outer source, it might have generated a different plan for the lower portion. But that is not the critical element here, because the SQL Server engine can still brute force its way through this plan in 2 seconds (this is a lot of CPU cycles, but we have a lots of CPU cycles).

Below is the lower portion of the actual execution plan with incorrect statistics. This portion is actually executed 45,489 times instead of the estimated 1 (0). The difference is that the starting point for the incorrect statistics plan lower portion starts with ZCodeArtifact_1000170 which now also has 45,489 rows for each row from the outer source.

So what happens here is that the 45,489 rows from the upper portion requires an execution of the lower portion, for which the first (right most) operation also generates 45,489 executes of the loop join to Document (producing 119K rows) for a total 2.07B loop join executes.

Even with a very powerful processor core (note this a non-parallel execution plan) the execution time for this 4884 seconds (1hr 21min).

Proposed Solution

This simplest solution for this situation is to explicitly update statistics on the specific ZCodeArtifact table whenever a new CodeArtifactID value is added (or increased?). We cannot rely on SQL Server agent jobs to frequently update ZCodeArtifact statistics because it likely that the search query is issued very soon after the new CodeArtifactID values are added. This is good enough to reduce query execution plan time from 4884 seconds to 2 seconds.

In this example, it is not necessary to implement a fully strategy of avoiding estimate error on the Anti Semi Join with alternative SQL expression methodology, which might perhaps improve the 2 sec to less than 1 sec. There are situations where this strategy might be necessary, to be documented else where.


Note: a Test database was created with just these tables and limited columns for in-depth investigation.