Home, Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage,

Relativity Part 1, Part 2, Part 3, Part 4, Part 5, Addendum

5) Full-Text Search

Full Text Search queries can cause problems for the same inherent reasons discussed earlier. One is row estimate error. Another is the AND-OR combination that prevents SQL Server from deducing that an equality clause exists.

Full-Text Search on Loop Join Inner Source - Difference

Previous examples illustrated the difficulties in estimating rows for a query with sub-expressions that take the difference of 2 or more sets. When there are estimation errors either way, a bad execution plan could result. When the estimated rows is high and the actual rows is low, the Top clause is one more source of trouble.

Consider the execution plan below, evaluating the difference of two expressions.

Estimated Plan - Difference
Diff_Est

By itself, the above execution plan is not a problem. SQL Server can execute loop, hash and merge joins just fine.

The problem occurs when the difference result feeds into another operation, as shown below going into a Full Text Search.

Estimated Plan - Difference, Full-Text
Diff_FT_Est

Because the estimated rows from the difference show as 1 (SQL Server substitutes 1 row whenever the estimate is less than 1, so 1 might really mean 0), the subsequent join operator is a loop join. The loop join is most efficient if there are zero rows from the outer source because the the inner source is not evaluated?

The loop join is still more efficient when there is one row from the outer source. For one row from the outer source, the inner source is evaluated once regardless of the join type, and the loop join has lower fixed overhead than hash or merge joins. In addition, a loop join can use the join predicate as a search argument.

The problem can occur when there are more than one row from the outer source. It would not be much of a problem if the inner source was a highly selective index seek. But this is a serious problem if the inner source operation was substantial: a full scan, a non-selective index seek or a Full-Text Search.

If the loop join inner source takes 1 CPU-sec to evaluate, and there are 1M rows from the outer source, the query will not complete (OK, it will in 11 days 13 hours at DOP 1). At high DOP, one query could put a significant load on the system. Even worse is when the users believes something is wrong, opens another session to re-issue the same query while the first is still runnning, and then a third session until the system is fully saturated. This really did happen.

It is rather unfortunate that SQL Server rigidly elects the lowest cost plan based on estimated rows. There is no consideration for the impact if the row estimate is wrong. The benefit of the loop join for a correct estimate is only slight over a hash or merge join. The consequence of a wrong estimate can be catastrophic with the loop join, but is negligible with the hash or merge joins.

Full Text Search on Loop Join Inner Source - AND/OR

The first example of the FTS on Loop Join inner source due to the well-known problem of row estimate error. The query plan below is probably due to a defect in the SQL Server query optimizer for FTS operations.

FullText_LoopJoinIS_2

The query is:

  SELECT COUNT([Document].[ArtifactID])
  FROM [Document] (NOLOCK)
  WHERE [Document].[AccessControlListID_D] IN (1,1000064)
  AND ((
   [Document].[RI_1003671] > 10
   OR
   CONTAINS(( [Document].[Filename]), '"File1" AND "File2"')
  ))

The original Relativity query involved a varchar(max) column in the first OR expression. Here it is replaced with an in-row column to avoid the LOB IO issue discussed earlier.

Here, the query optimizer clearly knowns there are 12M rows coming from the Loop Join outer source. The presence of the OR condition causes the Full Text Search to be in the Loop Join inner source. Here it is executed once per row from the outer source even though each execute of the FTS produces the same set of rows.

The reason for this plan is probably the same as the AND - OR combination example. The query optimizer is not capable of converting this combination into a join with one equality clause.

The fix for this situation is to write the SQL as separate queries consolidated with UNION.

SELECT COUNT([Document].[ArtifactID])
FROM (
  SELECT [Document].[ArtifactID]
  FROM [Document] (NOLOCK)
  WHERE [Document].[AccessControlListID_D] IN (1,1000064)
  AND [Document].[RI_1003671] > 10
  UNION
  SELECT [Document].[ArtifactID]
  FROM [Document] (NOLOCK)
  WHERE [Document].[AccessControlListID_D] IN (1,1000064)
  AND CONTAINS(( [Document].[Filename]), '"File1" AND "File2"')
) x

If the query required the Full Text rank we would do a UNION ALL and a follow-on GROUP BY to extract the Rank if required.

6) Additional Items

Before concluding, some additional items can be brought up. One is stored procedures and another is the Count followed by Top policy.

Stored Procedure Option

Previously, it was mentioned that the search queries defined by users from the UI. It might seem that there is no value in stored procedures as is the case in transaction processing systems. Now in the previous sections, problematic SQL were identified and fixes were proposed. Some are easy, such as not using the TOP clause when the actual row count is known to be low. It may be possible to incorporate fixes into the Relativity query generator. But it will not be possible to handle every situation.

Should a query time-out, there could be an option for Relativity to generate a stored procedure instead of calling the SQL directly. This would allow an expert to examine the SQL and make adjustments as necessary until a permanent fix can be deployed. A note could be made into AuditRecord that the stored procedure was manually tuned, hence Relativity cannot guarantee that the results are as intended.

Relativity search is not also entirely ad-hoc. Once a good search has been established, a user may execute it many times, perhaps going to source document, and then returning to the search.

Count - TOP Strategy

Full Text Search also brings into question the Relativity strategy of Count followed by rowset using the TOP clause. I did not bring up this matter earlier because most of the regular SQL queries could be made to be efficient. Some FTS can be inherently expensive. Of course, regular queries can be expensive as well. The purpose of the Count followed by Top is to let the user know the total number of records, while not overwhelming the application server with 1M rows. We should consider that the rowset query only sends back the ArtifactID, a 4-byte integer value. So it is not exactly as if 1M rows would overwhelm a good application server.

I do not know what the Relativity application server code looks like, but there should be 2 options. One is the code sequence is
  SqlDataReader = SqlCommand.ExecuteReader();
  DataTable.Load(SqlDataReader);

The other possibility
  SqlDataReader = SqlCommand.ExecuteReader();
then loop around
  Array[index++] = SqlDataReader.GetInt32();

The first has simpler code. This would need the Top clause to avoid overwhelming the app server with data. Furthermore, the DataTable structure (class) is probably not simple, so the virtual address space allocation is much more than 4 byte per row for just the ArtifactID.

The second option could be accompanied by declaring an array of 4-byte Int32, pre-size to perhaps a length of 1000-10000 values? As we loop through the reader, the array is populated up to the size of the array. Once past the pre-allocated size, we can continue to loop to get a count. Perhaps we do not need to loop to the end, because the user may not be interested in an exact count but rather just that there are very many rows.

My inclination is to implement this strategy for search queries that can be expected to be expensive, perhaps compound Full Text Searches. Given that users may re-execute certain searches, Relativity could track the execution time for both the count and rowset queries on each search. If the count query time is more than a preset value, then on the next execute it would use the alternative strategy.

Summary

A number of problems can occur with the SQL generated by Relativity. None have been encountered that could not be fixed. But most require intervention by kCura, and they have not acknowledged all as issues that warrant action. The reasons for the performance problems are understood. For time being, users can be instructed to work around these issues instead of repeatedly trying to run the impossible queries. And most definitely do not open multiple sessions all executing the same impossible query, none of which will return results.

On the hardware and system level side, there are special considerations for IO to LOB pages. A filegroup and partitioning strategy should be considered to avoid fragmentation of the Document LOB pages and to avoid table locks on AuditRecord during defragmentation.

 

 

Relativity Part 1, Part 2, Part 3, Part 4, Part 5, Addendum