Home, Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage,

Parallelism Strategy and Comments 2010-10?

Parallelism is already very important today with 4 to 12-core processors and 2 to 8-way servers. Going forward, parallelism will become even more important. Naturally we expect SQL Server parallel execution performance to improve over time. We also expect parallel SQL write capability in the near future, if not with SQL Server 2011 (11.0), then in a hot-fix soon thereafter.

Automated parallel write (Insert, Update, Delete, and Load) is so important that the usefulness of large machines is seriously limited without this capability. It is like the exotic Italian supercars that really also required an on-premise mechanic. On this matter, Adam Machanic has taken the initiative with a Query Parallelizer CLR library. Another aspect not frequently discussed is parallelization strategy, which is the topic of discussion here.

Max Right Degree of Parallelism

Parallelization strategy concerns assessing the right number of processors to assign to a query. This includes both assuming full uncontested access to the entire system resources, and in the presence of other concurrent and dynamically varying activity. The fact is that we expect some queries to have excellent scaling to very high DOP. Some will scale to a certain degree of parallelism and then experience diminishing further gains, leveling off, or even negative scaling.

We would like a sufficiently accurate plan cost model to make this assessment in order to bring the full power of the server to bear when there are worthwhile benefits, and not squander system resources when there is little, zero or outright negative impact. Potentially, this could be even be an configurable setting, perhaps with a recommendation to limit further increase in parallelism when the expected performance gain drops below 50% for each doubling of the number of processors. Limiting the degree of parallelism is especially important for moderate cost queries that definitely benefit from parallelism, but do not require or even benefit from high parallelism.

SQL Good Citizenship

In the presence of other concurrent and dynamically varying activity, we would like good cooperative system behavior. A high cost compute intensive query should to yield threads/processors to other activity, especially for what should be short duration transactions, but then resume full use of the system when available. The SQL Server query optimizer will almost always elected for maximum parallelism, even for moderate plan costs. Parallel execution plans generated by SQL Server is based on the costing for the maximum allowed DOP, but the actual number of threads assigned is a run-time decision.

On frequently called queries where the plan cost marginally qualifies for parallel execution, firing off too many threads for a query too often seems to cause serious contention issues. It also seems that SQL Server locks each thread of a parallel execution plan to a specific processor (not sure about the spawning thread). If a processor core is fully utilized by other threads, then the parallel plan may be stalled on that thread, even after the other threads have completed their assigned tasks.

Dynamic DOP during Execution

I suppose there is a good reason we cannot dynamically adjust the number of threads assigned to a parallel plan for a query execution in flight. If this is the case, then I think it is best to start with proper number threads based on there being no other activity, but dynamically move thread as necessary. Naturally, there is benefit to keep each thread on a fixed core is practical. However, to be a good (SQL) citizen, big queries should vacate threads for queries of short duration, and piggy back threads if some cores are unexpectedly fully loaded. It is understood that moving a thread is not an inexpensive operation, and moving a thread should not be done lightly. But not moving a thread can have more severe consequences.

Can all of the above, be incorporated into SQL Server? Kind-of good enough or better than what we have sooner is always better than perfect a very long time from now.

SQL Server Plan Cost Model

The SQL Server query optimizer since version 7.0 has implemented a cost base approach based on (fixed) IO and CPU cost. For parallel execution plan, the assumption is that the IO system is saturated with a single thread, degree of parallelism (DOP) 1. There were some changes in SQL Server 2005, but the basic cost model of 1350 pages in a scan operation (sequential or large block IO) equals 320 key lookup or loop join (8KB pseudo-random IO) leaf-level page accesses was retained.

On parallelization, the SQL Server query optimizer employs the execution plan with the lowest plan cost. The plan cost is described in SQL Server Books Online as: The cost refers to an estimated elapsed time in seconds required to run the serial plan on a specific hardware configuration.

SQL Server Parallelism Options

There are three qualifiers. The cost threshold for parallelism option sets the minimum cost query plan for which parallelization is considered, with a default value of 5. From BOL: Longer queries usually benefit from parallel plans; the performance advantage negates the additional time required to initialize, synchronize, and terminate parallel plans. The max degree of parallelism system option sets the upper bound on the number of processor to employ for a single query. And finally the query hint MAXDOP lets a statement (individual query or create index statement, etc) override the system setting specified with max degree of parallelism.

Other options that can also impact parallelism are the affinity mask, port affinity and the resource governor.

Saturated IO Cost Model

Earlier, I said the cost model assumes a saturated IO system. This means that in execution plans at higher degrees of parallelism, the IO cost for index and table access does not change. The CPU part of the plan cost is reduced by the degree of parallelism up to the higher of 2 or one-half the number of logical (?) processors. The IO cost of Hash Match (and possibly other intermediate operations) is reduced with parallelism. In reality, this may or may not be the case. A weak storage system will definitely saturate very quickly. A powerful, properly configured (but not necessarily a horribly expensive, or improperly configured) storage system can potentially supply as much IO as each and every processor can consume.

In summary, there are two deficiencies on the IO portion of the SQL Server cost model. One, the cost models assumes a saturated IO subsystem. Parallelism does not reduce the IO cost (or estimated duration) Second, the sequential to random IO ratio is fixed. The sequential-random IO ratio can be off by a full order-of-magnitude in a storage system configured for data warehousing and possibly by a factor of 3 in the other direction on SSD storage. Because of these elements, it is impossible for the current SQL Server execution plan cost formulas to model parallelism scaling, and hence has no useful capability to predict the point of diminishing returns.

Compute Scalar Costs

There are other limitations of the SQL Server cost model. The cost does not account for logic cost, typically in the Compute Scalar operation for the columns aggregated, string and date functions etc. This is perfectly fine for transaction processing systems, where the main requirement is to determine the best table-row access strategy. In a well designed, pure OLTP system, (stress that both qualifiers apply, author is not responsible for poorly designed OLTP systems) all table access should be with index seek, so it should only be necessary to have conceptually approximate cost formulas to determine a good index seek sequence.

It is primarily in medium to large queries typical of Reporting and DW that more accurate and comprehensive cost models can make a difference. Logic operations contribute to query cost, but do not impact table-row access strategy (beyond in whether a key lookup is required). However, the Compute Scalar usually soak up compute cycles without putting further strains on IO and locking, and can actually improve parallelism scaling, even though the query is more expensive.

OPTION (RIGHT DOP)

Yet one more aspect: The SQL Server parallel plan models the overhead of parallel execution, but does not taken into account the cycles to setup multiple threads. This was offset by setting the cost threshold for parallelism default to 5, representing an estimate of 5 sec on some system from long, long ago. The thread setup overhead should be well below 5 seconds, so this consideration was properly handled. Today, processor cores are so powerful it can blast through 54MB is a scan or range seek in less than 100 milli-sec (540MB/sec) or 2000 key lookups (or loop join inner source) in less than 10 ms (200K lookups per sec), both of which could be eligible for a parallel plan. A live user will probably not notice the faster response of a parallel query at that level. And certainly there is no benefit in elevating the degree of parallelism for 100 CPU-ms query into the 32-64 range, or even beyond 2-4 for that matter.

Tuning Parallelism Today

Right now, SQL Server only has the very basic parallelism handling features of the cost threshold for parallelism, and max degree of parallelism system options and the MAXDOP query hint. The common advice given is to increase cost threshold of parallelism to some where in the 20-500 range, and max degree of parallelism to perhaps 4, with individual queries hinted to override as suitable. We could be satisfied with this manual intervention, but SQL Server product principles has long been to be as self-tuning as possible. The message to users is to focus on their business strategy, not SQL Server tuning secrets and tricks. Manual intervention should be the infrequent exceptions, and not a common practice.

Mission Very Difficult

The parallelism strategies I described will not be easy, but we are mission impossible, very difficult should be a walk in the park (darn! did someone already say that?). Even so, it should not be impossible to do better than what we currently have.

Parallel SQL - Mostly for Write Operations

For some time, Intel C/C++ compilers add-ons had the ability to automatically generate parallel code (as in the coder does not have to do it). I also recall hearing some thing about Parallel LINQ, but that is outside the scope of my narrow view of the world. I was just browsing a Sybase script (for the TPC-H benchmark) and notice the following, which is our version of BULK INSERT.

LOAD TABLE PARTSUPP (PS_PARTKEY '|',PS_SUPPKEY '|',PS_AVAILQTY '|',PS_SUPPLYCOST '|',PS_COMMENT '|')
FROM '/rawdata/partsupp.tbl.1',
    '/rawdata/partsupp.tbl.2',
    '/rawdata/partsupp.tbl.3',
    '/rawdata/partsupp.tbl.4',
    '/rawdata/partsupp.tbl.5',
    '/rawdata/partsupp.tbl.6',
etc

Now I have not inclination to read Sybase documentation. I am inclined to interpret that from a single client session, Sybase will fire off multiple threads, one per file to do the BULK INSERT into the PARTSUPP table. (Paul NZ looked this up the Sybase doc, and it is in fact a not a parallel command). No matter, even if does not do parallelism, it should and the syntax just cries: do me in parallel!

SQL CAT has put out a great white paper showing that SQL Server can be tuned for truely amazing parallel bulk load performance. But from what I recall, some degree of "spoon bending" (Itzik copyright?) was required. I think they used SSIS. Sure we could open multiple SSMS windows, with each one loading a different file. But, I redefine what it means to be a lazy @$$, so implement the Sybase syntax.

SSMS 2008 already allows Multiple Server Query Execution, but what I want is to execute against one server BULK INSERT, each thread assigned a different file. The simplicity of the Sybase statement syntax is really nice so we don't have to write some parameter substitution code.

Potentially, this could also be able to parallel INSERT/UPDATE on a partitioned table, with each thread handling a single partition. It would be nice do parallel INSERT/UPDATE on a non-partitioned table, but if there are technical reasons this can't be done, I would happy with just parallel write ops to a partitioned table.

So will this be in Denali? if not, perhaps a soon to follow hot-fix (lets set aside the feature-fix delineation for this). Long ago, the MS x86 OS people was wanted really anxious to facilitate handling larger than 32-bit virtual addresses (until full 64-bit was available) and advised don't wait for Willamette, put it in Katmai. When that did not produce a favorable response, a team from another company did.

A parallel tool from SQLBI SqlBulk Tool to create a mirror copy of a database.