Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage

Server Sizing Comments: the case for 1-socket in OLTP (2016-04)

Long ago, almost 20 years, there was legitimate concern on whether servers had sufficient capability to drive transaction processing requirements for large businesses. Then for a long period, perhaps from the late 1990s to 2012, the better choice for most mid to large businesses was to simply buy a 4-way (socket) system. Larger systems were based on a non-uniform memory access (NUMA) architecture that had unusual characteristics and vendors provided almost no information on how to work around issues that could cripple such systems.

Towards the end of the second era of server sizing, processors (at the socket level) became ridiculously powerful such that properly tuned systems ran at very low overall CPU utilization. However, 2-socket systems in this period did not have adequate memory or IO capability, unless it was a Xeon 7500 or later E7 series system with 2 sockets populated. And even then, the structure of SQL Server per processor licensing meant that many people opted for the 4-way system regardless of requirements.

In 2012, Intel finally put powerful processors (Sandy Bridge, up to 8-cores) with larger memory and IO capability into the Xeon E5 line (there were both 2 and 4 socket versions). This coincided with SQL Server 2012 per core licensing. Now there was a stronger technical justification and financial incentive to not automatically opt for a 4-socket system with the most number of cores.

It would seem reasonable to make some effort to first determine the total number of processor cores that will meet requirements with some headroom. Then determine the number processor sockets as more processor sockets means more memory sockets (also considering the difference in DIMM sockets per processor between Intel Xeon E5 and E7 processors). It is unclear whether anyone actually engaged in the technical analysis versus just making the decision strictly on budget goals.

Today, it is time to consider the astonishing next step, that a single socket system is the best choice for a transaction processing systems. First, with proper database architecture and tuning, 12 or so physical cores should be more than sufficient for a very large majority of requirements.

We should also factor in that the second generation hyper-threading (two logical processors per physical core) from Nehalem on has almost linear scaling in transactional workloads (heavy on index seeks involving few rows). This is very different from the first generation HT in Willamette and Northwood which was problematic, and the improved first generation in Prescott which was somewhat better in the positive aspects, and had fewer negatives.

A single socket Intel Xeon E5 processor supports 12 DIMM sockets for 384GB with 32GB DIMMs, or 768GB with 64GB DIMMs which is now at a moderate (50%) premium per GB over the 32GB DIMM. In the old days, DBAs were taught that memory was precious in reducing disk IO to a tolerable level. That was back when a total system memory of 1GB was considered very large.

I noticed that after common server memory configuration reached 32GB memory, disk IO could be brought to a very manageable level, well under the IOPS possible with 100-200 15K HDDs.

Today, all-flash storage is practical. Definitely in direct-attach storage, and maybe in SAN based storage depending on the degree of price gouging from your SAN vendor. So while it might be possible to show that TB-plus memory reduces IOPS from 10,000 to nil, the reality is a modern storage system correctly configured for database applications can easily support 100,000 IOPS if not 1M (which would require NVMe to be practical). In this case, we should find that memory requirements are far lower than the memory capacity of recent generation 2 and 4-socket systems.

Why not continue to spectacularly over-configured the server system when even with SQL Server per core licensing is only a small portion of the costs in deploying a critical line-of-business application? The big bosses may not question the costs of a 2-socket system or perhaps 4-socket system even factoring in that there may be several environments between production, DR, QA and development. But if performance is inadequate, it may cost you your job.

The answer is performance. Today, transaction processing performance is heavily dependent on serialized round-trip memory access latency. The term is pointer chasing in that a CPU instruction accesses memory to determine the next memory location to access. No amount of L2/L3 cache can hide this, nor is it possible to pre-fetch memory. (What Intel needs to do is increase Hyper-threading beyond two logical processors per physical core. IBM Power is 4 and SPARC is 8?).

For a processor with integrated memory controllers (AMD from Opteron on, Intel from Nehalem on), local node memory access round-trip latency is probably 50ns. For memory on an adjacent processor, a 1-hop remote node, it is perhaps 100ns. This information is absolutely crucial in understanding system level performance for transactional databases, and yet vendors try to not mention it, instead stressing metrics that are mostly irrelevant.

In a single socket system, 100% of memory accesses are local node, because there is only 1 node. In a two socket system, the expectation is that memory accesses are split 50-50 between local and remote node. In principle, it is possible to devise some strategy such that a higher percentage is local, but SQL Server does not make it practical to implement such a scheme. In the Intel Xeon E7 4-socket systems, 1 out of 4 memory nodes is local, and there are 3 memory nodes 1-hop away. In an 8-socket system, 4 nodes are 2-hop away.

An argument in favor of multi-socket systems is that the number of memory channels scales with the number of sockets. However, in transaction processing, it is memory round-trip latency the dominates, not memory bandwidth. (it is a factor that more memory channels means there can be more memory transactions, but it is unclear as to whether this offsets the increase in memory latency.)

If someone has reasonably comparable systems with 1, 2 and 4 processor sockets, all of the same processor generation, the impact of the number of sockets can be tested.

I am also of the opinion that SQL Server cannot make meaningful use of all the processor cores in a top-of-the-line 2-socket (36-cores for Xeon E5 v3), let alone a 4-socket (72-cores) due to most write operations data and log being single threaded. But I would like here opinions on this. Is this addressed in SQL Server 2016 to a degree?

The above applies mostly to transaction processing applications.
It is reasonable to suppose that decision support system application could use both the greater number of processors cores and the greater number of memory channels in a multi-socket systems. However it is unclear as to whether SQL Server is effective in implementing parallel execution plans when the degree of parallelism is greater than the number of cores in one socket.

I would think that the strategy of partitioning and alignment of compute to local (system) node memory from (formerly) PDW applied to a single SQL Server instance, treating each processor socket as though it were a node, which it is, could dramatically improved very DOP parallel execution plans. But it does not seem that Microsoft is interested in this approach.

I am presuming that column store is better capable of utilizing multi-socket, as it is definitely not encumbered by serialized round-trip memory latency.

Addendum 2
The Dell PowerEdge R730 can be purchased with 1 or 2 of 2 sockets populated. The R930 must have 2 or 4 of 4 sockets populated. It appears that the premium from 32GB to 64GB memory modules might be because the 32 is RDIMM and the 64GB is LRDIMM. The 2400MT/s 32GB RDIMM is $558, and the 64GB LRIMM is $1483. For this differential, I would stay single socket, opting for the more expensive 64GB (if necessary) over 2 sockets using the less expensive 32GB, even though I am giving up the 4 extra memory channels.

If the plan is to implement memory-optimized tables in 2016 (which supports more than 256GB? and has fewer other restrictions) and more than 768GB system memory is needed, then go ahead with 2-sockets. But I would try to get my transaction processing DB under the single socket limit. In the old days, we moved the old out to the DW system.

Originally posted on SQLblog.com  
comments and replies below

Thomas LaRock: Thanks for the article Joe. Just wondering if you have seen this article regarding soft NUMA in SQL 2016: Memory Latency Checker  

no I have not read it until you pointed it out. I do know that older versions of SQL Server, 2008, not sure about R2, that a parallel execution plan might get some threads on one socket and some on another when MAXDOP is less than the total number. But back then, perhaps MS was not fully up to speed on such strategies. It does seem that 2008R2, with MAXDOP set to the number of cores on one socket, does try to put all threads on one socket.

So if HT is enabled, my first thought is that the parallel execution plan should only use 1 logical processor per physical core. One of my previous tests showed mixed results for a parallel plan using both logical vs just 1. So the question: is there some through gain running two queries with parallel execution plans, one on the 1st logical proc of each core, and the other on the second?

I suppose this soft Numa in SQL Server is nice, but I have pestered MS on fixing their stupid parallelism methodology for a while.

First, the query-optimizer cost formulas with respect to parallelism is totally stupid (being a saturated IO model), meaning that parallel execution benefit estimation is also stupid. If we cannot estimate the benefit of varying DOP versus DOP 1, then we cannot have an intelligent strategy on parallelism.

But if this were fixed, then the next step is to include the cost of starting up a parallel plan (creating threads), then have an intelligent strategy to determine the appropriate DOP based on the plan cost, limiting the DOP when there are no substantial further gains, and possibly going negative on cost of starting up threads. it is also high time that pre-pooled parallel threads are maintained, so there is almost no cost to start a parallel plan.  

Regarding NUMA, SOS first attempted to schedule all threads on the same node (when DOP < node size) as early as SQL Server 2005.

Regarding HT, in my tests I've seen very decent throughput gains up to 1.5x threads/physical core. Beyond that it seems to level out. (So for example, on a 10 physical core socket, I can exhibit and repro gains up to DOP 15. Beyond that any benefit drops off and at DOP 19 or 20 things may even be a bit slower than at DOP 15 -- but still faster than DOP 10.)

Also not so sure about your final point, WRT to startup cost. SQL Server already has a thread pool, and those threads aren't regularly shut down if the server remains active. On the flip side, if the server is not very busy then thread startup cost is basically negligible. Why do you feel that the existing implementation doesn't work?

I think that if your parallel plan run time is being influenced that much by thread startup cost then those queries probably not great candidates for heavy parallelism. Startup is going to add a small number of milliseconds in most cases. IMO parallelism is to be used for larger queries that run on the order of at least seconds.  

SQL Server 2005 forward may try to schedule all threads on the same node but does not guarantee it. I do remember situations where 7 threads were on one node and the 8th was on another.

I am of the opinion that when MAXDOP is set to N, N being the number of physical cores on one node or less, that the DOP strategy should be to give me up N with all threads being on the same node. If it cannot, I will live with what is available on the one node rather than split the threads. Perhaps this could be an optional setting, but I would prefer it be default.

In the main post, I was specifically saying that HT give almost linear scaling in throughput for light (transactional) queries. In other places, I have said that HT yields mixed results for parallel execution plans (did I say this above somewhere?, I might be getting senile, or drunk).

That last time I investigated this was on Xeon E5 v1 Sandy Bridge mostly with SQL Server 2008R2, and perhaps some 2012. It is possible that newer processors and newer SQL Server might have better results in parallel EP with HT. If so, I would like to see a good study on this.

In principle, HT should gain performance only if there are dead cycles in one thread, primarily from round-trip memory access, and possibly IO (SQL Server already does asynchronous IO for estimated rows > 25). In a plan with hash joins, I would expect that the code is kind of stream through memory except for the hokey read row offsets at the end of the page then go back to the row, and possibly jump around for the specific columns.

There would be plenty of round memory in a loop join. I had previously said that large loop join queries did not scale with parallelism, but perhaps this was because my tests were not large enough, and this was several years ago, so some one please update me on this. If so, then HT would definitely help. But does HT help in a parallel hash join too?

OK, so in my comment above, bitching about thread startup. the specific case is intermediate queries that could benefit from a parallel plan but only at limit DOP, and when this is called in volume. If it is called in volume, then you might argue DOP should be 1 for max efficiency.

Suppose we had a query of 1 CPU-sec, 1 sec elapsed time at DOP. Then we might expect at something like 1.1 CPU-sec, 0.55 elapsed at DOP 2, and 1.2 CPU-s 0.3 elapsed at DOP 4 etc., i.e., scaling degrades slowly. But there is a surprise. A parallel execution plan can and might use different operators, like the bitmap, in which case the 1-1 DOP 1 might be 0.6 CPU-s, 0.3 s elapsed at DOP 2. There are some examples of this in the tpc-h queries. So we can nolonger argue that high volume should be DOP 1, parallelism helps. However, in these 0.1 - 2 sec queries, the correct DOP is key.

This is why I have bitched at MS to have a flexible DOP strategy based on an intelligent cost model, not the crude one designed in SQL Server 7 1997 days, which is really stupid in recent years on multi-core, big memory, and SSD. For these small query, thread management time is meaningful, so I think there should be pre-set DOP 2 and 4 collections.

So I am specifically not talking about heavy parallelism. Since I am in the bitchy mood, which is frequently the case, I will go on. Look at the published TPC-H results for SQL Server and other DBs. Not the single composite score, but rather the separate Power and Throughput. On SQL Server, the Power score is higher than throughput. In the Lenovo result this is about 2:1, less on other SQL Server results.

In other DBs, Throughput is higher than Power, which is a reasonable expectation. The higher our DOP, the more high-parallel inefficiencies are incurred. So in theory, running two separate streams at half the DOP should yield high throughput than 1 stream at full DOP. So the interpretation is that SQL Server is not being smart about dynamically adjusting the DOP depending on the circumstances.