Originally from High Call Volume SQL Server Applications on NUMA Systems
Oct 07, 2005
One of the most difficult database operations to scale-up on high-end server systems is the network call. In fact, without special precautions, a high call volume database application can have severe negative scaling on large systems. It is suspected that this problem is more pronounced on NUMA systems than basic SMP systems. A NUMA system with 16 processors can have significantly lower throughput capability than an otherwise comparable 4-way SMP system. Many SQL Server performance experts have been advocating for some time the importance of consolidating calls to SQL Server because of the high cost of the call overhead relative to simple SQL operations, such as a single row index seek. There seems to be little awareness that the call overhead cost can escalate dramatically relative to the cost of SQL operations on NUMA systems with high processor count. Given the number of important applications that exhibit the high call volume characteristic, it is rather curious that there is essentially no meaningful public explanation of the special precautions (read: undocumented features) to avoid negative scaling on large systems for SQL Server 2000.
Three possible explanations for failing to consolidate calls in current generation applications include ignorance, ineptitude, and mainframe heritage. The reason the third item is exempt from the first two classifications is that mainframe systems may have been designed to sustain high call volume operations. One example is SAP, which appears to select a block of rows from one table, then retrieves one row from a given table per call. The queries do not even employ a join, as if the application architecture pre-dates relational databases.
Microsoft has announced features in SQL Server 2005 to handle the situation of high call volumes. Some of this is documented under sections describing NUMA features. This includes the ability to better utilize NUMA system memory organization, and the ability to map client connections to NUMA nodes by network port. The second item may be critical in achieving high call volume capability. There is some documentation mentioning the ability of SQL Server 2000 to optimize for NUMA memory organization, but no apparent documentation regarding client connection and CPU group affinity.
One might conclude that this capability is not in SQL Server 2000. However, it is noted that it would be very difficult for SQL Server 2000 to reach the high scores achieved in the published TPC-C benchmarks for NUMA systems with 16 or more processors. All of these systems use the VI protocol and VI capable fiber channel adapters for client connections. This by itself may help performance relative to using the TCP/IP protocol over gigabit Ethernet, but is unlikely to be sufficient. An examination of the TPC-C full disclosure reports shows configuration settings that appear to be connection affinity masks by network port. Now if this were only important for the TPC-C benchmark, then additional documentation may not be of much interest. But there are important actual customer applications that may depend on this capability, so proper documentation from Microsoft or NUMA system vendors would be very helpful.
Current generation systems with more the four processors are frequently NUMA systems (Non-Uniform Memory Architecture). The path from processor to memory has different latencies, depending on whether the processor and memory reside in a common node or a different node. Some older 8-way Pentium III Xeon systems are not NUMA. There are three NUMA chipsets, on which the server system is based, for the Itanium 2 processor relevant to Windows environments. One is based on the Intel E8870 chipset, another from HP for the Integrity and Superdome lines, and a third chipset for NEC systems. Unisys and IBM each have a NUMA chipset for the Intel Xeon MP processors. Unisys uses the Intel E8870 chipset for their Itanium systems. Each of the above NUMA systems supports four processors per cell.
All multi-processor Opteron systems are technically NUMA, in that the processor has a direct connection to memory and to other processors. So there is a difference in latency from processor to directly attached local memory and non-local memory, requiring one or more hops to other processors. However, the other NUMA chipsets have a large difference between local and non-local memory latency, typically in the range of 120-150ns for local memory and 250-300ns for non-local memory. The Opteron 2-way system has a direct attach memory latency on the order of 60ns and a one hop non-local memory latency of 90ns (See AMD documentation for accurate values).
Going forward with trend to multi-core processors, the term processors should be changed to sockets or cores as appropriate to distinguish between the package that plugs into a motherboard, and the entity that appears as a single physical processor to the operating system.
There are some references to the way the Windows operating system organizes memory. The common organization is a simple interleave. Line 1 is spread across the 4 DMH memory hubs of the first SNC. The second line is on the second SNC and so on to the fourth SNC. Line 5 returns back to the first SNC. An alternative option is for processors located in one SNC to be able to allocate memory from the local SNC. There is a SQL Server startup flag T825 referenced as the NUMA option, but there does not appear to be public documentation on this.
The Microsoft Platform SDK documentation mentions two aspects of NUMA support. One involves scheduling threads on processors in the same cell as the memory required. The other allows a thread to allocate memory from the same node.
Figure 1 Intel E870 chipset for 16-way Itanium 2 System
A single call from an application (on a different system) to the SQL Server system may consist of one or more network packet transmissions and one or more receive packets, depending on the amount of data sent in each direction per call, the network packet size, and window size. For this reason, a call to SQL Server should not be referred to as a network round-trip. It is not possible to determine the number of round-trips from the network interface performance counters, which only tracks packet and byte count. SQL Server calls can be measured in Performance Monitor by the object: ‘SQL Server: SQL Statistics’, the counter: ‘Batch Request/sec’. In SQL Profiler, this is measured by the Stored Procedure event ‘RPC:Completed’ and the TSQL event ‘SQL:BatchCompleted’. It is implied that in order to generate a high call volume, the vast majority of calls must be relatively inexpensive and quick to execute.
To explore the characteristics of SQL Server behavior in high call volume applications, a series of tests were conducted on a range of platforms from mid-2004 to present. The test stored procedure generates a single random number, performs another arithmetic operation, and returns the SPID. There are no table operations. An otherwise equivalent stored procedure that performs a single row index seek from a table entirely in the buffer cache is only slightly more expensive than the no-op call (in the range of 15%).
The load generator system is either a 2-way Xeon or a 4 x 2.7GHz Xeon MP. All network traffic runs over a single Gigabit Ethernet network. The network controllers employed include the Broadcom BCM570x, the Intel PRO/1000MT and the Intel PRO/1000XT. There were some performance improvements from the early Windows 2000 drivers to the more recent drivers, but no significant performance differences between driver versions released after 2004 have been observed. The operating system is Windows Server 2003, with hot-fixes current at the time of the test. The SQL Server 2000 versions ranged from SP3, hot-fix build 911, 1027 and SP4.
Figure 2 RPC performance for various 2-way platforms
Figure 2 shows the calls per second performance for various 2-way platforms with between 1 – 20 client connections. The first system is a 2-way 3.06GHz Xeon, 512K cache, 533MHz FSB on the ServerWorks GC-LE chipset. The second system is a 2x3.2GHz Xeon, 1M L2 cache, 800MHz FSB on the Intel E7520 chipset. The third system is a 2x2.2GHz Opteron system. The network controller is either the Broadcom or Intel PRO/1000MT. Data is listed in Appendix A.
Figure 3 compares the 4-way 2.7GHz Xeon MP and 16-way 3.0GHz Xeon MP call volume performance. The 4-way system can drive more than 2 times more calls per second than the 16-way system. There are good examples of SQL Server queries emphasizing data intensive operations rather than network round-trips that show excellent scaling on the 16-way system. So there are no fundamental problems with NUMA systems, only the problem of matching the software stack to the hardware characteristics.
Figure 3 Client Connections
The Microsoft KB article (Interrupt Affinity tool) on the Interrupt Affinity tool describes Windows 2000 as assigning interrupts to any available processor, and that performance improvement may be possible by assigning each network adapter to a specific processor. It is possible that Windows Server 2003 changed the default behavior as suggested and assigns each interrupt to specific processor. Excerpts from this KB article are in Appendix B.
Figure 4 shows the individual CPU utilization from Windows Task Manager on a Unisys 16-way Xeon MP system running Windows Server 2003 while sustaining 17K calls per second. Note that CPU 10 (counting up from 0) is at near 100% utilization. It is suspected that this is the processor handling the network interrupt, but the necessary steps to prove this were not conducted. There was no disk activity in this test. There were no other processes running and nothing else generating network activity. If this interpretation is correct, then the call handling capability of the 16-way system is saturated even though the other processors are not even close to fully loaded. An actual production server (16-way Itanium 2) running SAP exhibited essentially the same characteristics shown in Figure 4. Applying any addition network traffic to the connection handling SQL Server calls resulted in call volume performance degradation, but generating traffic on a different network connection not used by the active SQL Server clients did not degrade performance.
It is possible that distributing the network interrupt over more processors could improve call volume performance. It could also be speculated that excluding the SQL Server process affinity from one or more processors and binding the network interrupt to excluded processor(s) might help, but the net gain is not clear. Another point to note is that the CPU cost per call on the 16-way system is much higher than that of the 4-way system. So even if the CPU load could be evenly distributed, the performance with all 16 processors saturated may be no better than the 4-way call volume performance. It could be that there is substantial cost in having one processor handle the interrupt, then hand off the call to a SQL Server thread running on a processor in a different node.
Figure 5 shows the call volume scaling characteristics on an 8-way Itanium 2 system (HP rx8620, 1.5GHz processors). There are 4 processors in each of 2 cells. The call volume test was conducted with the system booted to 1, 2, 4, and 8 processors using the NUMPROC option in the EFI OS loader (equivalent to the boot.ini file in 32-bit systems).
Figure 5 Call volume performance for HP rx8620 booted to 1, 2, 4 and 8 processors.
It was not determined in the 2 & 4 CPU test whether all processors were in a common cell. Call volume scaling shows only marginal improvement from 1 to 2 processors (13.5K to 16.5K), no gain from 2 to 4 processors, and some degradation from 4 to 8 processors. It is possible that the one or both of the 2 & 4 processor tests.
The TPC-C benchmark is not as call volume intensive as some applications, but on current very high-end systems, would require techniques to avoid the problem described above. So it is worth examining the call volumes reached and any special configurations employed. The TPC-C benchmark requires a minimum of 55% calls for the four transactions other than the new order transaction, implying that new orders comprise at most 45% of calls. This implies that there are at least 2.22 calls per transaction. So score of 600,000 tpm-C constitutes 10,000 new order transactions per second for a total of 22.2K calls/sec. The table below shows the published TPC-C results for SQL Server 2000 on NUMA systems. The 64-way HP Superdome generates over 29,000 calls per second. Even the 16-way systems require over 12,000 calls per second.
All the above reports specify VI connections between the application server and the database server. The excerpts below are from the Oct-2003 NEC full disclosure report. The first excerpt are the boot options (Itanium system settings may be in the EFI instead of the boot.ini file), one of which explicitly states to be a directive to distribute interrupts. The second excerpt describes connection affinity bindings.
OsLoadOptions = /MAXPROCSPERCLUSTER=1 /CONFIGFLAG=8224
CONFIGFLAG=8224 : Interrupt logging disable
MAXPROCSPERCLUSTER=1: distribute interrupts to all CPUs equally.
We bind each client/server connection (i.e., 8 QLA2350 VIA NICs) onto the distinct CPU sets exclusively.
To do this, apply the following settings by using the SQLServer "Server Network Utility".
Enabled protocols: "VIA" only
Vendor: Select "QLogic"
Listen Info: Enter the following string
Both the Unisys Itanium 2 and Xeon MP reports used the following settings (with some differences).
There is no documentation from Microsoft detailing the values around the port number. The NEC report is most clear in explaining that this is for binding connections to a specific set of CPUs. Appendix C is a collection of excerpts from Microsoft documentation for SQL Server 2005 on NUMA and I/O affinity. The general strategy seems to be as follows. Each group of clients uses a specific port number. The interrupt for that port number is handled by a specific processor, and the user connection in turn will only use processors in the same node. Note that this feature in SQL Server 2005 applies to both TCP/IP and VI. The TPC-C reports for SQL Server 2000 only show the undocumented settings for the VI protocol, but not explicitly state whether there is or is not support for TCP/IP.
It is also unclear as whether the ListenInfo settings only sets affinity for only the network connection itself or also the SQL Server thread and SPID handling any connections open from that port. A look the registry setting for published SQL Server 2005 TPC-C results (Appendix D) indicates additional registry settings for CPU affinity by node.
Additional tuning parameters from the TPC-C full disclosures are as follows. The SQL Server Startup parameters commonly used are:
-T825 Enable NUMA support
-T826 don’t use large pages for buffer array
-T827 enable Super Latches
-T888 Use locked pages for buffers
-E increase the number of consecutive extents allocated per file to x
Below is a setting change in permissions.
Local Policies - User Rights Assignments - policy 'Lock pages in memory'
added group 'Administrators'
The following was used in the HP Superdome report of Oct 2003. Environment variable LOGAFFINITY specifies which processor the log write will be affinitized to.
The performance implications of high call volume SQL Server applications on large NUMA based server systems are clearly demonstrated. Microsoft has documented features in SQL Server 2005 involving NUMA memory optimizations and client connection affinity tuning for both the TCP/IP and VI protocols. There are indications that NUMA memory optimization and the client affinity tuning for the VI protocol already exist in SQL Server 2000, but there does not appear to be public documentation. It would be important to understand to what degree the high call volume issue on large systems can be alleviated on TCP/IP over Gigabit Ethernet and when it is necessary to use VI. Anyone with good connections to the Microsoft SQL Server team should make direct inquires if there is reason to believe this subject is important to their environment.