I am in the process of updating the NUMA page. But for the time being, it is rough draft. I will collect available links to SQL Server resources pertaining to NUMA.
In the early days, NUMA was synonymous very large systems with 8+ processors. The NUMA architecture allowed nodes of typically four processors to connected together. Since the node comprised both processors and memory, the architecture was inherently non-uniform in terms of memory access. Some years ago, the AMD Opteron processor integrated the memory controller. This meant that any multi-processor system was technically a non-uniform memory access architecture. So today, NUMA is nolonger synonymous with very large systems. For lack of a better term, I am extending the term big-iron to large Intel Xeon systems. Big-Iron was historically associated with mainframe systems.
Symmetric Multi-Processing (SMP) mostly means that any processor can work on any task. No processors are dedicated to specific tasks, which would be an Asymmetric Multi-Processing (ASMP) architecture. While SMP systems are usually built with multiple identical processors, the only firm requirement is that all processors have a common instruction set architecture (ISA). In ASMP, there is no requirement to have a common ISA, particularly for the special purpose processors. Other aspects of SMP systems is that each processor can access all memory, and the system runs a single operating system image. There is no specific definition as to interconnect or memory architecture. In the early days, shared-bus SMP, shown below, was popular, being the foundation of Intel 4-way system architecture from 1995 to 2005.
In the Non-Uniform Memory Access (NUMA) architecture, the path from processor to memory is non-uniform. This organization enables the construction of systems with a large number of processors, and hence the association with very large systems. A NUMA system with cache-coherent memory running a single OS image is still an SMP system. A general representation of the NUMA system is shown below.
The system is comprised of multiple nodes each with 2-4 processors, a memory controller, memory and perhaps IO. There might be a separate node controller, or the MC and NC could be integrated. The nodes could be connected by a shared bus, or may implement a cross-bar.
The most apparent aspect on this architecture is the non-uniform distance from processor to memory, as implied its name. The access time to local memory is frequently in the range of 150-200ns and 300-400ns for remote node memory, without accounting for cache coherency. It would seem that if there operating system was aware of this aspect of memory organization, it could take advantage of the lower access time for local memory. However, with cache coherency, this is more complicated.
There are two common methods for maintaining cache coherency. One is snoop, and the other is directory based. Of the two, snoop is simpler to implement, but directory is more effective for large system. As the cross-over point is usually beyond that of the 4-way system, the snoop filter was built into many Intel memory controllers.
In the 2003-4 time frame, AMD introduced the Opteron processor with integrated memory controller, and three point-to-point serial protocol links call Hyper-Transport (HT) instead of a shared bus. As is apparent, technically any multi-processor Opteron system is a NUMA architecture, comprised of single-socket nodes. However, the difference is memory access times between local and remote memory is low, and does not appear to introduce performance or behavioral anomalies that are common in traditional NUMA systems. It is not necessary to make special considerations for 2 or 4-way Opteron systems that might be essential on large NUMA systems.
The diagram below explains the benefits of HT Assist, introduced with the six-core Istanbul.
The Enhanced Twisted Ladder architecture shown below allows the implementation of a glue-less 8-way Opteron system. The Opteron processors before Magny-Cours have three HT links. In the system below, Processor 1 and 7 use one HT link to connect to the IO hub, leaving only 2 HT links to connect to other processors. The path from processor 1 to processor 7 is 3 hops. All other processor to processor paths are 1 or 2 hops.
As shown below, each die within the processor socket/package is connected to the other with one full-width and one half-width HT. In a 2-way system, a die in one processor connects to one die in the other processor with one full-width HT, and to the second die with a half-width HT. The remaining full-width HT is reserved for IO, having no cache-coherency support. The slides below are from the AMD Hotchips 2009 presentation referenced at the end.
In a 4-way system, one die connects to the first die in the other three processors with a half-width HT, and the other die connects to the second die in the same. With this strategy, AMD nolonger supports 8-socket systems, but now provides 48-cores in a relatively low cost 4-way system.
Intel finally transitioned off the shared-bus architecture to point-to-point signalling with their protocol being called Quick-Path Interconnect (QPI). The Intel Xeon 7500 series processors (Nehalem-EX) have 4 QPI links. This allows a 4-way system with all processors directly connected (1-hop) and there are redundant links to the IOH devices.
A diagram on Kagai showed that the in Sandy Bridge generation, the 4-way system will just be connected from corner to corner, that is, there will not be the diagonal connections.
The 8-way glue-less Xeon 7500 system connects all processors with 2 hops or less, as shown below.
Intel QPI implements something equivalent to the Snoop Filter from previous Intel multi-processor MCH devices. There was some discussion that a directory was better for large systems, but Intel did not have time to get this into Westmere-EX (to be the Xeon 7600 series). Presumably this would be built into the Sandy Bridge EX line?
Historically, Big-Iron meant mainframes. Since today, even standard high-volume 2-way and 4-way servers are now technically NUMA architecture, I will now use the term big-iron to describe large systems that were formerly called NUMA. AMD will nolonger support 8-way systems with their Magny-Cours generation and later.
It appears that HP alone did not implement a glueless 8-way Xeon 7500. The white paper HP ProLiant DL980 G7 server with HP PREMA Architecture (document 4AA3-0643ENW) describes HP's 8-way Xeon 7500 system. It is not clear what the acronymn PREMA stands for, but the PREMA architecture is distinguished by the use of node controllers as shown below.
A node comprises four processors, memory, IO and a pair of node controllers. There are three IOH devices in the system. The processors and memory on the CPU board are connected to the XNC board with the node controllers.
The question is what advantage does HP gain with the node controllers over a glueless 8-way system, which already has redundant fabric. The main advantage turns out to be memory latency. The Intel Xeon 7500 processor uses snoop to maintain cache coherency. On a large system, much of the interconnect bandwidth is consumed by cache coherency traffic. Of course, that is also why the system has so much interconnect bandwidth. However, even for a local memory access, a processor must still snoop the other processor caches to maintain cache coherency. This means that while it seems local memory access time is shorter, after that the snoop it is not that short.
The HP node controller maintains a directory of the contents of each processors cache, not the cache data itself, just the tags. This improve local memory access time.
This is a great deal of effort to gain a moderate performance advantage in a single platform, especially considering that Intel will probably implement a directory based cache coherency protocoal with Sandy Brigde-EX. The real purpose of the node controllers is perhap to lay the foundation for larger systems. The HP PREMA whitepaper says that future systems could have up to 32 processors. Given that Westmere-EX will have 10 cores, this would comprise 320 cores plus Hyper-Threading for 640 logical processors.
Even though the Intel Itanium processor architecture family has fallen from consideration as a Windows Server platform, it is still instructive to look at HP Itanium system architecture. The white paper HP Superdome 2: the Ultimate Mission-critical Platform describes the new Superdome with the quad-core Itanium processor codename Tukwila, and the new HP sx3000 chipset. The system architecture diagram is shown below.
Side note: Windows Server 2008 R2 is the last version to support the Itanium instruction set architecture. Microsoft used the availability of 64+ core systems on Itanium to develop the 64+ logical processor capability in Windows Server 2008 R2. The largest Xeon system at the beginning of this effort was a 16-way quad-core Xeon 7300 with 64 cores. Later, the six-core Xeon 7400 did become available
The three components of the sx3000 chipset are the agent, the crossbar and the PCIe bus adapter. Two processors, memory, two agents and an IOH comprise a blade. Eight blades connected to the crossbar fabric comprise a node. There can be up to four nodes in a Superdome 2 system. While each blade has IO, additional IO Expansion (IOX) units can be connected. A localized view of the blade is shown below.
The sx3000 maintains system cache coherence by using a Remote Ownership Tag Cache realized using on-chip SRAM in the Agent chips. The Level 4 cache is implemented with eDRAM stores lines fetched from memory on remote blades.
The figure below shows the Superdome 2 complete system topology.
The main difference between the sx3000 agent for Itanium and the PREMA node controller for Xeon 7500, aside from the fact that the sx3000 blade has 2 processors, is that the sx3000 agent has both an on-die directory (to reduce local memory access time) and a L4 cache to reduce average remote blade memory access. The PREMA node controller only maintains a directory to reduce local memory access time.
The Windows operating system can recognize difference in memory organization to better support NUMA systems. This capability actually derives from the memory controller. In the early days of systems with multi-channel memory, memory interleaved across all channels. It was necessary to populated the entire line with identical capacity memory modules. If a system has four memory channels, then memory is incremented in sets of 4 identical capacity modules.
For greater flexibility in the low-end, the more recent Intel memory controllers allowed memory to be interleaved or separate, or even a mix of the two. In the example below, the system has 3 memory channels (Nehalem). Channel 0 has a 2GB and 1GB DIMM, channel 1 has a 512MB and 1GB DIMM and Channel 2 has one 2GB DIMM.
The memory controller reorganizes this to the logical organization below.
Of course, it is best to have each memory channel populated with identical modules for best interleaved performance.
The operating system can recognize memory in two organizations as shown below. The first shows memory interleaved across all memory channels and all nodes. The second shows memory interleaved across memory channels within a node, but concatenated between nodes.
In a large system, with nodes comprised of more than one processor, potentially memory could be interleaved across all processors within a node, and concatenated between nodes.
It would seem that there could be an advantage to the NUMA model, but there are a host of issues. One is that the operating system might allocated a large block of contiguous memory, which would by necessity come from one node, because the block is contiguous. HP-UX has a solution for this called Locality Optimized Resource Alignment (LORA), where 12.5% of system memory is interleaved, and the remaining localized by NUMA node.
Next is the matter of actually achieving locality. Suppose for example that we are on a 4 node system not using the NUMA memory model. Then 25% of memory accesses go to each node. Now suppose we want to switch to the NUMA memory model. The objective is to achieve better than 25% local memory access. It is not realistic to think about achieving anything close to 100% memory locality, unless of course we run 4 separate instances of SQL Server with each affinitized to a particular node.
To achieve locality, we must go to consider effort to enable locality. Users (or equivalent) must connect to SQL Server using port specific for its group. SQL Server port affinity then localizing the node that particular connection will run on.
Now that we have localized execution, next we must localize the data structures. If the Customers, Orders, and Line Items table are all have cluster keys leading with Customer ID, and the port affinity were assigned by Customer ID range, then there should be locality of the data structures. The trick is to also balance locality with load distribution. If the low numbered customers are nolonger active, then we do not have uniform load.
It is also not a good idea to partition by geographic region, especially if each region were in a difference time zone. The worst idea (for affinity and load balancing purposes) would be a geographical partition of Europe/Africa, Americas, and Asia. Of course it would be good idea to partition a database for an international operation with this scheme for the purpose of enabling a maintenance window.
There should be two broad categorizations of NUMA on SQL Server. One aspect is high-call volume or chatty applications. The other is Data Warehouse style parallel execution plans. High-call volume applications require port affinity tuning.
Srik Raghavan and Kevin Cox, at SQL PASS, not sure what year, but 2008R2 is discussed,
DAT206: Scaling OLTP Applications: Application Design and Hardware Considerations
WinHEC2008 Alex Verbitski and Pravin Mittal Scaling More Than 64 Logical Processors: A SQL Perspective
Stuart Ozer, SQL CAT, at SEAS 2006 DB401 64-bit Solutions on SQL Server
Bob Dorr, SQL PASS 2006, Interviewing SQLOS Developers about NUMA Design Considerations
SQL Server Technical Article We Loaded 1TB in 30 Minutes with SSIS, and So Can You Writers: Len Wyatt, Tim Shea, David Powell Published: March 2009
SQL Server Technical Article Scaling Up Your Data Warehouse with SQL Server 2008 Writers: Eric N. Hanson, Kevin Cox, Alejandro Hernandez Saenz, Boris Baryshnikov, Joachim Hammer, Roman Schindlauer, and Grant Dickinson of Microsoft Corporation. Gunter Zink of HP. Technical Reviewer: Eric N. Hanson, Microsoft Corporation Published: July 2008
thecoderblogs msdnrss (this link is now dead?) sql-2008 and performance tuning material
SQL CAT Top 10 Hidden Gems
Thomas Grohser SQL Server on HP Integrity Failure is not an option HP Connection 2008, www.sqlserver-hwguide.com
From MSDN Library, SQL Server 2008 Books Online Understanding Non-uniform Memory Access
CSS SQL Server Engineers (formerly PSS) blog How It Works: SQL Server 2005 NUMA Basics
MSDN library How to: Map TCP/IP Ports to NUMA Nodes
Slava Oks's WebLog: Configuring SQL Server 2005 for Soft NUMA
ENT-T554 Windows Support For Greater Than 64 Logical Processors by Arie van der Hoeven
SVR-T332 NUMA I/O Optimizations by Bruce Worthington
Mark Friedman's blog on NUMA and TCP/IP Mainstream NUMA and the TCP/IP stack, Part IV: Parallelizing TCP/IP
Hotchips presentation on HT Assist Performance Methodologies for Opteron Servers
Intel Press QPI by Gurbir Singh
Intel Software Network blog Learning Experience of NUMA and Intel's Next Generation Xeon Processor I
HP LORA paper Locality Optimized Resource Alignment
Ray Engler, IBM System x Server Performance Optimizing SQL Server 2005/2008 Performance on Multi-Node Servers IBM excerpt
See Cisco Data Center Virtualization Server Architectures By Roger Andersson, Silvano Gai, Tommi Salli, Jul 2010,