I should write an updated SQL Server on NUMA, but for the time being, I will collect available links to SQL Server resources pertaining to NUMA.
There should be two broad categorizations of NUMA on SQL Server. One aspect is high-call volume or chatty applications. The other is Data Warehouse style parallel execution plans. High-call volume applications require port affinity tuning.
From MSDN Library, SQL Server 2008 Books Online Understanding Non-uniform Memory Access
From blogs.msdn.com/psssql CSS SQL Server Engineers: How It Works: SQL Server 2005 NUMA Basics
How to: Map TCP/IP Ports to NUMA Nodes
Slava Oks's WebLog: Configuring SQL Server 2005 for Soft NUMA
Ray Engler, IBM System x Server Performance Optimizing SQL Server 2005/2008 Performance on Multi-Node Servers IBM excerpt
Tuning IBM System x Servers for Performance
Tuning SQL Server 2005 on HP Integrity Servers
HP Integrity server solutions for Business Intelligence
Intel Software Network blog Learning Experience of NUMA and Intel's Next Generation Xeon Processor I
Emulex MSI and MSI-X: New Interrupt Handling Improves System Performance ...
Thomas Grohser SQL Server on HP IntegrityFailure HP Connection 2008, www.sqlserver-hwguide.com
Dean Bethke Technology Architect Microsoft Technology Team:
64-bit Solutions on SQL Server 2005(pdf) at www.sdsqlug.org
Same in PowerPoint from SEAS 2006
64-bit Solutions on SQL Server 2005(ppt)
SQL Server Technical Article We Loaded 1TB in 30 Minutes with SSIS, and So Can You Writers: Len Wyatt, Tim Shea, David Powell Published: March 2009
SQL Server Technical Article Scaling Up Your Data Warehouse with SQL Server 2008 Writers: Eric N. Hanson, Kevin Cox, Alejandro Hernandez Saenz, Boris Baryshnikov, Joachim Hammer, Roman Schindlauer, and Grant Dickinson of Microsoft Corporation. Gunter Zink of HP. Technical Reviewer: Eric N. Hanson, Microsoft Corporation Published: July 2008
SQL Server 2005 Memory Internals
www.sqlstress.com/Overview.aspx David Kubelka http://www.kubelka.com/ http://msdnrss.thecoderblogs.com/2009/08/27/sql-2008-and-performance-tuning-material/ http://sqlcat.com/top10lists/archive/2007/11/21/top-10-hidden-gems-in-sql-server-2005.aspx
 
Joe Chang jchang6@yahoo.com
Originally from High Call Volume SQL Server Applications on NUMA Systems
Oct 07, 2005
One of the most difficult database operations to scale-up on high-end server systems is the network call. In fact, without special precautions, a high call volume database application can have severe negative scaling on large systems. It is suspected that this problem is more pronounced on NUMA systems than basic SMP systems. A NUMA system with 16 processors can have significantly lower throughput capability than an otherwise comparable 4-way SMP system. Many SQL Server performance experts have been advocating for some time the importance of consolidating calls to SQL Server because of the high cost of the call overhead relative to simple SQL operations, such as a single row index seek. There seems to be little awareness that the call overhead cost can escalate dramatically relative to the cost of SQL operations on NUMA systems with high processor count. Given the number of important applications that exhibit the high call volume characteristic, it is rather curious that there is essentially no meaningful public explanation of the special precautions (read: undocumented features) to avoid negative scaling on large systems for SQL Server 2000.
Three possible explanations for failing to consolidate calls in current generation applications include ignorance, ineptitude, and mainframe heritage. The reason the third item is exempt from the first two classifications is that mainframe systems may have been designed to sustain high call volume operations. One example is SAP, which appears to select a block of rows from one table, then retrieves one row from a given table per call. The queries do not even employ a join, as if the application architecture pre-dates relational databases.
Microsoft has announced features in SQL Server 2005 to handle the situation of high call volumes. Some of this is documented under sections describing NUMA features. This includes the ability to better utilize NUMA system memory organization, and the ability to map client connections to NUMA nodes by network port. The second item may be critical in achieving high call volume capability. There is some documentation mentioning the ability of SQL Server 2000 to optimize for NUMA memory organization, but no apparent documentation regarding client connection and CPU group affinity.
One might conclude that this capability is not in SQL Server 2000. However, it is noted that it would be very difficult for SQL Server 2000 to reach the high scores achieved in the published TPC-C benchmarks for NUMA systems with 16 or more processors. All of these systems use the VI protocol and VI capable fiber channel adapters for client connections. This by itself may help performance relative to using the TCP/IP protocol over gigabit Ethernet, but is unlikely to be sufficient. An examination of the TPC-C full disclosure reports shows configuration settings that appear to be connection affinity masks by network port. Now if this were only important for the TPC-C benchmark, then additional documentation may not be of much interest. But there are important actual customer applications that may depend on this capability, so proper documentation from Microsoft or NUMA system vendors would be very helpful.
Current generation systems with more the four processors are frequently NUMA systems (Non-Uniform Memory Architecture). The path from processor to memory has different latencies, depending on whether the processor and memory reside in a common node or a different node. Some older 8-way Pentium III Xeon systems are not NUMA. There are three NUMA chipsets, on which the server system is based, for the Itanium 2 processor relevant to Windows environments. One is based on the Intel E8870 chipset, another from HP for the Integrity and Superdome lines, and a third chipset for NEC systems. Unisys and IBM each have a NUMA chipset for the Intel Xeon MP processors. Unisys uses the Intel E8870 chipset for their Itanium systems. Each of the above NUMA systems supports four processors per cell.
All multi-processor Opteron systems are technically NUMA, in that the processor has a direct connection to memory and to other processors. So there is a difference in latency from processor to directly attached local memory and non-local memory, requiring one or more hops to other processors. However, the other NUMA chipsets have a large difference between local and non-local memory latency, typically in the range of 120-150ns for local memory and 250-300ns for non-local memory. The Opteron 2-way system has a direct attach memory latency on the order of 60ns and a one hop non-local memory latency of 90ns (See AMD documentation for accurate values).
Going forward with trend to multi-core processors, the term processors should be changed to sockets or cores as appropriate to distinguish between the package that plugs into a motherboard, and the entity that appears as a single physical processor to the operating system.
There are some references to the way the Windows operating system organizes memory. The common organization is a simple interleave. Line 1 is spread across the 4 DMH memory hubs of the first SNC. The second line is on the second SNC and so on to the fourth SNC. Line 5 returns back to the first SNC. An alternative option is for processors located in one SNC to be able to allocate memory from the local SNC. There is a SQL Server startup flag T825 referenced as the NUMA option, but there does not appear to be public documentation on this.
The Microsoft Platform SDK documentation mentions two aspects of NUMA support. One involves scheduling threads on processors in the same cell as the memory required. The other allows a thread to allocate memory from the same node.

Figure 1 Intel E870 chipset for 16-way Itanium 2 System
A single call from an application (on a different system) to the SQL Server system may consist of one or more network packet transmissions and one or more receive packets, depending on the amount of data sent in each direction per call, the network packet size, and window size. For this reason, a call to SQL Server should not be referred to as a network round-trip. It is not possible to determine the number of round-trips from the network interface performance counters, which only tracks packet and byte count. SQL Server calls can be measured in Performance Monitor by the object: ‘SQL Server: SQL Statistics’, the counter: ‘Batch Request/sec’. In SQL Profiler, this is measured by the Stored Procedure event ‘RPC:Completed’ and the TSQL event ‘SQL:BatchCompleted’. It is implied that in order to generate a high call volume, the vast majority of calls must be relatively inexpensive and quick to execute.
To explore the characteristics of SQL Server behavior in high call volume applications, a series of tests were conducted on a range of platforms from mid-2004 to present. The test stored procedure generates a single random number, performs another arithmetic operation, and returns the SPID. There are no table operations. An otherwise equivalent stored procedure that performs a single row index seek from a table entirely in the buffer cache is only slightly more expensive than the no-op call (in the range of 15%).
The load generator system is either a 2-way Xeon or a 4 x 2.7GHz Xeon MP. All network traffic runs over a single Gigabit Ethernet network. The network controllers employed include the Broadcom BCM570x, the Intel PRO/1000MT and the Intel PRO/1000XT. There were some performance improvements from the early Windows 2000 drivers to the more recent drivers, but no significant performance differences between driver versions released after 2004 have been observed. The operating system is Windows Server 2003, with hot-fixes current at the time of the test. The SQL Server 2000 versions ranged from SP3, hot-fix build 911, 1027 and SP4.

Figure 2 RPC performance for various 2-way platforms
Figure 2 shows the calls per second performance for various 2-way platforms with between 1 – 20 client connections. The first system is a 2-way 3.06GHz Xeon, 512K cache, 533MHz FSB on the ServerWorks GC-LE chipset. The second system is a 2x3.2GHz Xeon, 1M L2 cache, 800MHz FSB on the Intel E7520 chipset. The third system is a 2x2.2GHz Opteron system. The network controller is either the Broadcom or Intel PRO/1000MT. Data is listed in Appendix A.
Figure 3 compares the 4-way 2.7GHz Xeon MP and 16-way 3.0GHz Xeon MP call volume performance. The 4-way system can drive more than 2 times more calls per second than the 16-way system. There are good examples of SQL Server queries emphasizing data intensive operations rather than network round-trips that show excellent scaling on the 16-way system. So there are no fundamental problems with NUMA systems, only the problem of matching the software stack to the hardware characteristics.

Figure 3 Client Connections
The Microsoft KB article (Interrupt Affinity tool) on the Interrupt Affinity tool describes Windows 2000 as assigning interrupts to any available processor, and that performance improvement may be possible by assigning each network adapter to a specific processor. It is possible that Windows Server 2003 changed the default behavior as suggested and assigns each interrupt to specific processor. Excerpts from this KB article are in Appendix B.
Figure 4 shows the individual CPU utilization from Windows Task Manager on a Unisys 16-way Xeon MP system running Windows Server 2003 while sustaining 17K calls per second. Note that CPU 10 (counting up from 0) is at near 100% utilization. It is suspected that this is the processor handling the network interrupt, but the necessary steps to prove this were not conducted. There was no disk activity in this test. There were no other processes running and nothing else generating network activity. If this interpretation is correct, then the call handling capability of the 16-way system is saturated even though the other processors are not even close to fully loaded. An actual production server (16-way Itanium 2) running SAP exhibited essentially the same characteristics shown in Figure 4. Applying any addition network traffic to the connection handling SQL Server calls resulted in call volume performance degradation, but generating traffic on a different network connection not used by the active SQL Server clients did not degrade performance.

It is possible that distributing the network interrupt over more processors could improve call volume performance. It could also be speculated that excluding the SQL Server process affinity from one or more processors and binding the network interrupt to excluded processor(s) might help, but the net gain is not clear. Another point to note is that the CPU cost per call on the 16-way system is much higher than that of the 4-way system. So even if the CPU load could be evenly distributed, the performance with all 16 processors saturated may be no better than the 4-way call volume performance. It could be that there is substantial cost in having one processor handle the interrupt, then hand off the call to a SQL Server thread running on a processor in a different node.
Figure 5 shows the call volume scaling characteristics on an 8-way Itanium 2 system (HP rx8620, 1.5GHz processors). There are 4 processors in each of 2 cells. The call volume test was conducted with the system booted to 1, 2, 4, and 8 processors using the NUMPROC option in the EFI OS loader (equivalent to the boot.ini file in 32-bit systems).

Figure 5 Call volume performance for HP rx8620 booted to 1, 2, 4 and 8 processors.
It was not determined in the 2 & 4 CPU test whether all processors were in a common cell. Call volume scaling shows only marginal improvement from 1 to 2 processors (13.5K to 16.5K), no gain from 2 to 4 processors, and some degradation from 4 to 8 processors. It is possible that the one or both of the 2 & 4 processor tests.
The TPC-C benchmark is not as call volume intensive as some applications, but on current very high-end systems, would require techniques to avoid the problem described above. So it is worth examining the call volumes reached and any special configurations employed. The TPC-C benchmark requires a minimum of 55% calls for the four transactions other than the new order transaction, implying that new orders comprise at most 45% of calls. This implies that there are at least 2.22 calls per transaction. So score of 600,000 tpm-C constitutes 10,000 new order transactions per second for a total of 22.2K calls/sec. The table below shows the published TPC-C results for SQL Server 2000 on NUMA systems. The 64-way HP Superdome generates over 29,000 calls per second. Even the 16-way systems require over 12,000 calls per second.
All the above reports specify VI connections between the application server and the database server. The excerpts below are from the Oct-2003 NEC full disclosure report. The first excerpt are the boot options (Itanium system settings may be in the EFI instead of the boot.ini file), one of which explicitly states to be a directive to distribute interrupts. The second excerpt describes connection affinity bindings.
OsLoadOptions = /MAXPROCSPERCLUSTER=1 /CONFIGFLAG=8224 CONFIGFLAG=8224 : Interrupt logging disable MAXPROCSPERCLUSTER=1: distribute interrupts to all CPUs equally. We bind each client/server connection (i.e., 8 QLA2350 VIA NICs) onto the distinct CPU sets exclusively. To do this, apply the following settings by using the SQLServer "Server Network Utility". Enabled protocols: "VIA" only Vendor: Select "QLogic" Listen Info: Enter the following string "0:1433[0xf],1:1433[0xf0],2:1433[0xf00],3:1433[0xf000],4:1433[0xf0000],5:1433[0xf00000],6:1433[0x7f000000],7:1433[0x80000000]" Both the Unisys Itanium 2 and Xeon MP reports used the following settings (with some differences). [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSSQLServer\MSSQLServer\SuperSocketNetLib\VIA] "Vendor"="QLogic" "RecognizedVendors"="QLogic" "ListenInfo"="0:1421[0x00000003],1:1422[0x0000000c],2:1423[0x00000030],3:1424[0x000000c0],4:1425[0x00000300], 5:1426[0x00000c00],6:1427[0x00003000],7:1428[0x0000c000]"
There is no documentation from Microsoft detailing the values around the port number. The NEC report is most clear in explaining that this is for binding connections to a specific set of CPUs. Appendix C is a collection of excerpts from Microsoft documentation for SQL Server 2005 on NUMA and I/O affinity. The general strategy seems to be as follows. Each group of clients uses a specific port number. The interrupt for that port number is handled by a specific processor, and the user connection in turn will only use processors in the same node. Note that this feature in SQL Server 2005 applies to both TCP/IP and VI. The TPC-C reports for SQL Server 2000 only show the undocumented settings for the VI protocol, but not explicitly state whether there is or is not support for TCP/IP.
It is also unclear as whether the ListenInfo settings only sets affinity for only the network connection itself or also the SQL Server thread and SPID handling any connections open from that port. A look the registry setting for published SQL Server 2005 TPC-C results (Appendix D) indicates additional registry settings for CPU affinity by node.
It is also unclear as whether the ListenInfo settings only sets affinity for only the network connection itself or also the SQL Server thread and SPID handling any connections open from that port. A look the registry setting for published SQL Server 2005 TPC-C results (Appendix D) indicates additional registry settings for CPU affinity by node. Additional tuning parameters from the TPC-C full disclosures are as follows. The SQL Server Startup parameters commonly used are: -T825 Enable NUMA support -T826 don’t use large pages for buffer array -T827 enable Super Latches -T888 Use locked pages for buffers -E increase the number of consecutive extents allocated per file to x Below is a setting change in permissions. Local Policies - User Rights Assignments - policy 'Lock pages in memory' added group 'Administrators' The following was used in the HP Superdome report of Oct 2003. Environment variable LOGAFFINITY specifies which processor the log write will be affinitized to.
The performance implications of high call volume SQL Server applications on large NUMA based server systems are clearly demonstrated. Microsoft has documented features in SQL Server 2005 involving NUMA memory optimizations and client connection affinity tuning for both the TCP/IP and VI protocols. There are indications that NUMA memory optimization and the client affinity tuning for the VI protocol already exist in SQL Server 2000, but there does not appear to be public documentation. It would be important to understand to what degree the high call volume issue on large systems can be alleviated on TCP/IP over Gigabit Ethernet and when it is necessary to use VI. Anyone with good connections to the Microsoft SQL Server team should make direct inquires if there is reason to believe this subject is important to their environment.
 
At Intel Developer Forum 2009 last week, Microsoft disclosed significant advances in Windows Server 2008 R2 with the elimination of many locks, most prominently the Dispatch Scheduler lock, that impact the ability scale performance up to and beyond 64 cores. Look for the presentation: Microsoft & Intel Innovations in Hardware and Software to Deliver New Technology Experiences by Mark Russinovich, and Shiv Kaushik, Intel Developer Forum session SPCS003 (http://www.intel.com/idf/training-sessions/ or https://intel.wingateweb.com/us09/scheduler/catalog/catalog.jsp), also available as a webcast (http://www.intel.com/idf/pressroom/video.htm)
Earlier I talked about Big-Iron revival with the upcoming Intel Nehalem-EX eight-core processor. Intel will finally have a Xeon processor with high-end scaling potential. The Intel Server Update slide deck by Boyd Davis state that 8 OEMs have 15 system designs for 8-way and larger in the works. Several system vendors, including IBM, NEC and Unisys, have had 8-way to 32-way systems for several Xeon processor generations, but it was always apparent that performance scaling trailed off after 4 sockets. See the IDF session SPCS002 Technology Insight: Intelligent and Expandable High-End Intel Server Platform, Codenamed Nehalem-EX by Stephen Pawlowski for details.
Scaling performance on big iron NUMA systems involves a complex combination of matching the database architecture to the capabilities of the underlying SQL Server engine, the operating system and hardware architecture. For major improvements in two pillars of the foundation to arrive together is bound to generate excitement and anticipation.
 
SQL Server 2000 and later
In the old days, it was possible but very difficult to scale SQL Server 2000 on the contemporary hardware and operating system of that time. Many of the difficulties were in the SQL Server engine, but there were limitations with the Windows operating system and hardware as well. Certain operations in SQL Server 2000 could scale well beyond 4 cores or over multiple NUMA nodes. Other operations did not, and some operations even had severe negative scaling beyond 4 cores.
In order to scale on NUMA systems, it was necessary to design the database cluster keys, non-clustered indexes and write SQL queries so that the execution plan avoided problematic operations. Almost none of the details for this have ever been published (mine were originally published else where, but I will try to collect them on my website www.qdpma.com). Apparently vendors do not like to tell customers they should completely re-architect. As a database architect consultant, I do not see why this subject is taboo; my completely unbiased opinion!
Many developers and even data architects do not have adequate skills for generating efficient execution plans on ordinary SMP systems, let alone manipulating the execution plan to match a specific NUMA system architecture.
People have just tried to put an existing application on a big-iron system without any consideration for redesigning database architecture, or even using execution plan hints. Usually this leads to an uneven outcome, and frequently severe problems specific to NUMA systems. I believe enough people encountered such problems that a general awareness developed to avoid big-iron systems. In the last few years, I have encountered few people contemplating the purchase of the big-iron system, and these were usually the ProLiant DL785.
With SQL Server 2005, many scaling problems in the database engine were resolved. Service pack 2 provided another round of significant fixes. SQL Server 2008 introduced data warehouse performance enhancement, but I found these to be problematic and would sometimes rewrite a query to force the 2005 execution plans.
Windows Server Operating System
Microsoft makes ongoing enhancements to the core server operating system to improve performance scaling over many processors. Improvements were made from Windows 2000 to 2003, but a change in the handling of interrupts caused performance issues in large systems. This was resolve in a post SP1 hot-fix. At Microsoft WinHEC 2007, disk I/O handling in NUMA systems was discussed, but is was unclear whether this was a Windows Server 2008 RTM or R2 feature.
See the follow WinHEC presentations for more details:
ENT-T554 Windows Support For Greater Than 64 Logical Processors by Arie van der Hoeven
ENT-T555 Scaling More Than 64 Logical Processors: A SQL Perspective by Alex Verbitski and Pravin Mittal
WinHEC 2007
SVR-T332 NUMA I/O Optimizations by Bruce Worthington
The WinHEC ENT-T555 presentation mentions OLTP scaling of 1.7X from 64 to 128 Logical Processors. The IDF presentations states 1.7X scaling from 128 to 256 LP, but it is very possible these two presentations do not reference the same baseline.
Even though the core elements will soon be in place to enable broadly scalable performance on big-iron systems, the expectation is that it will take time for the SQL Server engine team, and the Windows operating system team to build enough experience on the Nehalem-EX NUMA platforms to make all of this work together out of box. In the meantime, there are a handful of consultants with deep NUMA performance tuning experience that can make this happen as is (not to be construed as a solicitation for services).
Itanium
The Itanium processor and system architecture was designed for big system scaling, but the processor has languished at the 90nm dual-core Montvale. The 65nm quad-core Tukwila has encountered multiple delays to 2010? Itanium is now mostly positioned as having extensible reliability and availability features (Machine Check Architecture).
 
Later on, there will be Magny-Cours, which would be 2 Istanbul die in one package. Istanbul has six cores, 3 Hyper Transport links and 2 memory channels. In Magny-Cours, the two six core chips are linked by one HT link, so the external package will have 12 cores, 4 HT links and 4 memory channels. After this, a new improved micro-architecture would arrive?
Now there have been big iron Windows systems for many years. The HP Superdome supports up to 64 Itanium 2 sockets. The problem has been that Intel has not kept pace with Itanium. The current Itanium 9100 series, Montvale, is a 90nm dual core, while the Xeon line is at 45nm and six+ cores. Tukwila, the 65nm quad-core Itanium that should have been launched in 2008, was recently delayed until 2010. Supposedly Itanium should finally be caught up on process technology in 2011 with the 32nm Poulson. Unisys (ES7000 7600R), NEC (Express5800/A1160) and IBM (x3950M2) all have had 16-socket capable Xeon systems for a while. HP has the 8-way ProLiant DL785G5 for Opteron processors (I really would like to get the architectural diagram for how HP connects the 8 sockets). I have not followed Sun since I focus on Windows/SQL Server. (Sun has the 8-way x4600 for Opteron. see Sun Fire x4600 M2 Server Architecture whitepaper for an architectural diagram on how 8 Opterons are connected in a twisted ladder)
Still, I consider this to be a revival or perhaps true arrival of big iron because of the issues in the past on scaling beyond 4-sockets, both in terms of performance and price-performance.
Previously, there were technical challenges in scaling the Intel Xeon beyond 4 sockets, both for the system vendors in designing such a system, and the DBA/developer in getting their application to scale beyond 4-sockets. For an OEM to build an 8-way+ system, it required the effort to built custom chips, the market volume was low, and Intel kept changing the FSB. All of this meant there was a big step up in price per socket going from a 4-socket system to 8, 16 or 32.
Dell is the question. Their attitude might be that they do not expect to sell many big-iron systems, considering the technical difficulties they had in the past on this. To sell big iron, it is absolutely necessary to have top technical expertise to go into customer shops to find out if it is the right solution and what changes need to be made to deploy successfully. (OEM reps are invited to drop hints, even if its still a company secret, we will keep it just between us)
[OK, I forgot about SGI, they have big iron Itanium, which means if they do a chipset for the next gen Itanium with QPI, they can do a Nehalem-EX too. plus they just blogged this http://ceoblog.sgi.com/]
Up to Windows Server 2008 RTM, the OS does not support more than 64 cores, physical or logical. This limit will be lifted with Windows Server 2008 R2, accompanied by SQL Server 2008 R2(?). Both the Unisys 7600R and NEC A1160 posted TPC-E benchmark results for 16-sockets, but only 4 of the 6 cores in the Intel X7460 processor enabled, to stay under the current 64-core limit. Scaling was decent, but not spectacular, going from 721tps-E@4-sockets/24 cores, to 1156 tps-E@8S/48c, to 1400tps-E@12S/64c and 1568tps-E@16S/64c.
Note that scaling large/(hard) NUMA systems require proper use of port affinity settings, and how interrupts are handled. Windows 2008 R2 supposedly has a much improved disk I/O handling on NUMA systems.
The Intel announcement mentioned that 4-way Nehalem EX will have 2.5X+ performance over 4-way Xeon 7460, based on a very recent internal measurement using OLTP workload, i.e., TPC-C or TPC-E. This is also inline with the huge TPC-C and TPC-E gains posted by 2-way Xeon 5500 over Xeon 5400. Previously I discussed this matter. Each Nehalem core should have moderately better performance than a Core 2 micro-architecture core. Nehalem systems have more memory channels to better support multi-core scaling. The Nehalem EX 4-way system has 16 memory channels supporting 32 cores, versus the Xeon 7400 (7300 MCH) 4 memory channels supporting 24 cores. Nehalem EX will have 8 physical cores compared with 6 on Xeon 7460. Finally, both TPC-C and TPC-E benefit from Hyper-Threading, a feature from the Pentium 4 (NetBurst) micro-architecture (designed in Oregon), but not implemented by Core 2 (designed in Israel). Anyways, 2.5X over X7460 means 1.6M tpm-C or 1700 tps-E.
Now both TPC-C and TPC-E are OLTP benchmarks (workloads). The interpretation should not be that HT (and large cache) benefit OLTP workloads as in any one else's OLTP workload. Each TPC-C transaction involves on average 2.25 or so RPC calls (network roundtrip) and each TPC-E transaction involves approximately 22.3 RPCs. By looking at the recent results on Xeon 7460 or Opteron Quad-core, one can figure out that the average cost per RPC in both TPC-C and E is on the order of 1 CPU-millisecond (the duration of the complete RPC might be longer, say 80-400ms)
The correct interpretation should be that HT and large cache benefits high call volume applications, transaction processing or not. HT benefits mostly in the network round-trip. This was based on tests done on the previous version of HT, i.e., Pentium 4 architecture. I did not find one SQL operation that benefited from HT except in handling just the RPC overhead. The Quest LiteSpeed compression engine did show huge gains with HT, 40%. This indicates the theory behind HT is valid. One just needs to figure what in the SQL Server engine does not like HT. It is possible that the HT in Nehalem now works better with SQL Server.
The large cache reduces the (fixed) startup cost of an SQL operation, but not the incremental cost per additional rows. So if someone else's OLTP application average 10 CPU-ms per call, then it might not show as much gain going from Core 2 to Nehalem.
I suspect this is the reason Intel has not posted any TPC-H benchmark results. It should show some gain over Core 2, just not the spectacular gains in C & E. I am inclined to think that the 4-way Xeon 7460 is memory bandwidth constraint in TPC-H, and that is alleviated in Nehalem, but there are no published TPC-H results to substantiate this matter.
Dunnington and Nehalem EX are both 45nm. Dunnington has 1.9 billion transistors, 6 cores, there is a 3M L2 cache shared by each pair of cores, and a 16M L3 cache shared by all cores for a L2+L3 total of 25M. Nehalem EX has 2.3B transistors, 256K L2 cache dedicated for each core and 24M L3 cache for 26M L2+L3 cache. Granted there is a big increase in latency from L2 to L3. I would interest to see the supporting data (estimates made before the design work) for the big L2 caches in Dunnington.
Even with all of the improvements over time, on the hardware with Nehalem, integrated memory controllers, QPI, on the software stack, w2k8r2 and s2k8r8, scaling on NUMA systems is not trivial. What SQL execution plan operations scale?, what does not?, what might have negative scaling? etc, what problems can be fixed with code changes etc. All of this should be done with proper expertise. (Not to be construed as an advertisement or solicitation for services, this will not be cheap either)
PS -
I am neither advocating nor criticizing big-iron systems. The important point is that new systems coming every year are approximately 40% more powerful at comparable price ranges. That means the value of compute power depreciates at 30% per year (1/1.4 = 0.71). So it does not make sense to buy now for what you do not expect to need for 2+ years. Buy what need for the next year, and buy a new system after that, rotating the existing system to a less important task. Of course, if you work for an inflexible government agency that mandates replacement at 5 year intervals, or if buying the $1M system makes you more important than the other group that runs on a $30K system, well then go for it! On the flip side, one should not argue for the minimum system that meets requirements, but rather think about how massive compute power can be used to generate value.
I used have many complaints about Intel, particularly on the chipsets. Most have been addressed. The remaining complaint is that Intel has a twisted view that 4-way systems are special, ie, compared to 2-way systems. This is why the 6-core Dunnington is only used in the Xeon 7400 series and not the 5400 series, even though there is no reason it cannot be used in the 5400. The same applies to the upcoming 8-core Nehalem EX being only positioned in the 7000 line and not the 5000 line. AMD has no issues offering 6-core Istanbul in a 2-way. Hopefully, hardware vendors will have a better picture of customer interests, and offer a 2-way for Nehalem EX. Sure I know it is not cheap, this is why the different between men a boys is the size and price of their toys.
| CPU | memory | Power | Throughput | QphH |
|---|---|---|---|---|
| 32 Itanium2 DC | 256GB | 90,909 | 53,899 | 69,999 |
| 128 Core2 QC | 2080GB | 782,609 | 1,740,122 | 1,166,977 |