Joe Chang jchang6@yahoo.com
At Intel Developer Forum 2009 last week, Microsoft disclosed significant advances in Windows Server 2008 R2 with the elimination of many locks, most prominently the Dispatch Scheduler lock, that impact the ability scale performance up to and beyond 64 cores. Look for the presentation: Microsoft & Intel Innovations in Hardware and Software to Deliver New Technology Experiences by Mark Russinovich, and Shiv Kaushik, Intel Developer Forum session SPCS003 ( http://www.intel.com/idf/training-sessions/ or https://intel.wingateweb.com/us09/scheduler/catalog/catalog.jsp), also available as a webcast ( http://www.intel.com/idf/pressroom/video.htm)
Earlier I talked about Big-Iron revival with the upcoming Intel Nehalem-EX eight-core processor. Intel will finally have a Xeon processor with high-end scaling potential. The Intel Server Update slide deck by Boyd Davis state that 8 OEMs have 15 system designs for 8-way and larger in the works. Several system vendors, including IBM, NEC and Unisys, have had 8-way to 32-way systems for several Xeon processor generations, but it was always apparent that performance scaling trailed off after 4 sockets. See the IDF session SPCS002 Technology Insight: Intelligent and Expandable High-End Intel Server Platform, Codenamed Nehalem-EX by Stephen Pawlowski for details.
Scaling performance on big iron NUMA systems involves a complex combination of matching the database architecture to the capabilities of the underlying SQL Server engine, the operating system and hardware architecture. For major improvements in two pillars of the foundation to arrive together is bound to generate excitement and anticipation.
 
SQL Server 2000 and later
In the old days, it was possible but very difficult to scale SQL Server 2000 on the contemporary hardware and operating system of that time. Many of the difficulties were in the SQL Server engine, but there were limitations with the Windows operating system and hardware as well. Certain operations in SQL Server 2000 could scale well beyond 4 cores or over multiple NUMA nodes. Other operations did not, and some operations even had severe negative scaling beyond 4 cores.
In order to scale on NUMA systems, it was necessary to design the database cluster keys, non-clustered indexes and write SQL queries so that the execution plan avoided problematic operations. Almost none of the details for this have ever been published (mine were originally published else where, but I will try to collect them on my website www.qdpma.com). Apparently vendors do not like to tell customers they should completely re-architect. As a database architect consultant, I do not see why this subject is taboo; my completely unbiased opinion!
Many developers and even data architects do not have adequate skills for generating efficient execution plans on ordinary SMP systems, let alone manipulating the execution plan to match a specific NUMA system architecture.
People have just tried to put an existing application on a big-iron system without any consideration for redesigning database architecture, or even using execution plan hints. Usually this leads to an uneven outcome, and frequently severe problems specific to NUMA systems. I believe enough people encountered such problems that a general awareness developed to avoid big-iron systems. In the last few years, I have encountered few people contemplating the purchase of the big-iron system, and these were usually the ProLiant DL785.
With SQL Server 2005, many scaling problems in the database engine were resolved. Service pack 2 provided another round of significant fixes. SQL Server 2008 introduced data warehouse performance enhancement, but I found these to be problematic and would sometimes rewrite a query to force the 2005 execution plans.
Windows Server Operating System
Microsoft makes ongoing enhancements to the core server operating system to improve performance scaling over many processors. Improvements were made from Windows 2000 to 2003, but a change in the handling of interrupts caused performance issues in large systems. This was resolve in a post SP1 hot-fix. At Microsoft WinHEC 2007, disk I/O handling in NUMA systems was discussed, but is was unclear whether this was a Windows Server 2008 RTM or R2 feature.
See the follow WinHEC presentations for more details:
ENT-T554 Windows Support For Greater Than 64 Logical Processors by Arie van der Hoeven
ENT-T555 Scaling More Than 64 Logical Processors: A SQL Perspective by Alex Verbitski and Pravin Mittal
WinHEC 2007
SVR-T332 NUMA I/O Optimizations by Bruce Worthington
The WinHEC ENT-T555 presentation mentions OLTP scaling of 1.7X from 64 to 128 Logical Processors. The IDF presentations states 1.7X scaling from 128 to 256 LP, but it is very possible these two presentations do not reference the same baseline.
Even though the core elements will soon be in place to enable broadly scalable performance on big-iron systems, the expectation is that it will take time for the SQL Server engine team, and the Windows operating system team to build enough experience on the Nehalem-EX NUMA platforms to make all of this work together out of box. In the meantime, there are a handful of consultants with deep NUMA performance tuning experience that can make this happen as is (not to be construed as a solicitation for services).
AMD Istanbul
AMD crowed loudly that Opteron with its integrated memory controller and Hyper-Transport scaled memory and inter-processor bandwidth with the number of processors, while the Intel systems up to the Xeon 7400 series were constrained on a shared processor front-side bus and the fixed memory bandwidth of discrete memory controller. See for example excerpts from: SQL Server 2005 and AMD64 –a winning team.

With the announcement of the six-core Opteron, codename Istanbul, we find out that AMD previously did not have mechanism for maintaining cache-coherency comparable to the Snoop Filter in the Intel chipsets. Without this, much of the available bandwidth is consumed by cache coherency traffic, limiting scaling in systems with 4 or more processor sockets. In Istanbul, the HT Assist, or Probe Filter feature uses 1M of the 6M L3 as a directory cache to track cache lines. AMD measured 42GB/s memory bandwidth with HT Assist versus 25.5GB/s without HT Assist.

So there is now an expectation that Opteron systems should have improved scaling in 4-way and larger systems. While HP and Sun have 8-way Opteron systems since the quad-core Barcelona, only TPC-H data warehouse benchmarks have been published. To date no TPC-C or TPC-E OLTP benchmarks have been published for 8-way Opteron systems, even the six-core Istanbul with HT-Assist. For that matter, no TPC-C or TPC-E benchmarks have been published for 4-way systems with the six-core Istanbul, even though 4-way quad-core Opteron systems have posted respectable results on both TPC-C and TPC-E. It is possible that this is not a simple feature to implement and the first attempt has issues. Hopefully a fixed version will available before too long and we can see OLTP benchmark results for subsequent generation Opteron systems.
 
Itanium
The Itanium processor and system architecture was designed for big system scaling, but the processor has languished at the 90nm dual-core Montvale. The 65nm quad-core Tukwila has encountered multiple delays to 2010? Itanium is now mostly positioned as having extensible reliability and availability features (Machine Check Architecture).
 
Yesterday Intel held a product announcement press event for the upcoming Nehalem EX,
which will succeed the current Xeon 7400 series based on the Core 2 micro-architecture
for "expandable system", i.e., 4-way and higher, in late 2009 or early 2010.
The current Xeon 5500 series (also Nehalem architecture) has 4 cores,
8M shared L3, 2 QPI links, and 3 DDR3 memory channels.
Nehalem EX has 8 cores, 24M shared L3 cache, 4 QPI links and 4 FBD
memory channels
(there is now a Scalable Memory Buffer between the memory interface and memory,
did Intel just move the AMB from the DIMM to the motherboard?).
AMD has also recently discussed their plans. The current quad-core Shanghai gets a frequency bump from 2.7GHz to 3.1GHz, and a six core Istanbul should be released very soon (June, announced at 2.6GHz). See the Johan de Gelas Anandtech article on Istanbul. It describes HT assist, (essentially a snoop filter for HT) as using 1M of the L3 cache. The HP ProLiant DL585G6 for Istanbul also appears to be HT version 3.0 or HT3, upping the HT transfer rate from 2GT/s to 4.4GT/s.
Later on, there will be Magny-Cours, which would be 2 Istanbul die in one package. Istanbul has six cores, 3 Hyper Transport links and 2 memory channels. In Magny-Cours, the two six core chips are linked by one HT link, so the external package will have 12 cores, 4 HT links and 4 memory channels. After this, a new improved micro-architecture would arrive?
Now there have been big iron Windows systems for many years. The HP Superdome supports up to 64 Itanium 2 sockets. The problem has been that Intel has not kept pace with Itanium. The current Itanium 9100 series, Montvale, is a 90nm dual core, while the Xeon line is at 45nm and six+ cores. Tukwila, the 65nm quad-core Itanium that should have been launched in 2008, was recently delayed until 2010. Supposedly Itanium should finally be caught up on process technology in 2011 with the 32nm Poulson. Unisys (ES7000 7600R), NEC (Express5800/A1160) and IBM (x3950M2) all have had 16-socket capable Xeon systems for a while. HP has the 8-way ProLiant DL785G5 for Opteron processors (I really would like to get the architectural diagram for how HP connects the 8 sockets). I have not followed Sun since I focus on Windows/SQL Server. (Sun has the 8-way x4600 for Opteron. see Sun Fire x4600 M2 Server Architecture whitepaper for an architectural diagram on how 8 Opterons are connected in a twisted ladder)
Still, I consider this to be a revival or perhaps true arrival of big iron because of the issues in the past on scaling beyond 4-sockets, both in terms of performance and price-performance.
Previously, there were technical challenges in scaling the Intel Xeon beyond 4 sockets, both for the system vendors in designing such a system, and the DBA/developer in getting their application to scale beyond 4-sockets. For an OEM to build an 8-way+ system, it required the effort to built custom chips, the market volume was low, and Intel kept changing the FSB. All of this meant there was a big step up in price per socket going from a 4-socket system to 8, 16 or 32.
This was the rational for Oracle RAC. Instead of buying really expensive big-iron hardware, one can buy lower cost high volume hardware and really expensive software licenses. Think about it. Scaling up on big iron or a RAC-type technology depends on interconnect bandwidth and latency. For either the Intel QPI or AMD HT, it should be possible to achieve far better bandwidth and latency in big-iron than a RAC-type solution. The best Infini-band can do now in a x4 link is 40Gbit/s (5GB/s) at approx 1us latency.
Now that there is prospect of stability in the Intel processor interconnect, my expectation is that we should now see 8-way+ systems at a less severe price premium over 4-way systems. (there will always be a premium because validating and supporting big systems requires deeper technical skills). On AMD Opteron, having the 4 HT ports from one package enables 8-way glue-less systems (with fewer hops) and helps in building 8-way+ (with glue?).
In the Intel announcement was that 8 OEMS have 15 or so 8-way+ (including 16 and 32-way) Nehalem EX systems in the works. IBM, NEC and Unisys are obviously 3 of the OEMs, given their recent commitment to big-iron Xeon. Fujitsu and Hitachi might be another 2, as the Japanese players love big-iron. Sun should be one for 6 of the 8 OEMs. I am guessing this means HP and Dell are the two remaining OEMs. HP is no surprise. They already have the 8-way Opteron. Their commitment to Itanium means that HP would have built a chipset around QPI for the next generation, which is the same processor interconnect on Nehalem.
Dell is the question. Their attitude might be that they do not expect to sell many big-iron systems, considering the technical difficulties they had in the past on this. To sell big iron, it is absolutely necessary to have top technical expertise to go into customer shops to find out if it is the right solution and what changes need to be made to deploy successfully. (OEM reps are invited to drop hints, even if its still a company secret, we will keep it just between us)
[OK, I forgot about SGI, they have big iron Itanium, which means if they do a chipset for the next gen Itanium with QPI, they can do a Nehalem-EX too. plus they just blogged this http://ceoblog.sgi.com/]
Up to Windows Server 2008 RTM, the OS does not support more than 64 cores, physical or logical. This limit will be lifted with Windows Server 2008 R2, accompanied by SQL Server 2008 R2(?). Both the Unisys 7600R and NEC A1160 posted TPC-E benchmark results for 16-sockets, but only 4 of the 6 cores in the Intel X7460 processor enabled, to stay under the current 64-core limit. Scaling was decent, but not spectacular, going from 721tps-E@4-sockets/24 cores, to 1156 tps-E@8S/48c, to 1400tps-E@12S/64c and 1568tps-E@16S/64c.
Note that scaling large/(hard) NUMA systems require proper use of port affinity settings, and how interrupts are handled. Windows 2008 R2 supposedly has a much improved disk I/O handling on NUMA systems.
The Intel announcement mentioned that 4-way Nehalem EX will have 2.5X+ performance over 4-way Xeon 7460, based on a very recent internal measurement using OLTP workload, i.e., TPC-C or TPC-E. This is also inline with the huge TPC-C and TPC-E gains posted by 2-way Xeon 5500 over Xeon 5400. Previously I discussed this matter. Each Nehalem core should have moderately better performance than a Core 2 micro-architecture core. Nehalem systems have more memory channels to better support multi-core scaling. The Nehalem EX 4-way system has 16 memory channels supporting 32 cores, versus the Xeon 7400 (7300 MCH) 4 memory channels supporting 24 cores. Nehalem EX will have 8 physical cores compared with 6 on Xeon 7460. Finally, both TPC-C and TPC-E benefit from Hyper-Threading, a feature from the Pentium 4 (NetBurst) micro-architecture (designed in Oregon), but not implemented by Core 2 (designed in Israel). Anyways, 2.5X over X7460 means 1.6M tpm-C or 1700 tps-E.
Now both TPC-C and TPC-E are OLTP benchmarks (workloads). The interpretation should not be that HT (and large cache) benefit OLTP workloads as in any one else's OLTP workload. Each TPC-C transaction involves on average 2.25 or so RPC calls (network roundtrip) and each TPC-E transaction involves approximately 22.3 RPCs. By looking at the recent results on Xeon 7460 or Opteron Quad-core, one can figure out that the average cost per RPC in both TPC-C and E is on the order of 1 CPU-millisecond (the duration of the complete RPC might be longer, say 80-400ms)
The correct interpretation should be that HT and large cache benefits high call volume applications, transaction processing or not. HT benefits mostly in the network round-trip. This was based on tests done on the previous version of HT, i.e., Pentium 4 architecture. I did not find one SQL operation that benefited from HT except in handling just the RPC overhead. The Quest LiteSpeed compression engine did show huge gains with HT, 40%. This indicates the theory behind HT is valid. One just needs to figure what in the SQL Server engine does not like HT. It is possible that the HT in Nehalem now works better with SQL Server.
The large cache reduces the (fixed) startup cost of an SQL operation, but not the incremental cost per additional rows. So if someone else's OLTP application average 10 CPU-ms per call, then it might not show as much gain going from Core 2 to Nehalem.
I suspect this is the reason Intel has not posted any TPC-H benchmark results. It should show some gain over Core 2, just not the spectacular gains in C & E. I am inclined to think that the 4-way Xeon 7460 is memory bandwidth constraint in TPC-H, and that is alleviated in Nehalem, but there are no published TPC-H results to substantiate this matter.
Dunnington and Nehalem EX are both 45nm. Dunnington has 1.9 billion transistors, 6 cores, there is a 3M L2 cache shared by each pair of cores, and a 16M L3 cache shared by all cores for a L2+L3 total of 25M. Nehalem EX has 2.3B transistors, 256K L2 cache dedicated for each core and 24M L3 cache for 26M L2+L3 cache. Granted there is a big increase in latency from L2 to L3. I would interest to see the supporting data (estimates made before the design work) for the big L2 caches in Dunnington.
Even with all of the improvements over time, on the hardware with Nehalem, integrated memory controllers, QPI, on the software stack, w2k8r2 and s2k8r8, scaling on NUMA systems is not trivial. What SQL execution plan operations scale?, what does not?, what might have negative scaling? etc, what problems can be fixed with code changes etc. All of this should be done with proper expertise. (Not to be construed as an advertisement or solicitation for services, this will not be cheap either)
PS -
I am neither advocating nor criticizing big-iron systems. The important point is that new systems coming every year are approximately 40% more powerful at comparable price ranges. That means the value of compute power depreciates at 30% per year (1/1.4 = 0.71). So it does not make sense to buy now for what you do not expect to need for 2+ years. Buy what need for the next year, and buy a new system after that, rotating the existing system to a less important task. Of course, if you work for an inflexible government agency that mandates replacement at 5 year intervals, or if buying the $1M system makes you more important than the other group that runs on a $30K system, well then go for it! On the flip side, one should not argue for the minimum system that meets requirements, but rather think about how massive compute power can be used to generate value.
I used have many complaints about Intel, particularly on the chipsets. Most have been addressed. The remaining complaint is that Intel has a twisted view that 4-way systems are special, ie, compared to 2-way systems. This is why the 6-core Dunnington is only used in the Xeon 7400 series and not the 5400 series, even though there is no reason it cannot be used in the 5400. The same applies to the upcoming 8-core Nehalem EX being only positioned in the 7000 line and not the 5000 line. AMD has no issues offering 6-core Istanbul in a 2-way. Hopefully, hardware vendors will have a better picture of customer interests, and offer a 2-way for Nehalem EX. Sure I know it is not cheap, this is why the different between men a boys is the size and price of their toys.
HP/Oracle just published a RAC TPC-H result with 64 BL640c blade servers at1000GB. This system comprised 128 quad core Xeon 5450 processors (512 cores), 32GB memory per node (64GB on one node). The total memory was 2080GB. The full database size should be around 1700GB. The 1000GB description is for just the LineItem table, not including the two non-clustered indexes and the other tables.
| CPU | memory | Power | Throughput | QphH |
|---|---|---|---|---|
| 32 Itanium2 DC | 256GB | 90,909 | 53,899 | 69,999 |
| 128 Core2 QC | 2080GB | 782,609 | 1,740,122 | 1,166,977 |
Based on the published Oracle RAC results, I should point out that RAC scaling on TPC-H does look good. The almost total lack of TPC-C (1 published?) may indicate an issue in scaling high-call volume applications. In the above mentioned Oracle RAC result, the blade server hardware costs were about $700K, $500K for storage, $3M for Oracle, $1.5M for RAC, $700K for partitioning, $700K for compression, $400K for support ($100K for unbreakable Linux support, if its unbreakable, why the support?) for about $6M in software, minus $1.8M in Oracle discounts. If I could charge that much, I would get myself a 400ft yacht. Never mind, Larry already did.