Home, System Architecture, Server Systems,
Big Iron Revival (2009-05), Big Iron Revival II (2009-09), Big Iron Revival III (2010-09).

Big Iron Revival I, Intel Nehalem EX and AMD Magny-Cours 2009-05

Yesterday Intel held a product announcement press event for the upcoming Nehalem EX, which will succeed the current Xeon 7400 series based on the Core 2 micro-architecture for "expandable system", i.e., 4-way and higher, in late 2009 or early 2010. The current Xeon 5500 series (also Nehalem architecture) has 4 cores, 8M shared L3, 2 QPI links, and 3 DDR3 memory channels. Nehalem EX has 8 cores, 24M shared L3 cache, 4 QPI links and 4 FBD memory channels (there is now a Scalable Memory Buffer between the memory interface and memory, did Intel just move the AMB from the DIMM to the motherboard?).

AMD has also recently discussed their plans. The current quad-core Shanghai gets a frequency bump from 2.7GHz to 3.1GHz, and a six core Istanbul should be released very soon (June, announced at 2.6GHz). See the Johan de Gelas Anandtech article on Istanbul. It describes HT assist, (essentially a snoop filter for HT) as using 1M of the L3 cache. The HP ProLiant DL585G6 for Istanbul also appears to be HT version 3.0 or HT3, upping the HT transfer rate from 2GT/s to 4.4GT/s.

Later on, there will be Magny-Cours, which would be 2 Istanbul die in one package. Istanbul has six cores, 3 Hyper Transport links and 2 memory channels. In Magny-Cours, the two six core chips are linked by one HT link, so the external package will have 12 cores, 4 HT links and 4 memory channels. After this, a new improved micro-architecture would arrive?

Now there have been big iron Windows systems for many years. The HP Superdome supports up to 64 Itanium 2 sockets. The problem has been that Intel has not kept pace with Itanium. The current Itanium 9100 series, Montvale, is a 90nm dual core, while the Xeon line is at 45nm and six+ cores. Tukwila, the 65nm quad-core Itanium that should have been launched in 2008, was recently delayed until 2010. Supposedly Itanium should finally be caught up on process technology in 2011 with the 32nm Poulson. Unisys (ES7000 7600R), NEC (Express5800/A1160) and IBM (x3950M2) all have had 16-socket capable Xeon systems for a while. HP has the 8-way ProLiant DL785G5 for Opteron processors (I really would like to get the architectural diagram for how HP connects the 8 sockets). I have not followed Sun since I focus on Windows/SQL Server. (Sun has the 8-way x4600 for Opteron. see Sun Fire x4600 M2 Server Architecture whitepaper for an architectural diagram on how 8 Opterons are connected in a twisted ladder)

Still, I consider this to be a revival or perhaps true arrival of big iron because of the issues in the past on scaling beyond 4-sockets, both in terms of performance and price-performance.

Previously, there were technical challenges in scaling the Intel Xeon beyond 4 sockets, both for the system vendors in designing such a system, and the DBA/developer in getting their application to scale beyond 4-sockets. For an OEM to build an 8-way+ system, it required the effort to built custom chips, the market volume was low, and Intel kept changing the FSB. All of this meant there was a big step up in price per socket going from a 4-socket system to 8, 16 or 32.

This was the rational for Oracle RAC. Instead of buying really expensive big-iron hardware, one can buy lower cost high volume hardware and really expensive software licenses. Think about it. Scaling up on big iron or a RAC-type technology depends on interconnect bandwidth and latency. For either the Intel QPI or AMD HT, it should be possible to achieve far better bandwidth and latency in big-iron than a RAC-type solution. The best Infini-band can do now in a x4 link is 40Gbit/s (5GB/s) at approx 1us latency.

Now that there is prospect of stability in the Intel processor interconnect, my expectation is that we should now see 8-way+ systems at a less severe price premium over 4-way systems. (there will always be a premium because validating and supporting big systems requires deeper technical skills). On AMD Opteron, having the 4 HT ports from one package enables 8-way glue-less systems (with fewer hops) and helps in building 8-way+ (with glue?).

In the Intel announcement was that 8 OEMS have 15 or so 8-way+ (including 16 and 32-way) Nehalem EX systems in the works. IBM, NEC and Unisys are obviously 3 of the OEMs, given their recent commitment to big-iron Xeon. Fujitsu and Hitachi might be another 2, as the Japanese players love big-iron. Sun should be one for 6 of the 8 OEMs. I am guessing this means HP and Dell are the two remaining OEMs. HP is no surprise. They already have the 8-way Opteron. Their commitment to Itanium means that HP would have built a chipset around QPI for the next generation, which is the same processor interconnect on Nehalem.

Dell is the question. Their attitude might be that they do not expect to sell many big-iron systems, considering the technical difficulties they had in the past on this. To sell big iron, it is absolutely necessary to have top technical expertise to go into customer shops to find out if it is the right solution and what changes need to be made to deploy successfully. (OEM reps are invited to drop hints, even if its still a company secret, we will keep it just between us)

[OK, I forgot about SGI, they have big iron Itanium, which means if they do a chipset for the next gen Itanium with QPI, they can do a Nehalem-EX too. plus they just blogged this ceo blog sgi]

Up to Windows Server 2008 RTM, the OS does not support more than 64 cores, physical or logical. This limit will be lifted with Windows Server 2008 R2, accompanied by SQL Server 2008 R2(?). Both the Unisys 7600R and NEC A1160 posted TPC-E benchmark results for 16-sockets, but only 4 of the 6 cores in the Intel X7460 processor enabled, to stay under the current 64-core limit. Scaling was decent, but not spectacular, going from 721tps-E@4-sockets/24 cores, to 1156 tps-E@8S/48c, to 1400tps-E@12S/64c and 1568tps-E@16S/64c.

Note that scaling large/(hard) NUMA systems require proper use of port affinity settings, and how interrupts are handled. Windows 2008 R2 supposedly has a much improved disk I/O handling on NUMA systems.

The Intel announcement mentioned that 4-way Nehalem EX will have 2.5X+ performance over 4-way Xeon 7460, based on a very recent internal measurement using OLTP workload, i.e., TPC-C or TPC-E. This is also inline with the huge TPC-C and TPC-E gains posted by 2-way Xeon 5500 over Xeon 5400. Previously I discussed this matter. Each Nehalem core should have moderately better performance than a Core 2 micro-architecture core. Nehalem systems have more memory channels to better support multi-core scaling. The Nehalem EX 4-way system has 16 memory channels supporting 32 cores, versus the Xeon 7400 (7300 MCH) 4 memory channels supporting 24 cores. Nehalem EX will have 8 physical cores compared with 6 on Xeon 7460. Finally, both TPC-C and TPC-E benefit from Hyper-Threading, a feature from the Pentium 4 (NetBurst) micro-architecture (designed in Oregon), but not implemented by Core 2 (designed in Israel). Anyways, 2.5X over X7460 means 1.6M tpm-C or 1700 tps-E.

Now both TPC-C and TPC-E are OLTP benchmarks (workloads). The interpretation should not be that HT (and large cache) benefit OLTP workloads as in any one else's OLTP workload. Each TPC-C transaction involves on average 2.25 or so RPC calls (network roundtrip) and each TPC-E transaction involves approximately 22.3 RPCs. By looking at the recent results on Xeon 7460 or Opteron Quad-core, one can figure out that the average cost per RPC in both TPC-C and E is on the order of 1 CPU-millisecond (the duration of the complete RPC might be longer, say 80-400ms)

The correct interpretation should be that HT and large cache benefits high call volume applications, transaction processing or not. HT benefits mostly in the network round-trip. This was based on tests done on the previous version of HT, i.e., Pentium 4 architecture. I did not find one SQL operation that benefited from HT except in handling just the RPC overhead. The Quest LiteSpeed compression engine did show huge gains with HT, 40%. This indicates the theory behind HT is valid. One just needs to figure what in the SQL Server engine does not like HT. It is possible that the HT in Nehalem now works better with SQL Server.

The large cache reduces the (fixed) startup cost of an SQL operation, but not the incremental cost per additional rows. So if someone else's OLTP application average 10 CPU-ms per call, then it might not show as much gain going from Core 2 to Nehalem.

I suspect this is the reason Intel has not posted any TPC-H benchmark results. It should show some gain over Core 2, just not the spectacular gains in C & E. I am inclined to think that the 4-way Xeon 7460 is memory bandwidth constraint in TPC-H, and that is alleviated in Nehalem, but there are no published TPC-H results to substantiate this matter.

Dunnington and Nehalem EX are both 45nm. Dunnington has 1.9 billion transistors, 6 cores, there is a 3M L2 cache shared by each pair of cores, and a 16M L3 cache shared by all cores for a L2+L3 total of 25M. Nehalem EX has 2.3B transistors, 256K L2 cache dedicated for each core and 24M L3 cache for 26M L2+L3 cache. Granted there is a big increase in latency from L2 to L3. I would interest to see the supporting data (estimates made before the design work) for the big L2 caches in Dunnington.

Even with all of the improvements over time, on the hardware with Nehalem, integrated memory controllers, QPI, on the software stack, w2k8r2 and s2k8r8, scaling on NUMA systems is not trivial. What SQL execution plan operations scale?, what does not?, what might have negative scaling? etc, what problems can be fixed with code changes etc. All of this should be done with proper expertise. (Not to be construed as an advertisement or solicitation for services, this will not be cheap either)

PS -

I am neither advocating nor criticizing big-iron systems. The important point is that new systems coming every year are approximately 40% more powerful at comparable price ranges. That means the value of compute power depreciates at 30% per year (1/1.4 = 0.71). So it does not make sense to buy now for what you do not expect to need for 2+ years. Buy what need for the next year, and buy a new system after that, rotating the existing system to a less important task. Of course, if you work for an inflexible government agency that mandates replacement at 5 year intervals, or if buying the $1M system makes you more important than the other group that runs on a $30K system, well then go for it! On the flip side, one should not argue for the minimum system that meets requirements, but rather think about how massive compute power can be used to generate value.

I used have many complaints about Intel, particularly on the chipsets. Most have been addressed. The remaining complaint is that Intel has a twisted view that 4-way systems are special, ie, compared to 2-way systems. This is why the 6-core Dunnington is only used in the Xeon 7400 series and not the 5400 series, even though there is no reason it cannot be used in the 5400. The same applies to the upcoming 8-core Nehalem EX being only positioned in the 7000 line and not the 5000 line. AMD has no issues offering 6-core Istanbul in a 2-way. Hopefully, hardware vendors will have a better picture of customer interests, and offer a 2-way for Nehalem EX. Sure I know it is not cheap, this is why the different between men a boys is the size and price of their toys.

HP/Oracle just published a RAC TPC-H result with 64 BL640c blade servers at1000GB. This system comprised 128 quad core Xeon 5450 processors (512 cores), 32GB memory per node (64GB on one node). The total memory was 2080GB. The full database size should be around 1700GB. The 1000GB description is for just the LineItem table, not including the two non-clustered indexes and the other tables.

32 Itanium2 DC256GB90,90953,89969,999
128 Core2 QC2080GB782,6091,740,1221,166,977

Based on the published Oracle RAC results, I should point out that RAC scaling on TPC-H does look good. The almost total lack of TPC-C (1 published?) may indicate an issue in scaling high-call volume applications. In the above mentioned Oracle RAC result, the blade server hardware costs were about $700K, $500K for storage, $3M for Oracle, $1.5M for RAC, $700K for partitioning, $700K for compression, $400K for support ($100K for unbreakable Linux support, if its unbreakable, why the support?) for about $6M in software, minus $1.8M in Oracle discounts. If I could charge that much, I would get myself a 400ft yacht. Never mind, Larry already did.