Home, System Architecture, Server Systems,
Big Iron Revival (2009-05), Big Iron Revival II (2009-09), Big Iron Revival III (2010-09).

Big Iron Revival III, Revenge Return of Big Iron (2010-09)

In the old days, standard server systems did not have the power to run large enterprises, hence there were vendors that built really big servers. However it became apparent if not widely publicized that there were serious technical challenges scaling up on big iron systems. (Many of these difficulties have since been or will be resolved,) Furthermore, the press could not distinguish the difference between there being big systems on the market and actually scaling effectively on big systems. But IT departments did figure out that it was often better to buy the standard 4-way SMP server offered by almost all system vendors than a proprietary big NUMA system, even if this meant scaling back on features. More recently, microprocessors have become so powerful that the default system choice should now be a 2-way, meaning it is usually safe to pick this system without any technical sizing effort. Still there was a niche demand for truly immense compute power if such capability could be harnessed effectively in a production environment.

Years ago, Oracle recognized the limited realizable scaling and serious technical anomalies that occurred in big systems, and elected to pursue the distributed computing solution. The first iteration was Oracle Parallel Server, and then followed by Real Application Clusters (RAC). A full RAC system is very complicated, and it was impossible to provide good support because of all the variations in each specific customer implementation. Furthermore, Oracle probably got tired of hearing customer complaints that were repeatedly traced to really expensive storage systems with absolutely pathetic performance with regard to the special characteristics and requirements of database IO.

Two years ago, Oracle came out with the Oracle Database Machine (ODM), comprised of several pre-built RAC nodes (2-way quad-core Xeon 5400 systems) coupled with their Exadata storage system. Each Exadata storage unit was itself a 2-way quad-core Xeon system with 8GB (24GB in gen2) memory and 12 SATA or SAS disks, and running a special version of the Oracle database capable of off-loading certain processing tasks from the main database engine.

The first generation ODM/Exadata was only targeted at data warehouse environments as the storage system had excellent sequential IO bandwidth, which SAN system are really bad at, but only marginal random IO. (In fact SAN systems often have special features meant to prevent one host from drawing too much load, to better simultaneously support many hosts). The second iteration last year broadened the scope to also support random IO from OLTP environments with 384GB Flash storage to supplement hard disks, and making extensive use of compression.

For the 2010 ODM refresh, the RAC base unit is now not a 2-way or 4-way, but an 8-way system with Xeon 7500 processor. There are good reasons for choosing the Xeon 7500 series. This includes 4 duplex memory channels supporting 16 DIMMs per processor versus 9 for the Xeon 5600 and the Machine Check Architecture for enhanced reliability. All of this could have been done for 2-way or 4-way base units. The Exadata units remain 2-way Xeon 5500(?).

For years, Larry has ridiculed IBM and their POWERx system architecture, saying big iron belonged in a museum. The future was clustered small systems. Last year I commented that scale-out clustering with small systems (not a Microsoft feature until PDW is released) was a mistake. The propagation latency between systems over even an Infini-Band connection was far higher than the inter-node latency of most NUMA systems. The main deficiencies of the NUMA systems of years ago was limited interconnect bandwidth, which was still better than IB, and an integrated directory for cache coherence (I did not discuss this at the time). But all of this was on the public roadmaps of Intel. AMD already had the interconnect technology, and only needed HT-Assist (introduced with Istanbul), but AMD has decided to withdraw from >4-way, probably due to severe financial pressure.

So the Oracle decision to move ODM RAC nodes to not just Xeon 7500, but also to 8-way nodes is essentially a concession that scale-up first is the best strategy. The fact that ODM now has only 2-nodes with 8-way versus previously 8 nodes with 2-way sysem may be hinting that the two nodes strategy is more for availability than scale-out.

Side note on Exadata Storage Systems
In general, I think this is a good concept in:
1) putting the database engine on the storage system to off-load processing when practical,
2) having substantial compute power in the storage system to do this with 2 quad-core processors over relatively few disks,
3) using the compute power to also handle compression,
4) the choice of Infini-Band is based on best technology, not fear of leaving Ethernet.
The goal of 2GB/sec sequential per unit is reasonable, considering each Nehalem core can probably handle 250MB/sec compressed, then 8 cores matched to 2GB/sec works out.
I do disagree in having only 12 3.5in SAS disks to supply the 2GB/s means 167MB/sec per disk, which is possible, but requires pure sequential IO. This is not always what happens in DW, hence I think 24 SFF (2.5in) disks is better. potentially this could be 20 x 73GB 15K disks + 4 640GB 7200RPM SATA disks, never mind, + 4 x 600GB 10K SAS disks are better.

Kevin Closson from the Oracle side provided details on options for X2, Nehalem-EX Exadata
The X2-8 has two 8-way 8-core Xeon 7500 nodes (presumably the X7560)
the X2-2 has either 2, 4, or 8 six-core Xeon 5670 nodes (quarter, half and full rack)
The Exadata Storage server is now on Xeon 5600, but some confusion as to whether this is with the 4-core E5640 or one of the 6-core X model processors.

Kevin Closson:
"The fact that ODM now has only 2-nodes with 8-way versus previously 8 nodes with 2-way sysem may be hinting that the two nodes strategy is more for availability than scale-out."
One small clarification regarding options for the database grid of the Exadata Database Machine. The latest hardware refresh consists of either 2 8-socket Nehalem EX servers in a 2-node RAC cluster or quarter/half/full rack configurations based on 2-socket Westemere EP.
The Nehalem EX option is called X2-8 and the Westmere EP option is X2-2. The X2-8 is only avaialable in a full-rack configuration. The X2-2 is available in quater/half/full rack configurations.
The storage grid in all cases is based on 2-socket Westemere EP Exadata Storage Servers with either 12 15K SAS drives or 12 7200 RPM 2GB 2TB SAS drives (Seagate Constellation).
Regarding which Westmere EPs, the database hosts are fitted with 5670s and the storage servers have 5640s.

Joe Chang:
The main Oracle information only showed the X2-8 with the 8-way Xeon 7500. After Kevin's reminder, I searched the X2-2 and found it on the Oracle press release. I do think maintaining the 2-way option makes sense. The EP line should be out 1 year ahead of the EX line on a consistent basis. I am suprised storage server is moving to 5600, and the upper-end 6-core at that. Of course, Oracle should have sufficient information to assess whether the addition compute capability is required. I would have guessed that there would have been some evidence that the quad-core was sufficient to support 1.5GB/sec and 75 KIOPS.
I am curious as to which 5600 Oracle elected for each, as the 6-core is the X-models only. The previous generation Xeon 5500 was the E-model, which was a good choice for thermal and density reasons.
Next, while RAC is not perfect by any means, as close as it is made out to be, it is useful. So we should admire that Oracle has the guts to trail blaze in the bleeding edge, while somehow not suffering the consequences of others that have done so.
Now the curious matter of benchmarks. Oracle has not published a TPC-C for ODM even thought there is every reason to believe it should do well. There is a result for SPARC RAC. There is an older result on TPC-H(?) on RAC. I am inclined to think ODM can do some of the TPC-H queries well, but not others. This might depend on the rules, and partitioning strategies. Oracle has not published any TPC-E results. There is every reason to believe Oracle does fine for TPC-E on a single system, perhaps better than SQL Server on big systems. Is the reason Oracle is not publishing TPC-E only because TPC-E does not scale well on RAC?

I think the X5670 is a good choice for the database engine, being the top frequency at 95W, with the X5680 at 130W. But the E5650 (80W) is a 2.66GHz 4-core part. The X5650 (95W), also 2.66GHz, is the lowest 6-core. The Oracle press release clearly said that the full rack of 14 storage server had 168 (=14 x 2 x 6) cores. I can see moving to the 5600, but staying at the lower 80W, and 4-cores as there should be good data on how much compute power is required to support compression (and SQL offload) at the IO load. I expect that hyper-threading helps too.
oh yeah, the point of the blog is not whether ODM/RAC/Exa is good or bad, or whether SQL Server needs to follow vis PSW. The point is the that another company with deep database expertise, started out promoting scale-out on small boxes, but then returned to big-iron (when suitable big-iron became available). I am thinking very sound technical analysis went into this decision (or Larry reads my blog).
Each of the major database products is very complex, so DBAs tend to specialize in just one, sure there are people who do 2, but experts tend to do 1, So while it is seriously impractical to be expert in 2, one should atleast pay attention to what is happening on the other side of the pond.