Home,  Parent,  Memory Latency II,  DRAM,  SRAM II

Historical Systems (work in progress)

Pentium II and III era

Below is a representation of a 2-way system from the Pentium II and III era. Two processors share a bus and connect to a memory controller hub, which in turn connects directly to memory.

4S_PentII

In this era, the processor cache line size was 32-bytes. Filling the cache line requires 4 successive words from memory. Memory was transitioning from EDO to SDRAM in the Pentium II era. The 0.35µm Klamath Pentium II's still had a 66MHz front-side bus (FSB). The high-end 0.25µm Deschutes Pentium II's were 100MHz on the FSB and memory. Some Katmai and Coppermine Pentium III's had 133MHz FSB.

Less than one-half of the memory latency seen by the processor is due to the initial memory access, tRAC = 50-60ns, at the DRAM interface. If the memory access at the CPU includes the time to fill a 32-byte cache line, then 3 additional memory clock cycles are required after the first word. This would be 30ns for SDRAM at 100MHz, and 22.5ns at 133MHz.

Below is a representation of a 4-way Pentium II or III Xeon system with 450NX chipset. Here, the path to memory is more complicated, traversing the RCG on the outbound and the MUX on the inbound path in addition to the memory controller. I do not recall memory latency for this system. A highly optimistic guess would be 160ns, but higher is likely.

4S_PentII

The cost of a server system with maximum practical memory (using the second largest capacity memory module if there was a substantial premium for the largest capacity module.) is $5K for each processor and memory may have been about $5K per GB ($1250 per 128MB module). In this period, the price of memory decreased rapidly due to supply and demand imbalances, so any values cited may varying wildly depending on the exact time.

There were complications for Intel systems because 32-bit operating systems had limited support for PAE to use memory above 4GB. But potentially, we could be looking at $20K for four processors and $40K for 8GB memory.

A better example would be the Compaq (formerly DEC) AlphaServer ES40 with 4 Alpha 21264 processors with 16GB memory total (See TPC-C report submitted Feb 9, 2000). The processors were $8,750 each and memory was $19,354 per 2GB. Then four processors were $35,000 and 16GB memory was $154,832. Note the based system with 1 processor and 2GB memory was $48,803, so essentially $20K for the chassis.

The point is that at a formative period in the past, the memory in a database server could be several times more expensive than the processors. Of course, this was also the era of proprietary RISC processors, and system vendors usually handled this matter with breathtaking prices for processors used in large systems.

In that era, suppose a better DRAM technology could reduce latency at the DRAM interface from 60 to 30ns. The memory latency at the CPU is then reduced from 160 to 130ns. The performance impact on code sections waiting for memory is a 23% gain (160÷130).

Back then, 23% would have been really valuable. But if this were incurred at doubling of the cost of memory from $40K to $80K, then perhaps the 23% was not that interesting. The option that seemed more interesting at the time was scaling via multi-processor. The baseline for database servers was a 4-way server. Scaling to 8-way and higher was thought to be the answer.

If memory capacity was also doubled, at the original latency, then that would also contribute to reducing disk IO. Curiously, it was known in the 1998 period, that doubling memory in the 4-way Pentium II Xeon system from 4GB to 8GB would improve TPC-C performance by about 10% and another 5% from 8GB to 16GB. This was with a proper storage system that could handle the IOPS at the 4GB memory configuration.

Addendum

Core 2 showed significantly bettter performance than the preceding NetBurst architecture processors, though without Hyper-Threading. I had assumed that was almost entirely due to a better core. But the HC 18 slidedeck say the memory controller was a significant contributor.

See Hot Chips 18, 2006,
Blackford: A Dual Processor Chipset for Servers and Workstations,
Kai Cheng et. al.
Fully Buffered DIMM (FBD)
page 12
Cites Latencies and Bandwidth
between 2004 DP (LH) and 2006 DP (BF)
Memory idle latency 85ns and 87-102ns respectively
Average loaded latency for TPC-C mix as 180-200ns and 115-125ns respectively.

 

stuff below belongs elsewhere

2017 - Skylake Xeon SP

Below is the current generation standard system with two Xeon SP processors, having up to 28 cores per processor.

2S_XeonSP

The cost of this system is $10K per processor, and $1K per 64GB DIMM. The total processor plus memory cost is then $20K + $24K = $44K.

Average memory latency, based on 50/50 access to local and remote nodes, is (89+139)/2 = 114ns. In the DDR4-2666 generation, tRCD + tCAS + tBurst is 14.25 + 14.25 + 3ns = 31.5ns. tRC is 45ns. A reduction of 20ns would yield 21% gain.

The performance gain via reduced memory latency is comparable to what it was 20 years ago. But it is more interesting now for two reasons. One, the cost of memory in relation to the full system cost is less. Two, the alternative of more conventional memory to reduce disk IO is no longer of particular benefit as IO levels are already at noise levels. And the storage is an all-flash system that can handle incredibly high IOPS even though we no longer have a critical need for this.

However, we need to accept that the right solution today is to abandon the multi-processor system, stepping back to a single processor, as shown below.

2S_XeonSP

For unclear reasons, memory latency in the single socket system is 76ns, versus 89ns for the local node access in the 2-way system. A slide from Notes on NUMA architecture, at Intel Software Conference 2014, citing data on Nehalem, suggests that the MP system should not negatively impact local node memory latency.

notes4

Regardless of whether this is true, memory latency in the single socket system is lower because all memory access is local. Very few databases have been architected to achieve memory locality on a NUMA system. If in fact average memory latency is 76ns on the single processor system versus an average of 114ns in the 2-way system, then the performance per core in the 1S system is 50% better.

If memory latency can further be reduced by 20ns, then a further gain of 35.7% is possible.

notes4