Home, Query Optimizer, Benchmarks, Server Systems, System Architecture, Processors, Storage, TPC-H Studies

Of GHz and MHz, Processor Information Processor Frequency and Amdahl Revisited

Last week, a routine performance test ran about twice as long as expected. A check of dm_exec_query_stats showed that CPU bound statements (worker time roughly equal to elapsed time) were approximately 3 times higher than previous tests for matching SQL statements. Almost all the SQL were single or few row index seeks. The server system is a 2-socket Xeon E5-2680 (8 cores, 2.7GHz nominal, Sandy Bridge) in a managed data center. The data center had sent out notice that there would be system restarts the previous weekend, which could mean either OS patches or BIOS/UEFI updates. So naturally the next thing to do is check the Processor Information object for the Processor Frequency and % of Maximum Frequency counters (or equivalent). This showed 135, as in 135MHz, and 5% of maximum. Another system of the same model also rebooted showed 1188 and 44%.

This issue has occurred previously in this environment and in other HP systems that I am aware of. The BIOS orUEFI update puts the system into one of the energy efficient configurations. It could also be an operating system setting, but most that I have seen are BIOS settings? One can imagine a firmware engineer being so committed to green activism that this was made the default on BIOS updates without discussion with other parties. One can also imagine the in an facility (in Houston?) with inadequate air conditioning for the number systems having put this in to prevent the lab from overheating, then forgetting exclude this step in the production code, not that I have ever done such a thing (and no further questions on this should be asked).

Another question might be why the data center monitoring team did not check for this, as it has happened before. The whole argument for going to managed data center instead of a simple hosted data center was that the managed data center could provide a broad range of expertise that was not economical for a mid-size IT department. Obviously this managed data center did not monitor for the performance/power configuration.

This matter is of serious concern to production DBAs and IT staff in handling operations. As the Processor Information performance object with extended information was only introduced in Windows Server 2008 R2, many software monitoring tools may not alert on changes of Processor Frequency, especially after reboot. Imagine the IT staff or DBA encountering this for the first time and on the production, with users complaining, your boss watching over your shoulder, and his/her boss over your boss, offering their helpful insights in non-judgemental manner as bosses do.

Performance Insight

However, I am more interested in a different aspect of this incident. When there are two sets of data, one for the processors cores at 2.7GHz and another at presumably 135MHz, we can extrapolate parameters of interest. Does it seem stunning that the drop from 2.7GHz to 135MHz, a factor of 20, only decreases CPU efficiency (increase CPU-sec, or worker time) by 3X? Perhaps, but this actually should have been expected.

The salient aspect of modern computer system architecture is the difference between CPU clock cycle and memory access time. A young person might not know, but old timers would know. Up to about 20 years ago, the primary memory performance specification was access time, with 80, 70 and 60 ns being common in fast page mode and extended data out. Then with the switch to synchronous dram (SDRAM), the key specification changed to data rate. In the Xeon E5 (v1) generation, DDR at 1333MHz was common. This means a memory channel can deliver one line every 0.75ns, or 1.333B times per sec, with a line being 64-bits (excluding ECC bits). The Xeon E5 26xx series has four DDR3 channels. The Intel processor internally is shown as having 2 memory controllers, each controller driving 2 DDR channels, so channel can have different meanings depending on the context).

What is less commonly cited is the round trip latency, from a processor issuing a request to memory, the internal memory access and finally the transmit time back to the processor. (The L1, L2 and L3 cache sequence is also involve in memory access timing.) For local memory (attached directly to the processor) this is around 50ns. For memory on an adjacent processor, the round trip time might be 95ns or so.

On a 2.7GHz processor, the CPU cycle time is 0.37 ns, so 50ns for local memory round trip access is 135 CPU cycles. This particular system has 2 sockets, so one might expect that half of memory accesses are local at 50ns round-trip latency, and half at 95ns latency.

This is a well understood issue. Two methods of addressing the disparity between CPU cycle time and memory access are 1) large cache on the processor, and 2) pre-fetching memory. Current Intel processors have dedicated 32KB I+D L1 and 256K L2, both per core, and an additional shared L3 cache sized at 2.5MB per core. From Pentium 4 one, the processor pre-fetches 64-bytes (the cache line size) with an option to prefetch the adjacent cache line. Prefetching is exposed in the instruction set architecture (can someone provide a reference please) and there should also be a BIOS/UEFI for hardware prefetch.

Simple Model

Now lets visualize the (simplified) code sequence in a relational database engine with traditional page-row data structures. There is a memory access for the index root level page. Read to the page to find the pointer for the second level page. Memory access, and repeat. It is a sequence of serialized memory accesses with poor locality (so cache can only help so much) and the next location is not known until the current memory request is completed, so pre-fetching is not possible.

Modern processor performance characteristics are very complicated, but we will attempt to build a very simple model focusing on the impact of round-trip memory access latency. Start with an imaginary processor with a cycle time equal to the full round-trip memory access time. In this scenario, one instruction completes every cycle, be it an arithmetic or logic or memory access instruction.

Such a system may have never existed so now consider a system where the round trip memory access latency is some multiple of the CPU cycle time. The average time to complete an instruction where time is in units of the memory access latency (50ns or 20MHz for local node), a is the fraction of instructions that involve (non-local, non-prefetch-able) memory access and n is the processor frequency.

  (1-a)/n + a

The term (1-a) is the fraction of instructions that are either not memory access, or memory access to cache (from previous access or pre-fetched). 1/n is the processor cycle time (in units where memory access time is 1).

Performance (instructions per unit time), the inverse of average time per instruction is:

  P = 1 / ( (1-a)/n + a )

     = n / (1 +(n-1)*a )

We can see the the above equation has characteristics that as processor frequency increases, the upper bound on performance is:

  n -> infinity, P = 1/a

Also, if the fraction on instructions that require memory access, a, is zero, then P = n.

Does the above look familiar? It is just Amdahls Law, which formulated in the old days to demonstrate the limits of vectorization in supercomputers. I have just recast it to express the limits of increasing processor frequency relative to round-trip memory access time.

If someone would like to check my math, please do so. It has been a long time. Trying tricking your teenage son/daughter into doing this as a practical math exercise?

OK, anybody still reading is obviously not deterred by math, or knows the trick of skipping such things. What am I really after? In the above equation, what is known is processor frequency relative to memory access latency. While we know the performance or worker time of certain queries, we do not know it terms of instructions per CPU-cycle. And the second item we do not know is the fraction of instructions that introduce a round-trip memory access latency that cannot be hidden with cache or pre-fetching.

But, we have data points at 2 frequencies, 2.7GHz and reportedly 135MHz. Express the relative performance between the two points as a ratio.

  P2/P1 = R

Then from the two equations

  P1 = 1 / ( (1-a)/n1 + a )

  P2 = 1 / ( (1-a)/n2 + a )

we can solve for a in terms of the know values n1, n2 and R.

  a = (n2 n1*R) / ( n1*n2*(R-1) + n2-n1*R )

Assuming memory access latency of 50ns, the base frequency is 20MHz corresponds to memory access in 1 cycle. Plugging in the values n1 = 135MHz / 20MHz = 6.75, n2 = 2700/20 = 135 and R = 3. We get a = 0.059, or 5.9% of instructions incurring a non-cached, non-prefetch round-trip memory access latency would result in a 3:1 performance ratio between 135MHz and 2700MHz. (Perhaps it would be more correct in estimating round-trip memory access latency as the average between the local and 1-hop remote node at 75ns?)

So while it might seem astonishing that the difference between 135MHz and 2700MHz translates to only 3X performance, the database transaction processing workload is an extreme (but important) case. There are many workloads which exhibit better locality or have memory access patterns that are amenable to prefetch and have performance scaling better with frequency.

Hyper-Threading

Earlier, two methods of hiding round-trip memory access latency were mentioned. There is another, Hyper-threading. The processor core to appears as two (or more) logical processors to the operating system. Presumably, there is an extra set of program counters, and resources to determine which physical registers (different from the registers specified in the instruction set architecture) are assigned to each logical processor.

In the earlier example, say that the round-trip memory access time is 135 CPU-cycles and the fraction of instructions that incurs the full round-trip latency is 6%. Then for 100 instructions, 94 are executed in 1-cycle each (excluding consideration for superscalar) as either not involving memory access or data is already in cache, and the 6 the incurs the round-trip memory latency of 135 cycles. Then the total time in terms of CPU-cycles is 94*1 + 6*135 = 904. In other words, only 100 cycles out of 904 are used, the rest are no-ops.

The Intel Xeon processors from Nehalem on implement Hyper-Threading with 2 logical processors on each physical core. (This can be disabled in BIOS/UEFI. Some models have HT disabled. The earlier Intel Pentium 4 based processors implemented a more complex form of HT.)

In considering the nature of the database transaction processing workload, being a memory access to determine the next memory access in nature, it is perhaps time for Intel to increase the degree of HT, especially considering that the server oriented Xeon E5 and E7 models are already 1 full year or more behind the smaller desktop/mobile processor variants. I seem to recall IBM POWER as having 4 logical processors per physical core, one of the SPARC processor lines as having 8. It would also be necessary to have a good strategy for using HT based on workload. The option to enable or disable HT in the BIOS/UEFI is not I what mean. HT should be visible to the OS. But the application itself should detect the presence and degree of HT, and make its own decision on whether HT should be used and how it should be used.

Xeon Phi, Many Integrated Core

Another item worth mentioning here is the Intel many integrated core (MIC) architecture, codename Knights something, now Xeon Phi. The processor puts many smaller processor cores, 61 in the 22nm Knights Corner, versus 12-18 in the 22nm mainline Xeon processors. The theory behind many smaller cores stems from one of the two main elements of Moore's Law. Doubling the number logic transistors/complexity in a single core should translate to about 40% performance gain.

(This was the case up to several generations ago. Since then, Intel no longer tries to double the logic from one process to the next. There might be 10-20% performance gain in general instructions. Some effort is given to expanding the functionality of the special/vector instructions. And most effort has been in increasing the number of cores.)

One manifestation of this (more logic transistors) could increased frequency (which Intel stopped pursing years ago). Another might be more execution ports (8 in Haswell) or other areas to improves instructions per cycle (IPC). Following the rule of 2X transistor per 1.4X (square root of 2) backwards, the expectation is that a processor if 1/4th the size would have 1/2 the performance. But potentially there could be 4X as many cores, depending on interconnect and power limitations. So in workloads that are amenable to vectorization, or otherwise can be parallelized, the more smaller cores could be a better strategy.

The Xeon Phi is targeted to HPC workloads, as reflected in the 512-bit SIMD instructions. If we were thinking about a transaction processing database engine on the MIC architecture, we would probably consider a very basic ALU without SIMD, (not sure on FP). I am thinking that an asymmetric processor architecture might be the objective. Perhaps two powerful cores from the current main line, and many simpler cores (without SIMD) perhaps even simpler than Atom? (The Intel Xeon Phi line implements Hyper-Threading with 4 logical processors per physical core.)

Column-Store

As said earlier, the nature of database page storage along rows make serialized memory access (also called pointer chasing code?) its hallmark. This is why there is interest in column storage architecture. Now all of the sudden, for certain database workloads, the next memory access is 4 bytes over, already in cache. The work a little further down touches memory in the next 64 byte line or two away. Both the software and hardware knows this, and either is capable of issuing a pre-fetch. It does not matter that columnstore must touch more data. Processor can stream huge volumes of data, much more effectively than pointer chasing only the necessary rows.

Summary

I should probably say something here.

Notes

As I said earlier, modern microprocessors are very complex. Pipelined execution was introduced (in Intel processors) with the 80486 (1989) and superscalar execution with Pentium (1993). Pipelined means that while the processor can complete an instruction in each cycle, the actual start to finish time of a specific instruction occurs over several cycles. Intel does not talk about pipeline stages any more, but there are occasional references to Core2 and later processors having a 14+ stage pipeline. (Wikipedia say Core2 is 12-14 stage pipeline. Nehalem and later 20-24?, Sandy Bridge as 14-19.)

Superscalar means that there are more than one execution unit, with the goal of completing more than one instruction per cycle. Haswell has 8 execution ports. Several processors generation prior were 6-port on superscalar. We could apply the principle of Amdahls on scaling performance to any and all of pipe-lining, superscalar, and round-trip memory latency, and probably other things too.

Rethinking Computer System Architecture

I have said this else where. It is long past due to do a clean sheet system architecture with matching change to OS architecture. Current system architecture stems from the 1970's of processor with physical memory (8MB was big) and a page file on disk. Why do we still have a page file on disk? In the old days, there was not enough physical memory such that it was tolerable to have a page file on disk to support a larger virtual address space.

Today, more the 1TB of physical is possible and affordable (compare to the cost of SQL Server per core licensing). But the key factor is in how memory is used. Back then, it was mostly for executable code and internal data structures. The assumption was that very few database data pages would actually be in memory at any given point in time. Today, a very large percentage of memory is used for caching data pages. Of the memory used for executable code and internal data structures, most is junk.

The CPU-cycle time to memory access time discrepancy dictates that the more urgent strategy is to get memory closer to the processor even if it means drastically reducing the size of true memory, to perhaps a few GB per socket. Given that DRAM is so cheap, we would still have system with multi-TB DRAM capacity, except that this would now be the page file. Of course the operating system (and applications) would have to be designed around this new architecture. Given how well the Intel Itanium software coordination went, I guess this might be too much to expect.