The Problem with Standard SAN Configuration IO Performance 2015-02
The charts provided here illustrates my complaints against SAN vendor doctrine, obstinately adhering to the concept of one large pool of disks from which all volumes are created for any purpose (data, log, and junk). There is no consideration for the radically different characteristics of hard disks in random versus sequential IO (low and high queue depth IO behavior should also be an element of IO strategy). The architecture of all traditional relational database engines are built on the premise that high volume log writes are possible at very low latency in order to provide durability of transactions. And yet SAN vendors blithely disregard this to justify their mission to sell inexpensive commodity hardware at extraordinarily prices.
Transaction Performance Data
The chart below is CPU. The horizontal axis is time. One major division marked by the vertical line is 1 minute, and the small tick is 12 sec. The data points are 5 sec. There are 12 steps between each major division. The vertical axis is overall (system) CPU utilization in percent. Each of the stacked green lines represents an individual processor. There are 16 physical cores and 32 logical. A single logical core at 100% utilization would show a separation of 3.125% between lines.
On the second chart, the red line is the performance monitor object: SQL Server:Databases, counter: Transactions/sec. Note that the vertical axis is log-scale, base 10. One major division is a factor of 10. Each minor tick is an integer. The first small tick up from a major tick is 2, the next is 3 and so on to 9 for the last small tick.
The third chart is IOPS. Green is data reads, blue is data writes, and red is log writes. The vertical axis is log scale.
The fourth chart is IO latency, milli-sec per IO. The same color codes applies as for IOPS. Again the vertical axis is log scale.
The SQL activity is batch driven transaction processing. There are 14 or so threads concurrently looping through a set of items to be processed. Each item involves about 20 rows of insert or update activity, hence 1000 log writes per sec corresponds to approximately 20,000 transaction/sec on the SQL counter.
Most of the active data is in memory. There are probably 30-40 SELECT rows for each transaction or twice as many reads as writes. The data IO reads necessary to support the 20,000 inserts and updates/sec is about 2,000/sec, which the storage system is capable of supporting at about 4ms latency. This is because the data resides a small part of each disk. The actual latency for random IO is less than the expected value of 7.5 ms for data randomly accessed over an entire (10K) disk at queue depth 1.
For approximately 20 seconds out of each minute, the transaction rate drops from the peak value of 20,000 all the way down to about 8,000 per sec (noting the log scale). These are the check points when the data write IO surges to 20-50K IOPS, (which demonstrates that thee storage system is actually pretty decent) even though write latency is driven up to 50-90ms.
The checkpoint surge also pushes log write latency up from 1ms to 20-30ms. This dataset occurred during the day, when presumably there was activity for other hosts on different volumes but on the same SAN. At night, log write latency away from checkpoints could be under 0.3ms even at high volume.
The Storage System
I had not previously discussed the storage configuration in detail. The storage system consists of 240 x 10K HDDs only, with the standard system level caching. The SQL Server host is connected to the SAN over 4 FC ports (8Gb from host to switch, 4Gb from switch to SAN, and presumably 4Gb on the SAN backend?). The data is distributed over 8 volumes. The log is on a separate volume as seen by the host OS.
The problem is that on the SAN, all disks are aggregated into a single pool, from which volumes are created. This was done per standard SAN vendor doctrine. Their magically great and powerful "value-add" intelligence would solve all your performance problems. We cannot ask for dedicated physical disks for the log because the SAN was already configured, with the SAN admin getting assistance from the SAN vendor's field engineer who followed the SAN vendor's doctrine.
Input from the DBA team was not solicited and would have ignored in any case. Besides, there are no unallocated disks. And no, the SAN team will buy more disks because there are no empty bays in the disk enclosures. And there is no room for more enclosures in the storage rack. So the DBA request is denied.
Even if we put up the money to get an extra cabinet for one more disk enclosure, the SAN admin will still refuse to configure dedicated physical disks for the log volume because the SAN vendor said that their great and powerful SAN will solve all performance problems. Any problems must be with their application and not the SAN. As can be seen from the charts above, this is a load of crap.
The SAN Vendor Solution
As I said above, this particular SAN is comprised of 240 or so 10K HDDs.
Naturally the SAN vendor's proposed solution is that we should buy more of their value-add products in the form of auto-tiering SSD-HDD, and perhaps additional SSDs as flash-cache. The particular SAN with base features probably has an amortized cost of $4,000 per HDD. So SAN with 240 disks would cost just under $1M (while still failing to provide desired database performance characteristics). A mid-range SAN might have amortized cost per disk of $1,500-2K. Enterprise SAN could be $4-6K per disk.
The additional value-add features would substantially increase the already high cost, while providing only minor improvement, because the checkpoint IO surge will still drive up log write latency.
The sad thing is that the SAN group might buy into this totally stupid idea, and refuse to acknowledge that the correct solution is to simply have dedicated physical disks for the logs. If there were dedicated physical disks, the checkpoint data IO surge goes to completely different physical disks than the log disks.
In the specific example, it is not necessary to have separate FC ports for the logs. The 50K IOPS at 8K per IO generates 400MB/sec, which is only 25-33% of the realizable IO bandwidth of 4 x 4Gbit/s FC ports. The checkpoint IO surge would increase latency on data reads, but the primary reason for the performance drop is the increase (degradation) in log write latency.
Another angle is changing the checkpoint parameters in SQL Server, but the real problem is because we are prevented from leveraging the pure sequential IO characteristics of HDDs by allocating data and log volumes from a common pool.
One more item. In the old days before we had immense memory, typical transactional database data read/write mix was 50/50. This is because a read forces a dirty page to be written to disk. In this situation, a data write IO surge would also depress the data reads necessary to support transactions. So the standard practice those days was to set the checkpoint interval to infinity to prevent data write IO surges. In our case, very little data reads are necessary to support transactions, so the checkpoint surge might depress data reads should have lesser impact on transactions. It is the increase in log write latency that is depressing transaction volume.
Solutions that work
A simple solution that would work is to have separate dedicated physical disks for the log volume. It is that simple! And yet this not possible, and the SAN people would refuse, as it is not in their agenda.
It is unfortunate that the only practical solution is to get the critical database off the corporate SAN. I have done this by going to all flash in the form of PCI-E SSDs. That is, SSDs installed internal to SQL Server system. This is not because the exceptional performance characteristics of SSDs were required.
It was because I needed to get away from the SAN admin and his dogmatic adherence of SAN vendor doctrine. The IO performance requirements could have been meet with a direct-attach HDD array (or on a SAN). But anything with HDD enclosures would have been under the authority of the SAN admin, who would have nixed any storage components that was not a SAN, and then configured it according the SAN vendor principles.
I have used the excuse that PCI-E SSD "accelerators" are needed for tempdb, which are not really "storage", hence there is no IT department mandate that it be on the SAN, under the absolute control of the SAN admin. In fact there were no special requirements for tempdb different from that of data. Then for unrelated reasons, there was enough capacity to put the entire DB on local SSD. Perhaps a file group with non-critical objects could reside on the SAN to carry the pretense of the local SSD not really being storage.