The general idea behind storage performance for data warehouses is relatively simple. Think many fat pipes, and very many spindles (disk drives). Over time, this may be adjusted to many SSD devices for the main data and temp databases, and many hard drives for extra spaceCaution: Do not put the DW database on shared storage
This warning cannot be overstated. Storage system vendors and their acolytes have blind faith that their storage system is immensely powerful, suitable to centralize storage hosting for the entire enterprise. This concept is elevated to unquestioned doctrine.
The problem is that is not entirely true, and sometimes not even remotely true. This is just the story they tell to justify (rationalize) immensely expensive storage systems. Considering that a bare 15K disk drive costs $200-600 each, how is it that the storage system has an amortize cost per disk in the $2000-6000 range?
A storage system is no more powerful than the performance provided by the number of disk drives in the system, and sometime less that because of the storage system overhead. Another fantasy is that the storage system cache will solve all performance problems. This is a really stupid idea as far as databases are concerned because the database engine itself is a big disk cache which knows better how the data will be used. In fact, the database engine is built around the peculiar nature of disk drive random and sequential IO performance characteristics. Hence to separation of data and log.
There is a fundamental incompatibility between the objectives and requirements for data warehouse and online transaction processing systems (and other storage users). An OLTP system desires the ability to support high volume transactions, with the equally important criteria of maintaining low response times. A data warehouse system should be saturating both the processors and storage for maximum through-put. Furthermore, OLTP disk IO should be mostly random to data for which disk queue depth should be kept low, with sequential log writes. A DW may generate massive sequential IO, along with large block IO, and also high-queue pseudo-random IO.
Many storage systems are kind of designed for random IO, usually with a mechanism in place to prevent any one host from saturating the system in order to provide balanced services to multiple hosts. The general expectation is that mainstream storage systems have absolutely pathetic price-performance in terms of the requirements for data warehousing. It is my suspicion that the real reason Oracle entered the storage system business is because they just got tired of listening to customers complain the DW performance is unacceptable even on their multi-million storage system. There is nothing wrong with a properly designed Oracle data warehouse. It is just that mainstream storage systems are designed completely in contradiction to the objectives of DW storage performance characteristics and requirements.
Below are brief notes on the difference between nominal, or signaling bandwidth and net realizable bandwidth for various IO channels in the path to storage.
PCI-E Gen 2 – 5Gbit/sec signaling
x8 = 5GB/s, net BW after 8b/10b encoding = 4GB/s, net = 3.2GB/s(?)
x4 = 2GB/s, net = 1.6GB/s(?)
SAS 6Gbit/s – 6 Gbit/s
after 8b/10b = 600MB/sec per lane, x4 port: 2.4GB/s nominal,
net after protocol overhead 88.3% = 2.16GB/sec
Fibre Channel 8 Gbit/s nominal
after 8b/10b = 800MB/s, net after protocol 780GB/s point-to-point,
680MB/s from host to SAN to back-end FC-AL (loop)
SAS RAID Controller, x8 PCI-E G2, 2 x4 6Gpbs
2.8GB/s (this depends on the controller and should improve over new generations.
PCI-E, FC and apparently as well SAS all use 8b/10b encoding, so the bandwidth after encoding overhead is 80% of the quoted value. All three are also bi-directional. Some vendors feel compelled to cited the combined bi-directional bandwidth as if that would be more clear. There is additional protocol overhead. For PCI-E, this might be another 20%, but additional investigation might be needed. LSI provided the protocol overhead values for SAS.
My recommendation is that the data warehouse be configured with dedicated direct-attach storage using 15K SAS 2.5in (SSF) drives, or possible a SSD/HDD hybrid. There is really no requirement for the capabilities provide with SAN based storage systems, especially considering the extravagant expense. The diagram below emphasizes the point of multiple fat pipes and very many spindles with direct-attach SAS storage.
The purpose of this article is to discuss storage performance concepts in vendor neutral manner. But it may be helpful to understand cost structure. Without endorsing one vendor over other vendors, the Dell Power Vault MD1220 with 24 73GB 15K drives is as of 2010-Oct-06 is $11,049. For 24 146GB 15K drives, the price is $13,689. The PERC H800 RAID controller with 1GB NV cache is $799. The individual disk drive price for 73GB 15K 2.5in is $329 each and the 146G 15K is $439 each. Other options include a 500GB 7200RPM at $319 each and a 600GB 10K for $1009 each. The set of 8 enclosures, 24 disks each with 4 controllers is $89-112K, a very reasonably price for 10GB/sec bandwidth and 40K low queue, or 80K high queue 8K random IOPS.
The diagram below shows an example with Hybrid SSD-HDD storage. Recent SSD models with 6Gb/s SATA/SAS interfaces can provide 350MB/sec on large block reads. Six or so SSD devices should be able to saturate a x4 SAS 6Gbps channel (around 2GB/sec net capability). The typical disk enclosure can accommodate 24-25 2.5in disks in a 2U form factor. The enclosure should be able to be split into 2 x4 SAS sections (verify this from the vendor for the specific model!). This would leave 6 extra 2.5in disk bays. Since the enclsoure does cost money, I suggest filling the empty slots with standard HDD drives. Either 146GB 15K drive or possibly 300-600GB 10K drives are suitable. I have some concerns about the incidence of write errors in 7200RPM drives even if the interface is SAS.
Dell does not list their 50GB and 100GB SSD as options for the MD1220. These options on their server systems are $1100 for the 50GB 3Gbps SAS interface (?), and twice that for the 100GB model. There is a 149GB SSD (also 3Gbps SAS) option for the MD 1220 at $4,099. The SSD price from Crucial for a 64GB SATA 6Gbps interface unit is $149, with other options at 128GB and 256GB. The Crucial 64GB is probably equivalent capacity to the 50GB server model, which should have more capacity set aside for improved write performance and longevity.
It is expected that a enterprise/server grade product is more expensive than a consumer product. Double or triple is not unreasonable. However, consumer product prices are subject to continuous and aggressive price reductions for competitiveness. Server products tend to be introduced at a given price and then stay flat. This is just the nature of the business. I am hoping the spread between the server and consumer SSD will close somewhat with the 6Gbps SAS generation.
If one really wants SAN based FC storage,
then the diagram below shows just the distribution of IO over many FC IO channels.
Other SAN components like switches, redundant paths and service processors (controllers) are not shown.
The diagram shows a quad-port 8Gbps FC HBA in a x8 PCI-E gen2 slot,
which is a reasonable bandwidth match at 4 x 8Gbps on the FC side,
8 x 5Gbps on the PCI-E side.
Dual-Port 8Gbps FC HBAs are in PCI-E gen2 x4 slots.
I am not aware that there actually any quad-port 8Gbps FC HBA on the market.
There are quad-port 4Gbps HBA.
Both Emulex and QLogic have recently released quad-port 8Gbps FC HBAs (2010-Oct).
My diagram shows 30 disks in 2 x 15 disk enclosures per 8Gbps FC port. The expectation is about 27MB/sec per disk in sequential or large block IO for a combined bandwidth of 700MB/sec per 8Gbps FC port. This should be possible, but many production SAN systems seem to deliver only 10MB/sec per disk. Standard SAN FC configuration has a pair of FC connection to each enclosure for redundancy. In this case, each 8Gbps FC loop-pair should connect to 60 disks. I would recommend 2.5in disks for maximum disk density. If the SAN vendor offers 2U enclosures for 24-25 2.5in disks, then I suggest 2 enclosures for each 8Gbps loop pair. Many SAN vendors have stubbornly clung to the 3.5in disk, typically a 15 disk enclosure in 3U.
I interpret this as complete disregard and absence of knowledge of the concept of storage performance. There is a popular fantasy, despite the complete absence of supporting evidence-schmevidence, that the SAN with its large (8-32GB?) cache will solve all storage performance problems, so well in fact that the storage system can be shared, and that it also perform all sorts of management functionality. But thats just my opinion.
The typical enterprise SAN cost structure is an amortized cost per disk in the range of $3000 for the 146GB 15 drives to $6000 for the 450-600GB 15K drives. A large SAN capable of sustaining 10-20GB/sec might be comprised of 1000 disks over 20-40 FC ports with a cost north of $3M.
There are entry and mid-range SAN systems with FC on the front-end and SAS on the back-end. As an example, the Hitachi AMS line can support an amortized cost per disk of around $1500 for the low capacity 15K model.
I have said else where that enterprise SAN systems are just not designed for data warehouse usage. The Microsoft Fast Track Data Warehouse Reference Architecture whitepaper has demonstrated that multiple entry level SAN systems can be configured for data warehouse IO characteristics. Note: my diagram is not in conformance with FTDW.
Over the last year or so, Microsoft put out their Fast Track Data Warehouse (FTDW) Reference Architecture in a collection of papers. I have expressed my complaints elsewhere, and will also elaborate later, but the main point to highlight now is that the 200MB/sec memory (data) consumption rate per processor core cited in the FTDW paper was for the Intel Core 2 architecture processor. It was pointed out that this will change from processor to processor.
So it is now necessary to re-evaluate the data consumption rate. The Microsoft FTDW paper used the a modified TPC-H query (6) to do a simple aggregate of the Line Item table. As that was a modified query, we would have to actually run the test for each system. The TPC-H benchmark Query 1 on the otherhand is essentially a table scan of the Line Item table (actually a clustered index seek of over 95% of the table). Query 1 is more complicated, aggregating several columns, each of which contributes more cost, and aggregates witha Group By, which involves a Hash Match that is more expensive as a simple Stream Aggregate. But because it is a standard query, we can use published reports to assess memory consumption rate by system and processor architecture.
The tables below show calculations for Xeon and Opteron processors. Some additional information: the TPC-H SF100 Line Item table is approximately 100GB with the old 8 byte datetime data typem and 8.75GB with the new 4 byte date data type.
|Processors||GHz||total cores||Mem GB||SQL||SF||Q1 sec||Tot MB/s||MB/s per core|
|2 Xeon 5355||2.66||8||64||2k5 sp2||100||85.4||1,165.5||145.7|
|2 Xeon 5570||2.93||8||144||2k8 sp1||100||42.2||2,073.5||259.2|
|2 Xeon 5680||3.33||12||192||2k8 R2||100||21.0||4,166.7||347.2|
|4 Xeon 7560||2.26||32||640||2k8 R2||300||37.2||7056.5||220.5|
|8 Xeon 7560||2.26||64||512||2k8 R2||3000||183.8||14,282||223.2|
Note: at the SQL Bits conference, I had the 4-way Xeon 7560 data consumption rate at 250MB/sec per core. That was an incorrect calculation based on SF1 = 1000MB instead of the correct 875MB with 3 Date columns replacing of Datetime columns for SQL Server 2008.
For the Xeon 5355 (Core 2 architecture processor), the data consumption rate is 145.7MB/sec per core, as expected somewhat lower than the 200MB/s for the simpler baseline query in FTDW. However, note the capabilities of the newer Xeon 5570 (Nehalem-EP) and especially the 5680 (Westmere-EP) cores at 259 and 347MB/sec per core respectively. The slower 2.26GHz core in the Xeon 7560 can still consume 223MB/sec per core. The data consumption per processor socket is an impressive 2.1GB/sec for Xeon 5680 and 1.8GB/sec for Xeon 7560. It is possible some of the gain may be attributed to SQL Server 2008 R2 improvements.
The table below show TPCH-H Query 1 based data consumption rates for various Opteron processors.
|Processors||GHz||total cores||Mem GB||SQL||SF||Q1 sec||Tot MB/s||MB/s per core|
|4 Opteron 8220||2.8||8||128||2k5 rtm||300||309.7||868.7||121.1|
|8 Opteron 8360||2.5||32||256||2k8 rtm||300||91.4||2,872.0||89.7|
|8 Opteorn 8384||2.7||32||256||2k8 rtm||300||72.5||3,620.7||113.2|
|8 Opteron 8439||2.8||48||256||2k8 sp1||300||49.0||5,357.1||111.6|
|8 Opteron 8439||2.8||48||512||2k8 sp1||1000||166.9||5,242.7||109.2|
|2 Opteron 6176||2.3||24||192||2k8 R2||100||20.2||4,331.7||180.5|
|4 Opteron 6176||2.3||48||512||2k8 R2||300||31.8||8,254.7||172.0|
The Opteron 8360 is Barcelona, the 8384 is Shanghai, and the 8439 is Istanbul. All are relatively comparable in the range of 100MB/sec per core. One might have expected an bump in performance per core in Istanbul because of HT Assist, but it is not evident. The steep performance boost from Magny-Cours, Opteron 6176, is somewhat unexpected. It is unclear how much of this is from SQL Server 2008 R2, or from the improved topology and interconnect paths of the 4-socket, 12-core system relative to the 8-socket 6-core predecessor.
|Processors||total cores||BW per core||Target MB/s||PCI-E x8-x4||SAS HBA||Storage Units/Disks||Storage Units/Disks||Actual BW|
|2 Xeon 5680||12||350||4,200||5 - 1||2||2 - 48||4 - 96||5 GB/s|
|4 Opt 6176||48||175||8,400||5 - 1||4||4 - 96||8 - 192||10 GB/s|
|4 Xeon 7560||32||250||8,000||6 - 4||6||6 - 144||12 - 288||15 GB/s|
|8 Xeon 7560||64||225||14,000||9 - 5||11||10 - 240||20 - 480||26 GB/s|
The table below shows the SQL Server Query Optimizer IO Model along with "actual" IO capability of several storage configurations. The last columns is the Sequential to Random IO ratio both in 8K pages.
|IO Capability||Disks||BW KB/s||Sequential IOPS||"Random" IOPS||Sequential/ Random IO|
As discussed else where (Cost-Based Optimizer) the optimizer model weights sequential IO cost as 1350 pages per second, and the "random" IO in key lookups and loop join inner source at 320 per second. When the CPU cost is factored in, along with the percentage of key lookups or loop joins requiring IO, the typical index seek+key lookup to table scan cross-over occurs at 3.5.
Since the metric of interest is the Sequential to Random IO ratio, we only need to the consider the metric for a single HBA/controller, not multiple controllers, as long as the bandwidth and disks ratio is fixed. As is evident, a proper direct-attach storage configuration can achieved nearly 10X the sequential-random ratio assumed by the optimizer. For a 4Gbps FC with 30 15K disks, the sequential-random ratio is a close match. This is purely a coincidence and not by design, as the SQL Server Query Optimizer cost formulae were actually matched to the performance capabilities of 4 7200RPM disks from the 1994-5 era.
When the storage system is not properly configured, the reference query may register something similar to the following on Task Manager.
A proper HDD configuration with brute force bandwidth capability should register with nearly saturated CPU on single table scan operations, and possibly also on 2 table scan joins.
Only a truly massive number of disk drives can support the pseudo random IOPS generated by lookups or loop joins. This scenario is best matched to the hybrid SSD solution.
In the example below, there are 12 disks in a RAID group, and each RAID group is presented as a single LUN. For direct-attach storage RAID groups can be upto 16 disks. Smaller RAID groups can be rebuilt from a disk failure more quickly. Larger RAID groups results in fewer LUNs to manage.
On SAN systems, the RAID group could be limited to 14-16 disks (8+8 R10, 14+2 R5). The SAN operating system may also group several LUNs into a large meta-LUN comprised of more disks. This is not recommended for sequential performance.
The Microsoft FTDW reference specificies entry SAN systems with 11-disks, and 5 two-disk RAID 1 groups. This is due to the peculiar characteristics of the SAN where it is possible to achieve sequential IO of 100MB/sec per disk with small RAID group. FTDW proposes 2-disk RAID 1, an EMC document suggests 3-disk RAID 5. For some unkown reason SAN systems cannot achieve high sequential transfer rates with larger RAID groups. Direct-Attach RAID controllers seem to have no problems in this matter.
The operating system will see the LUNs (or RAID groups) as individual disks. One should always be absolutely clear on the actual OS (host) to storage system physical disk mapping. A group of physical disks could be presented as multiple LUNs sharing the same disks. Each LUN might be able to support 600MB/sec with no activity on the other LUNs, but the combined set might end up supporting less because the physical disks are nolonger in sequentially IO.
The strategy for file layout that most people seem to arrive at for a partitioned table is something like the following. A large storage system present multiple LUNs to the OS. A partition is created on each LUN (seen by the OS as a phyical disk). At the SQL Server level, a file group is created for each partition. One file is assigned to each file group. Each file is placed on its own LUN/disk group.
Partitioning was introduced for SQL Server in version 2005. Originally, a table scan would start with one partition, then continue on to the next partition and so on to the last partition. With the file layout show above, one set of disks would be active while each partition is being scanned, and the other disks would be idle. SQL Server 2008 (and possibly a later sp to 2005?) changed the parallel table scan behavior to simultaneously scan all partitions (?). It is suggested that the actual table scan disk IO behavior be verified before adopting the final file layout. The general concept in table scan for performance is to simultaneously access all disks with one stream each. The MS FTDT Reference Architecture says to create 2 LUNs from each RAID 1 group, meaning two streams will be accessing each disk group. It would seem that this would disrupt the pure sequential transfer behavior of hard disk drives, but it does deliver 100MB/sec per disk, which seems to be a paradox.