return

Joe Chang, SQL Server Performance Engineer - Consultant

jchang6 @ yahoo.com
South Deerfield, MA
US Cell: (407) 791-1126

Website: www.qdpma.com
Blog: sqlblog.com/blogs/joe_chang/default.aspx
LinkedIn: www.linkedin.com/pub/joe-chang/1/a45/174
Other: sql-server-performance.com authors

Summary

20+ years database architecture, design and performance.
Reverse engineered the SQL Server Query Optimizer cost model and formulas.
Quantified the true cost of SQL Server execution plan operations by processor and system architecture.
Exposed unusual behavior in SQL Server scaling at high parallelism and NUMA issues.
Originator of SQL Clone, a shell database having objects and data distribution statistics, but no data,
now a standard feature in SQL Server as of version 2014 SP2
Decoded SQL Server statistics binary structure for write-your-own statistics.
Publisher SQL ExecStats for cross referencing execution plans to index usage.
Security Clearance: 2009

Employment

Independent Consultant, QDPMA, 2000 to present (SolidQ 2007)

Database Architect and Performance Consultant
Major Projects:

SCOR (Reinsurance) 2014-16:

Support a planned 4X increase in volume from a corporate merger. Initially brought in as infrastructure (server and storage) expert. Rewrote critical SQL code sequences for correctness, efficiency, repeatability. Eliminated unnecessary recompiles. Identified and resolved blocking to achieve scaling at high concurrency. Restructured primary/clustered keys. Complete overhaul of nonclustered indexes with a drastic reduction. Worked around tables with computed columns defined by scalar functions at first. Then worked out plan to remove compute columns, touching several hundred procedures.

Mercury Payment Systems 2012-13

Database re-architecture to support 10X volume increase. On investigation, determined that objectives could be achieved by tuning key SQL code. Identified the true critical root source bottleneck being in the latency for remote server connection initialization, proving objectives could be achieved without re-architecture.

CRM – 4th Gen Programming 2011

The client had built a CRM app with a 4GL product achieving OK performance, but a subsequent patch resulted in daily system seize-ups. Vendor engineers came on site but could not find anything wrong. I was recommended by an Oracle expert. A CPU filtered profiler trace showed some SQL called with API Server cursor. A short-term unfiltered trace was then run, showing that cursors were opened but never closed. A nuance was that the client app (with permanently open connections) generating API Server cursor calls did not have symptoms, while another application using connection pooling did. The victim of unclosed cursors was the sp_reset_connection call. Vendor was not able to provide a quick fix and probably did not understand why this was a problem either. Moving from 32-bit to 64-bit would alleviate the problem. Just resetting the client app to force close connections and cursors was a workaround.

Evolver (kCura Relativity Project) 2011

Client had encountered serious performance problems with kCura Relativity in supporting a very large legal document discovery case with a large number of active users. A new high bandwidth storage system was specified and procured. Developed a plan for migrating the 2TB database off an iSCSI storage system on 1GbE within a limited maintenance windows. SQL Server did not produce efficient parallel execution plans for the style of SQL that Relativity generated. Demonstrated that an alternative style did allow SQL Server to generate parallel execution plans 100-1000 times faster. The kCura SQL expert instead directed developers to tackle the problem symptoms which did not help.

FBI CODIS (Unisys) 2010-11

Corrected serious flaws in the original server and storage configurations. Identified memory round-trip latency, not bandwidth as the key determinant. Rewrote critical SQL sequences to meet project requirements to support processing 50M DNA specimen in 8 hours. Notable code was the double pivot with false aggregation. Eliminated abusively frequent user defined function calls. Recognized that even though the contracted performance requirements could be met, the 10-hour turn-around time for QA cycles posed serious problems. Either the project schedule was not going to be met or the code would have an excessive number of bugs, and probably both. A moderate cost hardware configuration allowed QA cycles to be turned inside of 2 hours.

Customs and Border Patrol (Agilex) 2009

Typical government IT contracting, spending money generating useless paperwork instead of directly tackling the important applications.

US Army Funds Control Management (ATT/Ingenuity) 2007-09.

Mandate was to substantially improve performance as quickly as possible in the new US Army Funds Control management system. Instructions were to do so without changing the business logic. The project was facing cancellation in as little as six weeks if corrective action was not achieved. Performance gains far above requirements were achieved, with the necessary gains in two weeks by rewriting key business logic SQL. In the critical period of the first two weeks, coding rework was done on the production system.

Imceda/Quest LiteSpeed 2004-06

Worked with developers in identify Windows API and C code strategies for backup and restore performance on high-bandwidth IO sub-systems and scaling on (HP Integrity) NUMA system architecture. Wrote papers on system and storage configurations for high bandwidth IO. Advised on tuning strategies in buffer size and count, balancing performance against available 32-bit VAS, as opposed to work by others on peak performance, allocating excessively large and overly numerous buffers when contiguous address space was not likely to be available in a production system.

Cendant Hotel Reservation System Proof of concept (Microsoft Consulting Services 2004)

Existing system on 3 Sun E10000 Starfire server (each with 60 processors) and Informix. Each server could support 300 searches per second. The goal was to determine if 2-socket Opteron systems could support 150 searches per second. A team of Cendant internal people and Oracle engineers ported code from Informix to Oracle. MCS was told that this was to be an unbiased apple-to-apples comparison between Oracle and SQL Server. I was to port the code to SQL Server accordingly. In the actual course of the project, the Oracle team kept talking about how they would achieve an advantage over SQL Server using native 64-bit code able to use 8GB of memory. SQL Server 2000 Enterprise Edition was precluded, and hence limited to 2GB in standard edition. That being the case, I felt no obligation to abide by the original instructions, and rewrote key code structures, achieving 3000 tps, versus 50 for Oracle on their port. It took the Oracle team 3 weeks to figure out my re-coding logic before declaring both implementations to be equal, and the Cendant team selected Oracle because that’s what they wanted in the first place.

Public Storage 2002-3

A local consulting firm built a new line-of-business, specifying deployment on a Unisys E7000 with 16 processors and a storage system with 12 HDDs. In the critical first day of month payment processing, the system would seize up for reasons not understood. The was quickly determined to be caused by a sub-procedure that was called once at night to reconcile an entire day or month of transactions, and then called in volume during the day to process mostly single transactions. The night time call activated a temp table statistics re-compute, producing a parallel execution plan, scanning moderately large tables. The day time calls retained this plan, resulting in the system being unresponsive until another re-compute occurred. Immediate action was to put in a force recompile each morning after the night run. There continued to be problems with the payment procedure but having different symptoms. The execution plan for the procedure looked to be efficient. A cursor was used to process line items even though most day time transactions had just 1 item. It was shown that SQL Server 2000 could run this procedure efficiently with up to 4 concurrently running threads, but not more on a NUMA system. This problem would not have occurred on a 4-way SMP system. The cursor was removed. Also, SQL Server 2005 fixed the NUMA cursor issue, as cursors were a part of the TPC-E benchmark.

S1/Edify 2000-1

Unisys had sold 12 ProFusion 8-way Xeon servers to a large foreign bank, bringing in S1/Edify for software and implementation. One pair of systems was for the database and the others for application servers. There were some index deficiencies in the database, easily fixed. The app servers were expected to support 1000 users, in part to justify the platform cost. Each user required 2 threads. Problems were traced to the thread stack size (1M), and a multi-threaded heap allocation method (MpHeap on NT 4.0) with poor choice of caching, and another critical segment that made avoidable large memory allocations. Fixes allowed the 8-way application servers to meet objectives even though more 2-way servers would have been a better solution.

Others: National Guard G1 Lifecycle, Kaiser Permanente

Intel Corporation, Portland, OR, 1997-1999

Senior Software Engineer
Microprocessor Products Group, ISV Performance Lab
- Proved that scaling Exchange 5.5 on the 8-way ProFusion was possible, after everyone else gave up hope. Compaq request assistance from the Intel benchmark team for the first time in a long while.

Performance Microprocessor Division, Strategic Marketing
- Coordinate efforts of architects, design team managers and platform architects in formalizing product strategy.
- Made the successful argument for one more year of IA-32 in 4-way servers when the official POR had just decided to terminate IA-32 processor after Tanner/Cascade (Pentium III Xeon on 180nm), rejecting the ESG argument for 2 years overlap with Itanium in 4-way. Xeon was subsequently continued indefinitely.
- Contributed the idea of a bus expander for PCI at 66MHz to get around the single slot limit.
- Initiated the effort to relabel the upcoming Pentium III 180nm product (Coppermine) for 133.3MHz bus at multiplier 5 to 667MHz before any official document was distributed to OEMs showing the frequency rounded to the integer down from the numeric value.

Insignia Solutions, Mountain View, CA, 1995-1997

System Engineer

Acer America, San Jose, CA, 1994-1995

Product Manager, Server Systems

US Navy, 1989-1994

Unrestricted Line Officer, Nuclear Propulsion, USS Arkansas CGN-41

Education

California Institute of Technology

BS Physics

Professional Certifications

Microsoft Certified System Engineer, 1995
Novell Certified NetWare Engineer, 1994

 

Speaking Engagements and Publications

SQL Saturday Bratislava, Jun 2015
SQL Saturday New York, May 2015

SQL Saturday Redmond, May 2014
SQL Saturday Atlanta, May 2014
SQL Saturday Budapest, Mar 2014

2013 Oct 6 at Silicon Valley Code Camp
SQL Saturday Riyadh, 23 May 2013
SQL Rally Nordic, Copenhagen, 2-3 Oct 2012
SQL Bits 7, 2010
SQL Server Connections 2003, 2004, 2005
HP World 2004,
CMG 2004

Presentations at SQL Server Users Groups

Munich, Germany, 10 Jan 2013
Execution Plan Cost Formulas
Stockholm, Sweden, 24 Jan 2011
Malmo, Sweden, 25 Jan 2011
Copenhagen, Denmark, 26 Jan 2011
Munich, Germany, 16 Dec 2010
SQL Performance for Developers
Melbourne, Australia, 9 Aug, 2010
Curacao, 22 Apr 2010

Selected Articles

SQL Server Cost Based Optimizer
Queries barely over the Cost Threshold for Parallelism
Hyper-Threading Performance

In-Memory Database Technology
Big Iron Systems and NUMA System Architecture
Intel Server Strategy Shift with Sandy Bridge

Storage Performance 2013
Storage Performance for Data Warehouses
IO Queue Depth Strategy
Solid-State Devices