Generating High-Performance SQL Server Query Results

SQL Server does a fantastic job of interpreting commands in order to generate high-performance query results. Occasionally though, we need to change how we talk to SQL Server in order to squeeze out optimized performance when it really counts. In this article I'll look at some examples of when that occurs and how to change the way we tell SQL Server what we want in order to optimize performance.

A SQL Server Performance Scenario

To set the stage for this performance tip, consider a fictional website that lets members sign up and contribute recipes. Site members and visitors can then search and browse recipes by keywords, category, text searches, and so on. With a properly tuned and indexed database, users should be able to enjoy split-second responses to navigation requests and queries even with up to 12 million recipes logged in the system.

Performance will become an issue when this recipe site needs to kick out reports, or statistics, about which members have contributed the most recipes to the site. The reason for this, of course, is that rather than being able to use indexes to quickly seek out data (as with normal searches or navigation requests), SQL Server will have to scan entire indexes in order to properly aggregate, or report on, results. And with a lot of data (such as 12 million recipes) this is going to take some time–even on beefy hardware. There are, however, some general rules and techniques that you can use to mitigate those response times.

Sample Environment

To help make the techniques I'm about to outline a bit more clear, let's take our fictional recipes site to the next level, and stub-out two fictional tables–one for recipes and the other for members:

CREATE TABLE dbo.Members (
      MemberId int IDENTITY(1,1) NOT NULL,
      UserName nvarchar(30) NOT NULL,
      DateJoined smalldatetime NOT NULL
            CONSTRAINT DF_Members_Joined DEFAULT GETDATE(),
      --Bio,
      --OtherStuff,
      --etc,
      CONSTRAINT PK_Members
            PRIMARY KEY CLUSTERED (MemberId)
)
GO
 
CREATE TABLE dbo.Recipes (
      RecipeId int IDENTITY(1,1) NOT NULL,
      MemberId int NOT NULL,
      RecipeName nvarchar(200) NOT NULL,
      --Instructions,
      --RecipeCategoryId,
      --etc
      CONSTRAINT PK_Recipes
            PRIMARY KEY NONCLUSTERED(RecipeId)
)
GO

By way of example, I'll put a clustered index on the Recipes table–against the member or owner of the supplied recipe. This will make the most sense in terms of fetching recipes by member –such as in cases where individual members want to look at their own, contributed recipes (for example, on their own home page on the site). This index choice, however, might conflict with some of the reporting needs we'll look into though–so we'll cover that as applicable. But for now, I'll add that index, and throw in a Foreign Key constraint as well:

CREATE CLUSTERED INDEX CLIX_Recipes ON
dbo.Recipes (MemberId)
GO
 
ALTER TABLE Recipes
      ADD CONSTRAINT FK_Recipes_ByMember
      FOREIGN KEY (MemberId)
      REFERENCES dbo.Members (MemberId)
GO

Then, using the magic of your imagination, we'll pump around 1 million rows into the Members table, and around 12 million rows into the Recipes table. (Or, if you don't want to use your imagination, I highly recommend Red Gate's SQL Data Generator if you're interested in quickly spoofing some data into play for your own tests–which you can then follow along with in order to view performance costs differences as we proceed through this article.)

With a sample structure in place, we're now ready to look at a basic query:

-- this is our baseline query with a cost of n
SELECT TOP 100
      r.MemberId,
      COUNT(r.MemberId) \\[ContributedRecipeCount\\]
FROM
      Recipes r
GROUP BY
      r.MemberId
ORDER BY
      COUNT(r.MemberId) DESC

I've kept this query simple for two reasons: first, to showcase what we're after (the top 100 members who have contributed the largest number of recipes), and because keeping this query as simple as possible helps highlight some performance benefits that we'll see as we progress through the rest of this article. This simplified query represents our performance baseline.

It’s important to point out though, that if you have 12 million recipes, then SQL Server is going to have to iterate over each and every one of them in order to calculate which member has contributed the largest numbers of recipes. In our case, putting a clustered index on the Members column has helped move this operation from being a table scan to being an index scan, but the result is an operation that iterates over 12 million rows. Likewise, if our clustered index were on a different column (say the RecipeId–or some other, more beneficial, column) then we would NEED to put an index over the MemberId like so:

CREATE NONCLUSTERED INDEX IX_Recipes_ByMemberId
ON dbo.Recipes (MemberId)
GO

But even with that index in place, SQL Server won't be able to seek against the index; it will have to do a scan against it. In the case of a non-clustered index, SQL Server would actually end up needing to pull LESS data into memory–so this operation would actually execute much more quickly. Translation: In a real-world environment, you would want to contrast the need to be able to pull back recipes by member id (where a clustered index would make a lot of sense) versus the need or impact of generating aggregate reports.

But, no matter what you do, you're still iterating over 12 million rows.

A Performance Snafu

With those basics now covered, let's go over some ways to boost the performance of your actual reports or queries. Specifically, let's look at a more useful query, in terms of user experience, based upon our previous, TOP 100 Recipe Submitters, query listed above:

-- this query is 1.5x more expensive than previous query
SELECT TOP 100
      m.UserName,
      m.DateJoined,
      COUNT(r.MemberId) \\[ContributedRecipeCount\\]
FROM
      Recipes r
      INNER JOIN dbo.Members m ON r.MemberId = m.MemberId
GROUP BY
      m.UserName,
      m.DateJoined
ORDER BY
      COUNT(r.MemberId) DESC

As you can see in this query, rather than pulling back just a non-friendly MemberId, we're now pulling back a bit of meta data about the members who have contributed the largest numbers of recipes. By supplying a UserName and DateJoined info for these Top 100 contributors, our output becomes much more beneficial to site visitors.

The problem, however, is that our query has just taken a very expensive performance hit.

In fact, in my test environment where I'm running this, I see that this query now takes about 1.55 times the amount of resources that the previous query took. Yet, if we think about it, logically, all we've done is JOIN the member details for 100 members against our previous result. So what's going on?

Sadly, the Query Optimizer is just doing what we told it to do, and we just need to do a better job of explaining how we want this query satisfied. Because, ultimately, what we want is a list of the TOP 100 MemberIds contributing the most recipes and information about those members.

In our case, however, what SQL Server is providing is: the Top 100 members joined to Recipes–sorted by the highest number of contributors–and the process of joining EACH and EVERY member to EACH and EVERY recipe is adding serious overhead.

Optimizing Performance on Aggregated Reporting

This brings us to what I call the hourglass principle for maximizing SQL Server performance. In its most basic form, this principle dictates that you try to filter as much data out of your queries as possible before you go back and do any JOINS or add any metadata needed to format your query or report for output.

I've named this principle because it mimics the way an hourglass works: it's wide at the top, funnels down into a narrow, constricted, aperture, and then becomes wide again at the base. (And, just so we're clear, this technique is well understood by plenty of SQL Server performance gurus. I just happen to describe the technique using this metaphor.)

The point of this approach, however, is to try to narrow down search criteria as much as possible to avoid the overhead of incurring JOINS or other operations that you don't want or need. Commonly, this technique can be accomplished through the judicious use of a WHERE clause (though sometimes you might need to be more explicit in telling the query optimizer how you want to go about that). Likewise, this technique becomes a MUST when working with linked or remote servers because the last thing you want to do is try to force a JOIN (for example) across servers. Instead, you're better off pulling back all the data you need from a remote query (think hourglass and try to narrow that data as much as possible), then you can join it again on your server to expand the final outcome as needed.

Implementing the Hourglass Principle

In our case we can implement an hourglass approach in one of two ways. Either we can use a temp table/table variable or we can use a Common Table Expression (or CTE). Typically, I prefer table variables over temp tables, and CTEs over both temp tables and variables. But when performance is a must, I commonly take all three approaches and then use SQL Server Profiler to tell me which one works best.

Here's an example of how you could implement this same query using a table variable (which also has the benefit of working with SQL Server 2000, whereas a CTE won't be an option):

DECLARE @tops TABLE (
      MemberId int NOT NULL,
      RecipeCount int NOT NULL
)
 
-- keep query criteria as 'narrow' as possible
INSERT INTO @tops
SELECT TOP 100
      r.MemberId,
      COUNT(r.MemberId) \\[RecipeCount\\]
FROM
      Recipes r
GROUP BY
      r.MemberId
ORDER BY
      COUNT(r.MemberId) DESC
     
-- now we're safe to 'expand' the result set   
SELECT
      m.UserName,
      m.DateJoined,
      t.RecipeCount
FROM
      @tops t
      INNER JOIN dbo.Members m 
             ON t.MemberId = m.MemberId

In this query, you can see that I'm being more deliberate in terms of informing the optimizer what I want do to. Instead of joining all Recipes against their members or owners, I start by grabbing a list of the Top 100 contributors (to narrow my query as much as possible), then I'm widening that query out by joining JUST those 100 MemberIds back against the members table.

The same cost benefits pan out if you use a CTE as well:

-- this CTE narrows our results
WITH tops (MemberId, RecipeCount)
AS (
      SELECT TOP 100
            r.MemberId,
            COUNT(r.MemberId)\\[RecipeCount\\]
      FROM
            Recipes r
      GROUP BY
            r.MemberId
      ORDER BY
            COUNT(r.MemberId) DESC
)
 
-- now we're safe to 'expand' the result set
SELECT
      m.UserName,
      m.DateJoined,
      t.RecipeCount
FROM
      tops t
      INNER JOIN dbo.Members m
            ON t.MemberId = m.MemberId

In both cases (using a table variable or using a CTE), the net result is 11.999 million fewer joins–which, obviously, translates into improved performance.

The aggregate operation of scanning all 12 million recipes still takes some time, but by using the hour-glass approach, this query is back down to being 1.001x the cost (or even 1.0) as the first query listed above, which is a much better cost than the 1.5x cost we were seeing with the straight JOIN approach.

Conclusion

Aggregate operations in SQL Server over large amounts of data can be expensive. Consequently, if your site/application relies heavily upon them I recommend good indexes and then either caching results for a while, or creating a scheduled job that drops off these pre-calculated aggregates into a stats table that your application can query as needed without incurring a massive performance penalty.

But using the approach of restricting data as much as possible to satisfy your queries before joining against any metadata that you want to include with your output is a best practice that can generate some big performance gain–whether you're dealing with aggregates or not. So, the next time you're struggling to boost the performance of a big or ugly query, make sure you've analyzed it in regard to how well it complies with the hourglass principle.

Comments

Plain text