Link to original article: http://msdn.microsoft.com/en-us/library/aa175774(v=sql.80).aspx
For years, you’ve seen Tom Moreau solve many T-SQL problems, eschewing such ugly approaches as temp tables, their table variable cousins, andworse yetcursors. This month, Tom makes a concession, showing that the exception proves the rule. Along the way, he shows you how to create what amounts to an index on a table variable. Very cool.
As you know, I enjoy receiving e-mail from you, particularly when you provide me with a challenging T-SQL problem or one that makes me dig in and learn something new. This month’s “best question” award comes from UK-based Dr. Graham Smith.
Graham has a table that looks and feels a lot like the Orders table in Northwind, so we’ll use it for our discussion. Graham needs to build a stored proc that takes the EmployeeID as an input parameter and returns a rowset with the OrderID, RequiredDate, and ShippedDate of all orders for that EmployeeID. However, the orders need to be sorted by RequiredDateand he also needs to generate a sequence number. Graham tried using a SELECT INTO in conjunction with an IDENTITY column, but that approach errored out, since the OrderID itself is also an IDENTITY, and you can’t have two identities in the same table. More about that later.
Doing running counts or running totals via a correlated subquery in the SELECT list is fairly straightforward in T-SQL, as I’ve shown in a number of columns, so let’s start there. Take a look at Listing 1and Table 1.
Listing 1. Using a correlated subquery to generate sequence numbers.
select o1.OrderID, (select count (*) from Orders o2 where (o2.RequiredDate < o1.RequiredDate or (o2.RequiredDate = o1.RequiredDate and o2.OrderID <= o1.OrderID)) and o2.EmployeeID = 9) as SequenceNo, o1.RequiredDate, o1.ShippedDate from Orders o1 where o1.EmployeeID = 9 order by o1.RequiredDate
Table 1. Results of Listing 1.
|10255||1||1996-08-09 00:00:00.000||1996-07-15 00:00:00.000|
|10263||2||1996-08-20 00:00:00.000||1996-07-31 00:00:00.000|
|10324||3||1996-11-05 00:00:00.000||1996-10-10 00:00:00.000|
|10331||4||1996-11-27 00:00:00.000||1996-10-21 00:00:00.000|
|10386||5||1997-01-01 00:00:00.000||1996-12-25 00:00:00.000|
This does a scan count of 44once through the table to pick up the rows for EmployeeID 9, and then 43 more times to calculate the sequence number, once for every row in the result set. But it gets worse: 4,007 logical reads. Wouldn’t it be nice to ease the burden on the Orders table somewhat?
If you had many orders and many employees, then the percentage of orders for a given employee would be quite low. With an index on EmployeeID, you should be able to get the rows you need for your result set fairly quickly. It’s the sequence number thing that’s dragging us down.
Longtime subscribers know that I’m not a huge fan of temp tables, but let me clarify my position: I only don’t like them when they’re not necessary. They do, however, earn their keep when your query has to do the same thing more than once. In those cases, it’s better to take the hit once and just refer to yourmuch smallertemp table as many times as you need. One more thingSQL Server 2000 allows you to use table variables, and they have proven in many cases to outperform your basic temp table. [BOL points out that table variables, which behave like local variables, offer several advantages: They have well-defined scopes, they're cleaned up automatically at the end of the function, stored proc, or batch in which they're defined, and they tend to result in fewer recompilations and less locking in stored procs and transactions where they're used.Ed.]
Listing 2 shows you an alternative solution, using a table variable. It gets populated with the required rows and columns (sans sequence number). The correlated subquery then gets applied to it. Check out the code.
Listing 2. Using a table variable.
declare @t table (OrderID int primary key, RequiredDate datetime not null, ShippedDate datetime null) insert @t select o1.OrderID, o1.RequiredDate, o1.ShippedDate from Orders o1 where o1.EmployeeID = 9 select o1.OrderID, (select count (*) from @t o2 where (o2.RequiredDate < o1.RequiredDate or (o2.RequiredDate = o1.RequiredDate and o2.OrderID <= o1.OrderID))) as SequenceNo, o1.RequiredDate, o1.ShippedDate from @t o1 order by o1.RequiredDate
The INSERT made 99 logical reads against Orders, with a scan count of just one, but it also caused 88 logical reads against the table variable @t. And it didn’t even return a result set, so there’s still more work to do. If we do a SELECT (which gives you a scan count of 44no surprises there), it costs us 88 logical reads. Thus, the total number of logical reads is 99 + 88 = 187, way lower than the original query in Listing 1. Not only that, but the query cost went from 0.145 down to 0.1117. Looks like a win-win.
You know me well enough by now to know that after that bit of work, I wouldn’t just raise the flag of victory, grab a cold one, and look for the remote. Let’s think about that table variable. The correlated subquery looks at RequiredDate and OrderID, so it would be very cool if we could add an index to that table variable. Although you can’t do a CREATE INDEX against a table variable, you can have an index created behind the scenes when you declare the table variable with a PRIMARY KEY constraint (the same goes for UNIQUE constraints). Since a table can have at most one PRIMARY KEY constraint, any other index you add has to be done through a UNIQUE constraint. As its name implies, the combination of columns that make up the UNIQUE constraint’s key must be unique. So, any combinations of columns that also include the PRIMARY KEY constraint’s columns are de facto unique, too. Since we need an index on RequiredDate and OrderIDin order to get index coverage for the inner part of the subqueryadding a UNIQUE constraint on (RequiredDate, OrderID) does the trick.
This doesn’t just help with respect to the instance of @t (o2) used inside the subquery, it also helps with the instance (o1) outside the subquery. The subquery can tap the UNIQUE constraint twice and then do a Bookmark Lookup to pick up the remaining columns for o1. Listing 3 shows the revised DECLARE for @t. (The SELECT is the same as Listing 2 and isn’t shown.)
Listing 3. Adding a UNIQUE constraint.
declare @t table (OrderID int primary key, RequiredDate datetime not null, ShippedDate datetime null, unique (RequiredDate, OrderID))
Populating this version of table variable @t yields the same scan count and logical reads as before. However, the logical reads jump up to 132 for @t, since it has to populate not only the clustered index on the PRIMARY KEY constraint, but also the nonclustered index on the UNIQUE constraint. The final SELECT yielded a scan count of 87, with 173 logical reads. That looks like bad news. However, the query cost of Listing 3 dropped to 0.0693, suggesting that the saving was in CPU.
We’re not done yet. By default, PRIMARY KEY constraints are clustered, while UNIQUE are nonclustered. What would happen if we explicitly made the PRIMARY KEY nonclustered and the UNIQUE constraint clustered? See Listing 4.
Listing 4. Clustering on the UNIQUE constraint.
declare @t table (OrderID int primary key nonclustered, RequiredDate datetime not null, ShippedDate datetime null, unique clustered (RequiredDate, OrderID))
This time, you get one more logical read and a very slight drop in query cost to 0.0693. Of course, I decided to time the solutions and see who the real winner would besee Table 2.
Table 2. Query durations (ms) and relative query cost (percent).
|Duration||Std Dev||Relative Query Cost|
Each test ran over 100 iterations, purging the data cache via DBCC DROPCLEANBUFFERS each time. All three table variable solutions were 3X faster than the original query. Indeed, the average durations rivaled their own standard deviations. It appears that clustering on the UNIQUE constraint cost a bit relative to the other two table variable techniques.
Let’s get back to Graham’s original error. He wanted to do a SELECT INTO and got blindsided by the apparent double identities and didn’t really want OrderID to remain an identity in his temp table anyway. Listing 5 shows how to counteract that problem by using CAST().
Listing 5. Eliminating the identity property in a SELECT INTO.
select cast (o1.OrderID as int) OrderID, identity (int, 1, 1) as SequenceNo, o1.RequiredDate, o1.ShippedDate into #t from Orders o1 where o1.EmployeeID = 9
A word of caution, though: There’s no guarantee that the IDENTITY() function will coordinate with the other columnseven if you specify an ORDER BY. Therefore, this isn’t a reliable solution to Graham’s problem. See KB 273586, “INF: How the IDENTITY Function Behaves When It Is Used in SELECT INTO Queries That Have an ORDER BY Clause and a TOP Operator or a SET ROWCOUNT Statement,” for the details.
With SQL Server 2005 coming up on the horizon, I’d be remiss if I didn’t let you in on a new T-SQL feature that would make this type of problem far less work for you, the coder. Check out Listing 6and make sure you’re sitting down.
Listing 6. Using the ROW_NUMBER() function in SQL Server 2005.
select o1.OrderID, row_number () over (order by o1.RequiredDate) as SequenceNo, o1.RequiredDate, o1.ShippedDate from Orders o1 where o1.EmployeeID = 9 order by o1.RequiredDate
How long have you been waiting for that one, eh? You’ll also find that it will outperform all of the approaches you saw earlier.
Let me conclude by saying that, although this was a fun exercise, sequence numbers really belong in the client codespecifically, the presentation layer. Doing the work there takes the load off of SQL Server and spreads it over hundreds, thousands, or perhaps even millions of machinesand that’s the way it ought to be. Thanks for this one, Graham. Now, I think I’ll look for that remote. See you next month.
Tom Moreau, B.Sc., Ph.D., MCSE, and MCDBA, is an independent consultant specializing in Microsoft SQL Server database administration, design, and implementation and is based in the Toronto area. Tom’s a SQL Server MVP and co-author—with fellow MVP Itzik Ben-Gan—of Advanced Transact-SQL for SQL Server 2000. firstname.lastname@example.org.
To find out more about SQL Server Professional and Pinnacle Publishing, visit their Web site at http://www.pinpub.com/
Note: This is not a Microsoft Corporation Web site. Microsoft is not responsible for its content.
This article is reproduced from the February 2005 issue of SQL Server Professional. Copyright 2005, by Pinnacle Publishing, Inc., unless otherwise noted. All rights are reserved. SQL Server Professional is an independently produced publication of Pinnacle Publishing, Inc. No part of this article may be used or reproduced in any fashion (except in brief quotations used in critical articles and reviews) without prior consent of Pinnacle Publishing, Inc. To contact Pinnacle Publishing, Inc., please call 1-800-788-1900.