No Table? No Problem

Downloads
27051.zip

In many previous articles, I've emphasized the strength of the set-based approach to solving T-SQL problems. For most of the problems I discussed, I provided one or more tables and asked you to write code to produce a certain output. There are two possible approaches to solving problems involving data that's stored in tables: iterative and set-based. The iterative approach uses cursors, temporary tables, and loops to iterate through rows. This approach usually requires lots of code and focuses more on how to get the data than on what data you want.

Conversely, with the set-based approach, you write a query that logically specifies the request, letting SQL Server handle the query implementation. In other words, you specify what you want, and SQL Server decides how to get it. Usually, set-based solutions perform better than the iterative alternatives because SQL Server's query optimizer can choose among several possible execution plans for the same query.

Some T-SQL problems don't necessarily involve data that's stored in tables—rather, just data supplied as arguments. These problems require a third type of solution—one that uses one expression that's based on pure logic. In terms of performance, a single-expression approach is usually superior to both the set-based and the iterative approaches.

Let's look at a problem I call the working-days problem and explore the three types of solutions to it. I'd like to thank SQL Server trainers Dejan Sarka, Bruno Reiter, Luca Bianchi, David Lundell, and Dieter Nöth, who provided different solutions to a similar puzzle I presented in a private trainers forum. I've incorporated their ideas into the solutions I discuss here.

The working-days problem is simply this: Write a function that accepts two dates as input arguments and calculates the number of non-weekend dates between them. (In the United States, for example, weekend dates fall on Saturday or Sunday.) The range is inclusive—that is, it includes the two given input dates. For example, if a job started on Monday, January 6, 2003, and ended on Friday, January 17, 2003, it took 10 working days in total. Before reading the following solutions, try to come up with the most efficient solution you can.

The Iterative Approach

The iterative approach is the simplest to implement but the worst in terms of performance. You create a simple loop that iterates through all dates between the two given dates, incrementing a counter only if the current date doesn't fall on a weekend. Run the script that Listing 1 shows to create the fn_workdays() user-defined function (UDF), which implements an iterative approach.

Note that the fn_workdays() function assumes that DATEFIRST is set to 7 (US English default), in which case the DATEPART() function that takes the weekday argument returns 1 for Sunday and 7 for Saturday. If DATEFIRST is set to a different value in your environment, you need to specify the appropriate values instead of 1 and 7 or neutralize the effect of the DATEFIRST setting by adding @@DATEFIRST to the given date. Setting DATEFIRST to 7 and invoking

DATEPART(weekday, <some_date>)

gives you the same result as invoking

DATEPART(weekday, <some_date> + @@DATEFIRST)

regardless of the DATEFIRST setting.

To check the performance of the fn_workdays() UDF, I provided a range of more than half a million days:

SELECT dbo.fn_workdays
('20000101', '39991231')

This function ran for 4 seconds on my laptop. Usually, you'd invoke the function with much smaller ranges, but providing a large range is somewhat similar to invoking the function with smaller ranges many times. Let's look at an approach that gives better performance.

The Set-Based Approach

Using a set-based approach to solving the working-days problem means writing a query against a table. A common practice among T-SQL programmers is to create an auxiliary Calendar table that stores all possible dates within a certain range and that has a column that specifies the type of day. Usually, such a column stores a certain value (such as w) for a working day and another value (such as h) for a non-working day. In addition, you might want to be able to track different day properties (e.g., weekend, annual holiday, other holiday, strike). If you want to express many combinations of day properties (e.g., weekend plus annual holiday, weekend plus other holiday, non-weekend, strike), you can have the day_type column store a bitmap in which each bit represents a different day property. The smallint data type is sufficient for up to eight types, which is usually more than enough. Here's the code to create a sample Calendar table:

CREATE TABLE Calendar
(
  cday datetime NOT NULL PRIMARY KEY,
-- bit 0(1)=weekend,
-- bit 1(2)=annual holiday, 
-- bit 2(4)=other holiday, 
-- bit 3(8)=strike
  day_type tinyint NOT NULL
)

To populate the Calendar table with sample data, run the code that Listing 2 shows.

You can use the following code to modify the UDF that calculates working days. Here's how the body of the first set-based version should look:

ALTER FUNCTION dbo.fn_workdays
(@d1 AS datetime, @d2 AS datetime)
  RETURNS int
AS
BEGIN
RETURN(SELECT COUNT(*) FROM Calendar
WHERE cday BETWEEN @d1 AND @d2 AND day_type & 1 = 0)
END

To test this version, replace the function from Listing 1 with the one above, specifying ALTER FUNCTION instead of CREATE FUNCTION. The first set-based UDF version runs for 1 second on my laptop given the arguments '20000101', '39991231'.

If you have to work with a Calendar table and you don't have a way to distinguish between the different types of non-working days, you can use the DATEPART() function to determine which dates fall on weekends. Here's the body of a second set-based version of the UDF:

RETURN(SELECT COUNT(*) FROM Calendar
  WHERE cday BETWEEN @d1 AND @d2
    AND DATEPART(weekday, cday + @@DATEFIRST) NOT IN(1, 7))

Using the same arguments as in the previous tests, the second set-based UDF version runs for 2 seconds on my laptop.

If you don't have a Calendar table in your database but you do have an auxiliary table with a large sequence of integers such as the one that Listing 3 creates, you can use the following set-based UDF, which performs about as well as the second one:

-- Assuming an auxiliary table 
-- called Nums exists with a 
-- column called n
RETURN(SELECT COUNT(*) FROM Nums
  WHERE n <= DATEDIFF(day, @d1, @d2) + 1
    AND DATEPART(weekday,
(@d1+n-1) + @@DATEFIRST) NOT IN(1,7))

Now let's examine the single-expression approach to solving the problem.

The Single-Expression Approach

For the single-expression approach, you write one expression that doesn't access any tables to calculate the working days in the given range. You could reasonably assume that, if successful, this approach would perform better than the other two because it doesn't involve iterations or physical I/O (because no data needs to be read from disk). Let's look at two different solutions that use the single-expression approach. The first solution uses the following algorithm to calculate the number of working days between the two given dates @d1 and @d2:

(Number of whole weeks in the range @d1, @d2) * 5 +
  (Number of remaining days in last non-whole week
   that don't fall on Saturday or Sunday)

Note that for the purpose of this algorithm, a whole week starts on the weekday @d1. To calculate the number of whole weeks in the range @d1 to @d2, use the DATEDIFF() function to calculate the number of days in the range and divide the result by 7. Because SQL Server is using integer division, the fraction is truncated. Multiply the resulting number of whole weeks by 5, as the following code shows:

(DATEDIFF(day, @d1, @d2) + 1) /
7 * 5

and you get the number of working days in the whole weeks in the range.

Now you have to calculate the number of working days in the remaining days in the non-whole week. I'm going to cheat a little here and use a set-based operation against a small derived table that I create from scratch. See if you can work out what the following query calculates:

SELECT COUNT(*)
FROM (SELECT 0 AS n UNION ALL
SELECT 1 UNION ALL SELECT 2
UNION ALL SELECT 3 
UNION ALL SELECT 4
  UNION ALL SELECT 5) AS RW
WHERE n < (DATEDIFF(day, @d1, 
    @d2) + 1) % 7
  AND DATEPART(weekday, @d2 - n + @@DATEFIRST) NOT IN(1, 7))

Of the 6 values (0-5) in the RW derived table, the first logical expression in the WHERE clause uses the modulo operator (%) as a filter to give you only the values that are less than the number of days in the remaining non-whole week. For example, given the range '20030101', '20030110', which spans 10 days, the first logical expression in the WHERE clause can be simplified to n < 10 % 7, or n < 3. Only three rows qualify (n = 0, 1, 2). The second logical expression further filters the result by subtracting n from @d2 and checking that the result doesn't fall on Saturday or Sunday. Add the number of working days in the remaining non-whole week to the number of working days in the whole weeks, and you get the desired result. Listing 4 shows the first UDF version that uses the single-expression approach.

The second version of the UDF that uses the single-expression approach is even simpler. It relies on the fact that the DATEDIFF(week, @d1, @d2) function returns the number of week boundaries crossed between @d1 and @d2, considering Saturday as the last weekday and Sunday as the first, regardless of the DATEFIRST setting. Let wb equal the number of week boundaries crossed between @d1 and @d2. Logic says that the range @d1 to @d2 has at least 2*wb days that fall on Saturday and Sunday, plus another one if @d1 falls on Sunday and another one if @d2 falls on Saturday.

If you translated this logic to T-SQL, the body of the second UDF version that uses the single-expression approach would look like this:

RETURN(SELECT
  DATEDIFF(day, @d1, @d2) + 1 -
  2 * DATEDIFF(week, @d1, @d2) -
  CASE WHEN 
DATEPART(weekday,@d1+@@DATEFIRST)=1 THEN 1 ELSE 0 END -
  CASE WHEN 
DATEPART(weekday,@d2+@@DATEFIRST)=7 THEN 1 ELSE 0 END)

Both single-expression versions of the UDF complete in a fraction of a second regardless of the date range's size, demonstrating that this approach is superior to all others.

Test Your Skills

When you need to perform calculations that don't necessarily involve accessing data that's stored in tables, take the logical path to a solution that doesn't use iterations. To test your skills, see if you can devise a single-expression approach to solving the following problem. Given the table

CREATE TABLE Strnums
( sn VARCHAR(100) NOT NULL CHECK'
(/* your expression goes here */) )

write a logical expression for the CHECK constraint that allows only strings made up of digits in the sn column. Note that users might attempt to enter strings containing characters other than digits and the Latin characters a through z. You need to take all possible characters, including special ones, into consideration. For example, of the values 1234567890, 1a, $123, 12*34, 12,345, 1E3, and 12.34, only the first should be able to enter the table. The Web-exclusive sidebar "Solution to CHECK Constraint Puzzle" holds the answer to this challenge; just enter InstantDoc ID 27136 to check your solution against mine.

Comments

Plain text