Programming SQL in a Set-Based Way

As T-SQL programmers, we always hear that the SQL language is optimized for set-based solutions rather than procedural solutions, but we seldom see examples from that perspective. Consequently, many beginning SQL programmers don’t have a clear understanding of what set-based means in terms of the code they need to write to solve a specific problem.

Even for those who understand the concept, there are many programming problems for which a set-based solution seems impossible. Sometimes that's true. It's not always possible to find a set-based solution, but most of the time we can find one by using a little creative thinking. A good SQL programmer must develop the mental discipline to explore set-based possibilities thoroughly before falling back on the intuitive procedural solution.

In this article, I provide a relatively simple example that illustrates how to think in a set-based way about a common type of problem that also has an intuitive procedural solution.

The Business Case

When you visit the doctor’s office, the first thing the nurse does is put you on a scale, record your weight, and check your height. Checking your weight makes sense from a medical point of view, but have you ever wondered why the nurse records your height each time? Unless you're very young, your height hasn’t changed since your last visit and isn't likely to change again.

The reason the nurse checks your height is to guard against identity theft. Health care providers want to make sure that the services they provide are going to the person who gets the bill—not to an imposter with a forged identity card.

This kind of identity theft happens more frequently than you might think. HIPAA regulations now require an audit of changes in permanent physical characteristics in a patient’s history that might suggest identity theft.

Querying this kind of information provides a good example for comparing procedural thinking and set-based thinking when programming in SQL.

The Problem Statement

The generic programming problem is that the solution depends on the order of rows and requires the comparison of current row values with values in previous rows. This is a type of problem in which the procedural solution is intuitive, but the set-based solution isn't so obvious.

In this particular problem, we're looking for rows where a previous visit for the same patient has a height value that's different from the height on the current record. We want to return the patient’s unique medical record number, the date the change occurred, what the height was changed from, and what the height was changed to. We don't want to return any records that don't mark a change in height.

Listing 1 gives you the code to create and populate the tables in this example, if you'd like to run the example yourself.

Listing 1: Creating and populating the tables

We use the AdventureWorks sample database to create tables for our test but you may use another database by changing the USE statements in all 3 listings.

USE AdventureWorks;

SET NOCOUNT ON;

CREATE TABLE Dates
(ID int, VisitDate datetime);

--populate table with 20 visit dates
DECLARE @i int, @startdate datetime;
SET @i = 1;
SET @startdate = GETDATE();

WHILE @i <= 20
BEGIN
    INSERT Dates
    (ID, VisitDate)
    VALUES (@i, @startdate);
      SET @startdate = DATEADD(dd,7, @startdate);
    SET @i = @i+1;
END

CREATE TABLE PatientHeight
(PatientID  int not null
,Height int);

-- populate table with 1000 patientids with heights between 59 and 74 inches
SET @i = 1;

WHILE @i <= 10000
BEGIN
    INSERT PatientHeight
    (PatientID, Height)
    VALUES (@i, @i % 16 + 59);
      SET @i = @i+1;
END

ALTER TABLE PatientHeight ADD CONSTRAINT PK_PatientHeight
    PRIMARY KEY(PatientID);

-- cartesian join produces 200,000 PatientVisit records

SELECT 
    ISNULL(PatientID, -1) AS PatientID, 
    ISNULL(VisitDate, '19000101') AS VisitDate,
    Height
INTO PatientVisit
FROM PatientHeight
CROSS JOIN Dates;

ALTER TABLE PatientVisit ADD CONSTRAINT PK_PatientVisit
    PRIMARY KEY(PatientID, VisitDate);

-- create changes of height
SET @i = 3;

WHILE @i < 10000
BEGIN
    UPDATE pv
    SET Height = Height +2
    FROM PatientVisit pv
    WHERE PatientID = @i
    AND pv.VisitDate = 
    (SELECT TOP 1 VisitDate 
    FROM Dates 
    where id = ABS(CHECKSUM(@i)) % 19);
  SET @i = @i + 7;
END

/*
-- return AdventureWorks to its previous state when you are finished
-- with this example.

DROP TABLE Dates;
DROP TABLE PatientHeight;
DROP TABLE PatientVisit;
*/

A Procedural Approach

The intuitive, procedural way to attack this problem is to order the records by patient and visit date, then loop through the records for each patient one row at a time. We query the first record for the patient and save the patient’s original height in a variable. Then, we loop through subsequent records for the patient, comparing height values. If we find that the height is different on a subsequent record, we write an audit record, update the height variable with the current value, and continue looping through the rows. Then we move to the next patient.

Listing 2 contains the code for the cursor-based solution. The cursor method works, but it's very inefficient. It could pose a serious performance problem when working with a large number of rows. How can we do this in a set-based and presumably more efficient way?

Listng 2: the cursor-based solution (USE AdventureWorks)

CREATE TABLE #Changes
( PatientID int
, VisitDate    datetime
, BeginHeight smallint
, CurrentHeight    smallint);

DECLARE @PatientID    int
,    @CurrentID    int
,    @BeginHeight    smallint
,    @CurrentHeight    smallint
,    @VisitDate    datetime;

SET @PatientID = 0;

DECLARE Patient_cur CURSOR FAST_FORWARD FOR
SELECT PatientID
, VisitDate
, Height
FROM PatientVisit
ORDER BY PatientID
,VisitDate;

OPEN Patient_cur;

FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;

WHILE @@FETCH_STATUS = 0
BEGIN
-- first record for this patient
IF @PatientID <> @CurrentID
BEGIN
    SET @PatientID = @CurrentID;
    SET @BeginHeight = @CurrentHeight;
END 

IF @BeginHeight <> @CurrentHeight
BEGIN
INSERT #Changes ( PatientID
, VisitDate
, BeginHeight
, CurrentHeight)
VALUES
(@PatientID
, @VisitDate
, @BeginHeight
, @CurrentHeight);

SET @BeginHeight = @CurrentHeight;

END

FETCH NEXT FROM Patient_cur INTO @CurrentID, @VisitDate, @CurrentHeight;

END

CLOSE Patient_cur;
DEALLOCATE Patient_cur; 

SELECT * FROM #Changes

DROP TABLE #Changes

A Set-Based Approach

The difference between a procedural and set-based solution boils down to the way you define the problem. Stated in its simplest form, the change we're interested in involves only two records: two consecutive visits by the same patient. Everything else is irrelevant.

We start by ordering the data by the patient’s ID number and then by visit date. In that way, the records of consecutive visits by the same patient are adjacent to each other. The problem is then reduced to finding a way to join consecutive records from this set.

When we understand the problem in that way, the solution isn't so difficult to discover. We need to create a sequence number for the sorted rows that can be used to join one record with the next in a self-join.

We can create a common table expression (CTE) populated with patient data sorted by PatientID and VisitDate, adding a sequential ID using the ROW_NUMBER() function.

We can self-join this temporary table like this:

… from CTE t1
join CTE t2 on t2.ROWID = t1.ROWID + 1…

This will produce a set of records that represents every possible opportunity for the value of the patient’s height to change—that is, a set of records such that each contains the data from each set of two consecutive records in the original data set.

At this point, filtering out the records that don't represent a change is trivial. We simply review our statement of the problem: To qualify as a record of interest, the patient must be the same in consecutive visits but the two heights must be different. Listing 3 contains the code that implements this set-based method.

Listing 3: The set-based solution (USE adventureWorks)

WITH PV_RN AS
(
    SELECT ROW_NUMBER() OVER (ORDER BY PatientID, VisitDate) AS ROWID, * 
    FROM PatientVisit
)
select t1.PatientID
,t2.VisitDate as  DateChanged
,t1.Height as HeightChangedFrom
,t2.Height as HeightChangedTo
from PV_RN t1 
join PV_RN t2 on t2.ROWID = t1.ROWID + 1
    where t1.patientid = t2.patientid
    and t1.Height <> t2.Height
order by t1.PatientID, t2.VisitDate;

Relative Performance of the Two Methods

In Listing 1, we created the PatientVisit table and populate it with 200,000 records containing the PatientID, VisitDate, and the Height recorded for that visit. The table contains about 2,600 records that represent a change in height for a patient.

We used SQL Profiler to capture execution statistics of the two methods. First, we flushed the buffers to get the cold execution statistics, then we re-ran the query to get hot execution statistics after the data was in cache. Both the cursor and the set-based code returned identical results. Table 1 shows the execution statistics for each. Notice the huge difference in logical reads. This 160:1 difference can be a show stopper in many situations. CPU and Duration are roughly eight times as high in the cursor solution.

Method	Execution	Duration	Reads	CPU
Set-Based	Cold	503	1298	515
Cursor	Cold	4090	203646	3931
Set Based	Hot	476	1248	484
Cursor	Hot	3958	203728	3713

Table 1: Execution Statistics

The auditing requirements for a large healthcare provider can easily generate a million rows per day in the audit table. So, even if you run your audit reports for only a single day’s data, you'll have a lot of rows to process—far too many for a cursor or other looping mechanism to handle efficiently.

Set-Based Thinking

Note that the more efficient solution operates on whole sets of data, not on the individual rows. Compare this with the cursor solution, in which operations are repeated for each row in a set.

Nothing in this simple example is rocket science. You'll encounter SQL problems that are much more difficult to solve in a set-based way and some that are impossible. However, even this example requires a significant mental adjustment for programmers new to SQL programming. It requires a conscious effort to pull yourself out of your comfort zone and think in a new way. Even in the most difficult situations, don’t give up on a set-based solution until you've given it a fair amount of thought.

Comments

Plain text