User-defined functions (UDFs) are powerful tools when used properly, but they can introduce inefficient row-by-row processing into an otherwise elegant set-based solution when used improperly. Unfortunately, many people don't realize the potential pitfalls of using UDFs.
Most experienced SQL Server professionals know that ANSI SQL cursors create slow, inefficient T-SQL code. Cursors have valid uses, but they're row-by-row operations, which are inefficient compared with set-based operations. Most of us try to avoid T-SQL cursors and instead use set-based operations. However, few people I speak with understand the subtle way that UDFs might cause a set-based operation to take on row-by-row processing characteristics—including the associated row-by-row performance problems.
For example, imagine a scenario in which you have
- an Employee table containing 100,000 rows
- a Department table containing 50 distinct values
- a ranking system that assigns an employee "annual review grade" that's derived from data in other database tables
Imagine that your boss wants you to write a query that returns the average annual review grade for each department. Writing the query would be simple if AnnualReviewGrade were a column in the table. The query might look something like this:
SELECT DepartmentId ,avg(AnnualReviewGrade) AvgGrade FROM employee GROUP BY DepartmentId
But in our example, the annual review grade information isn't stored as a column. The AnnualReviewGrade calculation is a task that developers might need to perform in multiple pieces of code. Writing a join to get the information would be complicated, so the lead developer decides to write a UDF called GetAnnualReviewGrade that accepts an EmployeeId and returns the grade. You can now write the query for the average annual review grade as
SELECT DepartmentId ,avg(dbo.AnnualReviewGrade(EmployeeId) AvgGrade FROM employee GROUP BY DepartmentId
Now, let's think through the row-by-row processing implications of the UDF GetAnnualReviewGrade. Imagine that the UDF requires 15 logical I/Os to process, which might not seem bad. But remember that the UDF will be executed once for each row that needs to be evaluated. In this case, we'll be running the UDF once for each employee—100,000 times. That means the UDF alone adds 1.5 million logical I/Os to the processing cost of the query. In contrast, deriving the AnnualReviewGrade value for each employee by using a join or subquery might add just 5000 logical I/Os to the query. The UDF suddenly seems expensive. I've seen similar cases in which a query's processing time dropped from 15 or 20 seconds to less than 500ms when a developer replaced a complex UDF with join processing. Yes, the queries became more complex, and developers might have to code the business logic in more than one place. But dropping 15 to 20 seconds from a query's execution time might be worth the cost.
The problem with this UDF seems obvious. However, real-world problems are typically more difficult to spot, and you can usually see them only after you move code from development to production. The UDF that worked for a 1000-row result set in development might become a performance pig on a 1 million-row production result set. Replacing UDF logic with joins (and other set-based techniques) after the code is in production can be difficult and costly if the development team has used UDFs extensively.
I'm not saying that UDFs are necessarily bad. They're powerful T-SQL tools that I use regularly. However, I encourage you to think through how your code will use the UDF, paying close attention to the number of rows that might run through it in a query. UDFs might seem like a simple way to write set-based T-SQL code. However, you could open a row-by-row can of worms if you're not careful.