Custom Aggregations: User-Defined Aggregates

Downloads
49983.zip

In this series of articles, I'm discussing various classes of custom-aggregation solutions. SQL Server's built-in aggregate functions (e.g.,AVG, COUNT, MAX, MIN, SUM) are admittedly useful, but what if you need other types of aggregations that SQL Server doesn't provide? In the series' first two articles—"Custom Aggregations: The CursorTechnique" (http://www.sqlmag.com, March 2006, InstantDoc ID 49038) and "Custom Aggregations: The Pivoting Technique" (May 2006, InstantDoc ID 49675), I discussed cursor-based and pivot-based solutions.This month, I discuss solutions that utilize a new capability in SQL Server 2005: using .NET code to develop your own user-defined aggregates (UDAs).

As with the rest of the columns in this series, I'm using string concatenation as my example of a custom aggregation task. Here's a quick summary of the two sample concatenation tasks featured in the March and May columns:The first task is to use the Northwind database to return each customer and, for each, a concatenated list of order IDs expressed as a comma-separated list of values and sorted by order ID, as Table 1 shows.The second task is to aggregate distinct, concatenated employee IDs for each customer, as Table 2 shows.

Remember that we're evaluating the usefulness of each solution based on several factors: Is the solution generic? Does the solution support an unlimited number of elements in a group, and does it have any special limitations? Can you apply the aggregation to the elements in a desired order? How well does it perform? Finally, is the solution set-based or is it iterative/procedural? Let's first examine custom aggregate solutions based on UDAs, then evaluate the solution based on the aforementioned criteria, as I did with the cursor-based and pivot-based solutions.

UDA-Based Custom Aggregates

SQL Server 2005 introduces .NET integration. You can now develop various components in SQL Server, based on the Common Language Runtime (CLR). These include UDAs, user-defined types (UDTs), user-defined functions (UDFs), stored procedures, and triggers.The focus of this article is UDAs. Note that you can develop UDAs only with the CLR; SQL Server doesn't support the development of UDAs with TSQL code. However, after you deploy the UDAs in SQL Server, you can use them just like any other built-in aggregate function in T-SQL.

You can use any development environment you want to develop UDA code—even Notepad. However, using Visual Studio 2005 Professional Edition or later is more convenient because it's a rich development environment that lets you debug your code. This product also lets you easily deploy your code in SQL Server. In this article, I assume that you'll use Visual Studio 2005 Professional Edition or later. In doing so, I'll point out the steps that differ in the development and deployment process if you're using an edition of Visual Studio 2005 other than the Professional Edition (e.g., Standard).

Listing 1 contains the definition of the StringConcat function in C# code. In Web Listing 1 (http://www.sqlmag.com, InstantDoc ID 49983), you'll find the definition of the function in Visual Basic code.The code in both functions is similar; the only difference is the language of choice.As I describe the functions' logic, you can examine either of the listings, based on your language of preference.

First, I'll provide a brief description of the code, and second, I'll provide instructions for how to deploy and use the function in SQL Server.The code starts with declarations of namespaces (i.e., with the using statements). The purpose of these declarations is merely to allow for shorter naming of elements later in the code. Following the namespaces declaration, you'll find attribute definitions. Attributes help Visual Studio in the deployment process and help SQL Server optimize your code.

The Serializable attribute, which is mandatory for UDAs, means that the class field values will be serialized (persisted); thus, SQL Server will be able to access intermediate results. If you're using exclusively .NET value types, you can let SQL Server implement its own serialization (i.e., native serialization). However, if you use reference types (e.g., the string type, as in our example), you'll have to define your own serialization (i.e., user-defined serialization). To do so, you implement the IBinarySerialize interface, providing your own definitions of the interface's Read and Write methods, as I'll describe later. The SqlUserDefinedAggregate attribute defines various UDA properties, including the following:

Format.UserDefined—the function uses user-defined serializatiom
IsInvariantToDuplicates—the false value means that duplicates matter
IsInvariantToNulls—the true value means the code doesn't care about NULLs
IsInvariantToOrder—the false value means that order matters; SQL Server 2005 currently doesn't let you control the order of the computation and therefore ignores this attribute
IsNullIfEmpty—the false value tells the code not to yield NULL if the input is an empty set of values
MaxByteSize—the 8000 value sets the maximum size in bytes to 8000

Following the attribute definitions is the header of the structure (or class), which defines the UDA. This sample code names the structure StringConcat and specifies that the structure implement the IBinarySerialize interface, for the aforementioned reasons.

After the structure's header is its body. It starts with a definition of the sb variable (i.e., the StringBuilder object). This variable will contain the intermediate value of the concatenated string.The rest of the code in the structure's body implements different methods, including the following:

Init—This method initializes the computation and is invoked once per group that the query processor aggregates. In our case, the Init method initializes the variable sb with a new StringBuilder object, which in turn is initialized with an empty string.
Accumulate—The query processor uses this method to accumulate the aggregate values.The method is invoked once per value in the group that's being aggregated. In the case of this code, the method ignores (skips) NULL values, and appends known values to sb.
Merge—SQL Server might decide to perform multiple partial aggregations; each partial aggregation will get a separate subset of the input set of values from the group.This method merges another instance of the aggregate value with the current aggregate value.You use it to merge multiple partial computations of an aggregation. In this example, the Merge method simply appends the input string to the current string.
Terminate—This method completes the computation and returns the result of the aggregation. In this example, it returns a new SqlString object, which is initialized with the final concatenated string (sb.ToString()).
Read and Write—Remember that you need to define these methods because our function uses user-defined serialization and must therefore implement the IBinarySerialize interface.The implementation of the methods is simple in this example.The Read method returns a StringBuilder object that is initialized with the ReadString method (applied to the input value r).The Write method truncates the string to 8000 bytes (4000 characters) in case it requires truncation. Unfortunately, UDAs don't support strings larger than 8000 bytes.

The first four of the preceding methods—Init, Accumulate, Merge, and Terminate—are mandatory for all UDAs. The last two—Read and Write—are required when using user-defined serialization, as in the case of our example.

Putting the UDA to Use

So far, I've briefly described the code for defining the UDA. Next, I provide you with the instructions for deploying and using the function. First, run the following code in SQL Server to create a test database calledTestUDA, in which you'll later deploy the function:

USE master; 
GO 
IF DB_ID('TestUDA') IS NOT NULL 
  DROP DATABASE TestUDA; 
GO 
CREATE DATABASE TestUDA; 
GO 
USE TestUDA;

You implement the next steps through Visual Studio 2005.

Assuming you're using Visual Studio 2005 Professional Edition or later, create a new C# (or Visual Basic) project, using the Database, SQL Server Project template. Note that if you're using a version of Visual Studio other than the Professional Edition (e.g., Standard), you'll need to use the Class Library project template.
In the New Project dialog box, name the project ConcatString, specify C:\CLR as the target location (or another folder of your choice), and confirm.
If you used the SQL Server Project template, the Add Database Reference dialog box will appear. Create a new database reference (by clicking Add New Reference) to the TestUDA database, and choose it. If a dialog box asking whether you want to enable SQL/CLR debugging appears, click No. (I won't be discussing debugging in this article.)
Add an Aggregate item (by clicking Project, Add Aggregate), and name it StringConcat.cs if you used C# code or StringConcat.vb if you used Visual Basic code. If you used the Class Library project template, simply rename the file Class1.cs (or Class1.vb) to StringConcat.cs (or StringConcat.vb).
Overwrite the new file's code with the code from Listing 1 or Web Listing 1 (depending on your language of choice).
Deploy the solution to SQL Server by choosing Build, Deploy ConcatString. If you used the Class Library project template, you'll need to first build the solution (through the Build, Build StringConcat menu item), then use the appropriate TSQL code (i.e., the CREATE ASSEMBLY and CREATE AGGREGATE commands) to deploy it manually in SQL Server.

The deployment process creates an assembly in a .dll file on disk, loads the intermediate language (IL) code from the file to the Test-UDA database, and registers the StringConcat function in the database.The .dll file is no longer necessary because the IL code is now an integral part of the database. At this point, the function is ready for use. Now, you can switch to SQL Server and implement the solution.

Remember that you use a UDA just as you would use any built-in aggregate function, so I don't really have much to show you.You can simply start using it.As a simple example, the following query returns a concatenated list of customer IDs for each employee, using a semicolon as a separator:

USE Northwind; 
SELECT EmployeeID, 
   TestUDA.dbo.StringConcat 
   (CustomerID + N';') AS ArrCusts 
FROM dbo.Orders 
GROUP BY EmployeeID;

Table 3 shows the code's output.The code uses the fully qualified function name Test-UDA.dbo.StringConcat because the function was created in TestUDA; the database context of this code is Northwind.

Note that if an employee has handled more than one order for a customer, the customer ID will appear more than once in the employee's result string. If you want to concatenate only distinct customer IDs for each employee, use StringConcat( DISTINCT CustomerID + N';'), as you would with built-in aggregate functions. Also, remember that UDAs don't let you control the order of computation; therefore, you can't guarantee the order of concatenation.

Going back to the two tasks that I presented earlier:The first task is to concatenate all order IDs per customer, separated by commas, and present the output that Table 1 shows. Because OrderID is an integer, you'll need to convert it to a character string.Also, you don't want a leading or terminating comma, so you can add a comma in front of every OrderID, and use the STUFF function to get rid of the first comma, as follows:

SELECT CustomerID, 
   STUFF( 
TestUDA.dbo.StringConcat( 
   N',' + CAST(OrderID AS 
      NVARCHAR(10))), 
1, 1, N') AS ArrOrders 
FROM dbo.Orders 
GROUP BY CustomerID;

The second task is to concatenate the distinct employee IDs per customer, separated by commas, and produce the output that Table 2 shows.You use a similar solution to the previous one, with the addition of the DISTINCT keyword in the aggregate function to take only distinct employee IDs into consideration:

SELECT CustomerID, 
  STUFF( 
  TestUDA.dbo.StringConcat(
   DISTINCT N',' + CAST(EmployeeID 
   AS NVARCHAR(10))),
  1, 1, N') AS ArrEmps 
FROM dbo.Orders 
GROUP BY CustomerID;

Evaluating the UDA-Based Solution

Finally, to evaluate the UDA-based solution, let's go over the factors that I'm using to evaluate all custom aggregate solutions. One advantage of this solution is that it's generic. You can treat it as a template and use similar code when you need to implement other custom aggregates.The solution is fast because it doesn't involve a lot of I/O; the source data is scanned only once. And even though the UDA code is developed with a .NET language, you use it like any other aggregate function in a simple, set-based manner.

A disadvantage of this solution is that UDAs are limited to returning character strings of up to 8000 bytes long. Consequentially,-your solution is limited in terms of the number of elements that can be concatenated. Still, 8000 bytes might be sufficient in many cases, and it's not as limited as we saw with the pivoting techniques that I discussed last month. Another disadvantage is that you can't apply the aggregation to the elements in a desired order. Even though UDAs have an attribute called IsInvariantToOrder, this attribute is currently ignored.Also, implementing UDAs requires that you gain a new skill-set: developing in .NET.

Join me next month for the conclusion of this four-part series. I'll discuss specialized solutions, the fourth and final class of solutions to custom aggregates. Meanwhile, I'd like to thank SQL Server MVP Dejan Sarka for his contributions to this article. The CLR code herein is based on Dejan's code, with some revisions.

Comments

Plain text