Several years ago, when we were developing an application for a customer, as we designed the database schema we kept encountering the same problem: hierarchical data structures. Forty percent of our data types were hierarchical (e.g., product structure, customers, geographical territories, sales organization structure, folders). Because of the hierarchical data structures, we experienced "table creep": We needed so many tables that we couldn't fit them all on our drawing board. Our solution was to devise a pattern for a hierarchical data table and use it throughout the application. This pattern not only helped reduce the number of tables and therefore reduce complexity but also made our application more flexible. You can use our method to simplify the database structure of your own SQL Server 2000 database applications.
Think of storing products and product lines in a relational database. Typically, you'd use a two-tier table hierarchy similar to the one that Figure 1 shows. Let's say management asks you to modify the database so that salespeople can enter information about competitors' products as well as your company's.
To fulfill this request, you'd change the database schema so it looks the one that Figure 2 shows. Soon after you make this change, management asks you to add a Brand level between the Manufacturer and ProductLines levels so that you can see how each of your brands is faring. You can see where this is leading: table creep.
The problem with this method is its inflexibility: It's difficult to adapt the database structure to changing needs. The method's advantage is that it lets you easily group similar types of data in the same level of the hierarchy.
A more flexible method would be to store hierarchical data in a self-referencing table, like the one that Figure 3 shows. Now you can add new hierarchy levels anytime. In fact, you can store any "depth" of data in this structure. The catch to this method is that selecting records at a level within the hierarchy is cumbersome. For example, to select all the records at the Products level, you must select the top-level records (Manufacturer level), select its children (ProductLines level), and select the previous table's children (Products level). How can you determine whether a given record is from the Products or ProductLines table? Painful, isn't it? But we found a better way to manage the huge number of tables.
A Better Idea
Our better idea combines the first method's ease of use with the second method's flexibility. You simply store the level of the node (i.e., its distance from the root) in each record. In this way, you can easily select all records on the same level (Products or ProductLines) and also maintain the flexibility in terms of depth. Figure 4 shows this type of structure.
Assuming you have a Manufacturer-ProductLines-Products hierarchy, to select all Products, for example, you'd simply use the statement
SELECT * FROM Products WHERE HierarchyLevel = 3
If you create views (for Manufacturers, ProductLines, and Products) that retrieve all the records on a given level, you can reproduce the same behavior as the original three-table version.
For this method to work, you have to keep the HierarchyLevel column updated. Whenever a new record is inserted, you must set its HierarchyLevel based on its parent. We took another approach, though. In our case, the Products table was populated by a rather complex batch import, which we didn't want to pollute with the update. Also, the table was small, so we updated the HierarchyLevel when the import was complete. Here's how we did it.
Assume that we created the Products table by using the following statement (we've omitted integrity constraints to keep the example simple):
CREATE TABLE Products( Id int identity(1,1), ParentId int, HierarchyLevel tinyint, Name nvarchar(50) )
Because we can't assume any given depth, the update of HierarchyLevel has to be recursive. SQL Server 2000 doesn't handle recursions well, so we'll implement the HierarchyLevel update as an iteration, as the code in Listing 1 shows. (SQL Server 2005 will support recursive queries; we'll provide examples of using recursion shortly.)
The code in Listing 1 first sets the HierarchyLevel of all records to null, then sets the HierarchyLevel of the top-level rows (the ones without a parent) to 1. After that, the code sets the Hierarchy Level of the direct descendants of the top-level records to 2, and so on. Again, HierarchyLevel values help keep track of what's happening: For the rows whose levels are already set, level information is available, and we can always filter the records for the lowest known level by filtering records whose Hierarchy Level equals the value stored in @Current HierarchyLevel. Next, the code increments @CurrentHierarchyLevel and updates the next set of rows. The iteration stops when no further records can be updated.
You can apply this pattern to any and all your hierarchical tables. Doing so offers two main benefits:
- A programmer can easily remember the tables' usage.
- You can implement HierarchyLevel updating just once as a stored procedure.
The code in Listing 2 shows a stored procedure for updating the HierarchyLevel that works on any table. The stored procedure takes as parameters the table name and ID column name. You can specify additional parameters to designate the names of other fields.
Using SQL 2005 Recursive Queries
SQL Server 2005 will support recursive queries through Common Table Expressions (CTEs). The code in Listing 3 uses a CTE to query a hierarchical table. The code in Listing 4 uses CTEs to update the original table.
Listings 1 and 4 use different methods to do the same thing: Update a hierarchical table's records. We'll soon compare them side by side, but first let's look at Listing 3 and see how to structure a SQL Server 2005 recursive query. The SELECT statement at the end of the script is deceptively simple because of the CTE that it uses. The first part of the CTE describes the "anchor": an initial set of records to start the recursion with. This part executes only once. The second is the recursive part of the CTE, which references the CTE itself by joining the Customers table to HCustomersCTE. We already have records in the CTE result set in one round of the recursion, and we add all their children in the next round. Behind the scenes, SQL Server acts the same way it does in Listing 1: It keeps iterating the CTE as long as new records are added.
Now let's look at Listing 4. Here we implement the same update behavior as in Listing 1, but this time, we use a recursive query. Note the corresponding parts in both scripts: the anchor definition and the iteration/recursion steps. They're all simple SELECT statements. However, the UPDATE statement is more complex than the SELECT in Listing 3. In effect, the CTE is a recursive SELECT that provides HierarchyLevel values for each record in the Customers table, then joins the CTE to the original Customers table. Updating happens outside of the recursive part of the code.
Tips and Tricks
If your tables include a parent-child relationship that's described with multiple fields, you can slightly modify the stored procedure in Listing 2 to accommodate this situation—as long as all your tables use the same number of fields to describe this relationship. If your tables use different numbers of fields to describe the parent-child relationship, you'll have to implement the update procedure table by table.
As a best practice, you should have an ID field in a hierarchical table. In fact, we used an ID field in all our tables that other tables could reference. Doing so makes query writing easier. Additionally, if someday you have to transfer this data to OLAP and these tables become dimensions, SQL Server Analysis Services will easily import this type of hierarchy as a parent-child relationship.
You can use the hierarchical table pattern to handle challenges such as unbalanced table hierarchies, ambiguous levels, and having to store different attributes at different levels of the hierarchy. Of the three table structures we described earlier, our hierarchical table pattern is the only one you can use to accommodate an unbalanced hierarchy because you can't create separate tables to store specific levels.
Unbalanced hierarchy. Say that you have a table called Positions, which describes an organization's structure. The Positions table hierarchy includes a CEO, directors, managers, and secretaries. As you can see in Figure 5, the Positions hierarchy is unbalanced because secretaries report to various levels. You must either put the secretaries on the same level and have your references "jump" levels, or, as Figure 6 shows, you must put the secretaries on different levels—which changes the hierarchy.
In this example, the Secretaries level is actually outside your hierarchy. You'll never want to assign sales figures to secretaries, for instance, nor do you want to include them when you calculate sales figures. Because secretaries aren't part of the organizational hierarchy of the Positions table, you must represent "secretary" in a separate attribute; otherwise, the hierarchy levels are nonexistent.
Ambiguous levels. What happens when you have to create a new level between existing levels—say a Brand level that's typically between ProductLines and Products but in some cases should be above Product Lines? In this case, as in the Positions example, the Brand level is outside your hierarchy, and you should store it as a separate attribute.
Different attributes at different levels. You might have to store some different attributes at different levels of your hierarchy. For example, perhaps your customer has a Tax
Identifier attribute at the company level, but this attribute isn't needed at the shop level. Storing different attributes at different levels is easy when you have two different tables. But if you're using only one table, you have to store a union of all the attributes, so you'll have an attribute that's always empty for the shop level. In the end, it's a design question: Many times the similarity (expressed in the number of matching columns) is significant, so it makes sense to join different tables into one. If it doesn't, go ahead and create separate tables. Here, you have to sacrifice storage space for flexibility or vice versa.
A Flexible Approach
We've shown you the advantages of the hierarchy pattern that have made it an extremely useful database technique for us. Of course, this approach has some downsides, such as the necessity of storing a union of attributes for each level. You'll need to decide when creating or not creating this type of structure is appropriate. If you start using this technique, you might discover—as we did—that it will hone your database-design skills by forcing you to generalize the structure of your tables.