Revisiting How to Avoid Referential Integrity Errors

Editor's Note: This article is an updated version of "Avoid Referential Integrity Errors When Deleting Records from Databases." In it, Chandra Sekhar Pathivada goes more in-depth into the concepts behind his PR_HIERARCHIAL_DATA stored procedure.

Most of the time, DBAs need to deal with archiving, deleting, inserting, and masking data in relational tables. One major challenge that's common to all these activities is to mitigate the problems with references (i.e., foreign keys), where one table refers to other tables and the other tables, in turn, refer to many other tables as a hierarchy.

When you delete a database's records, you need to be careful about referential integrity errors. These errors will occur if you delete the records in tables with foreign keys before deleting the records in the referenced tables. The same holds true for inserting and updating foreign keys (and primary keys as well). One way to deal with this problem is to disable the foreign keys, then re-enable them after you're done with your activity related to the referential data.

If you need to perform the activity in a read-only database created for analytical purposes, you can disable all the foreign keys, work with the data, then re-enable the foreign keys. This process is not only straightforward but also improves the performance of the extraction, transformation, and loading (ETL) load.

However, if you need to perform the activity in production database, disabling the foreign keys isn't recommended because it breaks the referential integrity for the online transactions. To avoid referential integrity errors without disabling the foreign keys in a production database, you need to do things differently. To explain what that is, let's take a high-level look at the first few steps required for some of the activities previously mentioned.

Archiving. When archiving old records that contain references, you need to follow these steps:

Identify the dependency structure (primary key and foreign key) of the current level.
Copy the old records that contain references (foreign keys) first to the archive database.
Delete the records from the table that are now archived in the archive database.
Identify the next level and repeat steps 1 through 3 until you reach the root table (master entity) in the relational hierarchy.

Data masking. To mask primary keys, you need to follow these steps:

Identify those tables that contain the masked primary keys.
Identify the relationships, then mask the referencing foreign keys involved in those relationships.

ETL. To load data into a relational schema, you need to follow these steps:

Populate the master table.
Identify the references to the master table and populate the related data in sequence.

What's most common across all these activities is identifying the references.

In production, if you need to delete the reference data, you might be able to perform a cascade delete on the foreign keys. However, there are some drawbacks to this approach:

Cascade deletes won't work if a table has more than one foreign key.
If you specifically want to delete data only after you archive it (i.e., set it), you shouldn't perform a cascade delete, as you must make sure that you archive the dependent data first.

A better approach is to identify the dependencies in a table when designing the ETL process to avoid referential integrity errors. Then, when you need to delete or insert data, you should list the references in sequence (i.e., in hierarchical order) and follow that hierarchical order.

The PR_HIERARCHIAL_DATA stored procedure determines the hierarchical order of all the tables in a database, then produces a list of SQL statements with DELETE commands in reverse hierarchical order. Identifying the order is bit difficult because many types of relationships can exist in a database, including one-to-one, one-to-many, and many-to-many relationships. But this stored procedure does all the work.

To identify the hierarchical order, the stored procedure uses information in the sysseferences systems table and sys.objects system view. In sysseferences, it uses the fkeyid column, which contains the IDs of the referencing tables (i.e., the tables that contain foreign keys), and the rkeyid column, which contains the IDs of the referenced tables. In sys.objects, it uses the object_id column, which contains object IDs.

PR_HIERARCHIAL_DATA uses a recursive common table expression (CTE) query to join the sysreferences table multiple times based on a join between the rkeyid and fkeyid columns. I prefer using CTEs because they offer a more readable form of the derived table that can be declared once and referenced multiple times in a query. This CTE query provides a hierarchical list that contains the tables with foreign keys and the tables those foreign keys are referencing. However, the list doesn't include tables that don't have any relationships (i.e., independent entities). Another CTE query provides that information. The results from both CTE queries are put into a temporary table. Duplicate table names (which result when a table references to two or more tables) are removed with the MAX function.

The stored procedure then uses the information in the temporary table to delete the data in the tables in an order that avoids referential integrity errors. If PR_HIERARCHIAL_DATA finds conflicted tables (i.e., tables that reference each other), it will print the names of those tables. You'll then need to manually delete the data from those tables.

You can use the PR_HIERARCHIAL_DATA stored procedure three ways. To simply display a hierarchical list of the tables with foreign keys and the tables those foreign keys are referencing, you execute the stored procedure without passing in any parameter values:

EXEC PR_HIERARCHIAL_DATA

If you want to output a list of the tables in hierarchical sequence so you can design a SSIS package or write an insert script that will load data into the tables, you use the INSERT parameter when you call the stored procedure:

EXEC PR_HIERARCHIAL_DATA 'INSERT'

If you want to delete the data from the database tables, you use the DELETE parameter:

EXEC PR_HIERARCHIAL_DATA 'DELETE'

You can download the PR_HIERARCHIAL_DATA stored procedure by clicking the Download the Code button near the top of the page. Although I wrote this stored procedure for SQL Server 2005, I've successfully used it with SQL Server 2008 R2 and SQL Server 2008. (I haven't tried it with SQL Server 2012.) You don't need to customize any code before running the PR_HIERARCHIAL_DATA stored procedure.

Comments

Plain text