SQL By Design: The Foreign Key

In a previous column, I talked about primary keys ("How to Choose a Primary Key," April 1999). This month, I cover the foreign key. A foreign key is an integral part of a relational database design. It establishes relationships between tables, and it makes possible the procedures that cross-reference data stored in separate tables in the database. It enforces data integrity rules and prohibits modifications that might compromise data integrity. The foreign key is as important to the database design as the primary key, and together they control updates to the data. So let's look at what a foreign key is, what you use it for, and what value it brings to your database design.

What Is a Foreign Key?

A foreign key is a table column that establishes a link from the table it resides in to the primary key or a candidate key in another, related table. The foreign key is the anchor on the many side of a one-to-many (1:M) relationship, much as the primary or candidate key is the anchor on the one side of this relationship.

A foreign key is the linchpin that makes sure you don't enter invalid data into a table. It also prohibits you from doing a delete or update operation that might leave orphan rows. The database development community refers to this feature as declarative referential integrity, and it can't function without foreign keys.

Figure 1 is an entity relationship diagram (ERD) representing three tables (Project, Employee, and Paycheck) and two 1:M relationships. Project contains attributes describing the kinds of workplace projects an employee might be assigned to, Employee contains attributes describing individual employees, and Paycheck contains attributes describing employee compensation. To demonstrate a point, I've simplified the relationships and restricted the business rules. Each Project can have many Employees assigned to it, but an Employee is associated with only one Project at a time. Each Employee gets many Paychecks, but each Paycheck goes to only one Employee.

In Paycheck, the attribute EmpNo is a foreign key (FK). EmpNo in Paycheck has the same meaning and is the same data type and length as EmpNo in Employee. In fact, the set of real data values for EmpNo must be the same in both Employee and Paycheck to ensure that you never cut a paycheck for a nonexistent employee. This relationship creates a referencing link between Paycheck and Employee that you can enforce and exploit when you write SQL queries, as you'll see later in this article.

Figure 2 is an implementation of the Project-Employee-Paycheck ERD, in which I've converted each entity to a table, and each attribute has become a column in a table. I populated the tables with test data to demonstrate the concept of foreign keys.

Column ProjNo in the Employee table anchors the 1:M relationship between Project and Employee on the many side. In this model, the arrow points from the foreign key to the primary key, implying a foreign key reference from Employee.ProjNo to Project.ProjNo. This arrangement means that the project number in the Employee table references the project number in the Project table. So in Figure 2, both Abby and Don are assigned to work on the Y2K Remediation Project.

Dependent Relationships

Not all 1:M relationships are created equal. Notice that in both Figures 1 and 2, some relationship lines are dotted and some are solid. The solid connector line represents the dependent, or identifying, 1:M relationship. A dependent relationship means each record on the many side of the relationship depends for its existence on a related record on the one side.

The dotted connector line represents the independent, or nonidentifying, relationship, which is the condition between Project and Employee. An independent 1:M relationship means that a record may exist on the many side of the relationship without a related record on the one side. An Employee doesn't need to be working on a Project in order to be an employee. In Figure 2, Ethyl Etheridge is not connected to a project, but she still gets a paycheck.

The terms identifying relationship and non-identifying relationship refer to how the primary key of the table on the many side is supposed to be constructed. The theory is that for identifying relationships, the primary key of the one table cascades down to the many table, and is concatenated to the identifier of the many table to form the primary key. Many CASE tools that you can use to draw ERDs implement this theory as a rule that you can't easily circumvent.

You can enforce the identifying relationship two ways. You can declare the relationship when you initially create the table. Or you can declare this relationship after you create the table, with an ALTER TABLE statement, as in:

ALTER TABLE paycheck
  ADD CONSTRAINT fk_Paycheck2Emp FOREIGN
    KEY (EmpNo) 
    REFERENCES Employee(EmpNo)
GO

The nonidentifying relationship doesn't have a restriction about cascading the primary key. Each table has its own unique identifier; neither table inherits part of its primary key from the other. The Project table primary key, ProjNo, is a foreign key in the Employee table. ProjNo in the Employee table can be null, as in the case of Ethyl Etheridge, who is not assigned to a project.

Using the Foreign Key

You always need a foreign key in the many table when you have a 1:M relationship. The database must include this relationship somewhere. A nonrelational (object-oriented) database might store the relationship as pointers, or addresses in a one table record that point to associated records in the many table.

However, in a relational database management system (RDBMS), pointers aren't necessary and might not be used to define the 1:M relationships. Instead, when you specify two related tables in a JOIN query, the values in the foreign key column of the many table are compared to the values in the primary key column of the one table. If the two values match, the query returns that record in the result set.

Although I've been talking about a foreign key and its corresponding primary key, a foreign key can also be compared to a candidate key. In the April issue, I talked about what criteria you use to promote a candidate to primary key. A candidate key can be a primary key, but for business reasons, which you determine, it isn't always. Such is the case in the Employee table, where both EmpNo and EmpSSN are candidates, but for privacy reasons, I've promoted EmpNo to primary key.

Make sure that a foreign key and its corresponding primary or candidate key have the same data type and length when you create tables. (You don't have to make the attribute names the same.) Then you can use the following query, for example, to compare the values of EmpSSN in the Employee table to the values of EmpSSN in the Paycheck table and generate a report listing the total amount of pay each employee has received. The report appears in Table 1.

SELECT EmpLname, EmpFname, SUM(PayAmt) AS
  "Total Pay Week 15-17"
FROM Employee INNER JOIN PayCheck ON
Employee.EmpSSN = Paycheck.EmpSSN
GROUP BY EmpLname, EmpFname

You can use EmpNo to do the comparison in the previous query and generate the same result set. However, a reasonable assumption is that the payroll department will reference employees by Social Security Number (SSN).

In much the same way, by comparing ProjNo from the Project table to ProjNo from the Employee table, you can use the following query to generate a report that lists employees and their associated projects. The output report appears in Table 2.

SELECT ProjName, EmpFname + " " + EmpLname
  AS "Project Team Member"
FROM Project LEFT OUTER JOIN Employee ON
  Project.ProjNo = Employee.ProjNo
ORDER BY ProjName

The SQL Server query optimizer uses an index on the foreign key column, if one is available, to resolve a JOIN query if the tables are large (multipage). (For more information on the SQL Server query processor, see Petkovic and Unterreitmeier, "New Features for Query Processing," July 1999.)

When you create the foreign key reference, SQL Server doesn't automatically build an index for the foreign key column. Create the foreign key index in a separate step:

CREATE INDEX idx_PaycheckEmpNo ON
paycheck(EmpNo)

After you declare the foreign key relationship to the DBMS, the DBMS always enforces it; you cannot program around the constraint. For example, if you try to insert a record containing an employee ID of 6 into the Paycheck table and no employee number 6 exists, the DBMS will reject your INSERT query. If you try to delete the record for Becky Brown in the Employee table, the DBMS will reject that operation with a warning: DELETE statement conflicted with COLUMN REFERENCE constraint 'fk_Paycheck2Emp.' The conflict occurred in database 'SQLmag,' table 'Employee,' column 'EmpNo.' This message means that if you want to get rid of Becky Brown, you first have to remove from the Paycheck table all records that refer to her. Then you can remove her employee record from the Employee table. DRI, or the foreign key constraint, prevents you from leaving behind orphan records for Becky Brown in the Paycheck table.

Value-Added Design

A foreign key creates a bridge between two tables in a database. Database programmers use the foreign key as one of the critical components to dynamically join data from two or more tables. Without a foreign key, if you want to create a report that uses data from more than one table, you have to resort to exhaustive coding in a non-SQL programming language.

Extracting information from multiple tables is significantly complicated beyond the simple joins shown earlier, and end-user access to the data diminishes as the complexity of retrieval increases. You might be tempted to denormalize the tables to make data access easier for the end user. But table denormalization compromises data integrity because you have to manually synchronize duplicate data (the result of denormalization) on each data update.

The foreign key and the primary key control updates to a database. Used correctly, the foreign key can ensure database table integrity.

Comments

Plain text