The Global Data Protection Regulation, or GDPR, was passed as a regulation in 2016 by the Council of the European Union. The regulation provides rights to European Citizens over their personally identifiable information (PII).
The GDPR frames how PII is to be safeguarded, stored, processed and shared by entities that process or control this data--without setting firm guidelines on how to operationally engineer and retrofit systems to allow them to meet the rights afforded by the regulation. That was left up to architects, engineers, database administrators, managers, system admins and, of course, lawyers.
GDPR in a Nutshell
As of May 25, 2018, the GDPR is in full effect, and requests--as well as lawsuits--can commence. GDPR grants certain rights to EU citizens (and, by proxy of having data stored alongside EU data, to the global population) over PII. Three key areas of the regulation have been garnering the most attention:
- The right to have personally identifiable data points deleted or obfuscated, also referred to as the “right to be forgotten”
- The right to no longer have data included in any processing
- The right to receive a secure, digital copy of all of PII upon request.
It is the first area--the right to be forgotten--that gives rise to the discussion of natural keys.
Natural Keys and Surrogate Keys
A core understanding in relational database design is that each record in a table should be unique as a path to normalization. This is accomplished by creating a primary key constraint on a table and assigning a column or columns to the key constraint. This can be accomplished “naturally” through a column(s) that exists to describe the data itself or by creating a surrogate key that has no meaning at all in the data itself. Uniqueness is not the only mandate for a proper primary key. Other guiding principles include:
- Primary key values should be static, unchanging.
A primary key's value should never change once it's assigned. Primary keys exist to provide uniqueness within the set of rows that comprise a table, but they also exist to allow for relational mapping between tables within the domain of a database. Foreign keys in other tables will be employed to provide referential integrity and relationships between
- A primary key must have a value when the record is created.
A candidate for a primary key can’t be NULL, nor can it be calculated after a record is created in a table. Since it must uniquely describe the record, it has to have a value and meet all other conditions upon creation.
- Primary keys should be a small as possible.
Surrogate keys are an additional column added to the record to artificially provide uniqueness. By its definition, a surrogate key violates the guideline of keeping primary keys as small as possible because a surrogate key adds to the length of the data row it’s describing (particularly if there is a combination of existing rows that already uniquely describe the record.) Primary key values end up being stored in multiple locations physically on disk (and in memory when read from disk for processing as part of performance improvements derived from indexing methodologies).
In relational databases, clustered indexes are frequently built upon the primary key column(s). A table can have a single clustered index, and the data is logically and physically sorted on disk by the clustering key(s). Any number of non-clustered indexes built on the same table include the column(s) the non-clustered index is built on, but also include the clustering key so that an action that transverses the non-clustered index can then reference back to the clustered index--and the primary key--to perform a lookup for the row(s) that uniquely provide any additional column values not included in the non-clustered index. This means that when used as part of a clustered index on a table with one or more non-clustered indexes, the primary key is stored in more than a single location. Even if there is a column or multiple columns that can be used to create a primary key that satisfies all these rules governing what makes a column suitable for a primary key, when you add repetition into the equation an integer value's 4-byte surrogate key may end up being smaller in the long run than a natural
- And yes, finally, the primary key value must be unique.
Surrogate keys are typically added in a fashion that meets all of the aforementioned requirements. Switching to Microsoft Transact SQL language for one second because of my familiarity, they're typically an integer family datatype:
- Tiny integer (1 byte, values from 0-255)
- Small integer (2 bytes, values between -32,767 and 32,767
- Integer (4 bytes, values between -2,147,483,648 and 2,147,483,648)
- Big integer (8 bytes, values between -9,223,372,036,854,775,807 and 9,223,372,036,854,775,807)
The data type should depend on how many records one expects to have in a table over the lifetime of the database. In T-SQL they're usually assigned the IDENTITY property, which means that the value is auto-assigned and always incrementing at a consistent rate from the last value added to the table. Unlike natural keys, there is no reason I can conceive of where it needs to be exposed to an end user. That is not usually the case with columns used as natural keys.
I've seen interesting uses of natural keys over my 20 years as a data professional. While some meet the guidelines provided previously, none stands the test of time. A few
- Either medical record number (MRN) or Social Security Number (SSN) for a table storing patient data in a medical system
- Combination of first name, last name, middle initial and date of birth for a table identifying employees
- Mailing address and last name for storing addresses
Each of these fail multiple tests to be suitable for primary key consideration: Both SSN and MRN are considered PII because they can be used either solely or along with other columns in combination to personally identify an individual. SSN is not a valid column because not everyone has a Social Security Number. Even if the database in question is for a United States-based medical facility, one has to expect the facility will be visited by non-U.S. citizens on occasion due to medical emergencies. To support that need, the SSN column would need to either allow NULL values or a "dummy" value that would then not be unique (such as 00-000-0000.)
A combination of first name, last name, middle initial and date of birth falls victim to the PII concerns just as SSN and MRN do. Each of those columns is considered PII depending on the set size of the data being described. In combination, they're extremely descriptive of a single individual. "Extremely descriptive" does not equate to "unique," though. It's entirely possible that the combination is not unique even in a smaller data set.
Mailing address and last name was one of the oddest combinations I've seen to craft a primary key. It seemed as though the designer of the database was grasping for any way to avoid using a surrogate key by storing last name in a table it didn't belong in. Not to mention that most people sharing the same address have the same last name. Both mailing address and last name are also considered PII under multiple definitions and regulations, GDPR non-withstanding.
Natural Keys Are Inconsistent with the Right to be Forgotten
This brings us back to GDPR and, more specifically, the right to be forgotten. Honoring this right means deleting PII data where possible and obfuscating the remainder. Either operation has its inconsistencies with the existence of primary keys.
Obfuscating is incompatible with two of the requirements for primary keys--that any columns in a primary key are unique and that they have value--while deleting a primary key violates referential integrity rules. Any entity that has structured its database schemas upon singular or composite keys formed from columns considered PII already are seeing this and realizing the issues with meeting GDPR requirements while also addressing redesign of databases and access methods.
This should be a warning to future projects and those in flight that surrogate keys are the best option for structuring primary keys and enforcing referential integrity in the era of increased data privacy concerns. It should additionally warrant considering what data is necessary to support your business and to try to avoid collecting data that is irrelevant or possibly intrusive. This is my opinion, and natural versus surrogate keys is a sensitive subject for many data professionals. Do you feel the same way? What do you think? I would love to hear your opinion on this subject.