The Global Data Protection Regulation, or GDPR, is creating a great deal of stress as the May 25 deadline for compliance nears. Chief among the concerns for database administrators, chief data protection officers, CIOs and CEOs alike is customers’ data security in the context of GDPR, as well as their relational and NoSQL database systems' ability to accommodate GDPR requirements.
GDPR includes provisions that are a challenge to how personally identifiable data is controlled and processed. It also requires Controllers (those who host/store/control data that can identify a living human being) and Processors (those entities that use that same data for calculations/analysis to serve a purpose for controllers and users alike) to be accountable to the individuals whose personal data they control/process. The GDPR requires, among other things, that Controllers/Processors allow for the right for people to request that their data be “forgotten,” their personally identifiable data to be provided to them in a reasonable timeframe, and that, upon request, their personally identifiable data not be processed. There are many other GDPR requirements, but these three are the most widely discussed.
Also widely discussed is the ability—or inability, as the case may be—for database systems to enable GDPR compliance. NoSQL database systems, in particular, may present a challenge.
Just what are NoSQL databases, and what makes them the “flipside” of relational databases like Microsoft SQL Server or Oracle?
Providing answers to those questions starts in the 1960s.
A Brief History of NoSQL Databases
NoSQL originally referred to “non SQL” or “non-relational” databases, but the term is now associated with the phrasing “not only SQL.”
Today’s NoSQL databases handle storage and retrieval differently than the tables, and joins between tables, of traditional relational models.
While NoSQL databases have existed for more than 50 years, they didn’t really gain popularity until the rise of Web 2.0 in the 1990s and social media and retail giants like Facebook, Google and Amazon. These organizations, and others like them, wanted to find ways to bypass some of the concerns that come from traditional RDBMSs--mainly related to speed. NoSQL databases by design are focused on performance—but at the sacrifice of some of the benefits that RDBMSs provide.
NoSQL databases originally shunned the constraints of relational math, but they have since developed--in some part—a more centrist approach to managing data near-relationally: They sacrifice various aspects of ACID transactions--Atomicity, Consistency, Isolation and Durability--for availability and speed. This means it’s entirely feasible that issues resolved by relational databases—such as ensuring all data read has been committed and that all transactions submitted to the database are hardened--can occur in a NoSQL variant.
It’s important to note, however, that not all NoSQL databases are alike. In fact, there are three distinct categories:
1. Key-Value Databases
Key-value databases use an associative array (that is, a “map” or “dictionary”) to denote relationships of various keys in place of the typical tables and joins found in a relational database. In key-value databases, data is represented as a collection of pairs of key-values, with the rule that each possible key pair appears, at most, once in the collection.
The fundamental issue with using a pure key-value pair is that natural keys are employed--meaning that the actual value of a key-pair is stored. Think movie rentals: Only one person can rent a movie at any given time, so the key value pair may be something like “Ready Player One” and “Trevor Ford.”
An alternative to natural keys is a hash table, which uses an algorithm for the natural values of the keys. There are issues if the algorithm generates a duplicate hash for different key-value pairs, but there are work-arounds for such matters. With that said, there is a major challenge with key-value pairs that use natural keys, just as there is for a relational database using natural keys: when those keys are personally identifiable in nature. Think about the example above, identifying “Trevor Ford.” This will be an issue under GDPR requirements if Trevor Ford requests his data be forgotten. Forgetting that data would break a key-value pair, which is the inherent method for identifying a record in a key-value database. The hash would need to be recalculated if an organization was using a hash table as an alternative to natural keys when the request came in, which would add overhead and could lead to further problems.
Examples of key-value NoSQL databases include, but are not limited to:
- Apache Ignite
- Oracle NoSQL Database
2. Document Store Databases
The nature of a document store is a “document.” Though each offering in this category is different in its approach, they all pivot on the basis that documents encode data in some form of standard format such as JSON or XML. Each document has a unique key that identifies the specific document, and each document stores values related to the record (using relational lingo). Document store databases still use a key-value store, but offer an API or some form of pertinent query language that allows for retrieval. Unlike RDBMSs, however, each document may have a different structure, which is a dramatic shift from the rigors of a relational table structure schema. The different providers of document store databases handle organization of documents differently, be it through collections of documents, tagging, hierarchies or other metadata flaggings. The same concerns arise with document stores as with key-value databases when it comes to the GDPR rules of right to be forgotten--and even the requirement to have a user’s data exported in a secure fashion. Identifying that personally identifiable data across countless documents could prove difficult.
Examples of document databases include, but are not limited to:
- IBM’s Domino
3. Graph Databases
With the rise in data science--and data scientists--graph databases are all the rage. They are designed for data whose relations correlate as a graph consisting of elements with a set number of relations among them. Graph databases have an advantage over RDBMSs in that they can model complex hierarchical structures much better than their relational counterparts. Many graph databases utilize a relational type model on the back end, and it’s their query language that differs since relational queries can’t handle the complexities of the hierarchical nature of the graphing to be performed. Their challenges around GDPR tend to align closer to those of RDBMS as a result.
Examples of document databases include, but are not limited to:
- Apache Giraph
Dealing with Relational Data
Most NoSQL databases are unable to handle joins; the schema of these databases tends to be designed differently to account for this, including:
Multiple trips: Rather than retrieving all the data with a single query, multiple queries can be employed. Since NoSQL queries are often faster than RDBMS queries, the cost of having to do additional queries may be acceptable.
Nesting and non-normalized data: Storing more data--and non-normalized data at that--is a method that is frequently employed. Storing both a surrogate key such as a user_id as well as the user first name, last name, etc. could be employed to generate a single trip to the data. The trade-off for speed is increased storage consumption. This can lead to multiple iterations of personally identifiable data, which means increased concern for GDPR violations and overhead to remain compliant.
There are many issues to consider, but NoSQL databases' non-normalized nature and use of soft deletes can result in difficulties with GDPR compliance, compared with their relational counterparts' ability to support the regulations.