You’re getting ready to design a database from scratch. What do you really need to include? What steps do you really need to take? Are there shortcuts that you can take or steps that you can skip? I can’t really speak to this last question because I don’t know what your process is when creating a database, but I can speak to the components that are absolutely necessary for a successful database project. There are 10 essentials:
- Understand the database’s purpose.
- Get the right tool.
- Gather the requirements for the database.
- Be conscientious when modeling the data.
- Enforce relationships.
- Use the appropriate data types.
- Include indexes when modeling.
- Standardize the naming convention.
- Store the code that touches the data in SQL Server.
- Document the work.
1. Understand the Purpose
Despite the popularity of Scrum, rapid application development (RAD), and numerous other rapid-development techniques, you still need to know why you’re creating the database. The first essential is that you discover and understand the database’s purpose before doing anything else. Saying that the database’s purpose is “to store records” isn’t good enough. You need to understand the business reason for this database’s existence before it goes into development, much less production. The business reason needs to map not only to the technical specifications but also to the business purpose and the business processes that this database will be supporting. For example, will it be a retail point of sale (POS) database, an HR database, or part of a customer relationship management (CRM) package that will help you track customers and manage the sales cycle? You need to know this and many more details about why you’re creating the database before you start the design process.
2. Get the Right Tool
You can’t create a viable database without software tools, no more than you could build a house without construction tools. So, you’re going to need data modeling software. Data modeling software, also called a CASE tool, is the data modeler’s and data designer’s best friend. A good CASE tool can help you quickly create different types of data models—such as functional decompositions, system and context diagrams, business process and data flow diagrams, and entity relationship models—each of which will help you communicate to your colleagues the visual components of the proposed system you’re building. Some CASE tools have features that enable sharing, coordination, merging, and version control for design teams. Last, but certainly not least, these tools effectively document what you’re doing and, by implication, why you’re doing it.
The cost of entry into the CASE tool market is not insignificant. However, the initial investment will be paid back in terms of shortened time-to-market for database projects and increased knowledge of corporate data and processes. You can read about six different CASE tools in "Comparative Review: Sizing Up Data Modeling Software." The comparison is based on capability and price, with some indication of how long it might take you to learn the package.
There is one thing to remember, though: No modeling tool will ever replace a person who understands how the business works. The CASE tool will only assist in creating visual representations of the business processes and structure.
3. Gather the Requirements
Once you understand the overarching reason why you’re doing the database project and you’ve selected a tool that will help you visualize the as-is and to-be environments, you need to do a deep dive into requirements gathering. You need to understand the requirements well enough to be able to create visual representations of data, processes, and systems.
When gathering the requirements, don’t limit yourself to disk drives, virtual machines (VMs), and other technical requirements. Although you need to know the technical requirements, they aren’t pivotal to good database design. The database that you’re creating has to support business operations, so you need to interview company staff members and managers to understand how the business runs. Only then can you get a handle on the requirements that you’ll need to meet in order to create a viable database.
When you’re gathering requirements, you’re going to encounter conflicting business needs—count on it. You’ll have to wear your diplomat hat to get the parties involved to agree on some sort of compromise so that the project can move forward.
4. Be Conscientious When Modeling the Data
My favorite part of database design is modeling the data—that is, creating structures that will hold distinctly different data sets (entities or tables) and representing the relationships between pairs of these data sets. When you’re modeling the data, you’ll have a chance to reaffirm or correct what you found when gathering the requirements.
There’s not enough space here to do a deep dive into the details of how to model data, so I’ll point out the highlights and some constructs that you shouldn’t avoid simply because they seem a little complicated.
Point 1. Based on the answer to why you’re building the database, you need to use either transactional modeling techniques (for an OLTP database) or dimensional modeling techniques (for a relational data warehouse). In the SQL Server Magazine archive, you can find a wealth of information about how to do both types of data modeling. The Learning Path box lists some of those articles.
Point 2. You need to perform data modeling at multiple levels. You should create the following three models:
- Conceptual data model (CDM). The CDM contains the semantics of the information domain that you’re modeling. (Taken from the Greek word semantiká, semantics refers to the study of meaning.) The CDM defines the scope of the database. Entity classes—the distinct data sets I mentioned earlier—represent things of interest to the organization, and the relationships between the entity pairs assert business rules and regulations. The CDM expresses both the as-is and to-be states and is often created during the Discovery stage.
- Logical data model (LDM). The LDM describes the structure of the information domain. It includes criteria such as attribute set details, data types, key status, and individual attribute nullability.
- Physical data model (PDM).The PDM defines how the data will be physically stored. You need to map the LDM to a specific database management system (DBMS) platform.
Point 3. This point is relevant for transactional data models only: You need to normalize the data and model to third normal form (3NF). Failure to model to 3NF in the early stages of design restricts your understanding of how data items interact with each other and most likely will inhibit a full understanding of the business requirements that the database must support. Once you move to later levels of design (e.g., the PDM), you can denormalize back to second normal form (2NF) if you must. Denormalizing a transactional data model isn’t recommended, however. It results in data duplication, which causes insert, update, and deletion anomalies and data synchronization problems. But if you must denormalize (for production efficiencies, perhaps), you’ll at least know exactly where the data duplication will occur and where to expect data synchronization problems. For more information about 2NF, 3NF, and data normalization, see "SQL By Design: Why You Need Database Normalization."
Point 4. Every transactional database contains four integrities: primary key integrity, referential integrity, domain integrity, and business rules integrity. Whether these integrities are maintained will determine if your data integrity will last beyond the first data load into the database. It will also determine whether the database can support the business rules and requirements that you gathered. For more information about the four integrities, see the Learning Path box.
Point 5. Never use a printed report as a template for a table. This is a common mistake that database design novices often make. Take, for example, an invoice. An invoice contains three distinct data sets:
- Sales data. This data set includes information such as the date, time, and location of the sale, and the total invoice amount (with or without discounts, taxes, and shipping and handling).
- Items sold. This data set includes information such as quantity, per-item cost, total cost of items, and references to items’ descriptions.
- Product in inventory. This data set includes a complete description of each product and other information that’s necessary to maintaining an inventory.
Yet, in production databases, I’ve seen tables named Invoice that mix components of all three data sets. This is detrimental to data integrity and confusing for the office staff. A printed report is a business requirement and one of the outcomes of your database design. You can create a view, indexed or otherwise, to model a business report, but a business report shouldn’t be used as a template for a database table.
Point 6. Don’t be afraid to include supertype and subtype entities in your design in the CDM and onward. The subtypes represent classifications or categories of the supertype, such as employees who are staff members and employees who are authors. They’re both employees, and they’re both entitled to the employee benefits package. Yet some staff members are authors and some aren’t; authors have additional relationships and attributes that the staff members don’t have. Entities are represented as subtypes when it takes more than a single word or phrase to categorize the entity.
If a category has a life of its own, with separate attributes that describe how the category looks and behaves and separate relationships with other entities, then it’s time to invoke the supertype/subtype structure. Failure to do so will inhibit a complete understanding of the data and the business rules that drive data collection.
When it comes to implementing the supertype/subtype structure, the way you’ve modeled the business requirements and the relationships will determine whether you create one or many tables in the PDM. In this example, if you create an Employee table and an Author table, you could be duplicating data for the individual who is both a staff member and an author. This duplication would lead to nonkey data redundancy within the database, which inevitably causes insert, update, and deletion anomalies, unsynchronized data, and loss of data integrity. For more information about how to properly use supertypes and subtypes, see "Implementing Supertypes and Subtypes."
Point 7. You must look out for multivalued attributes (MVAs), both expressed and implied. In its latest rendering, Microsoft Access has introduced the MVA structure. In doing so, it has veered away from the principles of relational data modeling and has truly become a nonrelational database. I recently reviewed a real estate system for a business that wanted to upgrade from Access to SQL Server. The database schema was riddled with MVAs. For instance, the building entity had an array of attributes named Picture1, Picture2, Deed of Record, Title, and so on—all within the same table. The owner wasn’t too surprised when he was told that SQL Server wouldn’t support the MVAs because his programmers had been having problems trying to access the attributes that were stored as MVA structures.
An MVA contains a collection of values for an attribute, as shown by the list of disks (e.g., Disk1Capacity, Disk2Capacity) in the SERVER table in Figure 1. This figure shows the schema for a server room inventory database. The solution to resolving this MVA is to flatten the structure so that each disk has its own row in a newly created table named DISK_DEVICES, as Figure 2 shows. The only constraint on the number of disks that the server can have is the number of bays in the server.
MVAs make retrieving summary data extremely difficult. They also cause problems when inserting, updating, and deleting data in a database. You can read more about MVAs in "Multivalued Attributes."
5. Enforce Relationships
The whole idea of a relational database is to identify and enforce relationships between entity pairs so that data integrity, as defined in the business rules and regulations and represented in the data models, is maintained. If you have the rule "each order must be associated with a current customer," you don’t want someone to write a program that lets a person place an order but not get a customer number. No application-level code can circumvent the rule to maintain data integrity.
The three types of relationships—one-to-one (1:1), one-to-many (1:M), and many-to-many (M:N)—are easily transitioned from the data model to the implemented database. For information about how to do this, see "Logical Modeling."
6. Use the Appropriate Data Types
There are a myriad of reasons why you want to use the proper data type for an attribute. Reduced disk storage space, data accuracy, the ability to join or union two tables together, attribute synchronization with the application layer, and internationalization capability are just a few of the reasons why you want to make sure that you’re defining each attribute’s data type correctly.
Technically, using the correct data type for an attribute speaks to domain integrity, which is one of the four database integrities I mentioned previously. When enforced, domain integrity ensures that an attribute contains only the set of valid values that are relevant and meaningful to that attribute. Many data modeling software packages let you create data domains and use them throughout the model. For example, you can create a Zip+4 domain that’s defined as char(5)+'-'+char(4). As a result of using the Zip+4 domain, every zip code instance in the database is defined with the same data type, instead of some instances defined as char(), some as varchar(), and some as int. Adding a NOT NULL characteristic to the Zip+4 domain further enforces the domain integrity rule that states each occurrence of a zip code in the database must contain a value.
Implementing domain integrity in SQL Server is a manual effort if you don’t have a data modeling package that lets you create and enforce data domains. The tools at your disposal include
- Defining appropriate data types and lengths, nullability, defaults, and check constraints when creating tables
- Using user-defined types, user-defined functions, and stored procedures that evaluate data when it’s being entered or used
- Using Declarative Referential Integrity (DRI) or triggers to enforce relationships between pairs of tables
- Using lookup tables to restrict data values when they’re inputted or edited
For additional ideas on how to enforce domain integrity, see "SQL By Design: The Four Integrities."
7. Include Indexes When Modeling
When you’re building the PDM, you should include indexes. This is a bit of a chicken-and-egg situation because until you know how the data is going to be used, you can’t predict with certainty which columns will need indexing. But you’ve already gathered the requirements, so you should be able to make an intelligent guess as to which data should be indexed and how to index it. SQL Server automatically indexes (and clusters, by default) primary key constraints, but it doesn’t automatically index foreign key constraints. Therefore, the first set of indexes you want to create are those for the foreign keys in child tables. The next set would be for those columns on which you’re going to sort. In other words, you want to create indexes for columns that would be in ORDER BY clauses. From the gathered requirements, you should be able to determine which columns will be used for filtering (i.e., in the WHERE clause), so based on your understanding of how many rows are in each table, you can make an intelligent decision regarding which columns to index. For additional information about indexing, see “Indexing Dos and Don’ts”.
8. Standardize the Naming Convention
If you don’t already have one, adopt a naming convention for all database objects, including tables, columns, indexes, keys, named constraints and relationships, and code. Not having a naming convention can lead to errors in data retrieval and can even inhibit productivity. A database schema is rarely documented and evolves over time; an established naming convention for the objects helps modulate schema confusion. Here are a few rules for creating useful database object names:
- Use names that are meaningful to the entire organization and clearly identify the database object.
- Don’t get too wordy. Use the minimum number of characters necessary to identify the database object.
- Don’t use names that implicitly or explicitly refer to more than one topic (for tables) or characteristic (for columns).
- Use the singular form of a name wherever possible, especially for entities and tables. This helps to correctly distinguish the entity-to-entity relationship (i.e., 1:1, 1:M, or M:N).
- Don’t use spaces in database object names (e.g., Employee ID). Other database management systems might not support the use of spaces in object names, thereby limiting transportability and cross-database integration.
When you choose a naming convention, remember that no one convention is perfect, but almost any naming convention is better than none at all. For additional information about naming conventions, see “Seven Deadly Sins”.
9. Store the Code in SQL Server
Repeat after me, “I will not embed dynamic SQL code in my applications.” Doing so puts a database at risk for inconsistent and possibly inaccurate application of business rules and regulations. It also puts a database at risk for a SQL injection attack. In case you haven’t heard the term SQL injection before, it’s a security vulnerability that occurs whenever one programming or scripting language (in this case, T-SQL) is embedded inside another. A hacker can literally hijack the embedded T-SQL code, modify it on the fly, and gain access to the data in the database. The technique to do this is widely known and publicly available. For additional information about SQL injection, see the web-exclusive article "SQL Injection Attacks on the Rise."
The best way to avoid inconsistent rule application and SQL injection is to store code that touches a database’s data as a stored procedure or CLR object. When you need to add or modify data, you then call the appropriate stored procedure; the operation will execute the same way every time it’s called. For an introduction to the CLR, see the web-exclusive article "What’s the CLR and Should I Care?"
10. Document, Document, Document
Documenting the work you do is probably you're least favorite activity. But if you're following through with the essentials that I've discussed, you're already documenting. Now, that wasn't so bad, was it?
Reap the Benefits
If you did a good job discovering the database’s purpose and its requirements, and if you were conscientious while you worked through the myriad steps of the modeling process, then your database design will likely stand up to everything that the organization can throw at it. The database will support all the business requirements and comply with all the integrity, security, and privacy rules and regulations that govern your industry. It’ll perform well and be easy to use and maintain. And as the organization evolves into different lines of business, you’ll be able to easily extend the database’s functionality and easily scale it up and out to store and serve up many times more data than it was originally designed for.
SQL SERVER MAGAZINE RESOURCES
To learn the basics of transactional data modeling, see:
To learn the basics of dimensional modeling, see:
For more information about the four transactional database integrities, see: