Data Modeling

Data modeling is essential to building a well-functioning database. For a database to support the activities of a business, it needs a good blueprint and foundation: the data model. A data model represents a business' data. If the data model is flawed, the database and all programs that use the database will be flawed. You need to design the data model, and subsequently the database, to be extensible and expandable. To do so, you need to understand the business environment and the initial reason for the database. Knowing how to construct a data model and what some important data modeling issues are can help you build a more effective database.

You can create data models with nothing more than paper, pencil, and a large eraser, but you'll find that many CASE tools are available to assist in your data modeling tasks. Some CASE tools are extensive, offering templates for multiple modeling methodologies, meta data repositories for sharing models among the project team, and subsystems that generate data definition language (DDL) to create the physical database. Generally, the larger the CASE feature set, the higher the price. If you're working on a design of more than 8 to 10 entities, a CASE tool can save you many hours of labor by facilitating the drawing process and by keeping the various parts of the data model organized for you. For this discussion, I'll use the common Crow's Foot methodology, which most CASE software packages support.

Data Modeling Phases

Data modeling (usually) occurs in three phases: conceptual design, logical design, and physical design. The conceptual design phase uses an entity relationship diagram (ERD) to graphically represent the business' data and information requirements. The ERD is a concept or picture of what the database will eventually look like, what data it can store, and what information you can retrieve from it. The ERD shows what a system can do, not how it does it; generally, the ERD captures no processes or activities. Furthermore, the ERD needs to be technology-independent. In other words, design the ERD so that you can implement it on any relational database vendor's product.

Begin the logical design phase by mapping the ERD to a set of tables and testing whether these tables are in (at least) third normal form. (See "Why You Need Data Normalization," premiere issue, for a discussion of normalization.) A set of recipe-like rules, which I'll cover in an upcoming article, governs how to do the mapping. The logical design also needs to be technology-independent.

The physical design phase involves adapting the logical model to a specific product platform. Some steps you might take in this phase are ensuring that the table and column names conform to any naming standards your company might have and assigning synonyms for the tables, if necessary. A synonym is a name for a table that might be easier to reference than the name it was created with. For instance, a table you created with the name X86-EMPNW might be known to users by the synonym EmployeesNorth-West. (SQL Server 7.0 and 2000 don't use synonyms, but other database platforms might.)

Another important step is noting which columns are candidates for indexing. The primary keys are automatically indexed, but plan on indexing foreign keys, candidate keys, and any other column that users might often sort and search by. Next, decide how to implement any supertype-subtype structures, and create subtype discriminator fields for rolling up the subtypes into the supertype. You'll use these discriminator fields later to categorize the records into the various subtypes. (See "Supertypes and Subtypes," May 1999, for a complete description of supertype/subtype entity development.) Include in the tables any other flag fields needed for production or programming, such as date_rec_added, last_update, by_whom, archive, or include_in_list. You need to determine the file system, if possible (some database management systems let you choose between various indexed sequential and hashed file schemes), and plan whether and how you'll partition the tables. Do preliminary planning for file placement on disk, based on anticipated activity and use patterns. And do capacity estimates so you'll be able to requisition the type of processors and amount of hard disk space you'll need to implement the database you're modeling. Also in this phase, plan and test any distribution or replication schemes that you might need to employ. If you're not going to do the implementation yourself, you can pass this information to the production DBA who will be implementing your design.

Some people combine the physical design phase with the physical implementation, which is the result of all the data modeling, culminating in a live database. Many CASE tools don't differentiate between the phases. But don't skip any of the phases; each gives you an opportunity to check whether you've met the database project's requirements.

How Do You Create an ERD?

A preliminary step in creating an ERD is to analyze requirements thoroughly. You need to gather requirements before you can begin to model the data. Typically, you get this information by interviewing the clients. Ask questions. Make it clear that you need to understand the situation. If you're confused about how a process works or what a term means, you need to ask questions until you understand clearly. Your confusion might be echoing confusion in the organization, and unless the organization understands its processes and procedures, no one can develop a good data model.

After the analysis, you need to identify the major entities, define the entity properties (the attributes), and specify the relationships among the entities. If you begin to construct an ERD but you still aren't sure how one entity relates to another, review the requirements and do a deeper analysis. The ERD (and the resulting database) won't work right if you don't understand the project's purpose.

When you're constructing the ERD, keep in mind future developments and direction. A copy of the strategic plans for the corporation, the departments, and the groups you're working with can tell you where they want to be in 3, 5, and 10 years, and you can design accordingly. These plans can help you make the design and the database extensible and expandable. If you don't understand future requirements and directions, your database might be obsolete before you load the first row of data. For instance, if you're designing a database for a company that currently sells only on the wholesale market and you fail to discover that the company is planning to move into the retail market, you'll later need to modify the database to accommodate a retail sales model.

You construct the ERD of graphical components, mostly rectangles (which represent entities and their properties) and lines (which represent relationships and connect entities to entities), as Figure 1 illustrates. Some methodologies use diamond-and-line combinations to represent the relationships between entities, and ellipses to represent attributes. As a data modeling example, let's infer requirements for the Pubs database and create an ERD for it.

From the Client's Perspective

For this exercise, I've summarized a case study for Pubs, as follows: A large publishing house needs to better track its resources. First, the company wants to track sales by publisher imprint—what the outside world knows as the publisher of a book or magazine. This publishing company has eight imprints. Some imprints produce periodicals (e.g., monthly magazines and newsletters); others produce books and technical manuals. The company also wants to track which wholesale and retail outlets, by address, sold which titles. The periodicals are available on newsstands and by subscription, so the outlet for subscriptions is the subscription service that the company manages in-house. Based on the quantity of each title purchased, the company can discount the cost of each title to the outlets.

A sales-tracking mechanism would help the company determine how many of each title to print and the real cost per copy. The tracking system would also help the company's decision-makers decide whether to keep periodical subscription services in-house or to outsource to a third-party subscription service.

The publishing company also needs to keep track of its employees and who performs which jobs for which imprint. The company wants to be able to generate a list of names and job titles for any imprint on a moment's notice. This capability would let managers optimize work assignments and better fit each person's skills to the tasks at hand. And the system would make evaluating a person's current work skills and future areas for development easier.

With so many imprints and with authors writing for multiple imprints, the company needs to automate its system for managing authors and royalties. The system would also manage columnists who write ongoing articles for the periodicals, and track the timing of events in the production of periodicals and books.

Identify the Entities

The Pubs case study is typical of a situation you might encounter when you take on a data modeling job. The large questions you need to answer before you begin work are

Purpose—what reason does the client give for wanting this new system?
Functionality—what does the client want this system to be able to do?
Events—what does the client plan to do with this system after it's delivered?
Outcomes—what are the client's long-term expectations about this system?

Remember that entity modeling doesn't record methods or activities, but knowing how people will use the data helps you to better describe and define the static data store: the database. You can create process-flow and data-flow diagrams, each of which will give you a clearer picture of how the data will be used. For any large data modeling project or even for small, complex projects, use these diagrams to help you understand the requirements long before you begin entity modeling. I'll create these diagrams in the next article in this series.

To identify the entities for your ERD, read the case study, looking for descriptive nouns and verbs. A descriptive noun describes an object (entity) that you want to capture and store in the database. A descriptive verb describes activities and interactions between nouns. A collaborative session with two or more data modelers and subject matter experts can help you better understand the situation and identify the entities.

When you're defining the entities, remember that you're identifying meta data—descriptions of data. For example, the Pubs database includes a list of publishers: New Moon Books, Binnet & Hardley, and so on. These could be imprints, all from the same publishing house. This list contains data values. What you want to identify is the meta data—publisher_name, in this case. It's easy to confuse real data with meta data, so always make sure you're not calling lists of values meta data.

The case study contains the nouns and verbs that appear in Table 1. Now you can analyze each entity to make sure you understand what each is and how it relates to the entities around it. For instance, Periodicals, Maga-zines, Newsletters, and Books are closely related. They're all publications that are sold to the public through subscriptions or retail outlets. Each goes through a creation proc-ess, subject to editorial and publication tasks. Are they the same type of thing? This technique is called generalization: comparing different entities to see whether they are variants of the same thing. If they are, you designate an entity as the master or supertype entity. Because none of the four items (Periodicals, Magazines, News-letters, Books) can describe all the others, create a supertype entity called Publication and designate the four as subtypes of Publication.

Further down the list is an entry called Titles. Does Titles relate to Publications? What exactly is a Title? Is it the same thing as a Book, or is it more like an article in a periodical? In the case study, you can see some ambiguity surrounding the word "title." The text implies that Title refers to the title of a book or the name of a magazine. You might need another conversation with the client to clarify the point. Title might be synonymous with book title or magazine name (Publication), and in many cases (magazine, compendium, collection), a Title might be composed of Articles. Let's assume that Titles is analogous to Publication and that a Title can consist of one or more Articles.

In my next article, I'll describe how to determine the attributes for the entities we've defined here. Then I'll look at formalizing the relationships between these entities. Data modeling is as much an art as a science, and each situation might have many solutions. If you construct an ERD correctly, no solution is right or wrong, just better or worse. Ultimately, personal (or corporate) preference plus experience determines the final model.

Comments

Plain text