Data Modeling Q&A

Quirky field lengths, recommended reading, and more

In this issue, I'm letting my readers step into the spotlight. Mostly as a result of my six-part series on data modeling in the April through September issues, I've had wonderful feedback from you. Some readers asked how and others asked why. Still others wanted to share their way of doing things, which I deeply appreciate. Thank you one and all for your participation, and keep those questions and comments coming!

What is the reason for some of the field lengths you specify in your articles? In particular, I'm wondering about the EmpSSN Text(12) and Phone Text(30) in "The Physical Design," August 2000. Are you storing edit characters, and if so, why? It seems that edit characters would consume space unnecessarily and potentially contribute to data-quality problems.

The reason I use a char(12) for Social Security number and a varchar(30) for phone number is experience. Have you ever seen a Social Security number like 123-45-6789b? They exist. Your comment about potentially contributing to data-quality problems is right on (misplaced hyphens, for instance, could cause a comparison routine to assume that the same Social Security number listed with different hyphenation is two numbers), but the need to accommodate the suffixes is absolute in this case. I made EmpSSN a candidate key, so the data set must be unique, with no duplicates.

Also, although hard-disk storage requirements increase when you store the formatting characters, CPU use and memory requirements increase when you don't. If you don't store the format bytes, you have to insert them before every presentation and remove them before every process or write to disk. So you have to choose whether to put the load on disk storage or on the CPU and memory. I want to keep the load off the CPU and memory resources as much as I can, so I choose to increase the hard-disk capacity. I've done some calculations, and leaving in the format bytes for the EmpSSN field affects the blocking factor only slightly.

The reason for the phone number field length isn't quite the same. US and Canadian phone numbers might look like 303-123-4567, but European phone numbers look like 011-44-01234-123456, for example. (Don't believe a corporate chief who tells you that your company will never do business with a person or organization outside North America!) Building edit masks for varying formatting on phone numbers is tricky, and implementing them is time-consuming. The process entails storing the phone number's country of origin, making that field NOT NULL, then loading up the client application with code that checks the country code, then selects the proper phone number format mask based on that code. The result is a fat client application, and that application type isn't popular (or practical). And for application maintenance, storing the format bytes in the database is much simpler.

I'm an MCSD, MCDBA, and former MCSD instructor specializing in SQL Server who recently became an IT consultant. I feel capable of handling big projects despite my lack of real-world experience, but I'm always looking to learn more. Your article series intrigued me; I'm interested in the intricacies of database design and the process of moving from business requirements to the physical implementation. What books on this subject would you recommend I read?

You might find the following books helpful; I keep them on my desk and pull them out to reference when I need to consult the oracles (with a lowercase "o"):

Database Design for Mere Mortals by Michael J. Hernandez (Addison-Wesley, 1997) takes a hands-on, by-example approach to data modeling and database design. It's a beginner-level book that will initiate you gently into the field.
Modern Database Management, 5th edition, by Fred R. McFadden, Jeffrey A. Hoffer, and Mary B. Prescott (Addison-Wesley, 1999) is the text I use for my first-year graduate students. Not too heavy to be understandable, this book contains plenty of examples and good explanations of fairly complex concepts. In my classes, I use an academic version that comes packaged with Oracle8i Personal edition; I believe you can order this book only through a college bookstore.
Database Modeling and Design, 3rd Edition, by Toby J. Teorey (Morgan Kaufmann, 1999) is one of my favorites. If you're a data modeler with an engineering bent, this book will tickle your fancy.
The Data Model Resource Book by Len Silverston, W. H. Inmon, and Kent Graziano (Wiley, 1997) is invaluable for understanding why you should model a database to fifth normal form and what kind of trouble you can get into if you don't.
Systems Analysis and Design Methods, 4th edition, by Jeffrey L. Whitten, Lonnie D. Bentley, and Kevin C. Dittman (Irwin Professional Publishing, 1998) is the basic text on systems analysis and design concepts and techniques. The book presents a superset of database modeling and design and will give you a good view of how the database fits into the universe of a system.
The Data Warehouse Lifecycle Toolkit by Ralph Kimball, Laura Reeves, Margy Ross, and Warren Thornthwaite (Wiley, 1998) is a standard for dimensional modeling. It covers data warehousing, soup to nuts, and is a must-have if you're interested in the subject.

Do you know of any tools that help quantify the time you need to develop data models and the underlying physical database design and implementation? I'm currently defining the scope of a project that requires moving an application from a proprietary indexed sequential access method (ISAM) data format to a Web-based SQL Server application. The customer wants to convert the underlying database to a truly relational database while retaining most of the existing functionality. We're trying to generate time estimates so we can give the customer a reasonable, fixed-cost estimate. Our company has worked on similar projects, but those projects mostly involved converting from Microsoft Access to SQL Server. Can you give me any guidelines that might help make my estimate more accurate?

I don't know of any tools that will do exactly what you're looking for, although I just got a copy of Omni-Vista OnYourMark Pro from Omni-Vista (http:// www.omni-vista.com/products/index.htm), which looks promising. This package seems to be project-management software designed specifically for software and database development projects, and it has a pricing module that produces estimates based on time and resources. The results still depend on the user who enters the numbers, but the software has lots of checks and balances to make sure that your project is healthy (on time, on budget, on deliverables).

Software packages such as this can assist you and let you play out what-if scenarios, but software will never replace the knowledge gained from experience. I recommend that you enlist the aid of a grizzled veteran who knows systems integration and upgrade tactics and strategies.

How do composite primary keys slow down client/server applications? My understanding has been that if a composite has a clustered index on it and isn't excessively long, it's suitable for a primary key, even in client/server applications.

SQL Server can process numbers faster than it can process character data. Therefore, from a performance perspective, having keys, both primary and foreign, that are integer data types is better than having character or datetime data types. I also recommend that as many of these primary keys as possible be automatically incrementing values. The identity property ensures that you'll have a meaningless, unchanging, utterly simple primary key.

When you're creating the physical model, place yourself in a production environment and salt your model with a healthy dose of pragmatism. For example, in a perfect world, each salesperson would represent one of your vendor companies. But in the real world, you might meet a salesperson and want to record that salesperson's name and phone number in your database before you start buying from (and consequently add data about) the represented company. Also, salespeople change jobs; if there's little difference between the companies, you might prefer to maintain your business relationship with the salesperson. For example, if you constrain the relationship between Vendor and SalesPerson with a composite primary key of VendorID + PersonID, you can't enter the salesperson's information in the SalesPerson table because the associated company isn't in your database. You can't add a record to a table until you have the entire primary key. So, rather than restrict what you and your data users can do, you model your database to accommodate the world's imperfections.

Comments

Plain text