Data Types Revealed

The data type concept is one of the concepts least understood by database practitioners. This misunderstanding is both a cause and a consequence of the failure of SQL and the commercial database management systems (DBMSs) based on it to implement relational domains, which are nothing but data types that can be as complex as you want. As a result, object DBMS proponents mistakenly claim that the relational approach doesn't support so-called complex types, and therefore object DBMSs are superior to relational DBMSs. In this article, I explain the data-type concept, the meaning of DBMS' support of data types, and the distinction between simple and complex types.

Data Type Basics

Databases are structured collections of facts about real-world entities. Consider, for example, the following facts about a company's employee: "Employee 160, named Pianka, works in department D111, was hired on 10/11/1972, earns a salary of $22,250." In natural language, this statement of facts is a logical proposition asserted to be true. (Table 1, page 54, defines some common database terms.)

The general form of propositions about employees is: "Employee with employee number (EMPNO) and name (ENAME), works in department (DEPTNO), was hired on date (HIREDATE), earns a salary of (SALARY)." This generalized form of proposition is called a multivalued predicate, and the terms in parentheses are value placeholders. Substituting specific values for the placeholders turns the predicate into specific propositions. In a relational database, each table has a predicate; the sets of values (representing propositions) are rows, and placeholders map to columns.

For propositions to be true, values substituted for placeholders must be valid according to business rules. Data types constrain database values to the valid ranges that the business rules define. Data types serve as named conceptual pools of permissible values for columns: To be valid, every database value must come from such a pool and, therefore, be of a type.

Two aspects of data types are poorly understood. One aspect is the important distinction between the type and its representation —how a DBMS encodes the data type's values. This misunderstanding stems in large part from the fact that SQL's DBMS types have only one user representation, which implicitly bears the same name as the data type (e.g., the DECIMAL representation of the DECIMAL data type). But data types can have multiple possible representations; for example, you can represent a temperature data type as degrees Fahrenheit, degrees Celsius, degrees Kelvin, or simply hot, warm, cool.

The second aspect involves the fact that, aside from representations, a data type also includes operators for every representation exposed to users. With the operators defined for a given type, users can manipulate the values of that type. Without operators, they can't. For example, exposing the Fahrenheit representation to users means providing operators for manipulating degrees Fahrenheit. At least two basic operators must be defined for every data type (although more operators are required for usefulness):

read: retrieves (specifies) a specific database value of that type
=: compares two values of that type

Thus, data types constrain values and operators.

Simple Data Types

Numbers and character strings are the most basic kinds of values, simple values. A numeric value has multiple digits and a character string consists of multiple letters, but these internal structures are usually of no interest to users. The operators for these data types insulate users from the internal structure. Users can apply operators, such as comparison (=, >, <), Boolean (NOT, AND, OR), arithmetic (+, -, *, /), or concatenate (|), to values represented as decimals or strings without having to know anything about the internal structure of values so represented. Furthermore, simple data types have generally agreed-on, well-defined representations and operators.

All DBMSs come with one or more system-defined (built-in) simple data types. Aside from the two basic operators, DBMSs support additional operators for simple data types. SQL Server's data types, for example, include DECIMAL, NUMERIC, FLOAT, REAL, INT, SMALLINT, TINYINT, CHAR, VARCHAR for which SQL Server provides a variety of func-tions (operators). The seven numeric types have the same functions but different value ranges, so they are distinct (otherwise, they would all be possible representations of one data type; for example, numbers). Likewise, the two character types are distinct.

User-Defined Data Types

Suppose that employee ID numbers consist of one character and two digits. An EMPNO data type would have a CHAR(3) representation and a permissible value range of A00-Z99. Although some functions applicable to character strings (=, <>) are also meaningful for EMPNO values, others (+, REVERSE) make no sense.

Thus, EMPNO has the same representation as, and some of the operators of, the system-defined CHAR data type, but because its valid value range and operator set are narrower, it is distinct from CHAR. Thus, all possible employee numbers (EMPNO) doesn't mean the same as all possible character strings of length three (CHAR(3)).

Unless users can define such a data type, the valid range of employee numbers in the database and the operators applicable to them will be those of the system-defined CHAR data type. Furthermore, all data types represented to users as strings of length three (e.g., DEPTNO) will be the same as the system-defined data type CHAR(3), even though they have different value ranges, operator sets, and meanings. This situation would impair the DBMS' ability to protect the integrity of both database content and manipulation in accordance with business rules. Because a large number of arbitrary rules is possible, and they all can't be system-defined, a DBMS needs to support user-defined data types.

The value ranges of user-defined data types must derive, in some way, from those of the system-defined data types. For simple user-defined data types such as EMPNO, the derivation is straightforward—it is a narrowed-down version of the ranges of simple system data types (i.e., CHAR). SQL Server lets users define their own simple user types by imposing constraints on system-type value ranges. Users can create the EMPNO data type, for example, with the ADDTYPE system stored procedure, and then bind a pattern rule to it as follows:

SP_ADDTYPE empno, CHAR(3), NOT NULL 
CREATE RULE empno_range
AS @VALUE LIKE '%\[A-Z\]\[0-9\]\[0-9\]'
SP_BINDRULE empno_range, empno

However, SQL Server doesn't let users define operator sets for user-defined types. The consequences are that (1) users can still apply system-defined functions that don't make sense to user types, and (2) functions that are meaningful for user-defined types but are not provided for the system types are not available to users. Users must implement such functions with application code.

Data-Type Support

Data-type support means that for each operator, the DBMS also supports the types for its meaningful results. For n data types, there are up to n-type possible operators and, thus, as many potentially meaningful results. This possibility is true for user-defined data types too, which also might require user-defined functions. Therefore, support of even simple user-defined data types isn't trivial for either DBMS implementation or database design. The implication of user-defined data type and user-defined function support is that DBMSs must be user-extensible.

Consider a proposition about the company's projects of the general form Project (PROJ#) has staff size (STAFF), where STAFF has an INT representation exposed to users, constrained to the range 0-450. The operators + and - are meaningful, but what about the / (division) operator? Project staffing ratios might make sense, but a ratio's results are of a type different from STAFF, possibly with a DECIMAL representation. Because all values (including results of operators) have types, the DBMS must also support this data type DBMS for the STAFF and the operator to work. If results have distinct value ranges and operator sets (e.g., a not-to-exceed business rule on the staffing ratios of certain projects), they are of a user-defined data type.

Complex Types

A data type is a set of values without any restriction on how the values are represented. Any representation is permissible, with an internal structure of arbitrary complexity, including text, images, video clips, sound recordings, geometric forms, and maps. But proper operators must be defined for these representations for the values to be accessible to users.

Suppose that a company uses employee fingerprints for security. The predicate of the EMPLOYEES table would be Employee with employee number (EMPNO) and name (ENAME) has fingerprint (EFP), works in department (DEPTNO), was hired on (HIREDATE), earns a salary of (SALARY), where values in the EFP column are fingerprint images (i.e., values of type FINGERPRINT), with graphic representation. Consider the basic = comparison operator for this data type. Checking an employee's fingerprint against the database would invoke this operator. Many graphic representations are possible; for example, a representation of two-dimensional arrays of pixels, where each pixel is represented by three numbers representing amounts of red, green, and blue.

In this context, users aren't interested in these details. Therefore, like its numeric counterparts, the = operator for FINGERPRINT would have to insulate users from the internal structure of FINGERPRINT values. It's clear that even though the operator is derived from simple type operators, the derivation would be much more complex than for the = operators for simple data types such as EMPNO.

Furthermore, agreement on operators other than the two basic operators for complex data types is not nearly as general as for simple types. Would users of fingerprints for different purposes (e.g., scientists studying fingerprints) agree on the operators other than comparison and read? How about a HOUSE data type with image representations for real estate applications? Would users in different segments of the real estate business or users of house images for different purposes agree?

And there's the rub: Operators insulate users from internal complexity by absorbing, or internalizing, it. Although any value can be designated, designations for complex types are usually long and complicated. Moreover, the problem with complex types is that different instances of such types of data require widely different types of processing (operators)—another way of saying that complex types are user-defined data types. Not only do complex types have complex operators and implementations, but there is no general agreement on the representations and operators across databases and users. These problems are a characteristic of complex data types and not unique to relational databases. It is largely because of these problems that SQL Server comes with the system-defined complex types TEXT, BINARY, IMAGE, VARBINARY but provides practically no operators for them and no way to derive user-defined types from them. You can't bind rules to the complex system data types.

Complex Data-Type Support

To support complex user-defined types, a DBMS must permit users to define value ranges and user-defined functions (operators). This requirement means that

DBMSs must be user-extensible.
User-defined functions, whether for simple or complex data types, usually require programming.
Users have access to user-defined data types only via the operators programmed into such types.
Complex user-defined functions can make things easy for end users but can be difficult for programmers to build.

Comments

Plain text