Semantic heterogeneity might sound like a meaningless bit of tech jargon, but it actually describes a subtle set of problems confronting many developers and DBAs. Anyone who's ever had to combine data sets from two different databases (and what DBA or database developer hasn't?) has probably encountered a semantic heterogeneity: data that doesn't match in format and therefore is difficult to merge. Combining databases when temperature is recorded in Celsius in one database and Fahrenheit in the other, for example, can result in significant data corruption unless you deal with it in advance. Recovering from such a data corruption is at best extremely time-consuming—at worst, it's a logical impossibility. This article examines the four types of semantic heterogeneity, looks at where they occur in database comparisons, and shows different ways to handle them.
Semantic Heterogeneity Types
Semantic heterogeneity is a complicated term for the phenomenon of disagreement about the meaning or interpretation of the same or related data. Sources for semantic heterogeneity can include differences in data-definition constructs, differences in object representations, and system-level differences in the way that atomic data (e.g., byte order for multibyte data, such as an integer) is stored in two systems.
Dr. Stefano Ceri and Dr. Jennifer Widom, two leading database scientists, identify four kinds of semantic heterogeneity, as Table 1 shows: naming conflicts, domain conflicts, metadata conflicts, and structural conflicts. Let's look at each briefly, using example databases A and B in Figure 1 for illustration.
Naming conflicts occur when two databases use different names to identify the same real-world entities. For example, one database schema might store an employee identifier in a field named Employee ID, but in another database schema, a field containing the same data could be called Employee_Number.
Domain conflicts occur when different databases use different values to represent the same concepts. In the central company database, an employee's identifier might be as123, for example, but the department database might list her as number 17. The temperature representation example, in which one database uses Fahrenheit and another uses Celsius, falls into this heterogeneity category. For some examples of the common types of domain conflicts, see Table 2.
Metadata conflicts occur when the same concepts are represented in one database at the schema level and in another database at the data level. Historical salary information about an employee, for example, can be stored either as part of a table or as separate rows. For example, in database A in Figure 1, the last two salary values are stored as attributes; database B shows them as rows in a separate table.
Structural conflicts occur when different data organizations are used. In database B, a name is represented as one character string, whereas database A stores the name as two strings, one for the surname and one for all the other names. This kind of semantic heterogeneity also includes type conflicts, when different data types represent the same concept. For example, you can represent temperature as either a floating point or an integer type.
Where the Problems Occur
So where do such semantic heterogeneities occur when you're comparing databases? To answer this question, you need to consider the two levels at which databases are compared—the schema level and the data level.
Schema-level comparison. A schema-level comparison matches table names of one database with table names of the other. The matching tables are checked to ascertain whether their schemata are the same—that is, do they contain the same number of columns, and do the column names and domains match? This process also compares the constraints on the databases. The result of the schema-level comparison shows the DBA any differences between the compared databases at a structural level. It can show that a particular table that exists in two databases includes a certain column in one and not the other, as Figure 2 shows, or that the data type of a certain table column has changed. You can use these schema differences to synchronize the schemata of the two databases.
Comparison tools—such as Embarcadero's Change Manager, SQL Compare from Red Gate Software, DB SynchroComp from E-Dule, or AdeptSQL's Diff—can quickly identify the tables whose equivalents are missing from one of the databases you want to examine. But comparison tools generally consider only the name of the table, so they can miss renamed tables. The reason the comparison tool doesn't try to match tables automatically is that, although it's possible to guess the name of the equivalent table in certain circumstances, the guess might not be reliable. Thus, tables that are structurally identical yet represent different concepts might be synchronized, resulting in corrupted data.The last thing a DBA wants to do is automatically synchronize data between two random tables whose schema happen to be highly similar.
During schema-level comparison, your comparison tool can identify many semantic heterogeneity problems from the categories of naming conflicts, metadata conflicts, and structural conflicts. However, most of the domain conflicts can remain undetected. You can address these conflicts at the next database-comparison level.
Data-level comparison. Data-level comparison for one table in each database classifies rows into one of three categories:
- The row exists in the first database and has no match in the second database.
- The row exists only in the second database.
- The row exists in both databases and either it's identical or parts of the row are different—that is, some columns contain different values.
Note that this third classification is based on the assumption that the compared rows can be uniquely identified within a table, usually by a primary key. Generally, commercial data-comparison tools require the presence of a primary key on tables whose data you want to compare.
You can use the differences the comparison tool identified to synchronize the databases at the data level. Figure 3 shows an example of such identified differences. But is the data or the database schemata correct after such a synchronization?
You can represent data values differently in two ways. The first is a structural difference. For example, you can represent a name as either one string or two strings (containing first and last names). Luckily, comparison tools can identify many of these structural representation differences during the schema-level comparison. Merging such data requires the DBA to specify ad-hoc conversion routines—for example, one that splits a name into two parts (first and last name) or joins the first and last names to form a full name.
The second type of difference occurs when the interpretation of the values is different, as in Fahrenheit or Celsius. You can store such a value by using a floating-point data type such as real. In this case, automatic tools can't detect any differences between the schemata of two tables that contain such data. If you manually identify this difference, you need to provide a conversion function that changes the temperature value from one interpretation to the other. Because automatic comparison tools can't detect semantic-heterogeneity problems, you need to manually create these conversion functions and add them to the synchronization or migration scripts.
Unfortunately, you can't always create a conversion function that works in both directions without data loss. If your database stores temperature values as integers, for example, converting a Fahrenheit value to Celsius and back can result in a different temperature value than the original. For example, 101F = 38.3333C, which rounds to 38C; however, 38C = 100.4F, which rounds to 100F, not 101F.
Sometimes you can't provide such conversion functions at all. A string concatenation with unknown boundaries doesn't have an inverse. You can join the two strings "ab" and "cdef" to make "abcdef," but you can't reverse-engineer that to get the value of the original two strings if you have no information about where to split "abcdef."
Solutions to Common Problems
One way of handling semantic heterogeneity is to use strong data typing and associate the semantics of the data with the data type. Most database management systems (DBMSs) allow subtyping, making it possible to define a new data type for a temperature in Celsius, for example.
In SQL Server 2000, you can create user-defined data types (UDTs) by using the sp_addtype stored procedure:
EXEC sp_addtype CtemperatureInt, int, 'NOT NULL' EXEC sp_addtype FtemperatureInt, int, 'NOT NULL'
Unfortunately, SQL Server automatically converts values of the above two data types when necessary; thus, you don't achieve strong typing by creating a UDT. However, the difference in the data type shows up during schema comparison. And SQL Server 2005 promises a more flexible data-type support that could help to achieve strong typing.
Using UDTs adds implicit documentation to your schemata by providing a more descriptive and specific name for the data. It also helps to achieve consistency in whether a field is nullable. A disadvantage of using such derived types is that you might have difficulty figuring out the base data type when you're looking at the name of a derived type. You can solve this problem by suffixing the type's name with the base type, as in the UDT code example above.
You can also identify some interpretation differences by looking at the data. Body temperature, for example, should have a range between certain sensible limits. A body temperature value of 100 C° is certainly unacceptable in a database of medical records. In such cases, you can set up database integrity constraints. Constraints can help to detect when incorrect data is inserted into a table as a result of a table merge.
To set up a check constraint for a temperature column whose type is CtemperatureInt, you can include the following code in the table declaration:
temperature Ctemperature CHECK (temperature >10 AND temperature <43)
Inserting an incorrect value would result in an error, such as the one Figure 4 shows. Often, a schema is already given, such as during migration from legacy applications. Adding temporary check constraints can help to find problems in the migration scripts.
An alternative to checks are triggers that consider the value of inserted or modified fields, then abort the transaction when they encounter incorrect values. You could replace the preceding check constraint for single-row inserts with a trigger like the one Listing 1 shows. Check constraints generally perform better when most of the inserts are successful, and triggers perform better when most of the inserts contain invalid values.
SQL Server 2005 will introduce data-definition language (DDL) triggers. SQL Server will be able to execute such DDL triggers automatically when database schemata change. This will give DBAs the means to enforce basic schema-related rules on the database. Such rules might include mandatory documentation, the use of certain data types, and the mandatory use of check constraints.
In many cases, not even check or trigger constraints help, since the values of two interpretations can have overlapping value ranges. Two examples are the differences in floor numbering between the UK and the United States and representing money from different currencies. If you don't use derived data types to capture this disaccord of the data semantics, a good documentation of the database schemata is the last resort for a DBA. Automatic tools don't detect these conflicts, so unless you know that two databases use the same schema and are maintained in exactly the same way, you should double-check the merging script your database comparison tool provides and contrast it with the databases' documentation.
No Easy Solutions
Semantic heterogeneity leads to complex problems that often don't have simple solutions. DBAs and database developers need to recognize when semantic heterogeneity occurs and what actions they can take to prevent, detect, and fix resulting problems. A good resource for more information about this topic is "Managing Semantic Heterogeneity with Production Rules and Persistent Queues" by Dr. Stefano Ceri and Dr. Jennifer Widom.