Data Mining in SQL Server 2000

Downloads
16175.zip

There could be gold in that database

Data mining is a term for the computer implementation of a timeless human activity: It is the process of using automated methods to uncover trends, patterns, and relationships from accumulated electronic traces of data. Data mining—or knowledge discovery, as it is sometimes called—lets you exploit an enterprise data store by examining the data for patterns that suggest better ways to produce profit, savings, higher quality products, and greater customer satisfaction.

With the release of SQL Server 7.0 in fall 1998, Microsoft stepped squarely into the maturing area of decision support and Business Intelligence (BI). SQL Server 7.0's OLAP Services provides a widely accessible, functional, and flexible approach to OLAP and multidimensional cube data query and data manipulation. SQL Server 2000, released in fall 2000, extends OLAP Services' capabilities, incorporating data-mining algorithms in its renamed Analysis Services.

Origins of Data Mining

To mine data, you need access to data, so it's no coincidence that data mining developed at the same time as data warehousing did. As computer power and database capability grew through the late 1900s, people began to see that data wasn't simply a passive receptacle, useful only in performing billing or order-entry functions. People could also use data in a more proactive role to provide predictive value in guiding their businesses forward. This notion led to the development of a new breed of computer systems that went beyond running the business (as early computer applications did) to informing and analyzing the business. These new systems were sometimes called decision-support systems or executive information systems (EISs). These systems were designed to harness growing computing power and improved GUIs to provide ad hoc analytical reports that could slice and dice data in novel ways and went well beyond earlier notions of static reporting. Slicing and dicing data—drilling down into detailed reports or zooming up to a 10,000-foot "big picture" view—required special ways of organizing data for decision making. This need gave rise to the data warehouse.

The term data warehousing was virtually unknown in 1990. Ten years later, data warehousing has become a multibillion-dollar business of capturing and organizing data to provide a proactive analytical (versus operational) environment that uses data in defining and guiding business activity. As data warehousing matured, decision support and EIS gave way to the more general concepts of BI and data mining.

BI involves organizing data along various potential dimensions of analysis so that you can cross-reference and display any view of data—say sales results—from within any number of other potential dimensions—say region or product line. The ability to move up and down dimensions lets you drill down into detail or zoom up for a more general view. The ability to show variations in data along various dimensions—often, many dimensions simultaneously—provides multidimensional reporting capability in realtime. This general approach to manipulating data became known as online analytical processing (OLAP)—that is, processing data for analytical purposes instead of operational purposes. The term online refers to having the analytical data continuously available. OLAP takes advantage of a data warehouse by making data continuously available in a form that supports analytical decision-support tasks. The distinguishing characteristic of OLAP is the preprocessing, indexing, and storage of data in various dimensional representations to quickly deliver the various dimensional views BI requires.

However, BI OLAP tools might not find all the patterns and dependencies that exist in data. OLAP cubes are appropriate for a limited amount of data exploration, involving major variations according to critical and known business dimensions. But when the dimensions change as the business changes or when you're exploring novel situations, data mining can be an extremely flexible and powerful complement to OLAP. Data-mining solutions are perfectly suited for sifting through hundreds of competing and potentially useful dimensions of analysis and associated combinations. All data-mining algorithms have built-in mechanisms that can examine huge numbers of potential patterns in data and reduce the results to a simple summary report. The BI OLAP and data-mining approaches to reporting on data belong together and are synergistic when deployed together. Microsoft recognized this synergy after it released SQL Server 7.0 and began a development program to migrate data-mining capabilities into the SQL Server 2000 release.

The most common data-mining techniques are decision trees, neural networks, cluster analysis, and regression. In preparing to release SQL Server 2000 and Commerce Server, Microsoft developed a substantial data-mining infrastructure and core data-mining algorithms to carry out decision-tree and cluster-analysis data-mining tasks. As part of the data-mining infrastructure, Microsoft created the OLE DB for Data Mining specification, an extension of OLE DB for OLAP that defines the data-mining infrastructure and COM interfaces that expose data-mining models and algorithms to data-mining consumers. OLE DB for Data Mining serves as a standard that external product vendors can use for delivering their data-mining functionality in the Microsoft environment.

The Data-Mining Development Approach

The Data Mining and Exploration group at Microsoft, which developed the data-mining algorithms in SQL Server 2000, describes the goal of data mining as finding "structure within data." As defined by the group, structures are revealed through patterns, which are relationships or correlations (co-relations) in data. So the Data Mining and Exploration group has captured the essence of data mining: Correlations produce patterns or associations that show the structure of data. Structure, when placed in a business context, can drive a business model—or even drive the business and improve its effectiveness in the marketplace.

The Data Mining and Exploration group model of data mining is to deliver indicators of data structure through extensions of the data query process. Traditionally, you construct a query to retrieve particular information fields from a database and to summarize the fields in a particular fashion. A data-mining query is different from a traditional query in the same way that a data-mining model is different from a traditional database table. In a data-mining query, you specify the question that you want to examine (e.g., gross sales or likeliness to respond to a targeted marketing offer), and the data-mining query processor returns to the query station the query results in the form of a structural model that responds to the question.

The central object of Microsoft's data-mining implementation in SQL Server 2000 is the data-mining model. The Data Mining and Exploration group built several query wizards to facilitate the process of creating and interacting with the data-mining model so that end users need no query syntax. The OLE DB for Data Mining specification provides COM interfaces that can be accessed directly from a client application, however, so both end users and third-party applications can access data mining directly through query processing. The query in Listing 1 shows how to create a data-mining model to predict or classify age based on other attributes in the data set, such as gender, product name, or product type and quantity. Here, the client executes a CREATE statement that is similar to a CREATE TABLE statement. A full description of the language for creating and manipulating a data-mining model is contained in the OLE DB for Data Mining specification (http://www.microsoft.com/data/oledb).

Although the wizard-driven interface is the primary mechanism for accessing SQL Server 2000's data-mining query engine, clients and third-party applications can access data-mining models by using an OLE DB command object. After a data-mining model structure is built (either by wizard or directly), it is stored as part of an object hierarchy in the Analysis Services directory. The patterns—or structure within the data—are stored in summary form with dimensions, patterns, and relationships so that the predictive or classification power of the data will persist regardless of what happens to the original row-level data that the model is based on.

The Data Mining and Exploration group employed three broad strategies in developing SQL Server 2000 Analysis Services: self-service, integration of OLAP and data mining, and Universal Data Access (UDA). The group implemented the strategy of self-service primarily through task wizards that take you through the various steps involved in developing and using data-mining models. To integrate the OLAP and data-mining metaphors, the group decided to have the query-delivery mechanism deliver query results, whether OLAP-based or data mining-based, to the user through the same interface.

The OLE DB for Data Mining standard encapsulates the idea of a universal data-access mechanism that allows the sharing of data and data-mining results through heterogeneous environments with multiple applications. Microsoft developed OLE DB for tabular data sources but extended it for OLAP through OLE DB for OLAP in SQL Server 7.0 and for data mining through OLE DB for Data Mining in SQL Server 2000. With OLE DB for Data Mining, you can mine OLAP or relational data sources because the same OLE DB infrastructure supports both OLAP and data mining. Thus, the OLE DB for OLAP and OLE DB for Data Mining approaches reflect heterogeneous data access, a shared mining and multidimensional query storage medium, and a common interface for OLAP queries and data-mining queries.

The addition of data-mining capability in SQL Server 2000 represents a major new functional extension to SQL Server's capabilities. You can develop a data-mining model from either relational (standard tables) or dimensional (cube structures) sources. Microsoft extended the OLE DB data-access standard to provide generalized data-mining access mechanisms that include support for relational and dimensional sources as well as for third-party vendors' data-mining functionality. Analysis Manager contains several new wizards that facilitate interaction with data-mining models. (The OLE DB for Data Mining specification allows programmatic interaction.) And because SQL Server stores data-mining models as Decision Support Objects (DSO) in the same way that it stores dimensional cubes, you can use the same interface to service OLAP-style queries and data-mining queries.

OLE DB for Data Mining

OLE DB for Data Mining supports most popular data-mining algorithms. Using OLE DB for Data Mining, data-mining applications can tap into any tabular data source through an OLE DB provider, and you can perform data-mining analysis directly against a relational database. To bridge the gap between traditional data-mining techniques and modern relational database management systems (RDBMSs), OLE DB for Data Mining defines new concepts and features, including:

Data-mining model. The data-mining model is like a relational table, except that it contains special columns that you can use to derive the patterns and relationships that characterize the kinds of discoveries that data mining reveals, such as which offers drive sales or the characteristics of people who respond to a targeted marketing offer. You can also use these columns to make predictions; the data-mining model serves as the core functionality that both creates a prediction model and generates predictions. Unlike a standard relational table, which stores raw data, the data-mining model stores the patterns discovered by your data-mining algorithm. To create data-mining models, you use a CREATE statement that is very similar to the SQL CREATE TABLE statement. You populate a data-mining model by using the INSERT INTO statement, just as you would populate a table. The client application issues a SELECT statement to make predictions through the data-mining model. A prediction is like a query in that it shows the important fields in a given outcome, such as sales or probability of response. After the mining engine defines the important fields and stores them in the data-mining model, the model can use the same pattern to classify new data in which the outcome is unknown. The process of identifying the important fields that form a prediction's pattern is called training. The trained pattern, or structure, is what you save in the data-mining model.

OLE DB for Data Mining is an extension of OLE DB that lets data-mining client applications use data-mining services from a broad variety of providers. OLE DB for Data Mining treats data-mining models as a special type of table. When you insert the data into the table, a data-mining algorithm processes the data and the data-mining model query processor saves the resulting data-mining model instead of the data itself. You can then browse the saved data-mining model, refine it, or use it to make predictions.

OLE DB for Data Mining schema rowsets. These special-purpose schema rowsets let consumer applications find crucial information, such as available mining services, mining models, mining columns, and model contents. SQL Server 2000 Analysis Services' Analysis Manager and third-party data-mining providers populate schema rowsets during the model-creation stage, during which the data is examined for patterns. This process, called learning or training, refers to the examination of data to discern new patterns or, alternatively, the fact that the data-mining model is trained to recognize patterns in the new data source.

Prediction join operation. To facilitate deployment, this operation, which is similar to the join operation in SQL syntax, is mapped to a join query between a data-mining model (which contains the trained pattern from the original data) and the designated new input data. This mapping lets you easily generate a prediction result tailored to the business requirements of the analysis.

Predictive Model Markup Language (PMML). The OLE DB for Data Mining specification incorporates the PMML standards of the Data Mining Group (DMG), a data-mining consortium (http://www.oasis-open.org/cover/pmml.html). This specification gives developers an open interface to more effectively integrate data-mining tools and capabilities into line-of-business and e-commerce applications.

How the Data-Mining Process Looks

Data to be mined is a collection of tables. In an example I discuss later, you have a data object that contains a customer table that relates to a promotions table—both of which relate to a conference attendance table. This is a typical data-mining analysis scenario in which you use customer response to previous promotions to train a data-mining model to determine the characteristics of customers who are most likely to respond to new promotions. Through data mining, you first use the training process to identify historical patterns of behavior, then use these patterns to predict future behavior. Data mining accomplishes this prediction through a new data-mining operator, the prediction join, which you can implement through Data Transformation Services (DTS). DTS provides a simple query tool that lets you build a prediction package, which contains the trained data-mining model and points to an untrained data source that you want predicted outcome from. For example, if you had trained a data source to look for a pattern that predicts likely customer response to a conference invitation, you could use DTS to apply this predicted pattern to a new data source to see how many customers in the new data will likely respond. DTS's ready-made mechanism of deploying data-mining patterns provides a valuable synergy among data mining, BI, and data warehousing in the Microsoft environment.

The collection of data that makes up a single entity (such as a customer) is a case. The set of all associated cases (customers, promotions, conferences) is the case set. OLE DB for Data Mining uses nested tables—tables stored within other tables—as defined by the Data Shaping Service, which is part of Microsoft Data Access Components (MDAC). For example, you can store product purchases within the customer case. The OLE DB for Data Mining specification uses the SHAPE statement to perform this nesting.

A significant feature of SQL Server 2000's data-mining functionality is ease of deployment. With DTS, you can easily apply the results of previously trained models against new data sources. The strategy is to make data-mining products similar to classic data-processing products so that you can manipulate, examine, extract data from, and deploy data-mining models in the same way as you would any table in a typical database. This approach recognizes that data mining, as organizations usually practice it, requires the data-mining analyst to work outside the standard relational database. When you mine outside the database, you create a new database, which leads to redundancy, leaves room for error, takes time, and defeats the purpose of the database. So, a major objective of SQL Server 2000 is to embed the data-mining capability directly in the database so that a mining model is as much a database object as a data table is. If this approach is widely adopted in the industry, it will eliminate significant duplication of effort in the creation of data warehouses that are built especially for data-mining projects. This approach will also eliminate the time needed to produce specialized data-mining tables and the potential threats to data quality and data integrity that the creation of a separate data-mining database implies. Finally, in-place data mining (as direct data-mining access to the native database is called) will eliminate the time lag that the creation of a specialized data table inevitably entails. As the demand for data-mining products and enhancements increases, this time factor may prove to be the element that finally leads to the universal adoption of data-mining functionality as an intrinsic component of a core database management system (DBMS).

Data-Mining Tasks That Analysis Services Supports

You can apply data mining to many different tasks, which fall into three basic categories: outcome models, cluster models, and affinity models. Outcome models (which Microsoft calls classification) can help you predict or classify an outcome based on one or more fields, or variables, in the data set. You use cluster models, sometimes called segmentation, to group similar cases together based on the shared values of many fields in a data set. Affinity models—including association, sequence, and deviation analysis—and dependency modeling typically show the relationship or sequencing between one field and another field. SQL Server 2000 Analysis Services provides two basic data-mining algorithms to support classification and clustering: decision trees and cluster analysis.

Outcome models with decision trees. Outcome modeling uses a set of input variables to predict or classify the value of a target, or response, variable (the outcome). The target variable can be categorical (having discrete values such as reply/did not reply) or continuous (having values such as dollar amount purchased). When the target is categorical, you call it a classification task—a model that shows which combinations of the input variables you can use to reliably classify the target. When the target variable is continuous, the model is typically described as a regression model. Regression is the most common type of analysis that attempts to predict values of a continuous target variable based on the combined values of the input variables. For simplicity, Microsoft uses classification to mean both classification and regression trees. If this is confusing, remember that decision trees can predict or classify both discrete (some values) and continuous (many numeric values) outcomes.

Decision trees are a common and robust technique for carrying out predictive modeling tasks that have an outcome field to train on. Decision trees are easy to work with, produce a highly readable graphic display, and work well with both categorical and continuous data. Table 1 shows how you might arrange data to measure the response to an invitation to an IT conference. Although this data set is small, you would have a difficult time determining through visual inspection alone which of the attributes (columns) in the data set, if any, were predictive of people's likelihood of accepting (replying to) the invitation. Imagine, for example, trying to determine what influences the probability of response in a response database of more than 10,000 records: Is it job title? Gender? Number of employees or size of sales? Seeing two-variable predictive relationships is difficult, but it is impossible to see the combinations of predictive relationships that produce a strong predictive classification of the likelihood of replying.

Figure 1 shows a decision tree that reveals the predictive structure of the data: Company size (as measured by number of employees) seems to be the strongest predictor of attendance. The overall attendance rate (the number who responded to the invitation) is 40 percent. You can see that 75 percent of the invitees from large companies attended the conference, whereas only 27 percent of the invitees from small companies attended. So, employees of large companies are approximately three times as likely to attend the conference as those of small companies. However, this characteristic of small companies reverses when you consider sales income: In the two cases where sales income for small companies was less than $1 million, attendance is 100 percent. This number reveals that, in this sample, all employees from companies with less than $1 million in annual revenue attended the conference. I show these results as an example; you would never base results on such a small number of records unless you had completed substantial testing with other data sets to verify that this pattern repeats reliably in the target consumer population.

As Figure 1 shows, a decision tree works by collecting the overall data set (which is usually presented as the origin or root node of a decision tree at the top of the figure), and finding ways of partitioning the records or cases that occur in the root node to form branches. The branches are in the form of an upside-down tree, and the nodes at the ends of the branches are usually called leaves.

In a real-life targeted marketing task, you'd have many more attributes (columns) for each potential conference attendee and more potential attendees. When the scale of the problem increases, manually assessing predictive characteristics is difficult, so automatic techniques become necessary. However, assessing the predictive power of multiple attributes remains difficult, although to a lesser extent, even with sophisticated OLAP tools. The role of data mining and decision trees is to provide the ability to assess the combined predictive power of multiple attributes. You can get a sense of how this assessment works by examining Figure 1. You can see that, overall, 40 percent of the data set members responded to the offer. However, when you consider only small firms (EmploySize: Small) that have large sales (SalesSize: $1M+), the response rate drops to about 11 percent. This example shows that you can measure the ability to assess the drop in response rate as more attributes are introduced into the analysis, then use that measure to produce an indicator of the model's predictive power. Decision trees scale well with fields that have attributes with many values and many data records. Because of this scalability, decision trees can be extremely useful for a wide range of predictive modeling and classification tasks.

Segmentation (cluster analysis). Segmentation is the process of grouping or clustering cases based on their shared similarities to a set of attributes. Decision trees also find segments but determine the segments based on a particular outcome variable, such as conference attendance. So the values, which are shown as numeric codes or string values, on one branch of the decision tree form a cluster where the cases in that cluster (a leaf on the decision tree) have a shared similarity in terms of the attribute of the branch that forms the decision tree. For example, in Figure 1, you see that the small, high-dollar-volume customers in the analysis form a segment that has the lowest response rate of any other segment in the decision tree. The decision tree forms a branch that identifies the shared similarity of the cases in the leaf (cluster) and shows the choice that was made between the two outcomes in the node—in this example, Reply or No Reply. If no outcome variable exists or if you want to see how observations group together in terms of their shared values in multiple outcome variables, then cluster analysis is the technique to choose.

Cluster analysis forms groups of cases that are as homogeneous as possible on several shared attributes—such as height, weight, and age—yet are as different as possible when compared with any other clusters that are themselves homogeneous. For example, a cluster analysis might identify all tall, heavy, younger cases in one cluster and all short, light, older cases in another cluster. And cases that have similar purchasing or spending patterns form easily identified market segments toward which you can target different products. In terms of personalized interaction, different clusters can provide strong cues to suggest different treatments.

Several techniques have evolved over time to carry out cluster analysis tasks; one of the oldest of these cluster analysis techniques is K-means cluster analysis. In K-means cluster analysis, the user assigns several means that will serve as bins, or clusters, to hold the observations in the data set. Cases are then allocated to each of the bins depending on their shared similarity. Analysis Services employs a randomly assigned K-means, nearest-neighbor cluster analysis approach.

Implementing OLE DB for Data Mining

OLE DB for Data Mining addresses the utilization of the data-mining interface and the management of the user interface (UI). The solution Microsoft provides in SQL Server 2000 allows several data-mining extensions to this interface and supports data-mining wizards, which guide the user through the data-mining activity. The OLE DB for Data Mining extension enables OLAP applications, user applications, and system services such as Commerce Server and a variety of third-party tools and applications to plug in to an interconnectivity capability. Figure 2 shows a general implementation scenario for OLE DB for Data Mining.

The OLE DB for OLAP and OLE DB for Data Mining specifications give wizards and third-party applications access to data-mining services. In addition to the OLE DB for Data Mining specification, key components of the data-mining environment include the DSO data-mining model and the Data Mining Engine, which comprises both decision tree and cluster analysis. Analysis Services and any OLE DB for Data Mining-compliant third-party process can plug in to this environment to define, create, and manipulate data-mining models. If third-party functionality uses the OLE DB for Data Mining interface, that functionality can be published for use in this environment.

At a system level, Microsoft has extended the DSO model to support the addition of the data-mining model object type. Microsoft created server components to provide a built-in capability to use both OLAP and data-mining capabilities. This capability is a core defining feature of the new Analysis Services. On the client side, the implementation provides client OLAP and Data Mining Engine capabilities to exploit Analysis Services running on the server. The client provides complete access to both OLAP and data-mining models through the OLE DB for Data Mining specification.

Finally, by issuing the OLE DB for Data Mining specification, Microsoft provides a facility for third parties on both the server and client sides to use OLE DB for Data Mining COM interfaces to provide data-mining capabilities as plugins. This plug-in capability lets you add data-mining functionality in environments that conform to the OLE DB for Data Mining specification. Currently, several third-party tool and application vendors, notably the members of the Microsoft Data Warehousing Alliance, provide this kind of extensibility. Three data-mining providers are members of the Data Warehousing Alliance: Angoss Software, DBMiner Technology, and Megaputer Intelligence.

Microsoft designed its Data Warehousing Framework to unify BI needs and solution matching in one fast, flexible, and low-cost foundation. The Data Warehousing Alliance provides solutions in Data Extension, Transformation, and Loading (ETL) tools; analytical applications; querying, reporting, and analysis tools; and data-mining providers. For more information about the Data Warehousing Framework and BI, see http://www.microsoft.com/industry/bi.

The Bottom Line

SQL Server 2000 will have a significant impact on promoting BI awareness and capability just as the evolution of data warehousing, decision-support systems, and OLAP has previously. As enterprises move to higher levels of capability in the capture, storage, and manipulation of data, they are more often using data to analyze business directions and for standard reporting functions. OLAP approaches, introduced with SQL Server 7.0, and data-mining approaches, introduced in SQL Server 2000, are complementary and synergistic data-analysis capabilities. These technologies are now available in a common format, through a common interface, and with a common application interface specification to guide you in creating, using, and deploying data-analysis products. This groundbreaking development will lead to a generalized and unified view of dimensional reporting.

Comments

Plain text