Data Mining Reloaded

The two main functions of data mining are classification and prediction (or forecasting). Data mining helps you make sense of those countless gigabytes of raw data stored in databases by finding important patterns and rules present in the data or derived from it. Analysts then use this knowledge to make predictions and recommendations about new or future data. The main business applications of data mining are learning who your customers are and what they need, understanding where the sales are coming from and what factors affect them, fashioning marketing strategies, and predicting future business indicators.

With the release of SQL Server 2000, Microsoft rebranded OLAP Services as Analysis Services to reflect the addition of newly developed data-mining capabilities. The data-mining toolset in SQL Server 2000 included only two classical analysis algorithms (Clustering and Decision Trees), a special-purpose data-mining management and query-expression language named DMX, and limited client-side controls, viewers, and development tools.

SQL Server 2005 Analysis Services comes with a greatly expanded set of data-mining methods and an array of completely new client-side analysis and development tools designed to cover most common business intelligence (BI) needs. The Business Intelligence Framework in SQL Server 2005 offers a new data-mining experience for analysts and developers alike.

Let's quickly review the data-mining process. Then we'll explore the seven data-mining algorithms available in the SQL Server 2005 Analysis Services framework and briefly look at the "plug-in" technology that can help you add new and custom algorithms to that framework. Although we couldn't specifically address the data-mining UI design here, the snapshots included in several examples will give you a good first look at the power and usability of the new client-side tools.

Mining the Data

The design and deployment of a data-mining application consists of seven logical steps. First, you prepare the data sources: Identify the databases and connection protocols you want to use. Next, you describe the data-source views—that is, list tables that contain data for analysis. Third, define the mining structure by describing which columns you want to use in the models. The fourth step is to build mining models. SQL Server 2005 gives you seven data-mining algorithms to choose from—you can even apply several methods in parallel to each mining structure, as Figure 1 shows. The fifth step is called processing—that's where you get the mining models to "extract knowledge" from the data arriving from the data sources. Sixth, you evaluate the results. Using client-side viewers and accuracy charts, you can present the patterns and predictions to analysts and decision makers, then make necessary adjustments. Finally, incorporate data mining into your overall data-management routine—having identified the methods that work best, you'll have to reprocess the models periodically in order to track new data patterns. For instance, if your data source is email and your models predict spam, you'll need to retrain the models often to keep up with evolving spammer tactics.

Here's a quick example of a useful mining model. Let's say you're interested in identifying major groups of potential customers based on census data that includes occupational, demographic, and income profiles of the population. A good method for identifying large, characteristic census groups is to use the Clustering algorithm. This algorithm segments the population into clusters so that people in one cluster are similar and people in different clusters are dissimilar in one or more ways. To examine those clusters, you can use a tool called Microsoft Cluster Viewer (a standard Analysis Services 2005 component). Figure 2 shows one of the four views, giving you a side-by-side comparison of all the clusters. For instance, Clusters 6 and 7 correspond to persons not on active military duty. But Cluster 7 represents people who work longer hours for more income; the top row also suggests that people in Cluster 7 are mostly married.

Prediction and Mutual Prediction

Suppose you've selected just one column (e.g., Income) in a huge data table, designated that column as Prediction target, and now you're trying to make some predictions. But you probably won't get far by looking at just one column. You can compute the statistical mean and the variance range, but that's about it.

Instead, select specific values for one or more other columns (e.g., Age, Years of Experience, Education, Workload in census data tables) and focus only on those data rows that correspond to the selected values. You'll likely find within this subset of rows that the values of the target column fall into a relatively narrow range—now you can predict the values in the target column with some degree of certainty. In data-mining terms, we say that those other columns predict the target column.

Figure 3 shows a snapshot the Dependency Network (DepNet) data-mining control. This DepNet is a diagram where arrows show which of the census columns predict which others. Some of the edges (nodes) have arrows pointing both ways; this is called mutual prediction. Mutual prediction between A and B means that setting values of A reduces the uncertainty in column B, but also the other way around—picking a value of B would reduce the uncertainty of A.

All Microsoft data-mining techniques can track prediction, but different algorithms make predictions in different ways. As we examine the other data-mining methods, we point out the prediction specifics of each method.

Decision Trees

Prediction is the main idea behind the Microsoft Decision Trees (DT) algorithm. The knowledge that a DT model contains can be represented graphically in tree form, but it could also appear in the form of "node rules." For example, in a census decision tree for Income, a rule such as (Gender = Male and 1 < YearsWorked < 2) could describe a tree node containing the income statistics for males in their second year on the job. This node corresponds to a well-defined subpopulation of workers, and you should be able to make fairly specific predictions with regards to their income. Indeed, one of the census models gave the following formula under the condition of (Gender = Male and 1 < YearsWorked < 2):

INCOME = 24804.38+425.99*( YRSSRV -1.2)
+392.8*(HOURS-40.2) + 4165.82*(WORKLWK-1.022) 
± 24897

According to this formula, INCOME is defined mostly by YRSSRV and weekly overtime. (Note that this is just an example and not based on representative census data.) To obtain this equation in a visually simple way, you could use the Decision Tree viewer to view the Income tree and zoom in on a node corresponding to the gender and yearsworked values of interest, as the typical snapshot in Figure 4 shows.

The rule and the formula we've discovered identify gender, years of service, years worked, weekly hours, and workload as predictors for income. Because YRSSRV, HOURS, and WORKLWK appear in the above formula for INCOME, they're also called regressors. A decision tree that hosts such predictive formulas is called a regression tree.

Time Series

The Time Series algorithm introduces the concept of past, present, and future into the prediction business. This algorithm not only selects the best predictors for a prediction target but also identifies the most likely time periods during which you can expect to notice the effect of each predicting factor. For example, having built a model involving monthly primary economic indices, you might learn that the expected Yen-to-USD currency conversion rate today depends most strongly on the mortgage rate of 2 months ago and is related to the industrial production index of 7 months ago and per capita income of 6 to 7 months ago.

Figure 5 shows a data-mining control called Node Legend that gives a graphical view of these dependencies. The long left-side blue bar next to Mort30 Yr (-2) indicates a negative correlation between Yen to USD and the mortgage rate 2 months ago—meaning that with time, as one value goes up, the other value goes down. The purple curve (for Yen to USD) and the yellow curve (for the mortgage rate) in Figure 6 offer a nice graphical representation of this opposing movement of rates. Smaller blue bars in Figure 5 indicate that the exchange rate is to some extent self-sustaining; indeed, they highlight the fact that the rate today correlates well with the Yen-to-USD rate a month ago (coefficient 0.656) and somewhat with the rate 2 months ago (coefficient -0.117). So, when refinancing to a lower rate, you might consider cashing out and investing in Yen-backed securities—but first, you need to look at the prediction variances (and of course keep mum about the entire scheme).

Clustering and Sequence Clustering

A new feature of Microsoft Clustering algorithms is their ability to find a good cluster count for your model based on the properties of the training data. The number of clusters should be manageably small, but a cluster model should have a reasonably high predictive power. You can request either of the clustering algorithms to pick a suitable cluster count based on a balance between these two objectives.

Microsoft Sequence Clustering is a new algorithm that you can think of as order-sensitive clustering. Often, the order of items in a data record doesn't matter (think of a shopping basket), but sometimes it's crucial (think of flights on an itinerary or letters in a DNA code). When data contains ordered sequences of items, the overall frequencies of these items don't matter as much as what each sequence starts and ends with, as well as all the transitions in between.

Our favorite example that shows the benefits of Sequence Clustering is the analysis of Web click-stream data. Figure 7 shows an example of a browsing graph of a certain group of visitors to a Web site. An arrow into a Web page node is labeled with the probability of a viewer transitioning to that Web page from the arrow's starting point. In the example cluster, news and home are the viewer's most likely starting pages (note the incoming arrow with a probability of 0.40 into the news node and the probability 0.37 arrow into the home node). There's a 62 percent probability that a news browser will still be browsing news at the next click (note the 0.62 probability arrow from the news node into itself), but the browsers starting at home are likely to jump to either local, sport, or weather. A transition graph such as the one in Figure 7 is the main component of each sequence cluster, plus a sequence cluster can contain everything an ordinary cluster would.

Naive Bayes Models and Neural Networks

These algorithms build two kinds of predictive models. The Microsoft Naïve Bayes (NB) algorithm is the quickest, although somewhat limited, method of sorting out relationships between data columns. It's based on the simplifying hypothesis that, when you evaluate column A as a predictor for target columns B1, B2, and so on, you can disregard dependencies between those target columns. Thus, in order to build an NB model, you only need to learn dependencies in each (predictor, target) pair. To do so, the Naïve Bayes algorithm computes a set of conditional probabilities, such as this one, drawn from census data:

Probability( Marital = "Single" | 
Military = "On Active Duty" ) = 0.921

This formula shows that the probability of a person being single while on active duty is quite different from the overall, population-wide probability of being single (which is approximately 0.4), so you can conclude that military status is a good predictor of marital status.

The Neural Networks (NN) methodology is probably the oldest kind of prediction modeling and possibly the hardest to describe in a few words. Imagine that the values in the data columns you want to predict are outputs of a "black box" and the values in the potential predictor data columns are inputs to the same black box. Inside the box are several layers of virtual "neurons" that are connected to each other as well as to input and output wires.

The NN algorithm is designed to figure out what's inside the box, given the inputs and the corresponding outputs that are already recorded in your data tables. Once you've learned the internal structure from the data, you can predict the output values (i.e., values in target columns) when you have the input values.

Association Rules

The Association Rules algorithm is geared toward analyzing transactional data, also known as market-basket data. Its main use is for high-performance prediction in cross-sell data-mining applications. This algorithm operates in terms of itemsets. It takes in raw transaction records, such as the one that Figure 8 shows, and builds a sophisticated data structure for keeping track of counts of items (e.g., products) in the dataset. The algorithm creates groups of items (the itemsets) and gathers statistical counts for them. For Figure 8's tiny sample record, the statistics would look like Figure 9.

One of the most important parameters of a model is a threshold for excluding unpopular items and itemsets. This parameter is called the minimum support. In the preceding example, if you set the minimum support to 2, the only itemsets retained will be <Bread>, <Milk>, and <Bread, Milk>.

The result of the algorithm is the collection of itemsets and rules derived from the data. Each rule comes with a score called a lift score and a certain support value larger than or equal to the minimum support. The lift score measures how well the rule predicts the target item. Once the algorithm finds the interesting rules, you can easily use them to get product recommendations for your cross-sell Web sites or direct-mail materials.

Third-Party Algorithms (Plug-Ins)

The seven Microsoft algorithms pack a lot of power, but they might not give you the kind of knowledge or prediction patterns you need. If this is the case, you can develop a custom algorithm and host it on the Analysis Server. To fit into the data-mining framework, your algorithm needs to implement five main COM interfaces:

The algorithm-factory interface is responsible for the creation and disposal of the algorithm instances.
The metadata interface ensures access to the algorithm's parameters.
The algorithm interface is responsible for learning the mining models and making predictions based on these models.
The persistence interface supports the saving and loading of the mining models.
The navigation interface ensures access to the contents of these models.

Some of these interfaces are elaborate and take getting used to, but implementation templates are available in the Tutorials and Samples part of the SQL Server 2005 documentation. After you implement and register your algorithm as a COM object, hooking it up to the Analysis Server is as easy as adding a few lines to the server configuration.

When the algorithm is ready and hooked up, its functionality immediately becomes available through the tools in the Business Intelligence Development Studio and SQL Server Management Studio. Analysis Server treats the new algorithm as its own and takes care of all object access and query support.

Dig In

Analysis Services 2005 represents a complete redesign of Microsoft's BI platform. Embracing .NET, XML for Analysis, and ADOMD.NET, it offers an array of powerful new algorithms, full-featured designers, and viewers. Even bigger news is how open and transparent the platform has become. With Analysis Services 2005's new client APIs, plug-in algorithm capabilities, server object model, managed user-defined functions (UDFs), and complete Microsoft Visual Studio integration, there's virtually no limit to what a motivated BI developer can do.

Comments

Plain text