Getting Started with Data Mining in SQL Server

As database professionals, we typically work in a field of exact science. For example, a common practice in business intelligence (BI) solutions is creating duplicate copies of data sets, then comparing the results from the different sources to make sure they're the same. If you extract five years' worth of data from an application's database and put it into a data mart, the results in the data mart must be the same as the results in the application's database, even if the table structures were changed and older records were archived. You might build a cube or semantic model and again check to make sure the results are exactly the same as the source system. If the numbers don't add up, the results are rejected because you know that something is wrong and must be corrected. I have to confess that not getting a conclusive result when working on a tough data problem sometimes keeps me up at night.

Additional Resources
Microsoft's official data mining resource site:
Microsoft SQL Server Data Mining
Free tutorials, video demonstrations, and presentations about data mining:
MarkTab Consulting
Free articles and white papers about data mining and related BI technologies:
SolidQ Journal

Data mining is a database technology that, by its very nature, might not yield conclusive, reliable results. I'm going to show you that taking the time to analyze the anomalies can be extremely useful and valuable in business applications. By putting aside the prerequisite of always needing to have a conclusive, correct answer, you might find some golden nuggets amid those uncertain predictions. That outcome is the nature of data mining.

A term commonly used to describe what most data mining applications do is predictive analytics. It's essentially a way of using sophisticated software to make a computer take a guess in the absence of a complete set of facts.

As a kid, I watched old movies and TV shows where computers were portrayed as human-looking robots or consoles with flashing lights that had electronic voices. When a person asked it a complex question, it would say something like "does not compute" or wave its arms around and then blow a fuse. In one case, it processed information for thousands of years, then simply said that the answer was 42.

Nowadays, computers use mining models that apply algorithms to supply missing variables that don't exist (or don't yet exist) as facts. Every model and algorithm offers a certain degree of statistical probability, and with recursive learning, seemingly random guesses can be replaced with more accurate variables that are more likely to approach results with higher confidence. As Dr. George Box, a statistician best known for pioneering time-series predictions, once said, "Essentially, all models are wrong but some are useful." (Sadly, Dr. Box passed away earlier this year after a long, dedicated life to his craft of time-series predictions.)

Data Mining Architecture

Data mining is a core component of SQL Server Analysis Services (SSAS) 2012. Data mining is baked into SSAS's multidimensional designer and delivery architecture. The data structures are stored in the same database as SSAS analytical cubes, but they share only a few of the project assets.

To define a data mining model in SQL Server Data Tools (SSDT), you need to create an SSAS multidimensional project, but you don't need to define any cubes or dimensions. A mining model can get its data directly from any data source or database table defined in the project's data source view, as Figure 1 shows.

Figure 1: Examining the Data Mining Architecture

A mining model relies on the same core data connectivity and Unified Dimensional Model definition objects as a multidimensional SSAS project. Within a project, data sources and data source views are foundational objects used to define tables and queries for data mining structures. In turn, a structure defines one or more tables as well as the columns that are used as keys, input attributes, and prediction outputs for each mining model within the structure.

Data Mining Tools

When data mining was first introduced, the only way to create and use a model was through the Business Intelligence Development Studio (BIDS), which was a database development tool rather than an application suited for data analysts. Several data-mining viewers were also developed so that a mining model could be viewed graphically, but all these viewers were baked into the development environment and not accessible to business users. Programmers could integrate some of these viewers into custom applications, but that wasn't done very often.

When Microsoft introduced two data mining add-ins (Data Mining Client and Table Analysis Tools) for Microsoft Excel 2007, data mining was brought to the business community. Many of the model viewers used in the development environment were integrated into the Excel add-ins, along with several features that use Excel's native charts, pivot tables, filters, slicers, and conditional formatting capabilities.

Since then, Microsoft has been providing tools that let business users do their own analyses. Data mining remains a core component of SSAS 2012, but the target audience for the design and delivery tools has shifted from the IT developers to business users, with Excel being the delivery vehicle. The latest data mining add-ins for Excel 2013, which were introduced with SQL Server 2012 SP1, have been enhanced and improved. Business users can use them to create and consume data mining models and to perform advanced predictive analyses.

A Guided Tour

In the following short tour, I'll introduce you to the Data Mining Model Designer in SSDT and the data mining add-ins for Excel 2013. If you want to follow along, I provided a sample database that I derived from real data obtained from the National Oceanic and Atmospheric Administration (NOAA). The database contains weather observations and climatic events—including tornados, hurricanes, tsunamis, earthquakes, and volcanoes—that have occurred over the past 40 years. It's more interesting to work with real information, but I make no guarantee about the accuracy or reliability of this data, so you shouldn't use it as the basis for making any decisions.

To follow along, you need to have:

The Developer or Enterprise edition of SQL Server 2012 SP1, with the relational database engine, SSAS in multidimensional storage mode, and the client tools installed either locally on a single development machine or on a server to which you have administrative access
An SSAS instance (installed locally on a single development machine or a server) on which you have permission to create databases and objects
Access to a SQL Server relational instance that can read and process data for the mining structures
Excel 2013 (32 bit or 64 bit) installed

In addition, you need to:

Download and install the Microsoft SQL Server 2012 SP1 Data Mining Add-ins for Microsoft Office from the Microsoft Download Center.
Download and restore the sample Weather and Events database by clicking the Download the Code button near the top of the page.

My instructions assume that you know how to restore a database, create database objects, and set permissions using SQL Server Management Studio (SSMS).

Using SSDT's Data Mining Model Designer

In the following example, I'll walk you through creating a data mining project in SSDT. The data mining structure and data mining model that you'll create and explore will deal with tornado data from the U.S. states that are in "Tornado Alley," a region known for a high number of seasonal tornados. Those states are:

Kansas (KS)
Missouri (MO)
Nebraska (NE)
Oklahoma (OK)
South Dakota (SD)
Texas (TX)

Step 1: Create a New Data Mining Project

The first step is to create a new data mining project. To do so, open SSDT, select New on the File menu, and choose Analysis Services Multidimensional and Data Mining Project. Name both the project and the solution Weather and Events.

Step 2: Prepare the Data

The next step is to prepare the source data by simplifying, grouping, aggregating, and cleansing it. Don't underestimate the importance of this step. Data preparation is usually an iterative process. Start with small and simple sets of data. Create views or transform source data into separate tables, and don't be afraid to create multiple sets of data in different structures. Some mining models work best with values in separate columns, whereas other mining models work better with different attribute values in the same column. For ongoing analyses and complex data sources, your solution might need to include an extraction, transformation, and loading (ETL) process using SQL Server Integration Services (SSIS) packages.

The data preparation for this sample project has been completed for you. The views I've created in the Weather and Events database include data transformation logic, so this data is in the correct format for the analyses you'll perform.

Step 3: Add the Data Source to the Project

At this point, you need to add the Weather and Events database as a data source in your project. In SSDT's Solution Explorer, right-click the Data Sources folder and select New Data Source to start the Data Source Wizard.

In the Data Source Wizard, click Next, then New to add a new data source. In the Connection Manager dialog box, connect to the relational database server and select the Weather and Events database, as Figure 2 shows. Click OK in the Connection Manager dialog box, then click the Next button.

Figure 2: Adding the Weather and Events Database as a Data Source

In the Impersonation Information page in Figure 3, you need to specify how you want authentication to occur when a mining structure is processed. If you're using a local instance of SQL Server in a development environment and you're a local administrator, choose the Use the service account option. When you select this option, the service account credentials for the Analysis Services service are used to authenticate the SQL Server connection. If you have experience with service data connections, you've probably done this before so you know how this type of authentication works. If not, you need to take the time to understand how it works because it's used in many different types of SQL Server database solutions not specifically related to data mining.

Figure 3: Selecting the Type of Authentication to Use

Note that a successful test connection in the Connection Manager dialog box doesn't guarantee a successful connection when the structure is deployed, because the test connection is running on the developer machine and the processing actually runs on the server. If the processing fails, you'll want to return to the data source impersonation settings and try one of the other authentication options or grant permission to the database in SSMS.

After you select your authentication method in the Impersonation Information page, click the Finish button. In the next page, accept the default data source name and click Finish to add the data source and close the Data Source Wizard.

Step 4: Add the Views

As I mentioned previously, the Weather and Events database already includes the views for this sample project. To add the views to your project, right-click the Data Source Views node in Solution Explorer and choose New Data Source View.

When the Data Source View Wizard appears, click the Next button three times so that you're on the Select Tables and Views page. In the Available objects list on this page, select the six objects highlighted in Figure 4, then click the top-most button between the two list boxes to move the selected views to the Included objects list. Click Next, then click Finish on the following page to add the views and close the wizard.

Each table in the data source view must have a logical primary key. Two of the views used to define the tables in the data source view don't have these keys, so you need to configure them in the Data Source View window. For the vw_TornadosByYearByState view, right-click the Year field and choose Set Logical Primary Key, as shown in Figure 5. Do the same thing to set the YearMonth field as the logical primary key for the vw_TornadosByMonthByState view.

Figure 5: Setting the Logical Primary Key for One of the Data Source Views

Step 5: Create a Data Mining Structure

You're now ready to create a data mining structure that will have one new mining model. Right-click the Mining Structures node in Object Explorer and select New Mining Structure. When the Data Mining Structure Wizard appears, click Next twice so that you're on the Create the Data Mining Structure page. As Figure 6 shows, there are nine mining model algorithms included in the Microsoft data mining framework. Each algorithm applies a unique set of mathematical formulas, logic, and rules to analyze data in the mining structure. Think of each as a separate black box, capable of analyzing a set of data and making predictions in different ways. This sample project uses the Microsoft Time Series algorithm, so select that algorithm from the drop-down list, then click Next twice to go to the Specify Table Types page. In the Input tables list on this page, select the Case check box for the vw_TornadosByYearByState view and click Next.

Figure 6: Selecting the Mining Model Algorithm

In the Specify the Training Data page in Figure 7, you need to select the columns you want to use to train the model. When you train a model, an initial analysis is performed with a set of data, and the information about the discovered patterns and predictions is stored. This information can then be used to test the patterns in another set of data. In a production-scale solution, you would normally split the data into training and testing sets. This is an important concept in data mining, but it warrants a longer discussion, so put this on your list of things to look into after you learn the basics.

In this case, you're going to train the model using tornado information from the states in Tornado Alley. So, in the Mining model structure table, select both the Input and Predict check boxes for each of the following columns: KS, MO, NE, OK, SD, and TX. Be sure to leave the Year column set as the Key. Click Next.

In the Specify Columns' Content and Data Type page, change the data type for the KS, MO, NE, OK, SD, and TX columns from Long to Double. Leave the Year column set to Long, because the time series works best with a floating point data type. (It might return errors with long integer values.) Click Next.

In the Completing the Wizard page, you need to give the mining structure and mining model appropriate names. The mining structure will become the container for multiple models, and each model uses a specific model algorithm that should be incorporated into the name. The name of the structure should also reflect the name of the table or view on which it's based.

For this example, modify the default names so that the mining structure is named Tornados By Year By State and the mining model is named Time Series - Tornados By Year By State. Click Finish to create the data mining structure.

Step 6: Process and Explore the Mining Structure

With the mining structure created, it's time to process and explore it. On the Mining Models tab in the Data Mining Model Designer, right-click the Microsoft_Time_Series box and select Process Mining Structure and All Models, as Figure 8 shows.

Figure 8: Choosing the Option to Process the Mining Structure and Its Model

You'll be prompted to update and process the model. Accept all the prompts. When the Process Mining Structure dialog box opens, click the Run button. In the Process Progress dialog box, you can expand the nodes in the tree view to see the details while the structure and model are being processed, as shown in Figure 9. When the Status box displays Process succeeded, click the Close buttons in the Process Progress and Process Mining Structure dialog boxes.

Figure 9: Watching the Progress in the Processing of the Mining Structure and Its Model

Right-click the Microsoft_Time_Series box and select Browse Model. Click Yes to build and deploy the model and to update any objects.

Figure 10: Displaying Historical and Predicted Tornado Data by Year for All the States in Tornado Alley

When the Mining Model Viewer is displayed, you'll see a line chart like that in Figure 10, which shows historical and predicted tornado data by year for the states in Tornado Alley. Specifically, it shows the number of tornados (as a percentage of deviation from a baseline value) in each state from 1973 through 2011, with predictions for five more years. The first thing you're likely to notice is a rather tall spike prediction for Kansas. We know that this prediction is wrong because it was forecasting the future from 2011 and we know that there wasn't roughly a 5,000 percent increase in tornados (i.e., nearly 500 tornados) in Kansas in 2012. This brings us back to Dr. Box's statement that "all models are wrong but some are useful." This one isn't correct or useful. I'll deal with this a little bit later. For now, clear the check box next to KS. As you can see in Figure 11, the projected trend is much better now.

Figure 11: Displaying Historical and Predicted Tornado Data by Year for the States in Tornado Alley When Kansas Isn't Included

Next, clear all the check boxes, except for SD, which will isolate the results for South Dakota. Use the Prediction steps option to increase the prediction steps to 25. Notice that you're now projecting future tornado patterns 25 years into the future, to the year 2036. It's important to note that unless there's a very strong and regular pattern in the historical data, the time series algorithm might not be accurate beyond a few periods. However, looking at several periods will help you spot a predicted pattern and verify that the time series algorithm is doing its job.

Check the Show Deviations box to display the range of confidence in the accuracy of the predicted values. Figure 12 shows the results. South Dakota has had a fairly regular pattern of tornado activity from 1973 to 2011, which gives the time series algorithm a lot to work with. Even if you were to move the line to the upper or lower end of the deviation range, you could still see the predicted pattern.

Figure 12: Displaying the Range of Confidence in the Accuracy of the Predicted Values for South Dakota

Now, back to Kansas. Remember the big spike predicted for 2012? Clearly, the time series algorithm is having problems making a prediction with this data when using the default settings. This scenario is actually very common, and you just need to offer some guidance to get it on the right track.

Every one of the nine Microsoft data mining algorithms has a different set of parameters that do different things. These are the knobs and switches that control the behavior of the complex mathematical processes and rules used to make predictions. There are a lot of complex details that warrant further discussion and a deeper understanding. Many of these settings are covered in depth in the book Data Mining with Microsoft SQL Server 2008 (Wiley Publishing, 2009) by Jamie MacLennan, ZhaoHui Tang, and Bogdan Crivat. Making adjustments to these settings can either make a model work well or make the model go crazy. I encourage you to experiment with different settings by making a change and reprocessing the model. It can be time consuming, but this is an important part of the process for creating a useful data mining solution.

For this project, switch to the Mining Models tab, right-click the Microsoft_Time_Series box, and select Set Algorithm Parameters. Note that the default settings for the MAXIMUM_SERIES_VALUE and MINIMUM_SERIES_VALUE parameters are huge numbers. By leaving these unconstrained, the model algorithm is blowing a fuse and giving crazy results. Change MAXIMUM_SERIES_VALUE to 200 and MINIMUM_SERIES_VALUE to 0, then click the OK button to save the settings.

Reprocess and browse the model. This time the prediction results for KS are in a moderate range. If you increase the number of prediction steps, you'll see that the model seems to be making a reasonable set of predictions for annual tornado counts for the next 25 years. However, if you select the Show Deviations check box, you'll see that the algorithm has very little confidence in its ability to make a prediction with the information provided, as Figure 13 shows.

Figure 13: Displaying the Range of Confidence in the Accuracy of the Predicted Values for Kansas

Why can't this model predict the future of tornado activity in Kansas? I posed this question to Mark Tabladillo, who does a lot of work with predictive modeling and statistical analysis. He said, "Typically, we do not get 'whys' in data mining." It's often necessary to create multiple models with different filters and variables to validate a pattern and a reliable prediction. The desire to explain "why" is human nature, but a scientific explanation might not always be possible. According to Tabladillo, "Correlation and causality are different, and most data mining results are correlation alone. Through time and patience, we can make a case for causality, though people, from academics to news reporters, are tempted to jump to a causal conclusion, either to project that they have done that requisite homework or simply to be the first mover-of-record."

In this case, it might be that Kansas doesn't have a strong fluctuating pattern of annual tornado counts like South Dakota does. Keep in mind that, so far, you're considering only the absolute count of all tornados in each state, aggregated over a year. You're not considering other attributes such as each tornado's category, strength, or duration or the damage caused by each tornado. This information is in the data and can be used to create more targeted models.

Using the Data Mining Add-Ins for Excel 2013

I'm looking out my office window at Mount St. Helens, here in Washington State. Thirty-three years ago I watched it erupt and remember the events leading up to that event. I've had a fascination with volcanos and earthquakes ever since. During the evening news, before and shortly after the eruption, the United States Geological Survey (USGS) would report the location and characteristics of the earthquakes that it studied in its effort to learn more about what was going on with the mountain and perhaps other volcanos in the region.

In the Weather and Events database, a view named vw_EarthquakesAndVolcanos contains information about volcanic eruptions that have occurred since 1973 and every earthquake that occurred up to 30 days prior and within 500 kilometers of each eruption. I'll show you how to use the Excel data mining add-ins to analyze the potential association between volcanos and earthquakes by looking at how many days each earthquake occurred before each volcano eruption, as well as its depth, magnitude, and distance from the volcano.

Step 1: Enable the Excel Data Mining Add-Ins

Before the Excel data mining add-ins can be used to generate mining models, a feature must be enabled on the SSAS server. Open SSMS and connect to the SSAS instance you're using. (If you're running SQL Server locally on a single development machine, you can enter LocalHost for the Server Namein the connection dialog box.) Right-click the name of the SSAS server or instance in Object Explorer and choose Properties. On the General page of the Analysis Server Properties dialog box, find the property named DataMining\AllowSessionMiningModels. As Figure 14 shows, change this property to true, then click OK to save the setting.

Figure 14: Enabling the Excel Data Mining Add-Ins

Open Excel 2013 and verify that the data mining add-ins have been installed and enabled by making sure that there's a tab labeled DATA MINING. As Figure 15 shows, this tab includes many ribbon buttons organized into groups.

Figure 15: Exploring the DATA MINING Tab

Figure 16: Connection Group with No Default Connection Set

Note that when I created Figure 15, I had already set up a default connection. When a default connection isn't configured, the Connection group shows a button like the one in Figure 16. You can specify a default connection by clicking the button labeled and entering the requested information in the dialog box that appears.

Step 2: Add the Data Source

The next step is to add the data source, which is the Weather and Events database in this case. In Excel, place your cursor in the top-left cell of a blank sheet. On the Data tab, click From Other Sources and select From SQL Server, which launches the Data Connection Wizard. On the first page of the wizard, provide the name of the SQL Server instance or server. (If you're working on a local development machine, enter LocalHost.) Click Next. On the second page, select Weather and Events as the source database, choose the vw_EarthquakesAndVolcanos view, and click Next. On the last page, click Finish to save the connection and close the wizard.

In the Import Data dialog box that appears, accept the default option to import data into the existing worksheet starting at cell $A$1. Click OK to import this data into the worksheet.

Step 3: Create and Explore a Cluster Model

At this point, you can create a cluster model. Place the cursor anywhere in the table you imported. In the Data Modeling group in the DATA MINING ribbon, click the Cluster button to open the Cluster Wizard. Click Next twice to accept the current range as the table for the model. Click Next again and set the Percentage of data for testing value to 0. In a production solution, it would be best to use the default setting or to manually create separate training and testing sets. However, for this example, you need to analyze all the available data, which is why you just set the value to 0. Click the Finish button to complete the wizard.

After the Cluster Wizard creates the cluster model, it opens a Browse window that contains the results. As you can see in the Cluster Diagram tab in Figure 17, the cluster algorithm found six different clusters with similar attribute profiles. The more densely populated clusters have darker backgrounds.

Figure 17: Displaying the Diagram for the Cluster Model

The Cluster Profiles tab shows the characteristics of each cluster. As you can see in Figure 18, Cluster 1 includes several volcanos and 79 related earthquakes. As the turquoise diamond in the DaysBeforeEruption row shows, those earthquakes occurred several days before the eruption. Each turquoise-colored diamond displays the range of values for a particular variable, with the mean value at the midpoint of the diamond. A short diamond represents a very narrow range of values, and a tall diamond indicates that the values are indiscrete. The depth and magnitude of the earthquakes in Cluster 1 were consistently shallow and low, but the distance from the mountain was large—in the 400 to 500 kilometer range. Other clusters of volcanos and earthquakes had very different characteristics, which a geologist, seismologist, or volcanologist might find useful for categorizing future volcanic eruptions and predicting their relative behavior. Note that you can give the clusters more descriptive names. To do so, simply right-click the heading and choose Rename Cluster.

Figure 18: Examining the Characteristics of the Clusters

Clusters can be compared to one another on the Cluster Discrimination tab, which Figure 19 shows. The blue bars show the degree to which the variables differ in favor of one cluster or another.

Figure 19: Comparing the Clusters' Variables

Step 4: Create and Explore an Analyze Key Influencers Model

In addition to the standard data mining options that you can access from the DATA MINING ribbon, the Excel data mining add-ins also add several features to the TABLE TOOLS group when a table is selected in Excel. To begin, put your cursor in the table you imported from the Weather and Events database. Notice that Excel then shows the ANALYZE tab in the TABLE TOOLS group, as Figure 20 shows.

Figure 20: Accessing the ANALYZE Tab in TABLE TOOLS Group

Click the Analyze Key Influencers button. In the Analyze Key Influencers dialog box that appears, choose Volcano_Name from the drop-down list and click the blue link labeled Choose columns to be used for analysis. This opens another dialog box named Advanced Column Selections, which contains a list of column names. Clear all the check boxes, except those for the Magnitude, Depth, DaysBeforeEruption, and DistKMFromVolcanoToQuake columns. Click the OK button to close the Advanced Column Selections dialog box, then click the Run button in the Analyze Key Influencers dialog box to build the structure and model.

Rather than using one of the standard mining model viewers, the results of this analysis are reported using an Excel pivot table with colored data bars. This feature is convenient because users are able to use a tool with which they're already familiar. In Figure 21, I used the column filter to show only those volcanos where earthquakes were within 4 to 11 kilometers from the eruption.

Figure 21: Using an Excel Pivot Table to Display Analysis Results

Every model created with the Excel add-ins is stored in the SSAS database. You can view and modify these models in Excel or view them in SSMS. To view them in SSMS, you just need to expand the database and mining structure in Object Explorer, right-click the model, and choose Browse.

A Powerful But Different Feature

Data mining is a powerful feature in SQL Server, but it's based on a different kind of science compared to most other SQL Server features. Predictive models can find sophisticated patterns and make predictions that can have immense value. The information provided by these models can be used to improve services, save lives, reduce costs, and grow businesses. However, there's no guarantee that the predictions will always be accurate or even possible in some situations. You should keep your first models simple and be prepared to try and try again. Also, be prepared to discover that perhaps what you were looking for isn't in the data. Don't start with a conclusion and work backward to support it. If it's not there, it either doesn't exist or you need better data. Understanding the nature of predictive analytics and setting appropriate expectations with users and stakeholders will improve rates of success and create an environment in which you can find those nuggets of insight that can make a big difference.

Comments

Plain text