Most large companies these days have adopted some form of business intelligence (BI) platform,incorporating data warehouses,data analytics, and mining to transform information into profitable decisions. I’ve gained much experience working with SQL Server 2005 Analysis Services’new BI tools, and I recently came to the greedy realization that I could use these same BI applications to find predictable patterns in data for which the profit margin would be off the chart—for myself. Taking a rich (pun intended) data set such as, oh, state lottery draws over eight years and analyzing the patterns—even though most rational people would argue that such data is purely random—seemed like a straightforward and fun first project, so I decided to tackle it. The results were interesting.
Want to know the steps I took to build my Lottery Prediction Analysis Services project and train two data-mining models, comparing the different algorithms and results? Look no further. Or perhaps you think this kind of experimental exercise is merely frivolous? If so, think again. Although using SQL Server 2005 data-mining techniques to try to win the lottery might seem no more than a fun diversion, you will be able to directly translate these techniques into your business projects, as I have in several healthcare scenarios. Often, fun exercises are the best way to learn otherwise intimidating tools and technologies—and, as a side benefit, this article might just make you disgustingly wealthy.
The first step of data analysis is to load data into tables. Depending on the level of data transformation that will be necessary, populating database tables with source data can either be as painless as using the built-in import features of SQL Server 2005’s SQL Server Management Studio or as arduous as slogging through SQL Server Integration Services (SSIS). For my project, I took the painless route of importing lottery data into a database I named WinBig, in a single table of seven fields.The table, which I called lottery_load,consisted simply of a field to store the date of the drawing (DrawDate) and six number fields (N1 through N6) to hold the individual lottery numbers. I was able to easily find the entire dataset of all Florida Lotto drawings from May 7, 1998, to May 27, 2006, with a total of 1287 individual drawings. (I’m sure most states with lotteries have similar lists on their Web sites.)
After some coaxing to get the HTML list of drawings into a delimited dataset of the seven fields I needed, I stepped through the SQL Server Import and Export Wizard by right-clicking the database,then selecting Tasks and Import Data.When I was done, I had my data source to use for the Analysis Services project and was ready to discover hidden riches buried deep in the digits. (I should note that the process of extracting, transforming, and loading data from source to destination, as in a data-warehouse scenario, is typically the most time-consuming task of an Analysis Services project,especially when you’re training mining models, because you want your data to be as accurate as possible.)
I would now perform all data analysis in Business Intelligence Management Studio (BIDS), a Visual Studio 2005 shell application that ships free with SQL Server 2005. I created a new Analysis Services project by selecting File, New, Project. One of the project types you can select is Analysis Services Project. I named the project Lottery Prediction. In the Solution Explorer window that Figure 1 shows, you can see all the objects available to create. Before I could create a Mining Structure, I needed to create two of these objects: Data Sources and Data Source Views.
Data Sources. Creating the data source is as easy as right-clicking the Data Sources folder in BIDS and selecting New Data Source. The New Data Source wizard walks you through the process of choosing a server, authentication method, and database. In my case, I chose SRS01 as the server and WinBig as the database that contains the lottery_load table that I pre-populated with the lottery-drawing history.
Data Source Views. The Data Source Views step is a little more involved, primarily because simply choosing the table wouldn’t be sufficient for analysis. What I needed was a custom query that would provide not only the historical lottery numbers but also values that could be used as input to train the subsequent mining models. For example, suppose you’re trying to predict what type of product a customer will purchase based on certain criteria gathered about that customer. Knowing a customer’s demographics (e.g., income, marital status, geographic region) is a key component of determining the purchasing decisions he or she might make.All this associated data is fed into the mining model as inputs for the predictable items—such as the probability that he or she will purchase a particular item. This kind of analysis matters little in the random nature of a lottery, but providing the model with this type of associative data might reveal hidden patterns. In the case of the lottery prediction, the draw date is one data element, in addition to the actual numbers, that can be used as valuable input for predictable patterns. I can create a Data Source View based on one or more source tables,as in the case of the lottery_load table, but I can also use a Named Query instead of a table to achieve more flexibility in my schema: Using a Named Query lets me join underlying tables into one entity, filtering which columns are used and filtering data based on criteria.
The easiest way to create a Data Source View based on a Named Query is to right-click the Data Source Views folder in Solution Explorer and select New Data Source View to initiate the wizard. I’ll select the table from the WinBig data source that I created in the previous step, and I’ll walk through the wizard, selecting the lottery_load table and clicking Finish to create the Data Source View named WinBig.dsv. However,after I create the Data Source View, I have the freedom to right-click the table header and select Replace Table, With New Named Query. Named Queries let me more easily define new columns that aren’t in the base table.
With a Named Query, I added several new fields to be used as input items to help the mining model algorithms predict lottery numbers.The fields I added to the base table are all derivatives of the DrawDate field. Knowing which day of the week,for example, might reveal important trends in the resulting query. Just as important could be the month and year of the drawing. Knowing the moon phase, tidal shifts, local weather fore-casts, and barometric pressure might be just as reasonable in this exercise, but we have to work with the data that’s available.
Listing 1, shows the Named Query that will drive the mining models; notice the new columns derived from the DrawDate field.The WeekPart field might prove to be important while training the models because the lottery is drawn two days per week—Wednesday and Saturday, the fourth and seventh day of the week for the WeekPart function.
Creating Mining Models
It’s time to create and process the mining models. SQL Server 2005 offers many new data-mining algorithms. I chose Decision Trees and Microsoft Clustering algorithms, partly because their graphical mining-model viewers are superior to those of other algorithms, such as Linear Regression, but also because these two algorithms are available in all editions of SQL Server Analysis Services. Decision Trees is a Classification-type algorithm that can predict values based on other attributes. Microsoft Clustering is a Segmentation-type algorithm that can find natural groupings in data and predict outcomes based on those groupings.
The best way to create the initial Mining Structure that will contain the mining models based on the two algorithms is to walk through the Data Mining Wizard by right-clicking the Mining Structures folder in Solution Explorer and selecting New Mining Structure. The Data Mining Wizard guides you through several screens to gather input. The first question it asks is whether the structure will use an existing relational database or warehouse or be based on an existing cube. I’ll choose an existing relational database because I intend to use the lottery_load table. The next question asks for the type of algorithm (or technique, in the wizard’s terminology). I’ll have two models in one structure, but I’ll start with the Decision Trees algorithm for the wizard. On the next screen, I’ll select the WinBig data source view that I created and use the lottery_load table derived from the Named Query. It’s the only table available to select, and I’ll leave it set as a Case table instead of a Nested table.
On the next wizard screen, I’ll select and define the data that will train the model. As you can see in Figure 2, all the fields from the Data SourceView are listed. You can define each field as Key, Input, or Predictable. I’ll select the DrawDate field as the Key field because I know each drawing will have a unique date value. For Input, I’ll choose every field except DrawDate, and for the Predictable fields, I’ll choose N1 through N6. After clicking Next, I’ll specify content and data types for the fields.
There are several different content types, and each plays a significant role in how the model will work with the input data. The three content types that I have worked with while testing the output of the lottery-prediction models are Discrete, Continuous, and Discretized.
Discrete values contain a finite number of items that shouldn’t be used as values in calculations. In other words, even in my case—with fields N1 through N6 containing numerical data—the preference would be to make these fields Discrete and, further, to make their data type Text so that they won’t be additive to one another. The possible lottery numbers range from 1 to 53—a finite set of values. Think of the gender attributes, which have two values, Male or Female. If I had a gender value to use in my predictions, I wouldn’t want to add Male to Female, just as I don’t want to add lottery numbers 1 to 12 to 43 to 6, and so on.
Continuous content types are measurements of numerical data and are additive. I’ve noticed that the results of the model change when I alter the content types. The Discrete types seem to work best for this project. Discretized value types, which place Continuous numerical values in discrete groups,can be used with algorithms,such as Naïve Bayes, that don’t otherwise permit Continuous values. I could let the wizard detect the content types, but as Figure 3 shows, I set content types manually as Discrete or Key and used Text or Date for the data types. Now, I can finish the wizard and see the new structure.
Next, I’ll process the mining model on the Mining Model tab in BIDS by clicking the Process button in the upper left corner. When the Process window opens, I’ll simply click Run, and when the processing is finished, I’ll close the window and view the results.On the Mining Model Viewer,I’ll see the derived Decision Tree. This is a simple result because the Discrete and Text values have prevented the algorithm from performing many calculations with the numbers. My simple goal for the model was to deliver six numbers. In Figure 4, you can see the tree view for the predictable value N1 and the highest to lowest cases for each number 1 to 53 for each. The probability values also appear, along with a histogram.
At this point, I decided to build a DMX query to select my six numbers before moving on to add in the Microsoft Clustering model to the same structure and compare the differences.
On the Mining Model Prediction tab, I found a tiny button that led me to the Singleton Query. Singleton Queries are nice when you want to manually pass in input for the predictable values without having to pull them from an input table that stores values.The goal is to see what numbers the model will return for the May 31, 2006, drawing date.That date also happens to be a Wednesday. Figure 5 shows the graphical Singleton Query. Because N1 through N6 are the only values I used as Predictable, they’re the only ones I can select; however, I added a Prediction Function for the probability for N1. Listing 2 shows the resulting DMX query. According to the Decision Trees algorithm, 15, 11, 26, 35, 41, and 48 will be the winning numbers for May 31, 2006.
On the second tab of the project, I clicked Add Related Model, specified Microsoft Clustering, and named the new model Lottery_Cluster. New models added to the base mining structure inherit the content and data types of that structure. Therefore, all I had to do after adding the Microsoft Clustering model was to reprocess. After doing so,I could analyze the new model results on the Mining Model Viewer tab, which Figure 6 shows, and I could use the same type of DMX query to predict the six numbers based on the date input variables. The results of the Lottery_Cluster model were 2, 13, 22, 30, 34, and 53, with a probability for N1 of 5.71—not the greatest odds, but the number does suggest that the model found a pattern marginally better than a random guess.
Time to Play
Now, it was time to go buy some tickets. I purchased tickets for both series of numbers, and I also added a random drawing that the lottery machine chose for me.The randomly generated numbers were 11, 16, 20, 24, 26, and 49.
The winning numbers were 14, 24, 30, 34, 43, and 44. Is it luck that my Microsoft Clustering algorithm hit two of the six numbers (30 and 34) whereas the random number hit only one (24)? With further analysis and fine-tuning of the models, I would hope to say that it wasn’t all luck. I grant that the random nature of the drawing makes it nearly impossible to predict, which is why no one else hit all six numbers this time. So, the total jackpot is now at $10 million and I have only three days left to retrain the models and try new algorithms. I’ll let you know when I win.
Author’s Note: Special thanks to Jeff Dyktra, who helped me build an elaborate normalized lottery database with many possible input values, such as prime, odd, and even numbers. I chose a simple single table, but his Integration Services package is worthy of an article in itself.