From time to time, SQL Server professionals need to provide estimates of future values, such as revenue projections or sales forecasts. Organizations sometimes rely on data mining technology to build forecasting models to provide such estimates. I'll explain the key concepts necessary for understanding how these data mining technologies work. I'll also introduce you to a few of the underlying details so that they won't seem so foreign and formidable the first time you encounter them. With an understanding of the key concepts and exposure to some of the details, you'll be in a better position to start using the forecasting capabilities in SQL Server Analysis Services (SSAS).
Why Data Mining Isn't More Popular
There are different approaches to forecasting. For example, the Forecasting Methods website classifies forecasting methods into various categories, including casual (aka econometric), judgmental, time series, artificial intelligence, prediction market, probabilistic forecasting, forecasting simulations, and reference class forecasting. The Forecasting Principles website has a methodology tree that classifies the methods, starting with a split between judgmental methods (i.e., methods used when the available data is inadequate for quantitative analysis) and statistical methods (i.e., methods used when relevant numerical data is available). In this article, I'll focus on time series forecasting, a type of statistical approach in which historical data is available for the values to be forecasted.
Time series forecasting assumes past data can help explain future values. It's important to know that in some situations, there might be circumstances not reflected in the historical data. For example, there might be a new competitor that could adversely affect future revenues or a rapid change in the workforce composition (e.g., a rise in dual-income family structures that emerged in the 1960s) that could affect future unemployment rates. In these types of situations, a time series forecast might not be the best approach or shouldn't be the only approach considered. Often, different forecasting approaches are combined to provide the most accurate predictions.
Understanding the Basics of Time Series Forecasting
A time series is a set of values observed over a period of time, typically at regular intervals. Common examples include weekly sales amounts, quarterly expenses, and monthly unemployment rates. Time series data is often presented in a graphical format, with the time interval along the x-axis of a chart and the values along the y-axis, as Figure 1 shows.
In terms of understanding how a value changes from one period to the next and how to forecast future values, time series data has several major characteristics:
- Base level. The base level is typically defined as the average value of the series. In some forecasting models, the base level is defined as the starting value of the series data.
Trend. A trend is typically defined as how the series is changing from one period to the next. For example, in Figure 1, the number of unemployed tended to trend upward from early 2008 to January 2010, after which it seems to exhibit a downward trend. (For information about the sample data set used to create the charts in this article, see the sidebar "Calculating Unemployment.")
In "Understanding Time Series Forecasting Concepts," the data set for the charts comes from employment data published by the U.S. Bureau of Labor Statistics. The BLS publishes the unemployment rate based on a monthly U.S. Census Bureau survey that extrapolates the total number of employed and unemployed people. Specifically, the BLS uses the formula:
unemployment rate = unemployed / (unemployed + employed)
Interestingly, the unemployment rate typically cited in the news is a seasonally adjusted rate. The seasonal adjustment is accomplished with the publicly available autoregressive integrated moving average (ARIMA) implementation. This is essentially the same algorithm used by many data mining packages for time series forecasting, including SQL Server Analysis Services (SSAS). For more information about the ARIMA implementation used by the Census Bureau, see the X-12-ARIMA Seasonal Adjustment Program web page. Note that in the sample project for this article, I included both the seasonally adjusted and non-seasonally adjusted values.
Seasonality. Certain values will tend to rise and fall based on certain time periods, such as the day of the week or month of the year. Examples include retail sales, which often spike during the Christmas season. In the case of unemployment, there's a seasonal trend, with higher numbers of unemployed in January and July and lower numbers in May and October, as Figure 2 shows.
- Noise. Some forecasting models include a fourth characteristic, noise, which refers to random variations and irregular movements in the data. Noise won't be covered here.
So, if you can identify a trend, apply that trend to the base level, and account for any seasonality that might exist in the data, you have a forecasting model that can be used to predict future values:
Forecasted Value = Base Level + Trend + Seasonality
Identifying a Base Value and Trend
One way to identify a base value and trend is to apply a regression technique. The term regression means to study the relationship among variables. In this case, the relationship is between the independent variable of time and the dependent variable of the number of unemployed people. Note that the independent variable is sometimes referred to as the predictor.
You can use a tool such as Microsoft Excel to apply the regression technique. For example, you can have Excel automatically compute and add a trendline to a time series chart using the Trendline menu on the Chart Tools Layout tab or PivotChart Tools Layout tab in the Excel 2010 or Excel 2007 ribbon. In Figure 1, I added a straight-line trendline by selecting the Linear trendline option on the Trendline menu. Afterward, I chose More Trendline Options on the Trendline menu and selected theDisplay Equation on chart and Display R-squared value on chart options, which are shown in Figure 3.
This process of fitting a trendline to the historical data is called linear regression. As you can see in Figure 1, the trendline is calculated with an equation that identifies the base level (8248.8) and trend (104.67x):
y = 104.67x + 8248.8
You can think of the trendline as a series of connected x-y coordinates in which you can plug in a time period (i.e., the x-axis) to arrive at a value (the y-axis). Excel determines the "best" trendline using something called the least-squared method (identified as R² in Figure 1). The least-squared line is the line that minimizes the squared vertical distance from each trendline point to its corresponding line point. Square values are used so that deviations above and below the actual line don't cancel each other out. In Figure 1, R² = 0.5039, which indicates the linear relationship explains 50.39 percent of the changes in the unemployed statistics over time.
Identifying an accurate trendline in Excel often involves trial and error, along with visual inspection. In Figure 1, the linear trendline isn't a great fit. Excel provides other trendline options, which can be seen in Figure 3. For example, in Figure 4, I added a four-period moving average trendline, which plots points based on an average of the current and last specified periods in the time series.
I also added a polynomial trendline, which uses an algebraic equation to build a line. Note the polynomial trendline has a R² value of 0.9318, indicating a better fit in explaining the relationship between the independent and dependent variable. However, a higher R² doesn't necessarily indicate whether the trendline will provide accurate forecasts. There are other methods to calculate forecast accuracy, which I'll discuss shortly.
Some of the trendline options in Excel (e.g., linear, polynomial) let you forecast ahead and backward by a number of periods, with the resulting forecast values plotted on the graph. Saying that you can "forecast backward" might seem odd. The best way to explain it is with an example. Suppose that a new factor—a rapid growth in government jobs (e.g., Homeland Defense jobs in the early 2000s, temporary U.S. Census Bureau workers)—causes a fast drop in unemployment. You want to forecast the growth rate of this new job sector backward for several months, then recalculate unemployment to arrive at a smoothed rate of change.
You can also manually use the trendline equation to calculate forecasted values. In Figure 5, I added a polynomial trendline with a six-month forecast, having first removed the last six months of data (i.e., data from April through September 2012) from the original time series.
If you compare Figure 5 with Figure 1, you can see the polynomial forecasts are trending upward, which doesn't match the downward trend of the actual time series.
It's important to note two additional points about regression:
- As I mentioned previously, linear regression involves one independent and one dependent variable. If you want to understand how additional independent variables can explain the change in a dependent variable, you can build a multiple regression model. In the context of forecasting the number of unemployed people in the United States, you might be able to increase the R² (and the forecasting accuracy) by also factoring in the growth of the economy, the U.S. population, and the growth in the number of employed people. SSAS can accommodate multiple variables (i.e., regressors) in a time series forecasting model.
- Time series forecasting algorithms, including those in SSAS, might compute autocorrelation, which is the correlation between neighboring values in a time series. Forecasting models that directly incorporate autocorrelation are called autoregressive (AR)models. In an AR model, the predictors are the past values of the series rather than some factor of the independent variable(s). For example, a linear regression model provides a trend equation based on the period (e.g., 104.67 * x), whereas an AR model's trend is based on past values (e.g., -0.417 * Unemployed(-1) + 0.549 * Employed(-1)). AR models have the potential to improve forecast accuracy by including additional information beyond trend and seasonality.
Accounting for Seasonality
It's common for seasonality to show up in a time series, whether it's by day of the week, day of the month, or month of the year. As previously mentioned, the number of unemployed people in the United States typically rises and falls in a given calendar year. (This is true even when the economy is doing very well, as Figure 2 illustrates.) In terms of forecasting, you need to account for this seasonality to make accurate predictions. One common approach is to smooth out the seasonality. In Practical Time Series Forecasting: A Hands-On Guide, Second Edition (CreateSpace Independent Publishing Platform, 2012), Galit Shmueli recommends using one of three methods:
- Calculate a moving average
- Aggregate the time series at a less granular level (e.g., look at the unemployed numbers by quarter instead of by month)
- Generate separate time series (and forecasts) by season
A base level and trend is then derived to generate forecasts against the smoothed time series. Optionally, seasonality or granular adjustments can be reapplied to the forecasted values by factoring back in the original seasonality using the Holt-Winters method. If you want to see how you could factor in seasonality using Excel, perform an Internet search using the phrase Winters method in Excel. You can also find a thorough explanation of the Holt-Winters method in Wayne L. Winston's Microsoft Office Excel 2007: Data Analysis and Business Modeling, Second Edition (Microsoft Press, 2007).
In many of the data mining packages like SSAS, time series forecasting algorithms automatically account for seasonality by quantifying the seasonal relationships and including them in the forecasting model. However, you'll often want to provide hints about the actual seasonal pattern.
Measuring Forecasting Model Accuracy
As I mentioned previously, the initial fit of a model (as measured by the least-squared method) doesn't necessarily equate to accurate forecasts. The best way to test for predictive accuracy is to split the time series into two data sets: one for building (i.e., training) the model and the other for validating it. The validation data set will be the most recent portion of the original data set and should ideally span a length of time equal to the future forecasting timeline. To validate the model, the forecasted values are compared against actual values. Note that after validation has occurred, the model should be rebuilt using the entire time series so that future forecasts can benefit from the most recent actual values.
When measuring the accuracy of a forecast model, there are two commonly asked questions.
How should I define predictive accuracy? In some scenarios, predicted values that are higher than the actual value might be detrimental (e.g., predictions about investment performance). In other situations, predicted values that are lower than the actual value could be damaging (e.g., predicting the lowest winning bid on an auction item). But in cases where you want to calculate some kind of weighted score for all the predictions (not caring if the predictions are higher or lower than the actual value), you can start by quantifying the error in a single forecast using the definition:
error = predicted value – actual value
With this definition of an error, two of the most popular methods for measuring accuracy are mean absolute error (MAE) and mean absolute percentage error (MAPE). With the MAE method, the absolute values of forecast errors are summed, then divided by the total number of forecasts. With the MAPE method, the average deviations of the forecasts are calculated as a percentage. If you want to see an example of these and several other methods for measuring accuracy, an Excel template (with sample forecast data and accuracy scores) is available on the Demand Metrics Diagnostics Template web page.
How much historical data should I use to train my model? When working with a time series that goes far back into the past, you might be inclined to include all the historical data in the model. At some point, however, additional history might not improve forecasting accuracy. Older data can even skew the forecast if past conditions are vastly different than current conditions (e.g., there's a different workforce composition now than in the past). I haven't seen any specific formula or rule of thumb that tells how much historical data to include, so my suggestion is to start with a time series that's several times larger than the forecast time frame, then test for accuracy. Next, try adjusting the amount of history up or down and retest.
Working with Time Series Forecasting in SSAS
Time series forecasting first appeared in SSAS 2005. Its Microsoft Time Series algorithm used a single algorithm named autoregressive tree with cross prediction (ARTXP) to generate forecasts. ARTXP blends AR techniques with a data mining "decision tree" so that the forecasting equation can change (i.e., split) based on certain criteria. For example, a forecasting model might produce a closer fit (and better forecast accuracy) if it's first split by the date, then split by the value of an independent variable, as shown in Figure 6.
In SSAS 2008, the Microsoft Time Series algorithm started using a second algorithm named autoregressive integrated moving average (ARIMA) in addition to ARTXP to improve long-term forecasts. ARIMA is considered an industry standard and can be thought of as a combination of the AR and moving average techniques. It also evaluates historical forecast errors to improve the model.
The default behavior of the Microsoft Time Series algorithm is to blend the results of the ARIMA and ARTXP algorithms to achieve optimal forecasts. (You can override this default behavior if desired.) According to SQL Server Books Online (BOL):
"The algorithm trains two separate models on the same data: one model uses the ARTXP algorithm and one model uses the ARIMA algorithm. The algorithm then blends the results of the two models to yield the best prediction over a variable number of time slices. Because ARTXP is best for short-term predictions, it is weighted more heavily at the beginning of a series of predictions. However, as the time slices that you are predicting move further into the future, ARIMA is weighted more heavily."
When working with time series forecasting in SSAS, you need to keep the following in mind:
- Although there's a tab named Mining Accuracy Chart in SSAS, this tab doesn't work with time series data-mining models. As a result, you'll need to manually measure the accuracy with one of the methods I mentioned (e.g., MAE, MAPE) using a tool such as Excel to assist with the calculations.
- The Enterprise Edition of SSAS lets you segment a single time series model into multiple "historical models," so you don't have to manually split the data into training and validation data sets when testing for predictive accuracy. From an end user's point of view, there's still only one time series model, but you can compare actual results against the predicted results within the model, as shown in Figure 7. If you aren't using the Enterprise Edition or if you don't want to leverage this feature, you'll need to first manually split the data.
The Next Step
In this article, I introduced you to the concepts necessary for understanding the basics of time series forecasting. I also introduced you to some of the details of the underlying algorithms so that they don't become a barrier to implementing time series. As a next step, I invite you to walk through an implementation of time series forecasting with SSAS. I included a sample project that uses the unemployment data referenced in this article. (To obtain this project, click the Download the Code icon at the top of the page.) Afterward, you might want to check out the TechNet tutorial "Intermediate Data Mining Tutorial (Analysis Services – Data Mining)."