|Executive Summary: Microsoft provides an end-to-end business intelligence (BI) solution based on its SQL Server data platform. Read about how Premier Bankcard deployed Microsoft's SQL Server 2005 BI platform, including how long it took to get the data warehouse up and running and the challenges faced during implementation.|
In the past few years, Microsoft has released several business intelligence (BI) products, such as SQL Server 2005, the Microsoft Office 2007 suite, Microsoft Office SharePoint Server (MOSS) 2007, and Microsoft Office PerformancePoint Server 2007. However, it wasn’t until recently that companies began to feel business pressure to think about how their bottom line is affected by the quality and timeliness of the information that their employees use to make decisions. Microsoft and its competitors offer BI products to fill this need, and Microsoft touts its ability to provide an end-to-end BI solution based on its SQL Server data platform. With SQL Server 2008’s enhanced BI features, many companies are considering implementing Microsoft’s BI platform to improve access to meaningful data. To gain insight into how organizations decide whether to deploy Microsoft’s BI solutions and to see how these solutions can be implemented, I spoke with Ron Van Zanten, the directing officer of BI at Premier Bankcard, to get an inside look at how he implemented Microsoft’s SQL Server 2005 BI platform in his organization, including how long it took to get the data warehouse up and running and the challenges that he faced during the implementation.
Bearly: What made your company decide to implement Microsoft’s BI solution?
Van Zanten: We were like a lot of companies are when they’re starting out. We had reporting, and individual divisions had systems on which they would keep their data. We did have a data warehouse at the time. It was good except for two flaws—the information wasn’t timely or accurate. We needed to break down the silos of information and get one version of the truth, so that everybody was working off the same set of books. That’s why we chose a centralized data repository.
Bearly: Did your company look at alternative BI platforms before deciding to use Microsoft’s?
Van Zanten: Yes, we looked at Oracle and Teradata. Teradata had only a limited number of installations, it’s extremely expensive, and they wanted to have their consultant come in and set up the data warehouse, set up the schema to one of the schemas they have for financial institutions, and then build reports off of it. The downside of that is that we wouldn’t have the ownership for the implementation. Oracle didn’t have a BI stack and told us to use Analysis Services for the cubes. So it looked like we were going to use the top part of the Microsoft stack anyway. Since Analysis Services was in all three proposals, we decided to use Microsoft’s BI platform. I didn’t have any reservations with SQL Server. I knew we could build a schema and leverage SQL Server’s strengths and make it perform like we needed it to. As long as we started from the ground up, built it and designed it with performance in mind, and put a presentation layer on the top of it, we were certain SQL Server would be able to perform.
Bearly: Which Microsoft BI products are you using?
Van Zanten: SQL Server 2005 SP2. We’re running the latest version of Reporting Services \[2005 SP2\], and Analysis Services has the latest service pack \[2005 SP2\] on it, too. We’re running PerformancePoint 1.0, which is the first build. We can’t wait for Performance-Point 2.0. We use SharePoint 2007 and we did use ProClarity 6.1, but we’re migrating over to PerformancePoint for that presentation layer.
Bearly: What features made you use those products?
Van Zanten: The partitioned tables were something that we really wanted, and we really wanted our database platform to be NUMA \[Non-Uniform Memory Access\] aware. NUMA is a hardware feature where we have four cells on our Itanium box, and those cells have four processors and 16GB of memory on each one. SQL Server 2005 is aware of that, so we don’t allow people to take more than one cell of resources when they’re running their queries. We have one fact table that’s 2TB; it’s got almost three billion records in it. You could write a query to our data warehouse and its hardware, and SQL Server keeps a query like that from overrunning everything else. The query will go through, and we’ll eventually find it and be able to stop it before it wastes everybody’s time. But in the meantime, the SQL Server instance is still available and still has a decent response time.
We monitor our data warehouse by response time. We run a query against our Account dimension; we query the dimension table, returning one record back. We track the time in milliseconds that it takes to log in, return the data, and log out. We do this every 10 minutes. I use that metric above all others because that really tells me how SQL Server feels about things. If it has CPU pressure, that query will be slower; if there’s a lot of memory pressure it’s slower; or if there’s disk pressure it’s slower. If the response time starts to degrade, then I can see where the bottleneck is because it’s usually one of those three things that the SQL Server system is unhappy about. Our standard is that a query needs to happen in 250 milliseconds on the SQL Server system.
Bearly: You mentioned that you have a centralized data repository. Can you tell me more about your data warehouse?
Van Zanten: We created a centralized data warehouse, where everything is stored and loaded through a standardized ETL \[extraction, transformation, and loading\] process. The ETL process is where the data is validated and cleaned. Our data warehouse is a relational star—it’s actually a snowflake schema—where the different fact tables and different core areas of our business can be related to each other correctly. We look at letters that are sent to the customer, phone calls the customer makes to us, and the problems that they have, in the same context as other things that are happening in our company, so that we can get a 360-degree view of the customer. But that really only works well when you have a centralized data store. It does cause new problems because you’re inviting new challenges. One of those challenges is size, by having everything in one area—but we have a 12TB SQL Server instance that runs this data store.
Bearly: Can you tell me more about that SQL Server instance?
Van Zanten: It’s 12TB, and it contains over 40 billion records. If you think about it in the big picture, it actually makes our data stores smaller because we aren’t keeping separate copies of data; everything is located in one spot. If you want to get monetary transactional history, such as payments, purchases, or fees, you can go to one place that’s been audited, balanced, and built on a repeatable process that’s gone through change control. You have a place where you can trust the information. Reports can then be built on the information, and our business users can build data marts and subject areas to drill down further and do further analyses, but it’s always based on the original version of the truth. It gives everybody a point of reference for reports, and analysts can drill back and verify such information as “Where’s the baseline? Do I balance? Is my report correct?”
Bearly: How difficult was it to implement
Microsoft’s BI platform?
Van Zanten: The most important thing was understanding the platform, in order to leverage it. We had to have the right information but it also had to be timely. ETL performance was definitely a consideration. Our ETL process runs through SSIS on a separate server, and we bulk INSERT into the actual database instance all of the data.
Having our data warehouse and SSIS on the same SQL Server instance probably wouldn’t have been the best scenario, because both of those products love memory, and we didn’t want to have that kind of contention. So having a dedicated ETL box was important for us, although it’s something that companies sometimes don’t do because they think that if they go with the same vendor, they can put it all on one box. But if we had purchased a separate ETL tool from a third-party vendor, we would have had it installed on a separate server anyway. So we took best practices for the industry and applied them to our stack.
We have a separate SSIS server whose dedicated job is to manage the data movement, the data quality, and the checks, which is a lot of work because we actually take in data and put it in a dirty table, and then we start to normalize it to get it ready to put into our data warehouse. We put the keys in there for our dimension tables on the fact tables, and we standardize data types. We’re looking for things that are out of the ordinary. We put 20 to 25 million records in the data warehouse per day, but we inhale into our ETL server well over 30 million. Then we get the data distilled down, cleaned, and standardized. The data warehouse then accepts this new information and the ETL box will kick off Reporting Services schedules.
Of 180 loads, about 120 of them are just daily incrementations that run overnight. They contain the activity for the previous day. We only post transactions to credit cards at night, so that load only has to happen once every evening and then it’s done for the day. The data isn’t going to change during the day in the data warehouse. Then we kick off Reporting Services scheduled reports and they run against that data, using snapshot reports, so that when people look at those reports during the day in Reporting Services, the query has already been run against the data warehouse. Users think performance is extremely fast because the only thing Reporting Services has left to do is just to render the data. Reporting Services can render that data to 500 people quickly because it’s doing some web pages without having to touch the data warehouse itself—because we know in these particular reports that the data hasn’t changed, which saves the data warehouse from having repeating hits from all those reports.
Analysis Services also improves BI performance. Our Analysis Services server is separate, which provides some relief for the data warehouse. All the hits are done by the Reporting Services and Analysis Services servers. There are also some loads to the data warehouse that occur every 15 minutes or every half hour for load frequencies, and the data warehouse has to service them.
We use and leverage Microsoft’s stack pretty heavily. Every database engine has things it’s good at and things it’s not good at. A good DBA, a good schema, and a good process should leverage a database engine’s strengths. That’s what we do with SQL Server. It performs very well—queries take less than a minute on average, and we never have a query that takes over the SQL Server box. The data warehouse takes abuse constantly and it performs very well.
Bearly: Did you implement your BI platform
inhouse? Or did you contract with someone or work with Microsoft directly to set it up?
Van Zanten: We did it ourselves.
Bearly: What was the biggest technical hurdle that you had to overcome?
Van Zanten: When we were implementing it, our ETL process was done in DTS (we used DTS for order of precedence and dependencies), but almost all of the heavy lifting was done by SQL Server and stored procedures. In that space, the biggest technical hurdle was data quality and getting things married up. DTS wasn’t nearly as robust as SSIS is at sorting and weeding things out, so we had to do it by hand. You had to run SQL queries when cleaning queries, and look for inconsistencies.
Bearly: Did you encounter any problems other than with DTS?
Van Zanten: The quality of the data and the timeliness of our data sources is our biggest challenge. The other big challenge is having people understand the data warehouse—what’s in there, what’s not in there—as well as understand our process, such as knowing how to use Reporting Services and understanding how to pivot a cube. Cubes seem to be a leap for people. We could have the best data warehouse ever, but if nobody ever used it for their work, then it would be a failure.
Bearly: How long was it from when you started building your data warehouse to when it was functional for your users?
Van Zanten: Four months. The first high-value test reports were published in six months.
Bearly: Were you able to stay within the budget you were given?
Van Zanten: We were definitely within budget. Our first major purchase was the HP Itanium 8620. And we had a very small staff; it was actually just me the first year. I was the whole development staff on this project. Even now my staff includes only eight people who are dedicated to the BI data warehouse infrastructure, so it’s still fairly economical to run if you set it up correctly. The really giant purchase we made was the HP StorageWorks XP24000, because it was a seven-figure purchase. But we just bought that last year, so by that time we were a billion-dollar company, and all this corporate data was giving us a 700 percent return on investment; we could afford it. We were able to deliver, which is huge, and our data repository was delivered on time and correctly. It’s really important to think about how you’re going to show your return on investment.
Bearly: Are you planning to migrate to SQL Server 2008 when it’s released?
Van Zanten: We have SQL Server 2008 on a development box and for our test data warehouse. Our largest fact table is 2TB; but 700GB would likely go away if we could take advantage of SQL Server 2008’s compression.When SQL Server 2008 is released, we plan to upgrade our production data warehouse.