This is a continuation of our series on SQL Server BI for DBAs. In this posting, I’m going to keep it short with just 5 common areas that I see DBAs focus on most effectively when you are responsible for a SQL Server data warehouse instead of a classic OLTP database application. This is not going to focus on either of the new Microsoft SQL Server data warehouse appliance or reference server offerings: Fast Track & Parallel Data Warehouse. Instead, we’ll assume classic SQL Server 2005, 2008 or 2008 R2 as the database engine for your DW.
BTW, in no particular order:
1. Understand the different landscape
When managing OLTP application databases, as a DBA, your focus is on ensuring that the applications that are driving your business are running effectively & efficiently, data integrity is maintained and de-duped through 3rd normal form (normalized) data schemas, workload patterns are predictable and weighted toward inserts & updates and are optimized to validate user input. But when you are responsible for a data warehouse, expect a lot of ad-hoc queries, de-normalized data schemas and lots of worrying about data quality.
2. Usage Patterns: Bursts, Bulks and Reads
The data that gets loaded into the data warehouse tables will likely come in large bulk (ETL) operations. Data warehouses can usually benefit from more indexes. But there is a heavy price to pay for all of those indexes during the data loading ETL phase. A common practice is to drop indexes on the load tables before the bulk operations and then recreate them after the process. Remember to look for missing missing indexes in the missing index DMVs.
3. What to keep your eyes on
Here is a common list of counters and DMVs to monitor specific to SQL Server data warehouses: Missing indexes (sys.dm_db_missing_index_groups), watch for fragmentation due to large IO operations (sys.dm_db_index_physical_stats), check IO with Avg Disk Reads/Sec and also in Perfmon, keep your eyes on page life expectancy. When you see low values here, you should go back and look for missing index. Remember the nature of ad-hoc querying in a data warehouse environment can be a challenge to keep up with.
4. Manage large data movement with partitions
When I discussed last week about implementing a scalable SSAS cube, I left out partitioning the fact table or the cube as part of the discussion. In the data warehouse DBA role, with large data warehouses, managing partitions needs to be a part of your role and managing the lifecycle of data as it comes into life in the warehouse, moves out of active partitions and is archived. We won’t have the time in this blog series to go into details of archiving strategies for data warehouses. But managing partitions is a necessary part of the data lifecycle.
A common partitioning practice in data warehouses is to partition fact tables by date and pick a partition based on a time period that may be weekly or monthly, based on the most common dates that are queried by users. To manage those partitions, old data can be switched out to an old partition. This is where SQL Server 2008’s new partitioning wizard and sliding window techniques will come in handy to you.
One of my favorite ways to introduce DBAs new to a data warehouse role is to go to the Project Real website, which was a joint project between Microsoft, Barnes & Noble, Unisys and a few other partners. They designed an excellent reference architecture for building a basic data warehouse in SQL Server 2005 and is fully transferable to SQL Server 2008. Many more deeper treatments on partitioning strategies in ETL and data warehouses are explained in detail there.
5. Manage the ETL (Extract, Transform and Load) operations
One of the aspects of managing a data warehouse that is different in nearly all cases of a data warehouse over an OLTP system is that there will inevitably exist a process by which data from one or more (usually more and usually many different source types) sources of transactional data that is needed by the business users for their common business reports directly against the data warehouse or used by an SSAS cube for querying. In SQL Server, the ETL is very likely going to be implemented from a series of SSIS packages that the ETL developers created that you, as a DBA, will need to monitor for success/failure, index impact and scheduling. You can log into the SSIS service from SSMA by selecting Integration Services from the object explorer and then manage the packages from either the MSDB stored in the database or from the file system. You can either manually execute the package from SSMS or schedule it for a periodic run from SQL Server Agent jobs.