On November 9, 2010, Ted Kummert announced to the world the immediate availability of SQL Server Parallel Data Warehouse (PDW). Since that time, myself and our larger firm BI Voyage (BIV) have been working on multiple Parallel Data Warehouse customer deployments including Hadoop clusters. Our firm's work with the appliance begins with the world's fourth largest equities exchange by notional value, Direct Edge. Working together with Direct Edge, we determined that Parallel Data Warehouse was the best platform for the firm's large volumes of data required for analysis. Direct Edge became the first non-Microsoft buyer of Parallel Data Warehouse. Present day, several leading enterprises have purchased Parallel Data Warehouse appliances including our own firm's purchase of a half-rack PDW for training and development. In the remainder of this post I will discuss our firm's experiences with Parallel Data Warehouse and append a link to the Direct Edge PDW Case Study.
Designing Customer Solutions
When you are the lead architect for a Parallel Data Warehouse project the most important thing you can do is read the technical materials (including SQL Server Parallel Data Warehouse Books Online which ships with the unit) and gain as much experience as you can prior to the engagement. I see both sides of the training discussion, which is that PDW is drastically different from SQL Server SMP and thus requires substantial training for existing SQL Server professionals. One the one hand - MPP technology is quite different in that you have a new paradigm to design for performance by distributing (and replicating) your tables. There is also the notion of Distributed-SQL (D-SQL) and distributed query plans. D-SQL queries and plans involve two levels of operations: compute node-level SMP plans and appliance-level data movement plans. Certainly understanding how the distributed plan interfaces with the traditional SMP compute node plan is important and likewise understanding the key data movement operations such as shuffle and partition moves is good to know as well. However, beyond those key deltas, if you are experienced with SQL Server SMP today for data warehousing you should be just fine. Without question, one of Microsoft's core design goals for the appliance is to enable a seamless transition of technical expertise from SQL Server SMP to MPP (as easy as they can make it).
Once you are confident designing solutions for the appliance you should begin a detailed design phase (even if you are using SCRUM project management methodologies). The good news here is it takes a few weeks from the time of order for you or your customer to receive delivery of the appliance from the hardware vendor so you have some time to design and plan. You need to design (at least to a high-level), what the distributed schema is going to look like. Which tables are going to be distributed and which tables will be replicated? Do you want to have data management capabilities via table partitioning? How will you partition the tables? Should you employ non-clustered indexes (NCIs are used much less in PDW environments)?
The data loading infrastructure and frameworks require design - how will your solution accommodate historical one-time as well as ongoing incremental data loading? Are you going to leverage ETL or ELT processes? And do not forget about the security architecture of the solution. Who will connect to the appliance? Under what security context and permissions should these people (or services) have when connecting? Do you want a self-service report author to have the ability to drop the data warehouse?
There is also the notion of creating MPP Views which face the BI tools and components such as OLAP cubes and in-memory tabular models. It is common practice to create views to mask some of the data warehouse internal complexities away from the BI components and self-service report authors. And for those self-service to PDW scenarios trust me when I say you want to restrict how many rows folks can download into those self-service tools - in PDW environments we are generally working with hundreds of millions and even billions of records (On the BIV half rack unit we have a table with over 70 Billion records).
Before I leave the design realm there is that of OLAP Cubes and Tabular Model specific design. If you are going to be deploying these artifacts with PDW you want to ensure you read up on early best-practices and lessons learned as there are many inherit tuning options and best practices that have been discovered. Some of these items apply to design of OLAP cubes for example such as the selection of optimal storage modes.
Developing Customer Solutions
Probably the most import thing to mention under development is 'pre-bake' your sample datasets. We want to work small datasets so our development tasks are that much faster and easier to accomplish. Consider using a representative dataset such as X percentage of production data volumes.
It is a good time to mention appliance updates (AUs) as well. With each new AU Microsoft releases we gain new features. The new features span a variety of appliance aspects, however the development domain obtains many new updates with each AU including collations, data types, and new D-SQL constructs. So it is a good idea to double check that your PDW development unit is on the latest-and-greatest service pack available.
The primary development tools you will be using in developing PDW solutions is Nexus Query Chameleon (PDW V1 Query Tool), the PDW Admin Console (monitor and diagnose query performance), and the SQL Server SMP BI developer tools (BIDS or SSDT). You can also leverage SQL Server SMP's T-SQL with PDW by creating a linked server to the Parallel Data Warehouse appliance. The primary development objects available to you in PDW V1 are tables, views, and stored procedures.
Next up in PDE development is Create Table As Select (CTAS), get to know this syntax, you're going to be using it...a lot. CTAS is essentially the SQL Server SMP equivalent of SELECT INTO. It is a fully parallelized SELECT that results in the creation of a new user table. You can write CTAS UPDATEs (matching on conditions), CTAS DELETEs (filtering out records via a SELECT), just about any set-based operation you can think of can be accomplished using CTAS. CTAS is often also used to redistribute an existing distributed table by a new distribution column.
Deploying Customer Solutions
When you are ready to deploy a Parallel Data Warehouse database and potentially accompanying Microsoft business intelligence artifacts you will want to consider a few things. First and foremost, are the development and production appliances physically the same? Some of our customers have purchased the optional Parallel Data Warehouse development unit which is a smaller appliance configuration designed for development and test. In other customer scenarios we build and productionize on the same physical appliance when a test unit is not available. All of the changes related to physical appliance settings apply here including ensuring the production appliance is on the network, latest appliance updates applied, and updating your BI artifact's connection strings to point to the correct appliance control node clustered IP address for connectivity using either the Data Direct or Tabular Data Stream protocols.
Second is source-control. You can use most any source control system such as Visual Studio Team System or Subversion to check in your PDW's .sql files. These files may contain Distributed-SQL DDL, DML, or both. Also, do not forget about your BI artifacts such as Reporting Services reports and OLAP Cubes or Tabular Models.
Third is that of historical data loading. As part of your ELT design and development phases you should have a data loading solution in place that facilitates the massive historical load that first must take place. Generally speaking you will want to leverage dwloader to obtain the fast parallel bulk load performance (1 TB/hour). dwloader requires incoming (schema-matching) data files including support for gzip (compressed) files. You will want to ensure all of these files are ready for load and ideally they will be located in a directory on the appliance's landing zone.
Fourth are optional subscribing SQL Server SMP spokes. You can optionally choose to design a hub-and-spoke architecture (for example to reuse an existing SQL Server Fact Track investment) whereby traditional SQL Server relational instances receive fresh copies of datasets by using the appliance's feature called Remote Table Copy (RTC). RTC provides a high-speed export of appliance data, but you need to ensure the spokes are attached to the appliance's internal Infiniband network. The logic behind choosing such a design is you may want to offload some of the query workload to downstream spokes. One example is an accounting department that might constantly query a certain dataset throughout the business day and you wish to offload those queries to a separate server thus freeing up the Parallel Data Warehouse for more ad-hoc (and possibly of greater importance) requests.
Fifth is that of MPP backup plans. Once you have the customer's solution deployed and are actively loading data you will want to design and put into place a regular backup plan just like you would with regular SQL Server SMP. PDW supports full and differential backups. A typical configuration we use is weekly full and daily differentials. PDW backups are automatically compressed and backup the entire appliance database (across all nodes). You can (and we also recommend) backing up the PDW's Master database every so often as well which contains appliance metadata such as user logins. The biggest thing to be aware of when planning for PDW backups is that in order to restore a backup, you must restore it to an appliance with the same or more number of compute nodes. For example, if you elect to purchase two Parallel Data Warehouse units for disaster recovery purposes, you will need to ensure that both appliances have a matching number of compute nodes for cross-backup/restore operations.
The sixth item under MPP deployment is that of monitoring the appliance. Out-of-the-box you have access to the PDW Admin Console as well as the underlying system and dynamic management views that are accessible via a SQL prompt. You can optionally (and we recommend to our customers) downloading the SQL Server Parallel Data Warehouse Management Pack and installing it into Systems Center Operation Manager (SCOM) 2007 R2 instance. You can also (and we often do) develop additional custom monitoring reports using Reporting Services querying the system and dynamic views in the appliance and then deploying those items to a SharePoint farm for regular consumption.
The seventh and final item is ensuring proper appliance security. The appliance supports SQL Server authentication and entails logins and roles (Server and Database). Ideally you designed the security infrastructure as part of the design phase of work and you just need to instantiate it at this point. Be sure to setup service accounts for appliance access as well such as Reporting Services for example.
Migrating a SQL Server SMP Database
There are a few methods available to you for migration (forklift) of a traditional SQL Server SMP database to SQL Server Parallel Data Warehouse. There is the old reliable method of manually creating PDW DDL and then using a variety of tools including SSIS and PDW's dwloader to historically load data into the new MPP schema. This manual migration process involves converting incompatible data types, using PDW's specific DDL syntax, choosing distributed versus replication for table storage, and the selection of distribution, index, and partition columns and values. Our firm has created a utility to automatically migrate databases from SQL Server SMP to SQL Server Parallel Data Warehouse, you can find more information on the tool as well as a short video clip of its usage at http://www.bivoyage.com/pdw_migration.html (we use the utility to assist our customer's projects and do not sell the tool directly).
The Future of Parallel Data Warehouse: Microsoft's Big Data Appliance
With the announcements made by Microsoft at PASS Summit 2012 the future of Parallel Data Warehouse is clear. Dr. David DeWitt of the Microsoft Gray Systems Lab provided invaluable details behind Project Polybase, his detailed presentation can be found here. Microsoft intends to unify the two diverse worlds of Hadoop and Massively Parallel Processing database management systems. Project Polybase will facilitate the parallel query execution of data persisted in Parallel Data Warehouse as well as Hadoop distributed file systems (HDFS). In Dr. David DeWitt's presentation he discusses three phases of increasing PDW and HDFS parallel query integration.
Project Polybase will make its debut (Phase 1) in Parallel Data Warehouse's initial V2 release. In addition, Microsoft (in partnership with Hortonworks) is actively building HDInsight. HDInsight is an Apache compatible Hadoop distribution that Microsoft is making available on Windows Server and Windows Azure platforms. Project Polybase is slated to work with all Apache compatible Hadoop distributions. While Project Polybase is getting much warranted attention don't forget that Parallel Data Warehouse V2 will also include native support for columnstore indexes which can dramatically increase data warehouse query performance.
Parallel Data Warehouse's hardware vendors (HP and DELL) are also continuing to include faster hardware. Some of the key hardware components for Parallel Data Warehouse appliances include infiniband fabric, CPU cores (per compute node), and memory (per compute node). With increased available hardware, columnstore index support, and parallel query execution across HDFS and MPP persisted datasets Parallel Data Warehouse V2 is slated to become Microsoft's Big Data Appliance.
Download the Case Study
You can download the Direct Edge Parallel Data Warehouse Case Study at http://www.bivoyage.com/pdw_contact.html.