Getting Started with Parallel Data Warehouse

This summer Microsoft will release the SQL Server 2008 R2 Parallel Data Warehouse (PDW) edition, its first product in the Massively Parallel Processor (MPP) data warehouse space. PDW uniquely combines MPP software acquired from DATAllegro, parallel compute nodes, commodity servers, and disk storage. PDW lets you scale out enterprise data warehouse solutions into the hundreds of terabytes and even petabytes of data for the most demanding customer scenarios. In addition, because the parallel compute nodes work concurrently, it often takes only seconds to get the results of queries run against tables containing trillions of rows. For many customers, the large data sets and the fast query response times against those data sets are game-changing opportunities for competitive advantage.

The simplest way to think of PDW is a layer of integrated software that logically forms an umbrella over the parallel compute nodes. Each compute node is a single physical server that runs its own instance of the SQL Server 2008 relational engine in a shared-nothing architecture. In other words, compute node 1 doesn't share CPU, memory, or storage with compute node 2.

Figure 1 shows the architecture for a PDW data rack.

Figure 1: Examining the architecture of a PDW data rack (click to enlarge)

The smallest PDW will take up two full racks of space in a data center, and you can add storage and compute capacity to PDW one data rack at a time. A data rack contains 8 to 10 compute servers from vendors such as Bull, Dell, HP, and IBM, and Fibre Channel storage arrays from vendors such as EMC, HP, and IBM. The sale of PDW includes preconfigured and pretested software and hardware that's tightly configured to achieve balanced throughput and I/O for very large databases. Microsoft and these hardware vendors provide full planning, implementation, and configuration support for PDW.

The collection of physical servers and disk storage arrays that make up the MPP data warehouse is often referred to as an appliance. Although the appliance is often thought of as a single box or single database server, a typical PDW appliance is comprised of dozens of physical servers and disk storage arrays all working together, often in parallel and under the orchestration of a single server called the control node. The control node accepts client query requests, then creates an MPP execution plan that can call upon one or more compute nodes to execute different parts of the query, often in parallel. The retrieved results are sent back to the client as a single result set.

Taking a Closer Look

Let's dive deeper into PDW's architecture in Figure 1. As I mentioned previously, PDW has a control node that clients connect to in order to query a PDW database. The control node has an instance of the SQL Server 2008 relational engine for storing PDW metadata. It also uses this engine for storing intermediate query results in TempDB for some query types. The control node can receive the results of intermediate query results from multiple compute nodes for a single query, store those results in SQL Server temporary tables, then merge those results into a single result set for final delivery to the client.

The control node is an active/passive cluster server. Plus, there's a spare compute node for redundancy and failover capability.

A PDW data rack contains 8 to 10 compute nodes and related storage nodes, depending on the hardware vendor. Each compute node is a physical server that runs a standalone SQL Server 2008 relational engine instance. The storage nodes are Fibre Channel-connected storage arrays containing 10 to 12 disk drives.

You can add more capacity by adding data racks. Depending on disk sizes, a data rack can contain in the neighborhood of 30TB to 40TB of useable disk space. (These numbers can grow considerably if 750GB or larger disk drives are used by the hardware vendor.) The useable disk space is all RAID 1 at the hardware level and uses SQL Server 2008 page compression. So, if your PDW appliance has 10 full data racks (these are full data-center rack units), you have 300TB to 400TB of useable disk space and 80 to 100 parallel compute nodes. As of this writing, each compute node is a two-socket server with each CPU having at least four cores. In our example, that's 640 to 800 CPU cores and lots of Fibre Channel disk storage. I'm not sure how many organizations currently need that much CPU and storage capacity for their enterprise data warehouses. However, in the words of my big brother, "It's coming!"

Besides the control node, PDW has several additional nodes:

Landing zone node. This node is used to run dwloader, a key utility for high-speed parallel loading of large data files into databases, with minimal impact to concurrent queries executing on PDW. With this utility, data from a disk or SQL Server Integration Services (SSIS) pipeline can be loaded, in parallel, to all compute nodes. A new high-speed destination adapter was developed for SSIS. Because the destination adapter is an in-memory process, SSIS data doesn't have be staged on the landing zone prior to loading.
Backup node. This node is used for backing up user databases, which are physically spread across all compute nodes and their related storage nodes. When backing up a single user database, each compute node backs up, in parallel, its portion of the user database. To perform the backups, the user databases leverage the standard SQL Server 2008 backup functionality that's provided by the SQL Server 2008 relational engine on each compute node.
Management node. This node runs the Windows Active Directory (AD) domain controller (DC) for the appliance. It's also used to deploy patches to all nodes in the appliance and hold images in case a node needs reimaging.

Understanding the Table Types

PDW has two primary types of tables: replicated and distributed. Replicated tables exist on every compute node. This type of table is most often used for dimension tables. Dimension tables are often small, so keeping a copy of them on each compute node often improves queries' join performance because the data doesn't have to be shuffled between compute nodes to handle certain types of parallel queries or dimension-only queries. Very large dimension tables might be candidates for distributed tables.

Distributed tables are typically used for large fact or transaction tables that contain billions or even trillions of rows. PDW automatically creates distributions for a distributed table. Distributions are separate physical tables at the SQL Server instance level on a compute node. Metadata on the control node keeps track of the mapping of a single distributed table and all its constituent distributions on each compute node.

PDW automatically creates eight distributions per compute node for a distributed table. (As of this writing, the number of distributions isn't configurable.) Therefore, a PDW appliance with 10 compute nodes has 80 total distributions per distributed table. Loading 100 billion rows into a distributed table will cause that table to be distributed across all 80 distributions, providing a suitable distribution key is chosen.

A single-attribute column from a distributed table is used as the distribution key. PDW hashes distributed rows from a distributed table across the distributions as evenly as possible. Choosing the right column in a distributed table is a big part of the design process and will likely be the topic of many future articles and best practices. Suffice it to say, some trial and error is inevitable. Fortunately, the dwloader utility can quickly reload 5TB or more of data, which is often enough data to test a new design.

Creating Databases and Tables

To create databases and tables in PDW, you use code that is aligned with ANSI SQL 92 but has elements unique to PDW. To create a database, you use the CREATE DATABASE command. This command has four arguments:

AUTOGROW, which you use to specify whether you want to allow data and log files to automatically grow when needed.
REPLICATED_SIZE, which you use to specify how much space to initially reserve for replicated tables on each compute node.
DISTRIBUTED_SIZE, which you use to specify how much space to initially reserve for distributed tables. This space is equally divided among all the compute nodes.
LOG_SIZE, which you use to specify how much space to initially reserve for the transaction log. This space is equally divided among all the compute nodes.

For example, the following CREATE DATABASE command will create a user database named my_DB that has 16TB of distributed data space, 1TB of replicated table space, and 800GB of log file space on a PDW appliance with eight compute nodes:

CREATE DATABASE my_DB
WITH ( AUTOGROW = ON
  ,REPLICATED_SIZE = 1024 GB
  ,DISTRIBUTED_SIZE = 16,384 GB
  ,LOG_SIZE = 800 GB )

A total of 8TB of usable disk space (8 compute nodes x 1024GB) will be consumed by replicated tables because each compute node needs enough disk space to contain a copy of each replicated table. Two terabytes of usable disk space will be consumed by each of the 8 compute nodes (16,384GB / 8 compute nodes) for distributed tables. Each compute node will also consume 100GB of usable disk space (800GB / 8 compute nodes) for log files. As a general rule of thumb, the overall log-file space for a user database should be estimated at two times the size of the largest data file being loaded.

When creating a new user database, you won't be able to create file groups. PDW does this automatically during database creation because file group design is tightly configured with the storage to achieve overall performance and I/O balance across all compute nodes.

After the database is created, you use the CREATE TABLE command to create both replicated and distributed tables. PDW's CREATE TABLE command is very similar to a typical SQL Server CREATE TABLE command and even includes the ability to partition distributed tables as well as replicated tables. The most visible difference in this command on PDW is the ability to create a table as replicated or to create a table as distributed.

As a general rule of thumb, replicated tables should be 1GB or smaller in size. Listing 1 contains a sample CREATE TABLE statement that creates a replicated table named DimAccount. As you can see, the DISTRIBUTION argument is set to REPLICATE.

Listing 1

CREATE TABLE DimAccount
(
  AccountKey int NOT NULL,
  ParentAccountKey int NULL,
  AccountCodeAlternateKey int NULL,
  ParentAccountCodeAlternateKey int NULL,
  AccountDescription nvarchar(50) ,
  AccountType nvarchar(50) ,
  Operator nvarchar(50) ,
  CustomMembers nvarchar(300) ,
  ValueType nvarchar(50) ,
  CustomMemberOptions nvarchar(200) 
)
WITH (CLUSTERED INDEX(AccountKey),
  DISTRIBUTION = REPLICATE);

Generally speaking, distributed tables are used for transaction or fact tables that are often much larger than 1GB in size. In some cases, large dimension tables—for example, a 500-million row customer account table—is a better candidate for a distributed table. Listing 2 contains code that creates a distributed table named FactSales. (You can download the code in Listings 1 and Listing 2 by clicking the Download the Code Here button near the top of the page.)

Listing 2: Code That Creates a Distributed Table

CREATE TABLE FactSales
(
  StoreIDKey int NOT NULL,
  ProductKey int NOT NULL,
  DateKey int NOT NULL,
  SalesQty int NOT NULL,
  SalesAmount decimal(18,2) NOT NULL
)
WITH (CLUSTERED INDEX(DateKey),
DISTRIBUTION = HASH(StoreIDKey));

As I mentioned previously, a single-attribute column must be chosen as the distribution key so that data loading can be hash distributed as evenly as possible across all the compute nodes and their distributions. For a retailer with a large point of sale (POS) fact table and a large store-inventory fact table, a good candidate for the distribution key might be the column that contains the store ID. By hash distributing both fact tables on the store ID, you might create a fairly even distribution of the rows across all compute nodes. Also, PDW will co-locate on the same distribution (i.e., rows from the POS fact table and rows from the store-inventory fact table for the same store ID). Co-located data is related, so queries that access POS data and related store inventory data should perform very well.

To take full advantage of PDW, designing databases and tables for the highest-priority queries is crucial. PDW excels at scanning and joining large distributed tables, and often queries against these large tables are mission critical. A good database design on PDW often takes a lot of trial and error. What you learned in the single server database world isn't always the same in the MPP data warehouse world. For instance, clustered indexes can work well for large distributed tables, but nonclustered indexes can degrade query performance in some cases because of the random I/O patterns they create on the disk storage. PDW is tuned and configured to achieve high rates of sequential I/O against large tables. For many queries, sequential I/O against a distributed table can be faster than using nonclustered indexes, especially under concurrent workloads. In the MPP data warehouse world, this is known as an index-light design.

Querying Tables

After the tables are loaded with data, clients can connect to the control node and use SQL statements to query PDW tables. For example, the following query runs against the FactSales table created with Listing 2, leveraging all the parallel compute nodes and the clustered index for this distributed table:

SELECT StoreIDKey, SUM(SalesQty)
FROM dbo.FactSales
WHERE DateKey >= 20090401
  AND DateKey <= 20090408
  AND ProductKey = 2501
GROUP BY StoreIDKey

This query performs exceedingly well against very large distributed tables. It's distribution-compatible and aggregation-compatible because each compute node can answer its part of the parallel query in its entirety without shuffling data among compute nodes or merging intermediate result sets on the control node.

Once the control node receives that query, PDW's MPP engine performs its magic by taking the following steps:

It parses the SQL text.
It validates and authorizes all objects (authentication and authorization).
It builds an MPP execution plan.
It runs the MPP execution plan by executing SQL SELECT commands in parallel on each compute node.
It gathers and merges all the parallel result sets from the compute nodes.
It returns a single result set to the client.

As you can see, although queries appear to be run against a single table, in reality, they're run against a multitude of tables.

The MPP engine is responsible for a variety of features and functions in PDW. They include appliance configuration management, authentication, authorization, schema management, MPP query optimization, MPP query execution control, client interaction, metadata management, and collection of hardware status information.

The Power of PDW

PDW's power lies in large distributed tables and parallel compute nodes that scan those distributed tables to answer queries. Thus, PDW is well-suited for vertical industries (e.g., retail, telecom, logistics, hospitality), where large amounts of transactional data exist. It doesn't take too much of a leap to consider PDW well-suited for mission-critical applications (e.g., for the military or law enforcement), where lives depend on such capabilities. I hope you're as excited as I am to work on real-world and mission-critical solutions with this new technology.

Comments

Plain text