What's New in DTS?

In early 2000, the Microsoft Data Transformation Services (DTS) development team I work on started revising DTS with the goals of building on previous success and of improving the product to support user requests and to provide a richer extraction, transformation, and loading (ETL) platform. We evaluated every aspect of DTS and decided to totally rewrite it. DTS in the upcoming SQL Server 2005 release, formerly code-named Yukon, sports many brand-new features as well as enhanced ones. Because so much of DTS is new in SQL Server 2005, I want to show you some of the most important changes and the new look of the DTS Designer. When I wrote this article, I was working with Beta 1 of SQL Server 2005 DTS, so some features might change in upcoming betas or in the final release. But if you're already familiar with SQL Server 2000 and 7.0 DTS releases, you'll be able to appreciate the coming improvements.

SQL Server 2005 DTS Design Goals

Because comprehending everything about DTS at a glance is difficult, let's just take a quick look at the most important goals and how the goals drove the design and feature decisions the DTS team made in SQL Server 2005. Although these descriptions are brief, they should help you grasp the magnitude of the changes.

Provide true ETL capabilities. Although the data pump in pre-SQL Server 2005 DTS is useful and flexible, most users recognize that it has its limitations and needs to be revamped. For example, the data pump supports only one source and one destination per pump. True enterprise ETL requires fast, flexible, extensible, and dependable data movement. SQL Server 2005 DTS provides this capability through the Data Flow Task—or, as our team calls it, the pipeline. The pipeline supports multiple sources, multiple transforms, and multiple destinations in one fast, flexible data flow. As of Beta 1, SQL Server 2005 DTS includes 26 transforms. The Conditional Split and Derived Column transforms use an expression evaluator to support operations that provide virtually limitless combinations of functionality for processing data. Other transforms such as the Slowly Changing Dimension, Fuzzy Match, Aggregate, File Extractor, File Inserter, Partition Processing, Data Mining Query, Dimension Processing, Lookup, Sort, Unpivot, and Data Conversion transforms provide powerful data-manipulation capabilities that don't require scripting. This change is a real benefit because users can develop transformation solutions faster and manage them easier than hand-coded solutions.

Distinguish between data flow, control flow, and event handling. SQL Server 2005 DTS emphasizes the differences between various kinds of data processing. In current DTS releases, users are sometimes confused when they try to distinguish between data flow and control flow because both appear on the DTS Designer surface. In SQL Server 2005 DTS, the concept of data flow includes all the activities users perform to extract, transform, and load data. Control flow comprises all the processes that set up a given environment to support ETL, including executing the data flow. SQL Server 2005 DTS also has event handlers that allow nonsequential control flow execution based on events that tasks and other objects generate inside a package. SQL Server 2005 DTS clearly distinguishes between data flow, control flow, and event handling in the UI by showing them in separate Designer surfaces.

Minimize disk usage. To make DTS into a screaming fast ETL tool, we needed to eliminate unnecessary disk writes, disk reads, and memory movement. Because ETL solutions can be quite complex, they typically involve some sort of disk caching and lots of memory movement and allocations. In some cases, you can't avoid disk usage—for example, during data extraction, data loading, or aggregation or sorting of data sets that are larger than available memory. But in many cases, moving memory and caching aren't necessary. The pipeline helps eliminate the avoidable cases by optimizing memory usage and being smart about moving memory only when absolutely necessary.

Improve scalability. To be accepted as an enterprise ETL platform, SQL Server 2005 DTS needed the ability to scale. Users in smaller shops might need to run DTS on less-powerful, affordable commodity hardware, and users in enterprise environments want it to scale up to SMP production machines. SQL Server 2005 DTS solves this scalability problem by using multiple threads in one process. This approach is more efficient and uses less memory than using multiple processes. SQL Server uses this scaling approach successfully, so we decided to use the same method for DTS.

Recognize the development-programming connection. Experienced DTS users know that developing packages is much like writing code, but DTS in SQL Server 2000 doesn't support that connection very well. However, SQL Server 2005 DTS provides a professional development environment that includes projects, deployment, configuration, debugging, source control, and sample code. Package writers will have the tools they need to effectively write, troubleshoot, maintain, deploy, configure, and update packages in a fully supported development environment.

Improve package organization. As packages grow in size and complexity, they can sometimes become cluttered and unintelligible. To address users' concerns about managing larger packages, our team added more structure for packages and provided ways to better manage the objects in each package. For example, the DTS runtime, which houses the DTS control flow, now has containers that isolate parts of a package into smaller, easy-to-organize parts. Containers can hold other containers and tasks, so users can create a hierarchy of parts within the package.

SQL Server 2005 DTS variables are now scoped, which means that variables in a container are visible only to the container where the variable is defined and to the container's children. Containers also help users define transaction scope. In SQL Server 2005 DTS, users can define transaction scope by configuring the transaction in a container. Because a package can have multiple containers, one package can support the creation of multiple independent transactions. Users can also enable and disable execution of a container and all its children, which is especially useful when you attempt to isolate parts of the package for debugging or for developing new packages. On SQL Server 2005's DTS Designer surface, users can collapse containers to simplify the visible package and view a package as a collection of constituent compound parts. Variables support namespaces, which simplify identification and eliminate ambiguity in variable names. All these features let users simplify complex packages.

Eliminate promiscuous package access. In SQL Server 2005 DTS, the package pointer is no longer passed in to tasks, so tasks have no way to peruse the package and its contents. This design change discourages promiscuous access and profoundly affects the way users create DTS packages in SQL Server 2005 because it enforces declarative package creation, a process similar to coding. The change also simplifies package maintenance, troubleshooting, debugging, upgrading, and editing because the package logic is exposed in the Designer, not hidden inside a task.

In SQL Server 2000 DTS, tasks sometimes use the package pointer to promiscuously access the internals of the package in which they're running. This practice is a common use for the ActiveX Script Task. Scripting is desirable, but using the ActiveX Script Task this way creates packages that are difficult to understand and troubleshoot. It also makes updating packages difficult. For example, automatically upgrading a package that uses an ActiveX Script Task to loop inside the package is difficult because an upgrade utility would have to parse the script and modify it to work against the new object model. Continued support for tasks accessing the package object model would make upgrading packages to future DTS releases difficult as well. Also, this kind of promiscuous package access isn't advisable because in SQL Server 2005 DTS, tasks would interfere with all the services the DTS runtime provides, causing unpredictable results.

Removing promiscuous access has affected the set of runtime features. Many of the new DTS runtime features provide alternative ways of performing the functions that, in earlier DTS releases, the ActiveX Script and Dynamic Properties tasks provided. SQL Server 2005 DTS includes new loop containers, configurations, property mappings, and expressions that directly target the functional void that this change creates. These new features are better supported and more consistent and manageable than solutions you code yourself.

So what's happened to the ActiveX Script Task? Although it has a new, more powerful UI with integrated debugging, integrated Help, autocomplete, and Intellisense, the ActiveX Script Task, like all other tasks, is limited to providing only task behavior and can no longer modify the package during execution.

Isolate tasks in control flow. The focus of control-flow functionality in DTS has shifted from tasks to the runtime. Although tasks still define package behavior, in SQL Server 2005 DTS, the runtime directs all aspects of control-flow execution order, looping, package termination, configuration, event handling, connections, and transactions. Tasks in SQL Server 2005 DTS are relatively isolated objects that have no direct access to the package, other tasks, or external resources. Tasks work only in their problem domain and only with the parameters the DTS runtime passes to them. A special container called a Taskhost imposes most of these limits. Taskhosts are generally transparent to the package writer and perform default behavior on behalf of tasks. Some of the Taskhost's benefits are subtle, but one important benefit is that it simplifies writing a task that supports the new features such as breakpoints and logging.

Connection managers are another feature that extends the runtime's control over the environment in which tasks run. Connection managers are similar to connections in DTS in SQL Server 2000 but more extensive and more important. In SQL Server 2005 DTS, tasks and other objects use connection managers for accessing all external resources, including data from databases, flat files, Web pages, and FTP servers. Using connection managers lets the DTS runtime validate, detect, and report when a connection is using an invalid source or destination. The use of connection managers also lets users more easily discover what resources a package is accessing. Because resource access is confined to connection managers and not spread throughout the package in perhaps unknown or hard-to-find properties on tasks, use of connection managers simplifies package configuration, maintenance, and deployment.

Improve extensibility. The Microsoft DTS development team wrote SQL Server 2005 DTS with the understanding that it was to be a true platform. By this, I mean that users can embed DTS in their applications, write custom components and plug them into DTS, write management UIs for DTS, or use it for its original purpose—as a utility for moving data. Extensibility is a big part of what makes the new DTS a platform instead of a simple utility. Customers can still write custom tasks and custom transforms in SQL Server 2005 DTS, but the product contains new options that let customers write tasks and transforms by using managed code written in C#, Visual Basic .NET, and other .NET languages. And SQL Server 2005 DTS still supports writing custom components with Visual Basic (VB) 6.0, C++, and other native development languages.

The new release also includes more types of extensible components. Previous DTS releases provide connectivity only through OLE DB connections. SQL Server 2005 DTS includes HTTP, FTP, OLE DB, Windows Management Interface (WMI), Flat File, File, and other connections, and users can write their own connections if the ones they want aren't available. If users want to support new protocols or even new data-access technologies, they can create new connection types to support them without modifying other DTS components. The extensible connection feature benefits Microsoft and customers; it makes adding new connections simpler for Microsoft, and customers aren't limited to what Microsoft provides.

In SQL Server 2005 DTS, the runtime intrinsically supports looping through two new looping constructs in the form of containers. The Forloop container evaluates a user-defined expression at the beginning of each iteration, and the Foreachloop container iterates once for each item in a user-provided collection by using a new type of object called a Foreachenumerator. Because these loop constructs are containers, users can place tasks and other containers inside them and execute their contents multiple times. SQL Server 2005 DTS ships with several Foreachenumerators including SQL Server Management Objects (SMO), generic collection, XML nodelist, ADO, file, and multi-element enumerators. Foreachenumerators are also extensible, so if users want to support custom collections, they can write their own enumerators.

SQL Server 2005 DTS supports another new type of object called a log provider. Log providers handle all the details of creating log entries for a given destination and format. SQL Server 2005 DTS lets users easily write their own log providers if the ones that ship in the box don't meet their needs. The new product will ship with several log provider types, including Text, XML, event log, SQL Server Profiler, and SQL Server.

The pipeline is also extensible. Users can write custom data adapters and transformations that plug into the pipeline. Users can also write pipeline data-source adapters to support a particular source's format, parse the data, and put it into the pipeline. Likewise, pipeline data-destination adapters support removing data from the pipeline and loading it to the destination. Pipeline transforms are components that modify data as it flows through the pipeline. SQL Server 2005 DTS provides several options for writing pipeline data adapters and transforms, including using native code, managed code, or the Managed Script Transform.

Redesigning the Designer

SQL Server 2005's DTS Designer is more capable and powerful than those in earlier DTS releases. The new DTS Designer is hosted in the Visual Studio shell to take advantage of all the features Visual Studio provides such as integrated debugging, Intellisense, source control, deployment utilities, property grids, solution management, and editing support. These features simplify building, managing, and updating packages. As I mentioned, data flow, control flow, and event handling are separated into dedicated panes in the DTS Designer. This separation makes it easier for users to see what the package is doing and to isolate parts of the package. SQL Server 2005 DTS supports debugging with features such as breakpoints, watches, errors, warnings, informational messages, and progress notifications. Packages now return more targeted and informative error messages that are visible in various locations in the Designer. Improvements to many UI features make the entire workspace better. Experienced Visual Studio users will quickly feel at home in the new DTS Designer because it's so similar to other Visual Studio applications. But regardless of whether users are familiar with the environment, it's intuitive enough that they'll be comfortable working with it in no time. Let's look at a few important new features of the SQL Server 2005 DTS Designer.

Designer control flow. Figure 1 shows the Control Flow view in the SQL Server 2005 DTS Designer's Business Intelligence Workbench. In the left pane of the window is a toolbox containing all the available tasks. Double-clicking a control-flow item or dragging it onto the Designer surface adds a new instance of the selected task or container to the control flow in the package.

On the Control Flow tab, you can see a model of the sequence container. In the model, the Send Mail Task has an arrow beneath it. To create a precedence constraint, a user needs only to drag the arrow to another task. The Connections tab, which lists a package's data source connections, is in the pane below the design surface. The information in the Connections tab makes connections easier to find and clarifies the control flow. In SQL Server 2000 DTS, connections and tasks are combined on one Designer page and are easy to confuse. Our team eliminated this confusion by visually separating connections from tasks. The list in the Variables pane at the bottom of the window includes each variable's scope and data type.

The right two panes in Figure 1 are the Solution Explorer and the Properties grid. SQL Server 2005's DTS Designer supports Visual Studio projects, which keep track of files and settings related to the environment and the project files. The Solution Explorer provides a central location for managing projects. In this DTS Designer pane, you can manage Analysis Services and Reporting Services projects so that you can work with your cubes, reports, and packages in one solution. The Properties grid is a powerful tool for modifying packages. With it, you can view and edit the properties of any object visible within the DTS Designer, including tasks, precedence constraints, variables, breakpoints, and connections.

The sample package on the Control Flow tab in Figure 1 shows how you can embed containers inside each other. The package has a Foreachloop container that holds an XML Task and a Sequence Container that holds a set of tasks. In SQL Server 2005 DTS, when you delete a container, you also delete all the tasks and containers it holds because variables and transactions are created on containers. So, a transaction on the Sequence Container would be scoped only to that container's tasks and containers and wouldn't be visible outside the Sequence Container to tasks or containers such as the Foreachloop or XML Task. This change makes SQL Server 2005 DTS more flexible than DTS in SQL Server 2000, in which users can create transactions only at the package level.

Designer data flow. Figure 2, shows the DTS Designer's Data Flow tab, which you can access by clicking the tab or double-clicking a Data Flow Task. This view is similar to the Control Flow view, with a few differences. When the Data Flow view is active, the toolbox in the left pane shows Data Flow Items, including data-source adapters, transforms, and data-destination adapters. To use these tools, users double-click them or drag them to the Designer surface. Figure 2 also shows the Output pane. DTS requires validation, which means that a component must confirm that it can successfully run when the package calls its Execute() function. If a component can't run, it must explain why—DTS components communicate warnings, errors, or other information by raising events during package validation and execution. The SQL Server 2005 DTS Designer captures such events in the output window.

The Properties grid in Figure 2 shows a couple of interesting links. The Show Editor link, like the link of the same name on the Control Flow view's Properties grid, opens the editor for the currently selected transform. The Show Advanced Editor link shows a generic editor that lets users edit transforms that have no custom UI. Because transforms in a Data Flow Task don't execute in sequence, the DataView instead provides Data Viewers, UI elements that let users view data while it's passing between transforms. Data Viewers are a powerful debugging feature that helps package writers understand what's happening inside the pipeline.

Migration Pain

After reading about all the improvements, changes, and new features in DTS, you might wonder how the new product will work with legacy DTS packages. You might even anticipate problems with upgrading pre-SQL Server 2005 packages—and you'd be right. Early in the redesign of DTS, when we realized that we had to change the object model drastically, we also realized that the upgrade path from SQL Server 2000 DTS to SQL Server 2005 DTS would be difficult. After a lot of sometimes-heated discussion, we decided that our customers would benefit most if the product was free from the limitation of strict backward compatibility so that the next generation of DTS would be based on a more flexible design. Customers we spoke to told us that this choice was acceptable as long as we didn't break their existing DTS packages.

By now you've probably guessed that some of your packages won't upgrade completely. However, we've provided some upgrade options that you can use to help ensure a smooth migration to SQL Server 2005 DTS. The first option is to run your existing packages as you always have. The SQL Server 2000 DTS bits will ship with SQL Server 2005, so you'll still be able to execute your SQL Server 2000 DTS packages. The second option is to run SQL Server 2000 DTS packages inside SQL Server 2005 packages. You can do this by using the new ExecuteDTS2000Package Task, which wraps the SQL Server 2000 package in a SQL Server 2000 environment inside the SQL Server 2005 package. The ExecuteDTS2000Package Task will successfully execute your legacy packages and is useful in partial-migration scenarios while you're transitioning between SQL Server 2000 and SQL Server 2005 DTS.

If you want to upgrade your packages, you have a third option. SQL Server 2005 DTS will ship with a "best effort" upgrade wizard called the Migration Wizard that will move most of the packages that you generated by using the SQL Server 2000 DTS Import/Export wizard. If you have an ActiveX Script Task or a Dynamic Properties Task in your package, it probably does something that SQL Server 2005 DTS no longer allows, such as modifying other tasks or modifying the package. The migration wizard won't be able to migrate those parts of the package. However, you can migrate packages a little at a time because SQL Server 2005 DTS will support SQL Server 2000 DTS side-by-side execution.

Fresh Faces, SDK, and Other Support

Except for some new names, the Import/ Export Wizard and command-line utilities remain largely unchanged. DTSRun.exe is called DTExec.exe in SQL Server 2005 DTS. DTSRunUI.exe is called DTExecUI in SQL Server 2005 DTS, and it features a face-lift. We added a new command-line utility called DTUtil.exe that you can use for performing common administrative tasks such as moving, deleting, and copying packages. The utility also performs other tasks such as checking for the existence of packages. In addition, we included a new configuration wizard called the Package Configurations Organizer for creating package configurations, project-development capabilities that bundle a package with its configuration, and a self-installing executable for deploying packages to other machines.

SQL Server 2005 DTS might ship with a software development kit (SDK). However, as of Beta 1, the plan for the SDK is still undefined. Some features you might expect are a task wizard, a transform wizard, and other component-creation wizards. More information about the SDK should be available as SQL Server 2005 gets closer to shipping.

That's the whirlwind tour of SQL Server 2005 DTS. As you can see, most of the concepts remain the same, but the product is brand-new. Indeed, by the time you read this, DTS might even have a new name that reflects that fact.

Comments

Plain text