Writing Distributed Heterogeneous Queries in SQL Server 7.0

Shortcuts to data from multiple sources

Microsoft's catch phrase in the early days of Windows was "Information at your fingertips." That phrase comes to mind when I think about SQL Server 7.0's support for distributed heterogeneous queries.

What Is a Distributed Heterogeneous Query?

A distributed heterogeneous query lets you access data from multiple sources in one step. You can dissect distributed heterogeneous query into distributed, referring to data disbursed on multiple machines, and heterogeneous, referring to diverse data sources. Say you have data stored in databases on SQL Server, Microsoft Access, and Oracle. With a distributed query, you can create two linked servers in your SQL Server system that represent the Oracle and Access data sources. Now you can reference tables within those two distributed heterogeneous data sources as if they were local tables within SQL Server. You can issue a single query to SQL Server, which can then retrieve the Access and Oracle data.

SQL Server packages all the data into a single result set and returns the answer to the end user at a client PC. This approach is similar to using attached tables with Access.

Data Sources

SQL Server supports distributed queries through OLE DB, Microsoft's new API standard for Universal Data Access (UDA). You can understand OLE DB by comparing it with Open Database Connectivity (ODBC), which has become the primary universal data-access mechanism for relational data sources.

ODBC has simplified access to heterogeneous databases—particularly when you're accessing a relational database. However, much useful data, including audio, video, and full text, isn't in relational databases—or any database management system (DBMS). OLE DB lets you access this non-relational data.

SQL Server speaks OLE DB natively. OLE DB is the primary means for accessing SQL Server data from an application, and components of SQL Server use OLE DB to speak to each other. Microsoft designed distributed queries to work with any OLE DB provider. However, the initial release has been tested only with selected OLE DB providers including SQL Server, ODBC, Access (i.e., the Jet database engine), and Oracle.

Why would Microsoft release a native OLE DB API for Oracle before fully supporting Microsoft products? Microsoft knows that Oracle is numero uno in the database world. Microsoft must ensure that a stable and fast OLE DB interface exists for Oracle data sources; the development community is more likely, then, to accept the new API standard.

Also, OLE DB provides an alternative to the use of object-relational databases (ORDBs): universal servers. The sidebar "An ORDB Hybrid," page 29, explains Microsoft's ORDB management system (ORDBMS) strategy, which includes universal server functionality.

The examples in this article focus on distributed queries running against Access. (This focus simplifies the article for readers who don't have multiple SQL Server 7.0 boxes.) SQL Server-to-Access queries are useful, but I stress that SQL Server-to-SQL Server queries constitute a powerful way to use SQL Server 7.0's new distributed-queries technology. Microsoft will keep pre-SQL Server 7.0 remote server capabilities for backward compatibility, but you should use the new query features for data spread across multiple SQL Server systems.

EXAMPLE 1

Let's get our hands dirty with a simple example that walks you through creating a linked server connected to the sample Northwinds Access database. Next, you'll join a table from the Northwinds Access server to a local table in SQL Server. (By the way, Northwinds ships with SQL Server 7.0 as a sample database.)

The following command creates a distributed server entry called Northwinds. The command uses the Microsoft.Jet.OLEDB.3.51 OLE DB provider linking to an Access database located at c:\brian\writing\sqlservermag\nov98\nwind.mdb:

sp_addlinkedserver
	@server = "Northwinds",
	@srvproduct = "Access 97",
	@Provider = 
		"Microsoft.Jet.OLEDB.3.51",
		@datasrc =
		"c:\brian\writing\sqlservermag\nov98\nwind.mdb"

Example 1 uses data from this Access .mdb file. However, linking to other distributed data stores is just as simple. The sp_addlinkedserver reference in Microsoft's Books Online (BOL) shows the complete syntax for identifying a linked server to Access, SQL Server 7.0, Oracle, and the generic OLE DB for ODBC provider. The OLE DB for ODBC provider lets you create a linked server to any ODBC-compliant data store, including SQL Server 4.x and SQL Server 6.x.

When you set up a linked server, you're registering the connection information and data source information with SQL Server so that the server knows how to find the data when you need it. In ODBC terms, think of this process as creating a Data Source Name (DSN) for the client to use.

Referencing remote tables in a linked server is simple; it's similar to referencing a local SQL Server table. Each local table in SQL Server has a four-part, fully qualified name consisting of server, database, object owner, and object. Distributed queries use a similar four-part naming scheme, but the terms are slightly different. You specify distributed query four-part names in the following form: server_name.catalog.schema.object_name. Server_name is the logical name you specified in the sp_addlinkedserver call. Catalog and schema are ANSI terms that mean database and object owner in SQL Server terms, and object_name refers to a table or whatever object type works for the OLE DB provider you're using. When using the Access OLE DB provider, you don't need to specify all the pieces of the four-part naming scheme. If you specify all four parts, the system will send you the following error: Server: Msg 7312, Level 16, State 1: Illegal use of schema and/or catalog for OLE DB provider 'Microsoft.Jet.OLEDB.3.51'. A four-part name was supplied, but the provider does not expose the necessary interfaces to use a catalog and/or schema.

Providers, which expose differing levels of OLE DB functionality, have different rules governing the way you access data and name data stores. You'll probably discover them through trial and error. Microsoft didn't adequately explain the differences in Beta 3's BOL.

One step remains before you can run a distributed query against the new Northwinds linked server. By default, distributed queries use current user credentials when connecting to the remote server. You logged in to your server as systems administrator (sa); so SQL Server connects to the Access database as a user called sa. However, the Access database doesn't include an sa user, so the connection fails. The following command solves this problem by telling SQL Server that all local users logged in as sa should connect to the Northwinds server as the remote user called admin, which is the only default user in an Access database:

exec sp_addlinkedsrvlogin
@rmtsrvname = "Northwinds",
@useself = "false",
@locallogin = "sa",
@rmtuser = "Admin",
@rmtpassword = NULL

You're ready to run your first distributed query. Execute the following query, and you'll receive data from the Northwinds database:

SELECT * FROM
 Northwinds...customers

Congratulations, you've just written your first distributed heterogeneous query!

Access doesn't support the four-part name, but you must include the periods as placeholders. Otherwise, SQL Server won't realize you're trying to run a distributed query and will think you're trying to access a local table called customers owned by someone named Northwinds.

You can join tables from Access and SQL Server as if the Access table were a local SQL Server object. Suppose you've created a mobile sales-force automation system and need a query to insert new orders into the master order table when salespeople dock their laptops. The following query uses a left outer join to find remote orders not yet posted to the MasterOrders table. This query inserts those rows into the central MasterOrders table. You'll find little difference between referencing a local table and referencing a remote table:

INSERT into MasterOrders
SELECT
  RemoteOrders.*
FROM
  Northwinds...orders RemoteOrders left join 
  MasterOrders on RemoteOrders.OrderId = 
  MasterOrders.OrderId
  WHERE 
MasterOrders.OrderID = null

Distributed Queries Alternatives

Let's look at OpenRowset and OpenQuery, two Transact-SQL (T-SQL—for an introduction to T-SQL, see Michael D. Reilly, "Managing Data with Query Windows," page 73) functions that you use to execute distributed queries. OpenRowset and OpenQuery are similar; they both support the passthrough of queries to remote data sources. The primary difference between these operations is that OpenRowset requires definition of connection information on the fly, and OpenQuery works with pre-existing linked-server connections.

Distributed Queries Alternatives

The first OpenRowset example, Example 2, issues a simple passthrough using the connection information provided at run time:

SELECT * FROM
OpenRowset("Microsoft.Jet.OLEDB.3.51",
"c:\brian\writing\sqlservermag\nov98\nwind.mdb"; "admin";, "select * from orders")

Figure 1 shows the syntax for an OpenRowset query.

EXAMPLE 3

The second OpenRowset example, Example 3, re-creates the query in Example 1, which you issued using a linked server:

SELECT
RemoteOrders.*
FROM
OpenRowset("Microsoft.Jet.OLEDB.3.51",
"c:\brian\writing\sqlservermag\nov98\nwind.mdb"; "admin"; , 
"select * from orders") RemoteOrders
left join MasterOrders on 
RemoteOrders.OrderId = MasterOrders.OrderId
WHERE
MasterOrders.OrderID = null

OpenQuery doesn't force you to specify connection information, because the linked server already contains all the information you need. You can reference the OpenQuery function, as you can with OpenRowset, in the FROM clause of a query as if the open query function were a table name. You can reference the function as the target table of an INSERT, UPDATE, or DELETE statement, depending on what capabilities the OLE DB provider has exposed. (Like ODBC, the OLE DB specification is large, and different providers will furnish different levels of functionality. Microsoft has not published conformance levels for OLE DB as it has done with ODBC.)

EXAMPLE 4

Example 4 uses OpenQuery to re-create the distributed query in Example 1: As you can see, OpenQuery is easier to read than OpenRowset. OpenQuery runs a little faster because you're using the connection information that you already supplied using sp_addlinkedserver. Figure 2, page 28, shows the syntax for an OpenQuery.

SELECT
RemoteOrders.*
FROM 
OpenQuery(Northwinds, "select * from orders")
RemoteOrders left join MasterOrders on
RemoteOrders.OrderId = MasterOrders.OrderId 
WHERE 
MasterOrders.OrderID = null

You have three choices when deciding how to execute a distributed query. You can create a linked server with sp_addlinkedserver, which makes the underlying tables available. For the ultimate in dynamic flexibility, you can use OpenRowset to execute a passthrough query against a remote OLE DB data source, specifying connection information dynamically. Another choice is to use OpenQuery and pass through a query to an existing linked server.

Stored Procedures and UI

These examples use stored procedure interfaces for managing and executing distributed queries. Table 1 lists stored procedures and their functions.

Has Microsoft tired of easy-to-use database products and scrapped a user interface (UI)? Of course not! Details about the UI are beyond the scope of this article, and being familiar with the underlying stored procedures is valuable knowledge. Stored procedures are invaluable when you're writing scripts and automating tasks, but the UI does exist and will be especially helpful to the novice user.

Security

When you connect to almost any data store, whether it's a database, mail engine, or voice mail system, you must log in and provide user and security information. When using distributed queries, you're connected to the SQL Server system you're issuing queries from, but you must also connect to the remote data, which is often protected. SQL Server has three ways to authenticate you as a downstream distributed query source.

First, SQL Server can leverage the concept of integrated security and pass through Windows NT domain account credentials (for details, see Vicky Launders, "SQL Server 7.0 Security," page 54). This solution lets you use your NT credentials to first connect to SQL Server A, then use the same set of NT credentials to issue a distributed query against SQL Server B. This approach is the easiest and most powerful way to implement security for distributed queries, but you must wait for Windows 2000 (Win2K—formerly NT 5.0) for this support.

Second, SQL Server can authenticate you with the standard security credentials you supplied when logging in to SQL Server. If you log in to SQL Server A as JoeUser and run a query against SQL Server B, you're mapped as JoeUser on that server as well. The first problem with this security type is the user accounts and passwords must exist on both machines. Second, this approach won't work under NT 4.0 if JoeUser is logged in to SQL Server using integrated security.

The third security option to authenticate users is mapped logins that you create and manage with sp_addsrylogin. Such logins are the most practical approach for managing distributed query environments. Until NT 5.0 ships, this approach is the only option available if your users are connecting through integrated security. When a user logs in to the local server and executes a distributed query that accesses a table on the linked server, the local server will log in to the linked server on behalf of the user to access that table. You use sp_addlinkedsrvlogin to map the login credentials that the local server uses to log in to the linked server. You can execute sp_addlinkedserver to automatically create a default mapping between all logins on the local server and remote logins on the linked server. This default mapping states that SQL Server uses the local login's user credentials when connecting to the linked server on behalf of the login (equivalent to executing sp_addlinkedsrvlogin with @useself set to TRUE for the linked server).

Most of the sp_addlinkedsrvlogin arguments are intuitive except for @useself. (Figure 3 shows the syntax for @useself.) Setting @useself to TRUE tells SQL Server to use the current user credentials (username and password) when connecting to the remote server. If this value is TRUE, the system ignores the @rmtuser and @rmtpassword values. BOL says that a value of TRUE for @useself is invalid for an NT authenticated logon unless the NT environment supports security account delegation and the provider supports NT authentication. This approach won't work until Win2K is available, because NT 4.0 does not support security account delegation. Distributed queries and data partitioning will be relatively seamless after NT supports this feature.

Limitations and Other Gotchas

BOL has more than 40 pages of relevant distributed query information; this article covers the basics. I don't have space to discuss all the gotchas, but I want to share some key points.

Don't expect blazingly fast speed when performing cross-server joins. Distributed queries are powerful tools for extending the functionality and reach of SQL Server, but flexibility comes at a price. Current support for distributed queries does not include a full-featured, cost-based, distributed query processor. This situation means SQL Server might need to work a lot harder to join tables on remote servers than when joining the same data sets locally. Network overhead is affected, but the bigger problem is that SQL Server won't always have the same optimization options available when joining large data sets. SQL Server will try to execute as much of the query on the remote server as possible, but the system will inevitably suck large data sets across the network and store them in local temp tables so that SQL Server can process the join. I'd love to provide more details about the potential pitfalls, but the technology is so new that I haven't uncovered most of them yet. I'll explore this topic in a future article.

Portions of a query might execute on both the local and remote servers. Because of this situation, BOL says that distributed queries are always executed as if SET ANSI_DEFAULTS were set to TRUE. Read a little further in BOL and you'll see that the OLE DB provider for SQL Server sets CURSOR_CLOSE_ ON_COMMIT and SET IMPLICIT_TRANSACTIONS to FALSE after the connection is made. This information is spread throughout BOL.

OLE DB providers accessed from a distributed query can support different sort orders and character sets from those the local server supports, and SQL Server will always use the local sort order when processing data sets across servers. Consider the following example from BOL:

SELECT s.OrderId, s.OrderDate
FROM SalesSvr.pubs.dbo.Sales s
ORDER BY s.OrderDate

SQL Server processes the order by using the sorting order installed on the local server; so you need to move the entire data set across the network and sort it locally. Using sp_serveroptions to set the collation compatible option to TRUE tells SQL Server that the remote server sort order and character sets are compatible with the local ones. This information will give the SQL Server query processor more freedom to choose what pieces of a query can be executed on the remote server because comparisons of character data can pass directly to the remote server.

You must upgrade the catalog stored procedures on SQL Server 6.x before it can participate in a distributed query with SQL Server 7.0. See Upgrading the Catalog Stored Procedures (OLE DB) in BOL for information about how to upgrade the stored procedures.

A Little Fun

Distributed queries offer cool functionality in the initial release of SQL Server 7.0, but they have limitations. Performance needs to increase with heterogeneous joins, and I'd love to see more OLE DB providers supplied for Microsoft and third-party vendor products. The current inability to fully leverage NT single logon through integrated security will create headaches when people deploy this technology.

I hope this article helps you understand how important distributed queries and OLE DB technology are to the future of SQL Server. More important, I hope you now have enough information to start writing distributed queries—today.

Comments

Plain text