MERGE Statement Tips

When a new T-SQL feature is introduced in SQL Server, it usually takes some time before users realize certain aspects of working with the feature. In time, people develop best practices for using the feature and discover interesting, nontrivial things about it. In this article, I discuss some tips concerning the MERGE statement that I discovered after several years of working with the statement. I cover four main tips: preventing MERGE conflicts, understanding that the MERGE ON clause isn't a filter, realizing that the MERGE USING clause is like the FROM clause in a SELECT statement, and, finally, referring to source table elements in the OUTPUT clause.

For the purposes of this article, I use a sample database called TSQL2012. You can download the source code to create the database and populate it with sample data. Some of my examples use the Sales.Customers table as the source table for the MERGE statement. All of my examples use a table called Sales.MyCustomers as the target table. Use the code in Listing 1 to create the table in the TSQL2012 database.

USE TSQL2012; -- download from http://tsql.solidq.com/books/source_code/TSQL2012.zip

IF OBJECT_ID('Sales.MyCustomers') IS NOT NULL DROP TABLE Sales.MyCustomers;

CREATE TABLE Sales.MyCustomers
(
  custid       INT          NOT NULL,
  companyname  NVARCHAR(40) NOT NULL,
  country      NVARCHAR(15) NOT NULL,
  phone        NVARCHAR(24) NOT NULL,
  CONSTRAINT PK_MyCustomers PRIMARY KEY(custid)
);

Preventing MERGE Conflicts

One of the classic tasks that's handled with the MERGE statement is update where exists and insert where not exists. Some people refer to this task as upsert. For example, suppose you have a stored procedure that accepts as input parameters the attributes of a customer: @custid, @companyname, @country, @phone. The procedure is supposed to check whether the customer ID already exists in the target table Sales.MyCustomers; if it does exist, update the target row, overwriting the nonkey columns with the new values, and if it doesn't exist, add a new row. This task can be accomplished with the following MERGE statement:

MERGE INTO Sales.MyCustomers AS TGT
USING ( VALUES( @custid, @companyname, @country, @phone ) )
AS SRC( custid, companyname, country, phone )
ON SRC.custid = TGT.custid
WHEN MATCHED THEN UPDATE
SET TGT.companyname = SRC.companyname,
TGT.country = SRC.country,
TGT.phone = SRC.phone
WHEN NOT MATCHED THEN INSERT
VALUES( SRC.custid, SRC.companyname, SRC.country, SRC.phone );

The problem is that under the default isolation level, read committed, conflicts can occur if two processes execute the procedure at the same time with the same customer ID as input. If the input customer ID doesn't exist in the target table, the MERGE predicate in both cases will return a false, activating in both cases the WHEN NOT MATCHED action: INSERT. In such a case, one of the inserts will result in a primary key violation. To prove this, run the code in Listing 2 from two sessions.

SET NOCOUNT ON;
USE TSQL2012;

BEGIN TRY

  WHILE 1 = 1
  BEGIN

    DECLARE
      @custid       INT          = CHECKSUM(SYSDATETIME()),
      @companyname  NVARCHAR(40) = N'A',
      @country      NVARCHAR(15) = N'B',
      @phone        NVARCHAR(24) = N'C'; 

    MERGE INTO Sales.MyCustomers AS TGT
    USING ( VALUES( @custid, @companyname, @country, @phone ) )
          AS SRC( custid, companyname, country, phone )
      ON SRC.custid = TGT.custid
    WHEN MATCHED THEN UPDATE
      SET TGT.companyname = SRC.companyname,
          TGT.country     = SRC.country,
          TGT.phone       = SRC.phone
    WHEN NOT MATCHED THEN INSERT
      VALUES( SRC.custid, SRC.companyname, SRC.country, SRC.phone );
  END;

END TRY
BEGIN CATCH

  THROW;

END CATCH;
SET NOCOUNT OFF;

The code uses an infinite loop that invokes the aforementioned MERGE statement. It uses the expression CHECKSUM(SYSDATETIME()) to generate the customer ID. There's a high likelihood that the resulting value doesn't already exist as a customer ID in the target table and that after some iterations both sessions will produce the same value. When I ran this code from two sessions, I got a conflict after a few seconds, and one of the sessions generated the error message in Figure 1.

Note that this isn't considered a bug. The atomicity of the MERGE transaction isn't violated -- it's still an all-or-nothing behavior. According to the read committed isolation level, such conflicts aren't supposed to be prevented. If you want to serialize access to the object throughout the transaction (both the check whether the customer exists and the insertion), you need to use the serializable isolation level. You can achieve this by either setting the session's isolation level to serializable (SET TRANSACTION ISOLATION LEVEL SERIALIZABLE), or by specifying the table hint HOLDLOCK or SERIALIZABLE (both have the same meaning) against the target table.

To test using the serializable isolation level alter the MERGE statement in Listing 2, replace the first line in the statement with the following:

MERGE INTO Sales.MyCustomers WITH (HOLDLOCK) AS TGT

Run the code from two sessions, and this time conflicts shouldn't occur.

The MERGE ON Clause Isn't a Filter

The MERGE statement uses an ON clause to define the MERGE predicate. However, this clause has a different meaning than the ON clause in a join -- which some people don't realize. In a join, the ON clause is used for both matching and filtering purposes. In a MERGE statement, the ON clause is used to identify matches and nonmatches and to accordingly determine which WHEN clause to activate. If you don't realize this, and you think of ON as a filter, you can end up writing MERGE statements with bugs. I'll demonstrate this problem with an example, and I'll provide a solution.

In my example, I'll use the Sales.Customers table as the source for the MERGE statement and the Sales.MyCustomers table as the target. First, run the following code to populate the Sales.MyCustomers table with some rows representing customers from Sweden and Italy:

TRUNCATE TABLE Sales.MyCustomers;
INSERT INTO Sales.MyCustomers(custid, companyname, country, phone)
SELECT custid, companyname, country, phone
FROM Sales.Customers
WHERE country IN (N'Sweden', N'Italy');

Next, suppose you're given a task to write a MERGE statement that merges customers from Italy from the Sales.Customers table into the Sales.MyCustomers table. You're supposed to update customers that exist, and insert customers that don't exist. But you need to work with only source customers from Italy. If you think of the MERGE ON clause as a filter, you might write a statement such as the following:

MERGE INTO Sales.MyCustomers AS TGT
USING Sales.Customers SRC
ON SRC.custid = TGT.custid
AND SRC.country = N'Italy'
WHEN MATCHED THEN UPDATE
SET TGT.companyname = SRC.companyname,
TGT.country = SRC.country,
TGT.phone = SRC.phone
WHEN NOT MATCHED THEN INSERT
VALUES( SRC.custid, SRC.companyname, SRC.country, SRC.phone );

Try running this code, and you'll get the error message in Figure 2.

The reason for the error is that the ON clause isn't a filter; it only determines which of the WHEN clauses to activate. This means that if the source has a customer that isn't from Italy, the ON predicate produces a false, and the row is directed to the WHEN NOT MATCHED clause. This clause attempts to insert the row into the target table. If the customer already exists in the target, you get a primary key violation error. If it doesn't exist, you end up inserting a row that you're not supposed to into the target.

The solution is to do all the filtering that you need ahead of time in a table expression and use the table expression as the source for the MERGE statement. Here's an example of how you can achieve this with a CTE:

WITH SRC AS
(
SELECT custid, companyname, country, phone
FROM Sales.Customers
WHERE country = N'Italy'
)
MERGE INTO Sales.MyCustomers AS TGT
USING SRC
ON SRC.custid = TGT.custid
WHEN MATCHED THEN UPDATE
SET TGT.companyname = SRC.companyname,
TGT.country = SRC.country,
TGT.phone = SRC.phone
WHEN NOT MATCHED THEN INSERT
VALUES( SRC.custid, SRC.companyname, SRC.country, SRC.phone );
The MERGE USING Clause Is Like the FROM Clause

There's an interesting aspect of the MERGE statement that some people aren't aware of. The USING clause of the MERGE statement is designed very similar to the FROM clause in the SELECT statement. This means that you aren't limited to specifying only one table in the USING clause as the source. Rather, you can refer to table expressions, table functions, and even use table operators such as JOIN, APPLY, PIVOT, and UNPIVOT. Eventually, the outcome of the USING clause is a virtual table that's considered the source for the merge operation.

I'll first demonstrate using table operators such as joins, and then I'll demonstrate using table functions. Run the following code to create three tables, each holding a different attribute of the customer:

IF OBJECT_ID('Sales.CustCompany') IS NOT NULL DROP TABLE Sales.CustCompany;
IF OBJECT_ID('Sales.CustCountry') IS NOT NULL DROP TABLE Sales.CustCountry;
IF OBJECT_ID('Sales.CustPhone') IS NOT NULL DROP TABLE Sales.CustPhone;
SELECT custid, companyname INTO Sales.CustCompany FROM Sales.Customers;
SELECT custid, country INTO Sales.CustCountry FROM Sales.Customers;
SELECT custid, phone INTO Sales.CustPhone FROM Sales.Customers;
ALTER TABLE Sales.CustCompany ADD CONSTRAINT PK_CustCompany PRIMARY KEY(custid);
ALTER TABLE Sales.CustCountry ADD CONSTRAINT PK_CustCountry PRIMARY KEY(custid);
ALTER TABLE Sales.CustPhone ADD CONSTRAINT PK_CustPhone PRIMARY KEY(custid);

You're given three tables, each holding different customer attributes: Sales.CustCompany, Sales.CustCountry, and Sales.CustPhone. Never mind why the customer attributes were originally separated into three tables -- this isn't your concern. You're tasked with developing a MERGE statement that uses the data from the combined tables as the source and the Sales.MyCustomers table as the target. You need to implement simple upsert logic -- update where exists and insert where not exists.

Of course, you could handle the joins between the three tables in a table expression and use the table expression as the source for the operation. However, because the USING clause is designed very similar to the FROM clause, you can actually perform the joins directly in this clause. Eventually, the USING clause returns a virtual table representing the result of all table operators, and the result table is used as the source for the merge operation. Here's such a MERGE statement implementing the joins in the USING clause:

MERGE INTO Sales.MyCustomers AS TGT
USING Sales.CustCompany
INNER JOIN Sales.CustCountry
ON CustCompany.custid = CustCountry.custid -- join ON
INNER JOIN Sales.CustPhone
ON CustCompany.custid = CustPhone.custid -- join ON
ON CustCompany.custid = TGT.custid -- MERGE ON
WHEN MATCHED THEN UPDATE
SET TGT.companyname = CustCompany.companyname,
TGT.country = CustCountry.country,
TGT.phone = CustPhone.phone
WHEN NOT MATCHED THEN INSERT
VALUES( CustCompany.custid, CustCompany.companyname, CustCountry.country, CustPhone.phone );

It's important to note that the first two ON clauses are used by the two instances of the JOIN table operator, and the last ON clause is used by the MERGE statement. As I discussed in the previous section, the two cases have different meanings. The last one defining the MERGE predicate identifies matches and nonmatches between rows in the source (the result of all table operators) and the target.

Another example demonstrating the similarities between the USING and FROM clauses is referring to a table function as the source. For example, suppose that the source for the merge operation is data in a file called Customers.txt, and that you have a format file called CustomersFmt.xml. You could first bulk load the data into a staging table and then use the staging table as the source for the merge operation. But you can avoid the first step with the staging table by querying the file data directly, using the OPENROWSET function with the BULK provider.

Assuming the files are located in the folder C:\temp, here's a query retrieving the data from the source file using the BULK rowset provider:

SELECT *
FROM OPENROWSET(BULK 'c:\temp\Customers.txt',
FORMATFILE = 'c:\temp\CustomersFmt.xml') AS SRC;
Similarly, the OPENROWSET function can be used directly in the USING clause of the MERGE statement, like so:
MERGE INTO Sales.MyCustomers AS TGT
USING OPENROWSET(BULK 'c:\temp\Customers.txt',
FORMATFILE = 'c:\temp\CustomersFmt.xml') AS SRC
ON SRC.custid = TGT.custid
WHEN MATCHED THEN UPDATE
SET TGT.companyname = SRC.companyname,
TGT.country = SRC.country,
TGT.phone = SRC.phone
WHEN NOT MATCHED THEN INSERT
VALUES( SRC.custid, SRC.companyname, SRC.country, SRC.phone );

When you're done, run the following code to drop the source tables used in this section:

IF OBJECT_ID('Sales.CustCompany') IS NOT NULL DROP TABLE Sales.CustCompany;
IF OBJECT_ID('Sales.CustCountry') IS NOT NULL DROP TABLE Sales.CustCountry;
IF OBJECT_ID('Sales.CustPhone') IS NOT NULL DROP TABLE Sales.CustPhone;
MERGE with OUTPUT Allows Referring to Source Values

The MERGE statement provides you with more flexibility than other statements when it comes to the OUTPUT clause. With other statements, the OUTPUT clause lets you refer to only elements from the modified rows. With the MERGE statement, the OUTPUT clause lets you refer to elements from the modified rows, as well as elements from the source table. Before I demonstrate this advantage, run the following code to clear the Sales.MyCustomers table:

TRUNCATE TABLE Sales.MyCustomers;

Suppose that you need to query the Sales.MyCustomers table and insert the result rows into the Sales.MyCustomers table. You want to add an OUTPUT clause that returns the columns custid, companyname, country, and phone from the inserted rows. But you also want the OUTPUT clause to return the source column contactname even though it's not part of the target row. You attempt to issue the following INSERT statement:

INSERT INTO Sales.MyCustomers(custid, companyname, country, phone)
OUTPUT
inserted.custid, inserted.companyname,
inserted.country, inserted.phone, c.contactname
SELECT custid, companyname, country, phone
FROM Sales.Customers AS C;

But the OUTPUT clause of an INSERT statement doesn't let you refer to the source table columns. You get the error message in Figure 3.

Interestingly, the OUTPUT clause of the MERGE statement is more flexible in the sense that it lets you refer to source table columns. With this in mind, you can mimic the previous INSERT statement by using a MERGE statement with a predicate that's always false. This will activate the INSERT action of the WHEN NOT MATCHED clause. Here's the solution code:

MERGE INTO Sales.MyCustomers AS TGT
USING Sales.Customers AS SRC
ON 1 = 2
WHEN NOT MATCHED THEN INSERT
VALUES( SRC.custid, SRC.companyname, SRC.country, SRC.phone )
OUTPUT
inserted.custid, inserted.companyname,
inserted.country, inserted.phone, SRC.contactname;

This time, the code runs successfully.

When you're done, run the following code for cleanup:

IF OBJECT_ID('Sales.MyCustomers') IS NOT NULL DROP TABLE Sales.MyCustomers;
Use MERGE Effectively

The MERGE statement was introduced in SQL Server 2008. In this article, I covered aspects of the MERGE statement that I've discovered over time. I explained that in the default isolation level, read committed, MERGE statements implementing upsert logic can conflict with one another and that you can avoid such conflicts by using the serializable isolation level. I explained the difference between the JOIN ON clause and the MERGE ON clause, and I provided a solution for filtering the source data used for the operation. I described the similarities between the MERGE USING and the SELECT FROM clauses, demonstrating the use of joins and table functions in the USING clause. Finally, I explained that the OUTPUT clause in the MERGE statement lets you refer to the source table columns, unlike with other statements.

Comments

Plain text