Parsing Denormalized Strings with the STRING_SPLIT Function in SQL Server 2016

This Was Not the Article I Intended to Write

This article developed as I was in the process of writing another one on string functions. I had not run across the existence of the new STRING_SPLIT() function that was part of SQL Server 2016 until my research on what will be a couple of future articles. So long as your target database is in COMPATIBILITY_MODE = 130 you can take advantage of this function.

“What advantage is that?” You say? STRING_SPLIT() is a table-valued function which allows you to parse a delimited string passed in as a parameter (along with a separation character as a second parameter) returning the output column [value] as a table valued result set that presents as a row for each delimited value between each separator character.

STRING_SPLIT Syntax

The syntax for returning results from a function call to STRING_SPLIT is reminiscent of any call to a table-valued function:

SELECT value
FROM STRING_SPLIT( input string varchar(max), separator character varchar(1));

The input string parameter can be one of any of the following:

Delimited string value
varchar() typed variable assignment
Table column when using a CROSS APPLY

Examples for Returning Results from STRING_SPLIT()

Let’s take a look at how each of the aforementioned options for passing an input string parameter into STRING_SPLIT yields results:

Example One: Passing in a String Value:

This is by far the most straight-forward approach requiring simply the input string and the separator character. Using the following example should provide some clarity. Let’s see what happens when we input a list of airports I’ve visited last year as a sample:

SELECT value 
FROM STRING_SPLIT ('AZO,YYZ,SEA,PDX,SFO,RKV,AMS', ',');

value
---------------------------
AZO
YYZ
SEA
PDX
SFO
RKV
AMS

Of course the results are able to be filtered with predicates in the WHERE clause subject to sorting via the ORDER BY clause:

SELECT value 
FROM STRING_SPLIT ('AZO,YYZ,SEA,PDX,SFO,RKV,AMS', ',')
WHERE value LIKE 'A%'
ORDER BY value;

value
---------------------------
AMS
AZO

Example Two: Normalizing Data using a Variable for the Delimited String… and a Cursor:

We can also declare a variable as varchar(n) or varchar(max) and pass that in as the first parameter and see similar results. This example uses the practical case of cleaning denormalized data from one table through parsing and inserting into a normalized table. I’ll make use of three tables:

[events] – storing event data related to SQL Cruise for up-coming events through 2018
[event_ports_denormalized] – storing an id column referencing the id column in [events] and a ports_list_csv column which stores a comma-separated list of each port for that event.
[event_ports] which will be a normalized version of the [event_ports_denormalized] table.

CREATE TABLE [events]
	( 
		id INT IDENTITY(1,1),
		event_name VARCHAR(100)
	);

CREATE TABLE [event_ports_denormalized]
	(
		id INT NOT NULL, 
		ports_list_csv VARCHAR(500) NOT NULL
	);

CREATE TABLE [event_ports]
	(
		id INT NOT NULL, 
		port_name VARCHAR(50) NOT NULL
	);

We also need to populate these tables to make sense of the results:

INSERT INTO [events](event_name)
VALUES ('SQL Cruise Alaska 2017')
	, ('SQL Cruise Caribbean 2018')
	, ('SQL Cruise Alaska 2018') 

INSERT INTO [event_ports_denormalized](id, ports_list_csv) 
VALUES (1,'Seattle,Ketchikan,Juneau,Skagway,Victoria')
	, (2,'Miami,St. Thomas,Tortola,Nassau')
	, (3,'Seattle,Ketchikan,Juneau,Skagway,Victoria');

The tables’ contents looks like this:

[events]:

id          event_name
----------- --------------------------
1           SQL Cruise Alaska 2017
2           SQL Cruise Caribbean 2018
3           SQL Cruise Alaska 2018

[event_ports_denormalized]:

id          ports_list_csv
----------- --------------------------------------------
1           Seattle,Ketchikan,Juneau,Skagway,Victoria
2           Miami,St. Thomas,Tortola,Nassau
3           Seattle,Ketchikan,Juneau,Skagway,Victoria

If we wanted to look at a result set of ports and their associated events the resulting data doesn’t look conducive to further processing and joins with other tables in this state:

SELECT E.event_name, EPD.ports_list_csv
FROM [events] AS E
	INNER JOIN [event_ports_denormalized] AS EPD
		ON E.id = EPD.id
ORDER BY E.id;

event_name                    ports_list_csv
----------------------------- ---------------------------------------------
SQL Cruise Alaska 2017        Seattle,Ketchikan,Juneau,Skagway,Victoria
SQL Cruise Caribbean 2018     Miami,St. Thomas,Totola,Nassau
SQL Cruise Alaska 2018        Seattle,Ketchikan,Juneau,Skagway,Victoria

So it’s fair time we clean this data up so we can use it with other tables in the database. Let’s get to normalizing!

The approach we will take here is to use a cursor to assign each csv value and its associated id to variables to use STRING_SPLIT() to parse the values and insert into the [event_ports] table:

DECLARE @id INT
DECLARE @csv VARCHAR(500)

DECLARE splitting_cursor CURSOR FOR 
	SELECT id, ports_list_csv
	FROM [event_ports_denormalized];

OPEN splitting_cursor

FETCH NEXT FROM splitting_cursor 
INTO @id, @csv

WHILE @@FETCH_STATUS = 0
BEGIN
    INSERT INTO [event_ports] (id, port_name)
	SELECT @id, SS.value
	FROM STRING_SPLIT(@csv,',') AS SS

	FETCH NEXT FROM splitting_cursor 
	INTO @id, @csv
END 

CLOSE splitting_cursor;
DEALLOCATE splitting_cursor;
Using the following query we can now see a normalized listing of events and the associated ports:
SELECT E.event_name, EP.port_name
FROM [events] AS E
	INNER JOIN [event_ports] AS EP
		ON E.id = EP.id
ORDER BY E.id;

event_name                    port_name
----------------------------- ------------
SQL Cruise Alaska 2017        Seattle
SQL Cruise Alaska 2017        Ketchikan
SQL Cruise Alaska 2017        Juneau
SQL Cruise Alaska 2017        Skagway
SQL Cruise Alaska 2017        Victoria
SQL Cruise Caribbean 2018     Miami
SQL Cruise Caribbean 2018     St. Thomas
SQL Cruise Caribbean 2018     Totola
SQL Cruise Caribbean 2018     Nassau
SQL Cruise Alaska 2018        Seattle
SQL Cruise Alaska 2018        Ketchikan
SQL Cruise Alaska 2018        Juneau
SQL Cruise Alaska 2018        Skagway
SQL Cruise Alaska 2018        Victoria

Example Three: Using table values through a CROSS APPLY (Cursor-less):

I showed you the last process in order to show you how much simpler and lightweight it could be by using a CROSS APPLY. Let’s take the final step to using STRING_SPLIT() to normalize delimited data by a straight CROSS APPLY bypassing the cursor all together. I’ve truncated the [event_ports] table (take my word for it) and am ready to use the following process to reload the table:

INSERT INTO [event_ports] (id, port_name)
SELECT id, SS.value
FROM [event_ports_denormalized] AS EPD
	CROSS APPLY STRING_SPLIT(EPD.ports_list_csv,',') AS SS;

The results are the same and run lighter with only 4 lines of code:

SELECT E.event_name, EP.port_name
FROM [events] AS E
 INNER JOIN [event_ports] AS EP
  ON E.id = EP.id
ORDER BY E.id;

event_name                    port_name
----------------------------- ------------
SQL Cruise Alaska 2017        Seattle
SQL Cruise Alaska 2017        Ketchikan
SQL Cruise Alaska 2017        Juneau
SQL Cruise Alaska 2017        Skagway
SQL Cruise Alaska 2017        Victoria
SQL Cruise Caribbean 2018     Miami
SQL Cruise Caribbean 2018     St. Thomas
SQL Cruise Caribbean 2018     Totola
SQL Cruise Caribbean 2018     Nassau
SQL Cruise Alaska 2018        Seattle
SQL Cruise Alaska 2018        Ketchikan
SQL Cruise Alaska 2018        Juneau
SQL Cruise Alaska 2018        Skagway
SQL Cruise Alaska 2018        Victoria

NULL or Empty String?

One last thing I want to cover involves what happens when you have two delimiter characters positioned sequentially in the input string for the function. Take the following parameter as an example (noting the double comma sequence between Ketchikan and Juneau):

SELECT value
FROM STRING_SPLIT('Ketchikan,,Juneau,Skagway,Victoria',',');

value
--------------
Ketchikan

Juneau
Skagway
Victoria

(5 row(s) affected)

The value for the row returned for the sequential comma delimiter characters in treated as an empty string and not NULL.

Conclusion

Whether due to age or poorly-coded, many applications and nosql solutions often store comma-separated values (CSVs) in a single column/field. In many cases Microsoft SQL Server becomes involved in this denormalized data through either migration, ETL efforts, or some form of coordinated merge of system data that could benefit from using the STRING_SPLIT() function. The function is, essentially, a string parsing function that normalizes that which isn’t. This opens up new possibilities for easily normalizing data – whether internal to your SQL Server database or through interaction with an outside data source.

Comments

Plain text