This Was Not the Article I Intended to Write
This article developed as I was in the process of writing another one on string functions. I had not run across the existence of the new STRING_SPLIT() function that was part of SQL Server 2016 until my research on what will be a couple of future articles. So long as your target database is in COMPATIBILITY_MODE = 130 you can take advantage of this function.
“What advantage is that?” You say? STRING_SPLIT() is a table-valued function which allows you to parse a delimited string passed in as a parameter (along with a separation character as a second parameter) returning the output column [value] as a table valued result set that presents as a row for each delimited value between each separator character.
STRING_SPLIT Syntax
The syntax for returning results from a function call to STRING_SPLIT is reminiscent of any call to a table-valued function:
SELECT value
FROM STRING_SPLIT( input string varchar(max), separator character varchar(1));
The input string parameter can be one of any of the following:
- Delimited string value
- varchar() typed variable assignment
- Table column when using a CROSS APPLY
Examples for Returning Results from STRING_SPLIT()
Let’s take a look at how each of the aforementioned options for passing an input string parameter into STRING_SPLIT yields results:
Example One: Passing in a String Value:
This is by far the most straight-forward approach requiring simply the input string and the separator character. Using the following example should provide some clarity. Let’s see what happens when we input a list of airports I’ve visited last year as a sample:
SELECT value
FROM STRING_SPLIT ('AZO,YYZ,SEA,PDX,SFO,RKV,AMS', ',');
value
---------------------------
AZO
YYZ
SEA
PDX
SFO
RKV
AMS
Of course the results are able to be filtered with predicates in the WHERE clause subject to sorting via the ORDER BY clause:
SELECT value
FROM STRING_SPLIT ('AZO,YYZ,SEA,PDX,SFO,RKV,AMS', ',')
WHERE value LIKE 'A%'
ORDER BY value;
value
---------------------------
AMS
AZO
Example Two: Normalizing Data using a Variable for the Delimited String… and a Cursor:
We can also declare a variable as varchar(n) or varchar(max) and pass that in as the first parameter and see similar results. This example uses the practical case of cleaning denormalized data from one table through parsing and inserting into a normalized table. I’ll make use of three tables:
- [events] – storing event data related to SQL Cruise for up-coming events through 2018
- [event_ports_denormalized] – storing an id column referencing the id column in [events] and a ports_list_csv column which stores a comma-separated list of each port for that event.
- [event_ports] which will be a normalized version of the [event_ports_denormalized] table.
CREATE TABLE [events]
(
id INT IDENTITY(1,1),
event_name VARCHAR(100)
);
CREATE TABLE [event_ports_denormalized]
(
id INT NOT NULL,
ports_list_csv VARCHAR(500) NOT NULL
);
CREATE TABLE [event_ports]
(
id INT NOT NULL,
port_name VARCHAR(50) NOT NULL
);
We also need to populate these tables to make sense of the results:
INSERT INTO [events](event_name)
VALUES ('SQL Cruise Alaska 2017')
, ('SQL Cruise Caribbean 2018')
, ('SQL Cruise Alaska 2018')
INSERT INTO [event_ports_denormalized](id, ports_list_csv)
VALUES (1,'Seattle,Ketchikan,Juneau,Skagway,Victoria')
, (2,'Miami,St. Thomas,Tortola,Nassau')
, (3,'Seattle,Ketchikan,Juneau,Skagway,Victoria');
The tables’ contents looks like this:
[events]:
id event_name
----------- --------------------------
1 SQL Cruise Alaska 2017
2 SQL Cruise Caribbean 2018
3 SQL Cruise Alaska 2018
[event_ports_denormalized]:
id ports_list_csv
----------- --------------------------------------------
1 Seattle,Ketchikan,Juneau,Skagway,Victoria
2 Miami,St. Thomas,Tortola,Nassau
3 Seattle,Ketchikan,Juneau,Skagway,Victoria
If we wanted to look at a result set of ports and their associated events the resulting data doesn’t look conducive to further processing and joins with other tables in this state:
SELECT E.event_name, EPD.ports_list_csv
FROM [events] AS E
INNER JOIN [event_ports_denormalized] AS EPD
ON E.id = EPD.id
ORDER BY E.id;
event_name ports_list_csv
----------------------------- ---------------------------------------------
SQL Cruise Alaska 2017 Seattle,Ketchikan,Juneau,Skagway,Victoria
SQL Cruise Caribbean 2018 Miami,St. Thomas,Totola,Nassau
SQL Cruise Alaska 2018 Seattle,Ketchikan,Juneau,Skagway,Victoria
So it’s fair time we clean this data up so we can use it with other tables in the database. Let’s get to normalizing!
The approach we will take here is to use a cursor to assign each csv value and its associated id to variables to use STRING_SPLIT() to parse the values and insert into the [event_ports] table:
DECLARE @id INT
DECLARE @csv VARCHAR(500)
DECLARE splitting_cursor CURSOR FOR
SELECT id, ports_list_csv
FROM [event_ports_denormalized];
OPEN splitting_cursor
FETCH NEXT FROM splitting_cursor
INTO @id, @csv
WHILE @@FETCH_STATUS = 0
BEGIN
INSERT INTO [event_ports] (id, port_name)
SELECT @id, SS.value
FROM STRING_SPLIT(@csv,',') AS SS
FETCH NEXT FROM splitting_cursor
INTO @id, @csv
END
CLOSE splitting_cursor;
DEALLOCATE splitting_cursor;
Using the following query we can now see a normalized listing of events and the associated ports:
SELECT E.event_name, EP.port_name
FROM [events] AS E
INNER JOIN [event_ports] AS EP
ON E.id = EP.id
ORDER BY E.id;
event_name port_name
----------------------------- ------------
SQL Cruise Alaska 2017 Seattle
SQL Cruise Alaska 2017 Ketchikan
SQL Cruise Alaska 2017 Juneau
SQL Cruise Alaska 2017 Skagway
SQL Cruise Alaska 2017 Victoria
SQL Cruise Caribbean 2018 Miami
SQL Cruise Caribbean 2018 St. Thomas
SQL Cruise Caribbean 2018 Totola
SQL Cruise Caribbean 2018 Nassau
SQL Cruise Alaska 2018 Seattle
SQL Cruise Alaska 2018 Ketchikan
SQL Cruise Alaska 2018 Juneau
SQL Cruise Alaska 2018 Skagway
SQL Cruise Alaska 2018 Victoria
Example Three: Using table values through a CROSS APPLY (Cursor-less):
I showed you the last process in order to show you how much simpler and lightweight it could be by using a CROSS APPLY. Let’s take the final step to using STRING_SPLIT() to normalize delimited data by a straight CROSS APPLY bypassing the cursor all together. I’ve truncated the [event_ports] table (take my word for it) and am ready to use the following process to reload the table:
INSERT INTO [event_ports] (id, port_name)
SELECT id, SS.value
FROM [event_ports_denormalized] AS EPD
CROSS APPLY STRING_SPLIT(EPD.ports_list_csv,',') AS SS;
The results are the same and run lighter with only 4 lines of code:
SELECT E.event_name, EP.port_name
FROM [events] AS E
INNER JOIN [event_ports] AS EP
ON E.id = EP.id
ORDER BY E.id;
event_name port_name
----------------------------- ------------
SQL Cruise Alaska 2017 Seattle
SQL Cruise Alaska 2017 Ketchikan
SQL Cruise Alaska 2017 Juneau
SQL Cruise Alaska 2017 Skagway
SQL Cruise Alaska 2017 Victoria
SQL Cruise Caribbean 2018 Miami
SQL Cruise Caribbean 2018 St. Thomas
SQL Cruise Caribbean 2018 Totola
SQL Cruise Caribbean 2018 Nassau
SQL Cruise Alaska 2018 Seattle
SQL Cruise Alaska 2018 Ketchikan
SQL Cruise Alaska 2018 Juneau
SQL Cruise Alaska 2018 Skagway
SQL Cruise Alaska 2018 Victoria
NULL or Empty String?
One last thing I want to cover involves what happens when you have two delimiter characters positioned sequentially in the input string for the function. Take the following parameter as an example (noting the double comma sequence between Ketchikan and Juneau):
SELECT value
FROM STRING_SPLIT('Ketchikan,,Juneau,Skagway,Victoria',',');
value
--------------
Ketchikan
Juneau
Skagway
Victoria
(5 row(s) affected)
The value for the row returned for the sequential comma delimiter characters in treated as an empty string and not NULL.
Conclusion
Whether due to age or poorly-coded, many applications and nosql solutions often store comma-separated values (CSVs) in a single column/field. In many cases Microsoft SQL Server becomes involved in this denormalized data through either migration, ETL efforts, or some form of coordinated merge of system data that could benefit from using the STRING_SPLIT() function. The function is, essentially, a string parsing function that normalizes that which isn’t. This opens up new possibilities for easily normalizing data – whether internal to your SQL Server database or through interaction with an outside data source.