Skip navigation
colorful balls of string

T-SQL String Manipulation Tips and Techniques, Part 1

Make the most of T-SQL’s limited capabilities

T-SQL is a language that was mainly designed to handle data manipulation tasks. Not much effort and attention were given to other kinds of tasks, such as string manipulation. Therefore, when you do need to manipulate strings in T-SQL, it can sometimes be quite challenging even for seemingly simple tasks. This article is the first of a two-part series in which I cover several common string manipulation needs. I’d like to thank the following people who provided input regarding this topic: Ruben Garrigos, Kevin Boles, Fabiano Amorim, Milos Radivojevic, Peter Larsson, and Davide Mauri.

Related: 8 T-SQL String Functions

Counting Occurrences of a Substring Within a String

The first technique that I discuss is an old one, yet it’s one of my favorites—especially because it’s interesting to see the expression of surprise on people’s faces when they learn it for the first time. The task at hand involves counting the number of occurrences of a substring (@substr) within an input string (@str). Here’s the solution applied to local variables:

DECLARE
  @str    AS VARCHAR(1000) = 'abchellodehellofhello',
  @substr AS VARCHAR(1000) = 'hello';

SELECT (LEN(@str) - LEN(REPLACE(@str, @substr, ''))) / LEN(@substr);

This code returns the value 3, indicating that the substring 'hello' appears three times in the string 'abchellodehellofhello'.

The steps in this technique are:

  1. Replace all occurrences of @substr in @str with an empty string.
  2. Measure the length of the new string.
  3. Subtract this length from the length of the original string.
  4. Divide the result of the subtraction by the length of @substr.

Because the solution is in the form of a single expression, it can be applied as part of a query using the underlying table’s or view’s attributes as inputs.

Exactly N Occurrences of a Substring Within a String

The next technique that I cover involves checking whether an input string (@str) contains exactly a given number of occurrences (@N) of an input substring (@substr). An example of where you might need this capability is part of a validation that an input string is an IP address. For IPV4, the input string needs to have exactly three dots, and for IPV6 exactly five.

One obvious way to achieve this is to use the previous technique to count occurrences of a substring in a string, like so:

DECLARE
  @str    AS VARCHAR(1000) = 'abchellodehellofhello',
  @substr AS VARCHAR(1000) = 'hello',
  @N      AS INT           = 3;

SELECT
  CASE
    WHEN (LEN(@str) - LEN(REPLACE(@str, @substr, ''))) / LEN(@substr) = @N
      THEN 'True'
    ELSE 'False or Unknown'
  END;

Another way to handle the task is to use the LIKE predicate, like so:

SELECT
  CASE
    WHEN @str LIKE '%' + REPLICATE(@substr + '%', @N)
         AND @str NOT LIKE '%' + REPLICATE(@substr + '%', @N+1)
      THEN 'True'
    ELSE 'False or Unknown'
  END;

The idea is to check that you can find @N occurrences but not @N+1 or more.

With both techniques, the validation is handled by a single expression. Therefore, you can easily embed the expression in a query filter or a CHECK constraint in a table definition.

Replacing Multiple Contiguous Spaces With a Single Space

Another common task is related to data cleansing. You need to get rid of redundant spaces from an input string (@str); that is, you need to replace each segment of multiple contiguous spaces with one space. Peter Larsson, a SQL Server MVP, came up with a beautiful technique to handle this task. First, you need to pick a token that you know can’t appear in the input data. The token can be a single character (e.g., ~) or even multiple characters (e.g., ~!@#). Then, the solution involves three steps (assuming the token is ~):

  1. Replace in @str each occurrence of ' ' (space) with '~ ' (token plus space).
  2. Replace in the result of the previous step each occurrence of ' ~' (space plus token) with '' (an empty string).
  3. Replace in the result of the previous step each occurrence of '~ ' (token plus space) with ' ' (space).

All this translates to the following T-SQL expression:

DECLARE @str AS VARCHAR(1000) =
  'this   is     a   string    with     lots   of   spaces';

SELECT REPLACE(REPLACE(REPLACE(@str, ' ', '~ '), ' ~', ''), '~ ', ' ');

The output of this code is: this is a string with lots of spaces.

Replacing Overlapping Occurrences

The next challenge was brought up by Davide Mauri; he initially found it in the Italian SQL Server forums. It involves a certain misconception that some people have regarding the way the REPLACE function works. Consider the following code. Before executing it, see if you can guess what the output will be:

DECLARE @str AS VARCHAR(1000) = '.x.x.x.x.';
SELECT REPLACE(@str, '.x.', '.y.');

Some people intuitively think that the output should be .y.y.y.y.—however, it’s actually .y.x.y.x.. The reasoning behind the actual result is that REPLACE considers nonoverlapping occurrences from left to right. If we represent the string .x. with the symbol A, you can express the source string as AxAx.; then, replacing each occurrence of A with .y. gives you .y.x.y.x..

Suppose you want to handle the replacement task while considering overlapping occurrences. One way to achieve this is to nest two calls to the REPLACE function, like so:

SELECT REPLACE(REPLACE(@str, '.x.', '.y.'), '.x.', '.y.');

Another option is replacing each occurrence of the separator with two, then applying the originally intended replacement, then replacing each two occurrences of the separator with one. Here’s the expression in T-SQL:

SELECT REPLACE(REPLACE(REPLACE(@str, '.', '..'), '.x.', '.y.'), '..', '.');

String Formatting Numbers With Leading Zeros

There are several tasks related to formatting numbers as strings that people often inquire about. T-SQL perhaps isn’t the best place to handle those, but I’ll still present solutions—mainly for their technical value.

One typical number formatting task is adding leading zeros to an input number (@num INT) that needs to be formatted as a character string with a fixed number of digits (e.g., 10). The task might or might not also require you to deal with negative values, in which case you need to add a - (minus) sign in front. For the sake of our exercise, let’s assume that negative values are possible as input.

One solution involves using the STR function to format the absolute value of @num as a 10-character string with leading spaces, then replacing each space with 0, then adding a minus sign in front of the string. This translates to the following T-SQL expression:

DECLARE @num AS INT = -1759;

SELECT CASE SIGN(@num) WHEN -1 THEN '-' ELSE '' END
  + REPLACE(STR(ABS(@num), 10), ' ', '0');

The output of this code is -0000001759.

Another solution involves converting the absolute input value to a character string, concatenating a string with nine zeros (000000000) with the result of the previous step, extracting the 10 rightmost characters from the result, and finally adding the minus sign in front if the input is negative. Here’s the expression that implements this logic:

SELECT CASE SIGN(@num) WHEN -1 THEN '-' ELSE '' END
  + RIGHT('000000000' + CAST(ABS(@num) AS VARCHAR(10)), 10);

SQL Server 2012 (formerly code-named "Denali"), is planned to support a function called FORMAT that will make your life really easy when it comes to formatting strings representing numbers and other types of values. The FORMAT function accepts a format string compatible with .NET format strings for similar purposes, indicating how you want to format the value. For our specific task, the expression is very simple:

SELECT FORMAT(@num, '0000000000');

Left Trimming Leading Occurrences of a Character

The next task I cover involves trimming all left occurrences of a given character from a string. One common case where this is needed is in stripping off leading zeros from a number formatted as a character string (@str).

The technique to achieve this is quite straightforward when the source string doesn’t contain spaces to begin with. You first replace each occurrence of a zero with a space, then apply LTRIM to remove leading spaces, then replace each occurrence of a space with a zero, like so:

DECLARE @str AS VARCHAR(100) = '0000001709';
SELECT REPLACE(LTRIM(REPLACE(@str, '0', ' ')), ' ', '0');

This code returns 1709.

If spaces are allowed in the input, you want to first replace each existing space with a token that you know can’t appear in the source data (e.g., ~), then apply the previous technique, then replace each occurrence of the token with a space, like so:

DECLARE @str AS VARCHAR(100) = '0000001709 ABC';
SELECT REPLACE(REPLACE(LTRIM(REPLACE(REPLACE(@str, ' ', '~'), '0', ' ')), ' ', '0'), '~', ' ');

This code returns 1709 ABC.

This just gives you an idea of dealing with one complication. Of course there can be several additional complications, such as support for negative values, spaces at the beginning, and so on.

Checking That a String Is Made of Only Digits

Our next task involves operating on an input string (@str) and verifying that it contains only digits. I’ve seen attempts to handle this need that involved using the ISNUMERIC function. The problem with this function is that it returns 1 if the string is convertible to any numeric data type, including FLOAT, REAL, MONEY, and so on. This means that you need to consider a lot of special cases, including currency symbols and others.

One solution in which you don’t need to worry about special cases is to generate a string pattern in which you replicate the character pattern [0-9] (basically, a digit) as many times as the number of characters in the input string, like so:

DECLARE @str AS VARCHAR(1000) = '1759';

SELECT
  CASE
    WHEN @str LIKE REPLICATE('[0-9]', LEN(@str))
      THEN 'True'
    ELSE 'False or Unknown'
  END;

The problem with this approach is that if the input strings are very long, the pattern generated will be five times as long as the input.

There’s another solution that also doesn’t require you to deal with special cases, and it uses a very short pattern. You use two levels of negation, verifying that the string doesn’t contain any character that’s not a digit, like so:

SELECT
  CASE
    WHEN @str NOT LIKE '%[^0-9]%' THEN 'True'
    ELSE 'False or Unknown'
  END;

Just like with the previous techniques I described, the last two validation techniques implement the logic using a single expression. This means that you can embed the predicate wherever predicates are supported in T-SQL, operating on a table or view attribute. For example, if you need to enforce this rule in a CHECK constraint as part of a table definition, you can, like so:

CHECK (col1 NOT LIKE '%[^0-9]%')

Make the Most of T-SQL’s Tools

I covered several common string manipulation tasks: counting occurrences of a substring within a string, verifying that there are an exact number of occurrences of a substring within a string, replacing multiple contiguous spaces with a single space, replacing overlapping occurrences, formatting strings representing numbers with leading zeros, left trimming leading character occurrences, and verifying that a string is made of only digits. As you saw, T-SQL doesn’t provide very rich capabilities in this area. You therefore need to be a bit creative with the limited set of tools that you do get. Fortunately, SQL Server Denali improves support for string manipulation by adding new functions, among them the FORMAT function that lets you format an input based on a format string.

Next month, I’ll continue the discussion about string manipulation, and I’ll cover additional challenges related to string manipulation involving tricky type conversions. If you have your own cool and creative techniques to handle common string manipulation needs, I’d love to hear about them.

For Part 2 go to T-SQL String Manipulation Tips and Techniques, Part 2

TAGS: SQL
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish