T-SQL for Starters: Working with Character Data

In last month's article, I looked at simple Select statements. This month, I use SQL Server 6.5 examples to show how to handle character data in queries. These examples also work in SQL Server 7.0, with a few differences, which I'll point out as they come up. To run the queries in this article, use the ISQ/W window or the Query Window in the Enterprise Manager in SQL Server 6.5. In SQL Server 7.0, use the Query Analyzer from the program group or call it from within Enterprise Manager.

What Is Character Data?

Loosely defined, character data is any data that's stored as a string of ASCII characters rather than as a numeric value. Many database fields (columns) are character data—names, addresses, job titles, etc. Some fields that appear to be numeric are character data, such as telephone numbers, zip codes, and equipment serial numbers. This datatype makes sense because you don't perform mathematical calculations on such fields. In addition, you might want to mix letters and numbers, for example, to accommodate Canadian postal codes. Also, a zip code such as 00217 is easier to handle as a character field. The integer type drops leading zeros and therefore would truncate a zip code of 00217 to 217.

You specify character data in SQL Server by enclosing the data in single quotes. (SQL Server 7.0 is more forgiving if you make formatting errors than SQL Server 6.5.) For example, the following query for all authors from California in the Pubs database is correct because of the quotes around 'ca':

USE PUBS
SELECT au_fname, au_lname, phone, au_id
   FROM authors
	WHERE state = 'ca'

If you omit the quotes, you'll see an error message, "Invalid column name ca." Both SQL Server 6.5 and 7.0 will interpret the missing quotes as a request to compare two columns rather than comparing a column to a text string.

A similar error occurs if you try to select data that appears to be numeric, such as zip codes:

USE PUBS
SELECT au_fname, au_lname, phone, au_id
   FROM authors
   WHERE zip > 90000

In SQL Server 6.5, this comparison fails because, as the error message will tell you, you asked for an implicit conversion. In other words, you asked SQL Server to compare two dissimilar values (a character field, zip code and numeric value, 90000). Although SQL Server 7.0 will run the query if you omit the quotes, I recommend following the ANSI SQL standard and putting 90000 in quotes.

Variable-Length Character Fields

In SQL Server 6.5, you can choose fixed-length columns or variable-length columns for character data. You designate such columns as char(n) or varchar(n), respectively. SQL Server 7.0 offers these choices plus support for Unicode characters, which you designate as Nchar(n) or Nvarchar(n). SQL Server stores character data as a column in a table, and it stores text data in the database, but not as part of the row. The row includes only a pointer to the text data's location. SQL Server 6.5 limits character data to 255 characters and stores additional characters as text data. In SQL Server 7.0, you can use character data for columns up to 8000 bytes long, which is an entire page.

Some character data is fixed-length, such as zip codes or state codes. But for character data such as names, addresses, and product descriptions, you might consider variable-length columns. This approach can save storage space by storing only the bytes required to hold the data, plus one extra byte that tells how many bytes of data are in the column. The down side of this method is that SQL Server first reads the byte that describes how long the data is in the row, then reads that many bytes of data. Thus, reading variable-length data is a two-step process. Data storage is cheap, but users expect better performance than variable-length columns allow. Therefore, many developers use fixed-length columns for speed rather than choosing the more efficient storage option. However, SQL Server 7.0 might change such thinking.

The sort order and, to some extent, the character set you choose when you install SQL Server will affect how data is returned when you query the database. For example, if you choose a binary sort order and list customers by last name, SQL Server returns them in the order Adams, Baker, Cole, Edwards, . . . Zahn, de Lucia, van Zandt. The sort uses the ASCII order of A-Z, then a-z. So customers de Lucia or van Zandt show up after Zahn. The default sort order in SQL Server 7.0 is dictionary sort order, case-insensitive, so SQL Server 7.0 lists the names as you expect to see them: Adams, Baker, Cole, de Lucia, Edwards, . . . van Zandt, . . . Zahn. If your queries show data in a different order from what you expect, run the system stored procedure SP_HELPSORT to show the settings for sort order and character set.

SQL Server provides many functions for manipulating character data, including the ability to concatenate character data to combine multiple data columns into one output column. For example, you can combine the first and last name fields in a variety of ways. In these examples, I added formatting such as spaces and commas. This example shows first and last names as two separate columns:

SELECT au_fname, au_lname 
FROM authors

This example shows the author name as one column. You can add a space, enclosed in single quotes, to separate the names.

SELECT au_fname + ' ' + au_lname 
FROM authors

SELECT au_lname + ',  ' + au_fname 
FROM authors

You can also add alphanumeric strings to your output for clarity, as Screen 1 shows. In Screen 1, note the spaces after the text in 'The author ' and before and after the text ' lives in '. This approach is one you can use to format your output. Another technique is to use the SPACE() function to insert spaces:

SELECT 'The author' +
SPACE(1) + au_fname +
SPACE(1) + au_lname +
SPACE(2) + 'lives in' +
SPACE(2) + city
FROM authors
WHERE au_lname LIKE
'Green'

Character Functions

SQL Server provides many character functions, some of which you'll recognize from programming languages. For example, SUBSTRING selects a subset of the character data, starting at a specified character and continuing for as many characters as you specify. You use SUBSTRING, for example, to display the author name in the format lname, firstinitial, as in Greene, M.

SELECT au_lname + ', ' + 
   SUBSTRING (au_fname,1,1) + '.',  au_id
   FROM authors
WHERE au_lname LIKE 'Greene'

If you want to take characters from the right of the column, for example, the last four digits of a Social Security number, use the RIGHT function.

SELECT au_lname, au_fname, RIGHT (au_id, 4)
FROM authors

There is no LEFT function. Instead, use SUBSTRING starting at character 1. Expected functions such as removing leading and trailing blanks (LTRIM, RTRIM) and converting to upper- or lowercase are available. I find the STUFF function useful. This function replaces a string anywhere in the data with another string. In the Pubs database, the author IDs look suspiciously like Social Security numbers. Suppose that I want the receptionist to verify the authors' identities when they call the office, but I don't want to give the Social Security numbers to anyone but the accounting department. Screen 2 shows a query I can use to generate the author ID list for the receptionist.

This example contains a search on the author name, which is character data. SQL Server supports various ways to search for character data. You can use an equality to search, as in:

WHERE au_lname = 'Green'

or LIKE, which imposes an additional filter when you use it in Query Analyzer.

WHERE au_lname LIKE 'Green'

Suppose that you want to find all authors whose last names begin with B. This query would use the % symbol, to mean "any characters":

SELECT au_fname + ' ' +
   au_lname 
FROM authors
WHERE au_lname LIKE 'B%'

If you choose a case-insensitive sort order, your queries are case-insensitive. Thus, the above query would work if you use:

WHERE au_lname LIKE  'b%'

If you try this query with WHERE au_lname LIKE 'd%', you'll see that the results include two authors whose names begin with uppercase letters and one whose last name, del Castillo, begins with a lowercase letter. The query doesn't differentiate between these names. In fact, if you use a case-insensitive sort order, you need to use a trick to find authors whose names begin with lowercase letters. If you want a list of authors whose names begin with A through M, for example, you specify the range as:

SELECT au_fname + ' ' + au_lname 
FROM authors
WHERE au_lname LIKE  '\[A-M\]%'

Include a dash in the square brackets because LIKE '\[AM\]%' gives you names that begin with either A or M. You need to use square brackets, or the query will return names that begin with the three-character string 'A-M,' of which there are few.

To find names begining with Mc, modify the query to

SELECT au_fname + ' ' + au_lname 
FROM authors
WHERE au_lname LIKE  'Mc%'

If you want to exclude the author McBadden, and anyone else whose name begins with Mc, you can use LIKE 'M\[^c\]%'. The caret symbol excludes the following character. In this case, you are asking for names that begin with M but don't have a c as the second character. Be careful to use square brackets only around the character you're excluding. LIKE '\[M^c\]%' would return names that begin with M, the caret character, or C.

Because the index on the au_lname column is sorted alphabetically, you can use the index to quickly search for a name that begins with a certain letter or range of letters. But if you want a list of last names that end with "son," such as Anderson or Carson, you use the WHERE clause

WHERE au_lname LIKE  '%son'

and you can't use an index. SQL Server must scan the entire table for a match. An index search would be faster than a table scan for just a few names, but you can't avoid the table scan in this case.

To combine columns that have character and non-character data, you need to convert all the data to character data. For example, to find the price for a book, and output it as a complete sentence, you can use the query in Screen 3. This example contains two CONVERT functions. The first function converts the overly long title column from a variable column of up to 80 characters to a fixed 30 characters for display purposes. The second function converts the price, which is a money data type, to character mode.

SQL Server gives you two options for handling apostrophes and quotes in character data. One option is to repeat the single quote. When SQL Server sees two single quotes together, it assumes that you intend it to recognize the second one. For example, suppose you want to find the authors whose names begin with O'. The query in Screen 4 will show that Michael O'Leary is the only author whose name satisfies the WHERE clause in the query. SQL Server prefers single quotes. But you can mix single and double quotes to write the same example.

SELECT au_fname + ' ' + au_lname 
FROM authors
WHERE au_lname LIKE  "O'%"

The double quotes delimit the text string, and the single quote is treated as part of the text.

Comments

Plain text