With SQL Server, you can easily change the collation of a database. Before you do so, though, you should be familiar with the purpose and structure of collations, know what will be altered when the collation changes, and understand the consequences if a collation change isn't done correctly. It's also a good idea to have a methodology for changing the collation of a SQL Server database. I use a seven-step process, which I'll share with you.
Related: Forcing Collation in the WHERE Clause
The Purpose and Structure of Collations
Collation serves two important roles:
- It defines the rules that SQL Server applies when sorting and comparing characters in a database.
- It defines the underlying code pages used to physically store non-Unicode data (e.g., column types of char, varchar, and text).
Let's take a close look at each of these roles.
Defining the sort and comparison rules. The rules that SQL Server applies when sorting and comparing characters are strongly related to the language associated with the data. They can also specify other sorting or comparison behaviors, such as whether the sort or comparison operation is case-sensitive (_CS), case-insensitive (_CI), accent-sensitive (_AS), accent-insensitive (_AI), or width-sensitive (_WS). In SQL Server 2000 and later, you can use two types of collations:
- SQL Server collations, which start with the prefix SQL_. These collations were used in SQL Server versions prior to SQL Server 2000. With these collations, the rules for Unicode and non-Unicode data are different; therefore, sorting and comparison operations on Unicode data will have different results than on non-Unicode data.
- Windows collations, which are the recommended collations. In particular, SQL Server 2008 introduced many Windows collations to support the Windows system locales. Because these collations are centralized at the OS level, they're common to all the applications installed on a Windows computer. As a result, they can provide useful and consistent sorting and comparison behaviors. In addition, the rules for Unicode and non-Unicode data are the same, ensuring data consistency.
Defining the underlying code pages. A code page determines the way values are assigned to characters when data is stored on a disk or network. SQL Server supports 16 different code pages. They use single bytes to store each character. This limits the space and memory used and leads to better performance compared with using multi-bytes.
The drawback of the single-byte approach is the small number of available values: 28 or 256. As TechNet's Code Page Architecture web page explains:
There are 8 bits in a byte, and the 8 bits can be turned ON and OFF in 256 different patterns. A program that uses 1 byte to store each character can therefore represent up to 256 different characters by assigning a character to each of the bit patterns.
The values from b0000000 to b01111111 are used for all code pages equally. Their corresponding characters are encoded in a manner described by the ASCII base table. The other 128 bytes (represented by values b10000000 through b11111111) are individually mapped for each code page. The mapping between a value and a represented character can vary widely between code pages, so changing the collation might change the underlying code page.
If a code page change occurs, any stored data that's not defined in the ASCII base table will be converted. In some cases, characters that aren't defined in the new code page can't be converted and will be automatically changed to an "undefined" character. This means that some characters can get lost during a conversion. For this reason, if a code page change is needed, you need to analyze your data to avoid wrong conversions.
Related: 7 Considerations for Server Consolidations
Let's take a look at an example of an incorrect conversion that results in a character being lost. Suppose that the collation Latin1_General_CI_AS is current. All the € characters stored in the char, varchar, and text columns are represented as the ASCII value 128. For some reason, the collation has to be changed to SQL_Latin1_General_CP850_CI_AS. An analysis shows that this character isn't defined in the new collation. If the change is performed without any checks and corrections, all the € characters will be converted to the undefined character ? with the ASCII character code of 63. As a result, users won't be able to tell the difference between the old € character and the real ? character anymore. All of this information is lost. A possible solution to this problem is to change all of the € characters to Euro prior to the collation change.
To help you find the code pages used by different collations, you can use the T-SQL command
SELECT COLLATIONPROPERTY('', 'CodePage')
where
SELECT COLLATIONPROPERTY('Traditional_Spanish_CS_AS_KS_WS', 'CodePage') SELECT COLLATIONPROPERTY('Thai_CS_AS_KS_WS', 'CodePage')
return 1252 and 874, respectively.
Note that Unicode data stored in the column types of nchar, nvarchar, and ntext uses a UCS-2 code page (Universal Character Set, using only 2 bytes per character). This is identical for all collations in SQL Server 7.0 and later. This data doesn't need to be converted when the collation is changed because all Unicode data uses the same Unicode code page. Collations don't control the code page used for Unicode columns; they only control the attributes.
Collation Recommendations
When working with collations, there are a few recommendations to keep in mind:
- Use Windows collations to avoid having to use different rules for Unicode and non-Unicode data.
- Choose a collation that's a good match for not only the data to be stored but also the sorting and comparison operations you want to perform on that data.
- Don't mix collations at the instance level, database level, and object level (e.g., tables, store procedures), because doing so can strongly degrade the overall performance of the instance.
- Use a more general collation (e.g., latin1) when a database must contain data that's in different languages.
- Make sure that all characters to be stored in the database are defined in the underlining code page.
Collation Change Methodology
Before changing a database collation on a production system, you should test the change (including testing the applications that use the database) on a development, test, or quality assurance (QA) system. That way, you can avoid database problems and any surprises in applications' behavior. In addition, you should let the owners or developers of the applications know about the collation change beforehand.
I follow a seven-step process when changing a database collation:
- Check the database with DBCC CHECKDB WITH DATA_PURITY.
- Perform a backup with the copy-only option.
- Change the collation at the database level.
- Find all the table columns whose collation must be changed and validate the findings with the owners or developers of the applications using the database.
- Change the collation at the table level.
- Change the collation of the views.
- Rebuild the indexes.
Figure 1 shows the checklist I use to make sure I complete all the steps.
Let's look at each step in more detail.
Step 1: Check the Database
Although performing a complete integrity check is a good idea, it's not required. In many cases, simply using the DATA_PURITY option can be enough:
DBCC CHECKDB WITH DATA_PURITY
The DATA_PURITY option will check the database for column values that are invalid or out-of-range (i.e., not defined in the code page). For example, it will tell you if value 128 is found, but the current collation doesn't define it.
Step 2: Perform a Backup
The next step is to make a backup of your database. You can perform a copy-only backup so that it won't affect your backup strategy. Here's a sample script you can use to perform the copy-only backup:
BACKUP DATABASE [DBxxx] TO DISK = N'B:\Backup\DBxxx_BeforeCollationChange.bak' WITH COPY_ONLY, NOFORMAT, NOINIT, NAME = DBxxx_BeforeCollationChange', SKIP, NOREWIND, NOUNLOAD, STATS = 10
Before you run this script, you need to change each instance of DBxxx to the name of your database.
Step 3: Change the Collation at the Database Level
After you've backed up your database, you can change the database collation. To do so, use the command
ALTER DATABASE [DBxxx] COLLATE New_Collation
where DBxxx is the name of your database and New_Collation is the name of the collation you want to use.
In many cases, this command will run successfully. However, the command might fail and generate error messages concerning functions, primary keys, constraints, or indexes. These objects don't contain primary data, so they can be re-created.
If errors do occur, you need to:
- Drop all functions, primary keys, constraints, and indexes.
- Change the database collation.
- Re-create all the dropped objects. (Don't forget to reassign the rights granted to these objects.)
However, I recommend that you re-create all the dropped objects later in step 7. That way, you might avoid having to redo some work.
If this step goes well, the database properties will show the new collation. Many DBAs will stop here, but as you'll see shortly, this step only changes the default collation for the new objects that will be created in the database. All existing objects still contain data values referring to their origin mapping.
Step 4: Find All the Table Columns Whose Collation Must Be Changed
The collation of all table columns of types char, varchar, text, nchar, nvarchar, and ntext must be changed. To list all the table columns with their current collations, you can run the query:
SELECT t.name, c.name, c.collation_name FROM sys.columns c INNER JOIN sys.tables t ON t.object_id = c.object_id WHERE c.object_id IN (SELECT object_id FROM sys.objects WHERE type = 'U') AND c.collation_name != 'NULL' ORDER BY t.name,c.name
In the generated list, you need to look for these three cases:
- Case 1: Columns defined with the old default collation
- Case 2: Columns defined with a collation that differs from the old default collation
- Case 3: Columns defined with a user data type (UDT)
Before making any changes to these columns' collations, you should check with the application owners or developers to see whether the changes will generate bad behaviors in the applications. In addition, if a column falls under case 2, you should ask them why its collation differs from the old default collation. This information will help you determine if the column can be changed to the new default collation.
Step 5: Change the Collation at the Table Level
After the application owners or developers have given you the okay to change the collation of the identified columns, you can make the changes. How you make those changes depends on the three cases identified in step 4.
Case 1. When a column is defined with the old collation, you can change the old collation to the new collation with the command:
ALTER TABLE [