This week, I'd like to offer advice about the common need to "munge a lot of data." To help set the stage, here's a question a colleague recently asked:
"I'm having trouble updating a large number of rows in a table that contains 35 million rows. For example, say that the table, MyTable, has a column called UserId. A typical value for UserId looks like '374771&something,' and I need to trim everything to the right of the & symbol. The UPDATE query I'm running looks like this:
UPDATE MyTable SET UserID = SUBSTRING(UserID, 1, charindex('&', UserID)-1) WHERE UserID LIKE '%&%'
However, the query has been running for several hours, the transaction log is currently 10GB and growing, and I wonder if there's a more efficient way to tackle this problem."
The need to update large amounts of data is common in the real world, but many SQL Server professionals don't know how to tackle the problem efficiently. Fully addressing all the scenarios and solutions that this question raises could fill an issue of SQL Server Magazine, and I don't have that much space. However, the basic rule of thumb I follow in data-modification cases is similar to a rule of thumb you should follow when eating a nice juicy steak: You might choke and die if you try to swallow the entire steak in one bite, so don't do it. Eating the steak in multiple, smaller bites is much easier, safer, and tastier. The same advice holds true when modifying massive amounts of data.
My colleague was trying to update 35 million rows. Conceptually, you can update the row through a single UPDATE statement, or you can do it in multiple UPDATE statements, updating smaller "bites" of data with each statement. Chunking the data—for example, issuing 70 statements that each update 500,000 rows instead of one statement that updates all 35 million—is beneficial for many reasons, including the following (I'm assuming you don't have a business requirement to perform the update in a single all-or-nothing transaction):
- You can manage the size of the transaction log.
- You won't experience a horrible roll-back situation if the UPDATE fails 99 percent of the way through updating 35 million rows.
- The server won't be as sluggish for other people working with the database at the same time.
- You probably won't lock the entire table, so you'll greatly improve concurrency, which allows other actions against the database during the update process.
- You'll have a much more predictable way to gauge how long the update process will take.
I could go on and on. In fact, I will when I address the same question in my SQL Server Savvy Q&A column in an upcoming issue of SQL Server Magazine. Trust me, your mom was right when she told you to take smaller bites. Taking smaller bites when updating your data might just prevent a great big SQL tummyache.