I have two tables, Account and AccountTransaction, that have more than 20 million records each. There's a one-to-many (1:M) relationship between Account and AccountTransaction. Data Transformation Services (DTS) constantly loads new records into the tables by using a text file. My company's DBA insists that we must use a cursor to insert each record one by one to avoid massive locking and blocking conflicts on the tables when DTS performs large inserts. I recommended using a batch-insert process, but the DBA says batch processes can cause major read-concurrency problems. How can I achieve higher insert throughput without seriously damaging SELECT-query performance or read concurrency?
I see this scenario often at customer sites. Most of my customers have solutions that heavily favor either insert speed or select speed. Your DBA is right. Large inserts into production tables can significantly decrease performance. However, if you're careful, you can have your cake and eat it, too. I can't provide sample code because I don't know what your data sets look like, but I can outline the standard design pattern I use in cases like yours. I've found a way to achieve high insert throughput rates without significantly damaging my ability to handle read queries at the same time.
Say you have 100,000 new rows to insert into a table that has 20 million existing rows. The benefit of a batch-insert approach is that you'll probably get much better throughput than if you try to insert 100,000 rows by executing an INSERT statement for each row. The downside of a batch approach is that the response time for the 100,000-row data load might be so long that it noticeably affects concurrent read queries' response times. Here's how I attack the problem.
First, I batch load the new data into a staging table. I can load the data using the bulk copy program (BCP) utility, the BULK INSERT statement, or DTS. Once I have the file's raw data in a staging table, I don't use a cursor to move the data one row at a time into the production tables. Instead, I find a way to partition the staged data so that I can insert batches of records by using an INSERT INTO/SELECT FROM command. I'll explain what I mean by "partition the data" in a moment. First, I want to stress that using a single INSERT INTO/SELECT FROM command to insert multiple rows is usually much faster than using one INSERT statement for every row you need to insert.
Now, what do I mean by "partition the data?" Here's a simple example. I have a 1 million-row data set of new orders, and I know the OrderId values range from 1 to 1 million. Testing shows that I can insert 50,000 rows per batch without significantly affecting read performance. In this case, I have 20 batches, and the INSERT INTO command for each batch includes a range that specifies the particular rows I need to insert. For example, the INSERT INTO command for the second batch specifies rows 50,001 through 100,000. Knowing how to partition the data and build the WHERE clause for each batch is the challenging part of this technique. To partition correctly, you need to understand your data set. If your data set doesn't have a good natural partitioning key that lets you uniquely identify a row, you can add an IDENTITY column to the staging table and use that to partition your data set. In other words, if you can't figure out how to divide the data into batches of 50,000 using your existing keys, then adding a unique IDENTITY column makes that task easy. How do I know how big the INSERT INTO batch-partition size should be? Testing. I find the largest INSERT batch number that completes fast enough to have close to zero affect on read-operation concurrency. The INSERT batch blocks reads when it's running, but the blocks are trivial when the INSERT statement runs quickly. I usually set up a loop that does one batch (or partition of data) at a time, until the entire data set is inserted. I include a slight delay—a few milliseconds—between each batch to ensure that I never start to back up SELECT queries that might be running against the same table that I'm inserting new rows into. The trick to picking a batch size is to pick the largest batch you can insert fast enough to avoid creating blocking problems for other queries that are reading from the table.
This batch-partitioning approach lets you have your insert throughput and your query performance, too. Tune the batch size properly, and you'll be able to handle a high volume of inserts in a short period of time without noticeably affecting other queries running on your system.