Adventures with Big Data: How to Import 16 Billion Rows into a Single Table

So here I was, writing some SQL Server 2005 Reporting Services (SSRS) reports for a user when a coworker asked if I could process some data for him. I’m the SQL Server guy at work, so I immediately volunteered to help.

Related: Loading Data with BULK INSERT

The data arrived several days later on a 1TB external hard drive that contained 20,000 flat files taking up 400GB of space. The data consisted of latitude and longitude coordinates and three extra informational fields. I had no idea how many records the raw data represented because not all the latitude and longitude coordinates were to be included. For latitude and longitude coordinates to be included, their extra fields had to match certain criteria.

The first step was to reduce the record size to the absolute bare minimum. I discovered that the latitude and longitude coordinates were in 1/120-degree increments, so I rounded those floating-point values to small integers by multiplying them by 120 and truncating the remainder. The remaining fields were also rounded after multiplication, thus reducing my final record size to 9 bytes.

Unfortunately, there was no way of directly importing the data into SQL Server because the flat files were in a proprietary data format. So, I wrote a Visual Basic .NET 2.0 application that iterated through each file and created a pipe-delimited output file that I could subsequently import into SQL Server using the BULK INSERT statement. I chose to bulk insert the records because submitting them through ADO.NET would be much slower.

The final piece of the puzzle was to define a clustered index on the table consisting of the latitude and longitude fields. I hoped this would give me instant access to all related data once lookups began. I knew that inserting unsorted records into a table with a defined clustered index would be slow, but I knew of no way to presort the records, so I went for it.

The import operation started off well. A half million records were being imported in a single shot every 10 seconds. So, I let the import operation run a few hours while I attended to some other tasks. When I came back, I discovered that my upload speed had drastically decreased. As my table of imported values grew, the cost of inserting more records into the clustered index became prohibitive to the point where my ADO.NET connection was timing out after 10 minutes—a far cry from the 10-second imports at the beginning. Clearly, I couldn’t import the data into a table with a defined clustered index.

Plan B was simple: I would remove the clustered index and run the import operation again. My thinking was that applying the index after importing the data would take much less time. After three days of importing, my data had been migrated from thousands of flat files into a single SQL Server table. It consisted of 16 billion rows and took up 250GB of space. The final step was to apply the clustered index. I wrote a script and set it running with fingers crossed.

About 28 hours later, SQL Server had sucked up the 750GB of disk space left and began rolling back the index change. After another 24 hours, the rollback had completed and I was back to square one. As it turns out, applying a clustered index takes a lot of disk space. Plus, the entire operation is logged. Not only did my database balloon to 500GB but so did my log file.

I considered several options at this point. The first was to define a partitioned table that would make subsequent indexing easier. Another was to physically split the database between multiple tables. Then, while doing some research on BULK INSERT, I came across an interesting detail: When bulk inserting data into a table with a defined clustered index, you can greatly speed things up by supplying already sorted data. All you have to do is use BULK INSERT’s ORDER parameter to specify an order hint telling SQL Server that the incoming data is ordered like the index. So, I came up with a new plan: Export the entire table of 16 billion rows to disk, sort the data, and use BULK INSERT with the ORDER parameter to re-import the sorted data into a table whose clustered index is already defined.

To export the data from SQL Server to a flat file, I tried using BCP because it’s fast. Unfortunately, BCP maintains a count of the records exported, which it prints in the DOS shell. I say “unfortunately” because BCP’s counter variable is a 4-byte integer that stops on row 2,147,483,647. With only an eighth of my database exported, I had to look for other options.

I ended up writing a SQL CLR procedure in C# that exported my table, row by row, to a flat file on the server. I defined my counter variable as a LONG data type to avoid the problem I encountered with BCP. I ran this procedure, and 19 hours later I had a 141GB flat file in binary format ready to be sorted.

On the Internet, I discovered Ordinal Technologies’ NSort (www.ordinal.com), which I could use for sorting. It promises to sort a terabyte of data in 33 minutes, given sufficient God-like hardware. NSort was a bit finicky to set up, but with help from its developer, I managed to sort my data file in 3.5 hours. The developer later told me that had I used a larger memory parameter, it could have taken only 2.5 hours.

I was finally ready to import the sorted data back into SQL Server. Luckily, I did a BULK INSERT test on 10 million rows to make sure my log wouldn’t explode. It did. As it turns out, by default, BCP loads the entire input file into the destination table as a single transaction, which means the entire table is fully logged. In my case, the log space required was 10 times the size of my data.

With a bit more reading, I discovered I could use BULK INSERT’s BATCHSIZE parameter to fix the log-space problem. This parameter tells BULK INSERT how many records it should commit at a time. I picked 100,000 and ran my test again. This time the log never exceeded 39MB. A distinct improvement over the 2GB it had ballooned to the first time.

Satisfied that everything was in place and fully tested, I started the import operation again and settled in for the expected 50-hour wait. You can imagine my excitement when I returned from a long weekend to discover the import had finished successfully. The resulting coordinate table was the same physical size as before because a clustered index takes virtually no extra space.

Next, I created a test location table that consisted of 200,000 latitude and longitude coordinates. I added new LatIndex and LonIndex fields in the location table to represent the integer versions of the original floating-point latitude and longitude coordinates. All that remained was to join the location table to the coordinate table on the indexed fields and return the results. Forty-five seconds later I was staring at 181,458 beautiful rows of data representing a join between 200,000 locations and 16 billion coordinates. I had watched the CPU graph on the server while the SELECT statement ran and was gratified to see it pinned at 100 percent for the entire time, which means that disk I/O was clearly not a bottleneck. SQL Server is one lean, mean row-slinging machine.

Here are some distilled tips from my experience:

The more data you have in a table with a defined clustered index, the slower it becomes to import unsorted records into it. At some point, it becomes too slow to be practical.
If you want to export your table to the smallest possible file, make it native format. This works best with tables containing mostly numeric columns because they’re more compactly represented in binary fields than character data. If all your data is alphanumeric, you won’t gain much by exporting it in native format. Not allowing nulls in the numeric fields can further compact the data. If you allow a field to be nullable, the field’s binary representation will contain a 1-byte prefix indicating how many bytes of data will follow.
You can’t use BCP for more than 2,147,483,647 records because the BCP counter variable is a 4-byte integer. I wasn’t able to find any reference to this on MSDN or the Internet. If your table consists of more than 2,147,483,647 records, you’ll have to export it in chunks or write your own export routine.
Defining a clustered index on a prepopulated table takes a lot of disk space. In my test, my log exploded to 10 times the original table size before completion.
When importing a large number of records using the BULK INSERT statement, include the BATCHSIZE parameter and specify how many records to commit at a time. If you don’t include this parameter, your entire file is imported as a single transaction, which requires a lot of log space.
The fastest way of getting data into a table with a clustered index is to presort the data first. You can then import it using the BULK INSERT statement with the ORDER parameter.

—Dmitry Mnushkin, application developer, Renaissance Reinsurance

Comments

Plain text