Final Revelations from the Recent TPC-H Benchmark

I hope you’ve enjoyed my multipart write up on HP and Microsoft’s most recent TPC-H Benchmark disclosure report. I’m a huge proponent of benchmarking and of reading the vendor disclosure reports. So I hope I was able to convey the highlights from the report as well as give you a few reasons as to why it’s important to review their findings. I’ve also greatly enjoyed your feedback and learned from it. Keep it up!

The only other major concept used in the construction of the benchmark environment worth noting at this point, imo, is the astronomically high number of files per filegroup and the use of multiple filegroups per database. Here’s an excerpt of the CREATE DATABASE script:

CREATE DATABASE tpch3000g ON PRIMARY

(NAME = tpch3000g_root,

FILENAME = "G:\TPCH_Roots\tpch3000g.mdf",

SIZE = 10MB,

FILEGROWTH = 10MB),

FILEGROUP STAGING_FG

(NAME = tpch3000g_stage1, FILENAME = 'G:\mnt\ntfs\1\load\stage.mdf', SIZE = 110000MB),

(NAME = tpch3000g_stage32, FILENAME = 'G:\mnt\ntfs\32\load\stage.mdf', SIZE = 110000MB),

FILEGROUP LINEITEM_FG

(NAME = tpch3000g_li1, FILENAME = 'G:\mnt\li\1\', SIZE = 22500MB),

…

(NAME = tpch3000g_li192, FILENAME = 'G:\mnt\li\192\', SIZE = 22500MB),

FILEGROUP GENERAL_FG

(NAME = tpch3000g_gen1, FILENAME = 'G:\mnt\gen\1\', SIZE = 7800MB),

…

(NAME = tpch3000g_gen192, FILENAME = 'G:\mnt\gen\192\', SIZE = 7800MB)

LOG ON

(NAME = tpch3000g_log1, FILENAME = "L:\\", SIZE = 297000MB)

As you can see, each filegroup had as many as 192 files of hundreds of GB in size. I’ve heard a number of IO tuning recommendations related to the use and configuration of files and filegroups, the most common advice being to configure N-1 filegroups per database (where N is the number of CPUs on the server). Since the benchmarked server had 32 dual-core Itanium 2 9050 processors, I’m not sure how the specific use of 192 files and 4 major filegroups reconciled with this advice. I’d appreciate any insight you might have.

There are a lot of other interesting but very minor tidbits to glean from reviewing the disclosure report. For example, the benchmarking team substituted their own user-defined function called “big_count” in place of the system-supplied function COUNT. Perhaps this new UDF was better at counting very large numbers of records?

Thanks again for your comments. I hope this was of some help to you.

Cheers,

-Kevin

Comments

Plain text