Skip navigation

Query Data Samples in a Table

SQL Server 2005 introduces a new TABLESAMPLE clause that lets you query a random sample of data from a table (either an exact number of rows or a percentage of rows), instead of querying the entire table’s data. This clause is especially useful when you want to test your code against a random subset of data that you copy from a production environment, or when you just want to test the validity of your solutions against a subset of data as opposed to the entire data set. Of course, there are other programmatic techniques that you can use to request random portions of data, but such techniques usually require scanning the entire table, and with large tables, this can take a while. One benefit of using the TABLESAMPLE clause is that SQL Server physically scans only those pages that SQL Server selected randomly. I'll explore several ways to use the TABLESAMPLE clause. In the sidebar, "Using SQL Server 2005 Features with Earlier SQL Server Versions," I demonstrate how you can use new SQL Server 2005 features such as TABLESAMPLE with databases based on earlier versions of SQL Server.

Using TABLESAMPLE


You specify TABLESAMPLE in a query's FROM clause, right after the table name or table alias, if one was provided. The TABLESAMPLE clause has the following syntax:

FROM 
  TABLESAMPLE \[SYSTEM\] (sample_number \[PERCENT | ROWS\])
\[REPEATABLE(repeat_seed)\]

I'll provide some sample queries that illustrate how to use the various TABLESAMPLE options. In these examples, I use the Sales.SalesOrderDetail table from the AdventureWorks database as the main source table. SalesOrderDetail has 121,317 rows stored on about 1240 pages, which is a reasonable amount of source data for demonstration purposes. I'll also use the Employees table from the Northwind database to demonstrate querying a small input table. The Employees table has only nine rows, all stored on a single page.

Let's start with a simple example. The following query requests 1000 sampled rows from the SalesOrderDetail table:

SET NOCOUNT ON;
USE AdventureWorks;

SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID
FROM Sales.SalesOrderDetail TABLESAMPLE (1000 ROWS);

Notice in the previous TABLESAMPLE syntax example that you can specify the SYSTEM sampling method. Currently, in SQL Server 2005 supports only the SYSTEM sampling method, which is also the default. The following query is logically the same as the previous query, except the SYSTEM method is specified explicitly:

SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID
FROM Sales.SalesOrderDetail TABLESAMPLE SYSTEM (1000 ROWS);

Per ANSI SQL, SYSTEM specifies an implementation-dependent sampling method. Thus, individual database management system (DBMS) products can implement this method differently. SQL Server uses the same sampling method that it uses to sample data to create statistics, which the query optimizer uses to optimize queries. (Other database platforms, such as DB2, implement additional sampling methods—for example, BERNOULLI, which is based on the Bernoulli sampling algorithm.)

When I ran the previous query for the first time, I got 861 rows; you might get a different number of rows when you run this code. If you run the query repeatedly, you'll probably get a different number of rows every time. SQL Server uses randomization to determine whether or not a page should be returned. Data is considered to be returned (or not) at the page level, not at the row level. SQL Server generates a random value per page, based on the table size and the requested number or percentage of rows. Thus, whether a page is chosen is a matter of probability. Each page is selected or skipped according to probability (i.e., requested number of rows/number of rows in the table). The larger the table, and the greater the number of rows you request, the higher the probability that you'll get a close percentage of rows to the requested one.

Sampling a Percentage of Rows


You can also specify a percentage of rows, instead of a number of rows. For example, the following query requests SQL Server to return a sample containing 1 percent of rows in the table:

SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID
FROM Sales.SalesOrderDetail TABLESAMPLE (1 PERCENT);

In fact, when you specify a number of rows, SQL Server internally converts it to a percentage.

If you want to ensure that you'll get a data sample that's repeatable (assuming the source table hasn't changed), you can do so by using the REPEATABLE clause and specifying an integer seed value, like this:

SELECT SalesOrderID, SalesOrderDetailID, OrderQty, ProductID
FROM Sales.SalesOrderDetail TABLESAMPLE (1000 ROWS) REPEATABLE(42);

If you invoke this query with the same seed value (42 in our example) repeatedly, you’re guaranteed to get the same sample data provided that the table wasn't changed. This option can be handy, for example, when you want people who review your performance analysis or benchmarks to be able to repeat result sets that you obtained from your queries.

Remember that you aren't guaranteed to get the exact same number or percentage of rows that you requested. You might get more, or you might get less. To increase the probability of receiving the number of rows that you requested, you can use the TOP option. You do so by specifying a higher number of rows than you actually need in the TABLESAMPLE clause, then in the TOP option specify the number of rows that you actually need, like this:

SELECT TOP(1000) SalesOrderID, SalesOrderDetailID, OrderQty, ProductID
FROM Sales.SalesOrderDetail TABLESAMPLE (2000 ROWS);

This query requests to sample 2000 rows and uses the TOP option to limit the result set to 1000 rows. You still might get fewer than 1000 rows, but the likelihood of that occurring is lower than if you'd actually requested 1000 rows in the TABLESAMPLE clause. By specifying TOP(1000), you're guaranteed, of course, not to get more than 1000 rows. By using a combination of TABLESAMPLE and TOP, the data that you obtain will better reflect the distribution of data in your production environment than if you use TOP alone, and you'll usually get the data faster.

The smaller the table is, and the smaller the number or percentage of rows that you request, the less likely that you'll get a percentage of rows close to the percentage that you requested. With very small tables, you might not even get any rows. For example, if you request one row from the Employees table in the Northwind database, which contains only nine rows per page, you'll get an empty set in most invocations and the entire table in others, as this query shows:

USE Northwind;

SELECT EmployeeID, FirstName, LastName
FROM dbo.Employees TABLESAMPLE (1 ROWS);

With such small tables, you’re better off using other techniques to sample data, for example:

SELECT TOP(1) EmployeeID, FirstName, LastName
FROM dbo.Employees
ORDER BY CHECKSUM(NEWID());

Although this query requires SQL Server to scan the entire table, returning the results for a small table shouldn't take too long.

Efficient Sampling


As you've seen, SQL Server 2005's TABLESAMPLE clause lets you efficiently sample pages from a table. Although you aren't guaranteed to obtain the exact number or percentage of rows you requested, you can use techniques such as those I've demonstrated to increase the probability of getting a percentage of rows close to the percentage you requested. TABLESAMPLE is optimal for sampling data from large tables; for small tables, you're better off using other techniques for sampling data.

TAGS: SQL
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish