Simplifying Spatial Data

It has always been possible to work with geospatial data in SQL Server using built-in geometric and trigonometric functions such as SIN(), COS(), and ACOS(). However, SQL Server 2008’s spatial data types made complex spatial queries a lot more accessible to those of us who don’t remember what a cosine actually is. In this article, I will show you how to use these built-in functions within a user-defined function (UDF) to illustrate the traditional method of calculating distances between geographic points. I will use both traditional T-SQL methods and the new spatial data types in SQL Server 2008 as examples. I will also compare the performance of the two methods.

Prerequisites

For the purpose of this article, I will assume you recognize the terms latitude and longitude and know that the earth is sort of round. I will also assume you have a basic familiarity with reading and writing T-SQL code. To follow the examples, you will need SQL Server 2008 or later.

Complete coverage of all the complex issues surrounding geospatial data calculations would require a much longer article. Although a deep understanding of geospatial data is required for some types of applications, many business applications have relatively simple geospatial needs, such as calculating shipping charges based on the distance between two ZIP codes or displaying the location of your nearest retail outlets on a map on your website based on a ZIP code entered by a visitor to the site. These types of queries have a common goal: to find the distance between geographical points. In some cases, they return the distance between two specific points, and in other cases they return the identity of all points within a certain distance of a center point. By the end of this article, even the most geometry-challenged programmer will be able write queries like these, using both traditional methods and the new spatial data types.

Using Legacy T-SQL for Geospatial Calculations

Listing 1 contains the code for a conventional UDF that calculates linear distances between two points on the surface of a sphere. Functions like this one represent a method of working with geographical data that was typical before spatial data types were introduced in SQL Server 2008.

As you can see, this function takes the latitude and longitude values of two geographic points and returns the distance between them in miles. It’s a little easier to understand the calculation if you know the following things:

The latitude and longitude values we will work with are stated in decimal degrees. The T-SQL SIN() and COS() functions expect those values to be in radians, so we convert them using the RADIANS() function.
The subtraction in this function is necessary because lines of longitude are not parallel to each other as are the lines of latitude. Lines of longitude converge at the poles and are at their widest separation at the equator. The distance between them is a function of the latitude of the points.
The distance between degrees of latitude varies slightly from pole to equator, but the difference is small enough that it can be ignored for most purposes. A good compromise is 111,200 meters for latitudes in temperate zones. You don’t see this figure in this function, but it is used later in our optimization example.

We will use this function to compare the performance of legacy T-SQL methods to the new geospatial methods. You will notice some slight differences in results because of the differing precision of conversion factors in the two methods. However, the differences are trivial for most business applications. At any rate, you should not mix the computational methods in your applications, so your results should remain consistent.

Using SQL Server 2008’s Spatial Data Types

SQL Server 2008 introduced two spatial data types: Geometry and Geography. Both Geometry and Geography support a subset of the objects and functions prescribed by the Open Geospatial Consortium (OGC). I will not discuss the OGC standard here, but you can visit OGC website.

Geometry and Geography are complex data types, implemented as CLR assemblies. Because the assemblies are registered by default, you don’t need to do anything special to enable them. You will declare spatial types as you would any other data type.

By complex data types, I mean not only that these data types can contain multiple discrete values but also that they have methods and properties that can be accessed by your T-SQL code. In T-SQL, these methods are commonly referred to as functions, but they are accessed differently than normal T-SQL functions. Also, the function names are case-sensitive no matter what collation the database uses. The following is an example of a spatial function call:

SELECT Point1.STDistance(Point2)

This statement returns the distance between two points on the earth’s surface. Point1 is a value of the Geography data type that holds latitude and longitude coordinates. This value is followed by a period and the case-sensitive name of the method. The method STDistance() takes a single parameter that represents another point value of the Geography data type. This seems to be a simple way to obtain the same result as the traditional T-SQL function in Listing 1.

The following are a couple of additional frequently used functions of the Geography data type:

Point1.Lat returns the latitude of Point1. Point1.Long returns the longitude.
Object1.STIntersection(Object2) returns the point at which lines or polygons represented by Object1 and Object2 intersect.

The basic difference between the Geometry and Geography types is that Geography deals with round-earth coordinates—latitude and longitude. Geometry deals with flat surfaces using arbitrary Cartesian coordinate systems. In this article and the examples, I will work only with Geography data, as that is the type most commonly used for business applications. My examples use point data, although the spatial data types also support complex lines and polygons.

Spatial data types let you code solutions such as those previously mentioned without wading through cosine and arctangent functions to calculate linear distances on the surface of a sphere. The methods contained in the data type itself can make this calculation automatically and very accurately, as well as a variety of other complex calculations.

In addition to the spatial data types, Microsoft introduced a spatial index. We will look at the performance of spatial indexes in the upcoming examples.

Loading the Sample Data

I used the ZIP code data from the US Census Bureau for the examples because it is a body of spatial data that is a reasonable size and free and simple to download. You can create a test database to run this project or download one of the Microsoft sample databases.

There are many ways to load this flat file into SQL Server. The code in Listing 2 begins by creating and populating the ZipCodes table using OPENROWSET with the bulk load option.

Listing 3 provides the code of the format file used by the OPENROWSET function. Save this file as a .xml file before you try to run the OPENROWSET function. The path and filename used in Listing 2 is C:\GEO\zcta5format.xml. If you use that name and location to save the file, be sure to change the file path in the OPENROWSET function.

After making some necessary changes to the table, the script then updates the table to load the same latitude and longitude values into the spatial column GEO. Note that the UPDATE statement imports the data from the latitude and longitude columns into the Geography data type. It uses the STPointFromText() spatial function to insert point data as a string formatted in the Well-Known Text (WKT) format that is supported by OGC. As you can see, the WKT string starts with the type of geography to be created, followed in this case by the longitude and latitude of the point. This string is followed by a comma and a numeric Spatial Reference ID (SRID). The SRID identifies the type of coordinates, the units of measurement, and the particular model of the earth represented by your geography data. (That is a gross over-simplification of a complex topic, but it will do for our purposes). All your spatial data should have the same SRID.

Until you get to the point in which you know why you might want to use another SRID, stick with SRID 4326. This is the SRID used by GPS devices and is the SRID you will encounter most often. Because SRID 4326 uses the meter as its distance unit, US users will typically need to use the following conversion values to convert function results into miles, and vice-versa:

Meters to miles: *0.000621371192
Miles to meters: *1609.344

Finally, Listing 2 creates three indexes: a clustered primary key on the latitude and longitude columns, a nonclustered index on the ZIP code, and a spatial index on GEO. At this point you should use Listing 1 to create the UDF that we previously examined.

Going Postal

Since our data is all about ZIP codes, our examples will be postal in nature. Let’s assume that the geographic coordinates of the ZIP code describe the physical location of the post office itself (which is not strictly true of this data).

Example 1: Finding the distance between ZIP codes. In our first example, an eCommerce site needs to find the distance between its warehouse and the delivery ZIP code to calculate shipping charges for an order. Listing 4 contains the code for two queries that accomplish this task. The first query uses spatial data types and the second query uses a more traditional T-SQL approach by calling the UDF to perform the calculations.

You might notice in Listing 4 is that there is substantially less code to write when you use spatial types. And both statements consume a very small amount of resources in using the traditional indexes. The resource usage, as captured by SQL Server Profiler, is essentially equal for both forms of the query.

Example 2: Finding all locations within a specified radius. Let’s assume that the US Postal Service wants to display a map on its website showing the location of all the nearby post offices based on the ZIP code you enter. Listing 5 contains two queries that accomplish this task. As in the previous example, the first query uses spatial data types, and the second uses our T-SQL UDF.

Both of the queries in Listing 5 return the location of the same 149 post offices. However, when you look at the cost of the two queries, you find that the query that uses spatial data types is far more expensive in terms of disk I/O. The legacy T-SQL method finished with 119 page reads on a partial scan of the clustered index. There were 3,300 page reads required for the spatial version—far more than the 476 pages that exist in the table itself. The spatial query performed slightly better than the UDF version in terms of duration and CPU usage.

The problem with spatial data is that there is often a lot of it. Multimillion-row tables are the rule rather than the exception when dealing with geographic points. Because our sample data contains only about 33,000 rows, we do not see the full effect of the potentially critical problem that can be posed by a table of several million rows.

Many common data retrieval operations, such as the previous radius query, involve calculations to filter the data. Calculations or function calls in a WHERE clause often prevent indexes from being used. The problem is that the database engine doesn’t know the distance to any ZIP code in the table until it takes that row and runs the function, using the row values. This means that the database engine must run the function for every row in the table before it can determine which rows to return. The result is a full table scan on a huge table.

Example 3: Squaring the circle. A practical way to optimize a radius query is to limit the number of rows on which the distance calculation must be performed. Consider our radius query. It is obvious that all the points within that circle with a 50-mile radius must also lie within a 100-mile square that has the same center point.

You can use the coordinates of the center point and the length of the radius to calculate the latitude and longitude of four points that define the corners of the square. Ninety-nine percent of the rows in the table can be eliminated by a preliminary index seek that passes only the few points in the square to the distance function. This makes the query orders of magnitude faster on large data sets. However, you will see in this example that spatial data types and spatial functions do not take advantage of this kind of optimization. Listing 6 contains both forms of the query.

After running both versions of the query, you will find that the query using spatial data types to filter the latitude and longitude columns is significantly more expensive than the filter using normal indexes. Traditional T-SQL filtering on the latitude and longitude columns performs 26 page reads to return the data. But the spatial query requires 3,274 page reads because it performs a full scan of the full spatial index and also a full scan of the clustered index. The execution time of the two versions is not significantly different in this relatively small table, but full table scans on much larger collections of spatial data could be a show-stopper.

As a further optimization, I included a distance function in both forms of the query to exclude all points outside the radius to confirm that we would get the same results as the previous radius query. However, if you are placing the retrieved points on a square map on a website, it might make more sense to eliminate the distance function altogether and display all points found within the square.

Query Performance

Logical I/O is arguably the most dependable measurement of query performance. By that metric, traditional methods seem to provide much better performance than spatial data types. However, the two forms of the query are much closer in terms of other measures such as CPU usage and duration. In some cases, the spatial function actually performs better than the UDF by those metrics.

In searching for a reason for the big disparity in page reads, I speculated that indexing on the table, particularly the clustered index on latitude and longitude, might promote artificially efficient range queries on those values. In the real world, that might not be the most efficient overall indexing plan. Perhaps spatial data types might compare more favorably in a different indexing scheme. I tested this theory with a couple of different indexing schemes that I thought might be appropriate for a table like this. The numbers changed somewhat, but the huge disparity in logical reads remained in all the indexing schemes I tried.

Still, you must remember that this was not an exhaustive performance analysis. I tested performance on only a single aspect of spatial data types. It would be wrong to conclude that spatial data types and functions are inherently less efficient.

The primary benefit of spatial data types is their ease of use in manipulating geospatial data in various complex ways. The simple examples in this article did not fully display this enhanced functionality. If spatial types provide adequate performance, their functionality should make up for a slight degree of inefficiency (if, in fact, there is one).

Other Spatial Data Functionality

I explored only a small part of the functionality available through spatial data types. Querying geographical point data might be the most common application, but spatial data types are capable of doing much more. Spatial data types support points, linestrings, and polygons, as well as collections of those types. SQL Server supplies about two dozen functions to work with those shapes in various ways. For example, you can define a polygon of virtually any shape and then calculate its area with a simple function call.

It is this additional functionality that differentiates spatial data types from traditional T-SQL. Although a good programmer or geometrician can reproduce the same complex functionality using pure T-SQL, this requires a huge amount of work, and the number of programmers with the necessary mathematical background is relatively small. Spatial functions give you the capability to do extraordinarily complex things with geographic or geometric data, without coding complex formulas. If you would like to dive a little deeper into working with spatial data, see Alistair Aitchison’s book Beginning Spatial with SQL Server 2008 (Apress).

CREATE FUNCTION dbo.DistanceInMiles
            (@LatA             FLOAT = NULL,
             @LongA          FLOAT = NULL,
             @LatB             FLOAT = NULL,
             @LongB          FLOAT = NULL
             )
RETURNS decimal(9,1)
AS
BEGIN
            DECLARE @Distance FLOAT
            SET @Distance = (SIN(RADIANS(@LatA)) *
              SIN(RADIANS(@LatB)) +
              COS(RADIANS(@LatA)) *
              COS(RADIANS(@LatB)) *
              COS(RADIANS(@LongA - @LongB)))
            -- Convert to miles
            SET @Distance = CAST( (DEGREES(ACOS(@Distance))) * 69.09 AS decimal(9,1))
            RETURN @Distance

END
GO

-- Load data and prepare table

USE MYDB
GO

IF OBJECT_ID('ZipCodes', 'U') IS NOT NULL
DROP TABLE ZipCodes
GO

-- openrowset creates table and loads data
SELECT *
INTO Zipcodes
FROM OPENROWSET(BULK 'C:\GEO\zcta5.txt',FORMATFILE = 'C:\GEO\zcta5format.xml') AS Z

ALTER TABLE ZipCodes DROP COLUMN deleteme;
ALTER TABLE ZipCodes add GEO geography;
ALTER TABLE ZipCodes ALTER COLUMN state VARCHAR(2) NOT NULL;
ALTER TABLE ZipCodes ALTER COLUMN ZIP VARCHAR(5) NOT NULL;
ALTER TABLE ZipCodes ALTER COLUMN latitude FLOAT NOT NULL;
ALTER TABLE ZipCodes ALTER COLUMN longitude FLOAT NOT NULL;

-- copy data into spatial data type
UPDATE \\[dbo\\].\\[ZipCodes\\]
SET \\[GEO\\] =
geography::STPointFromText('POINT('
+ CAST(\\[Longitude\\] AS VARCHAR(20)) + ' ' +
CAST(\\[Latitude\\] AS VARCHAR(20)) + ')', 4326)
GO

-- add indexes
ALTER TABLE ZipCodes ADD PRIMARY KEY (latitude, longitude);
CREATE NONCLUSTERED INDEX nc_zip ON ZipCodes(ZIP);
CREATE SPATIAL INDEX SIndxGEO ON \\[ZipCodes\\](GEO);
GO
-- create udf for legacy examples

CREATE FUNCTION dbo.DistanceInMiles
(@LatA             FLOAT = NULL,
@LongA           FLOAT = NULL,
@LatB FLOAT = NULL,
@LongB           FLOAT = NULL
)
RETURNS decimal(9,1)
AS
BEGIN

DECLARE @Distance FLOAT

SET @Distance = (SIN(RADIANS(@LatA)) *
SIN(RADIANS(@LatB)) +
COS(RADIANS(@LatA)) *
COS(RADIANS(@LatB)) *
COS(RADIANS(@LongA - @LongB)))

-- Convert to miles
SET @Distance =
CAST( (DEGREES(ACOS(@Distance))) * 69.09 AS decimal(9,1))

RETURN @Distance

END
GO

  <?xml version="1.0" ?>
- <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
- <RECORD>
  <FIELD ID="f01" xsi:type="CharFixed" LENGTH="2" />
  <FIELD ID="f02" xsi:type="CharFixed" LENGTH="5" />
  <FIELD ID="f03" xsi:type="CharFixed" LENGTH="59" />
  <FIELD ID="f04" xsi:type="CharFixed" LENGTH="9" />
  <FIELD ID="f05" xsi:type="CharFixed" LENGTH="9" />
  <FIELD ID="f06" xsi:type="CharFixed" LENGTH="14" />
  <FIELD ID="f07" xsi:type="CharFixed" LENGTH="14" />
  <FIELD ID="f08" xsi:type="CharFixed" LENGTH="12" />
  <FIELD ID="f09" xsi:type="CharFixed" LENGTH="12" />
  <FIELD ID="f10" xsi:type="CharFixed" LENGTH="10" />
  <FIELD ID="f11" xsi:type="CharFixed" LENGTH="11" />
  <FIELD ID="f99" xsi:type="CharTerm" TERMINATOR="\r\n" />
  </RECORD>
- <ROW>
  <COLUMN SOURCE="f01" xsi:type="SQLCHAR" LENGTH="2" NAME="state" />
  <COLUMN SOURCE="f02" xsi:type="SQLCHAR" LENGTH="5" NAME="ZIP" />
  <COLUMN SOURCE="f03" xsi:type="SQLVARYCHAR" LENGTH="100" NAME="deleteme" />
  <COLUMN SOURCE="f04" xsi:type="SQLBIGINT" NAME="population_tot" />
  <COLUMN SOURCE="f05" xsi:type="SQLBIGINT" NAME="housing_tot" />
  <COLUMN SOURCE="f06" xsi:type="SQLFLT8" NAME="water_area_meter" />
  <COLUMN SOURCE="f07" xsi:type="SQLFLT8" NAME="land_area_meter" />
  <COLUMN SOURCE="f08" xsi:type="SQLFLT8" NAME="water_area_mile" />
  <COLUMN SOURCE="f09" xsi:type="SQLFLT8" />
  <COLUMN SOURCE="f10" xsi:type="SQLFLT8" />
  <COLUMN SOURCE="f11" xsi:type="SQLFLT8" />
  </ROW>
  </BCPFORMAT>

-- COMPUTE DISTANCE BETWEEN TWO POINTS USING SPATIAL FUNCTION

DECLARE @Geo1 geography; 
DECLARE @Geo2 geography;

-- get coordinates for warehouse
SELECT @Geo1 = GEO

FROM Zipcodes
WHERE ZIP = '97219';
-- get coordinates for delivery location
SELECT @Geo2 = GEO
FROM Zipcodes
WHERE ZIP = '55437';
-- return result as miles
SELECT CAST((@Geo1.STDistance(@Geo2) * 0.000621371192) AS decimal(9,1));

-- COMPUTE DISTANCE BETWEEN TWO POINTS USING TSQL UDF

DECLARE @WarehouseZip       varchar(5);
DECLARE @DeliveryZip            varchar(5);
DECLARE @LatWhs                             FLOAT;
DECLARE @LongWhs              FLOAT;
DECLARE @LatDel                               FLOAT;
DECLARE @LongDel                FLOAT;

SET @WarehouseZip = '97219';
SET @DeliveryZip = '55437';

-- get coordinates for warehouse
SELECT            @LatWhs = latitude,
                        @LongWhs = longitude
FROM Zipcodes
where ZIP = @WarehouseZip;
-- get coordinates for delivery location
SELECT            @LatDel = latitude,
                        @LongDel = longitude
FROM Zipcodes
where ZIP = @DeliveryZip;

-- return result as miles
Select  dbo.DistanceInMiles(@LatWhs, @LongWhs, @LatDel, @LongDel);

-- RETURN ALL POINTS WITHIN RADIUS USING SPATIAL FUNCTION

DECLARE @geo1 geography
,                       @RadMiles int 
,                       @MaxMeters  decimal (12,2);

SET @RadMiles = 50; -- radius of area searched
SET @MaxMeters = @RadMiles *1609.344;

-- get coordinates of entered zipcode
SELECT @geo1 = GEO
FROM \\[Zipcodes\\]
WHERE ZIP = '97219';

SELECT  ZIP, GEO.Lat, GEO.Long
FROM \\[ZipCodes\\]
WHERE @geo1.STDistance(GEO) <= @MaxMeters;

-- RETURN ALL POINTS WITHIN RADIUS USING TSQL UDF

DECLARE @RadMiles float
,                       @Lat_in float
,                       @Long_in float;

SET @RadMiles = 50; -- radius of area searched

-- get coordinates of entered zipcode
SELECT            @Lat_in = latitude,
                        @Long_in = longitude
FROM Zipcodes
where ZIP = '97219';

SELECT ZIP, latitude, longitude
FROM ZipCodes
WHERE dbo.DistanceinMiles
(@Lat_in
, @Long_in
, latitude
, longitude) <= @RadMiles;

-- SQUARING THE CIRCLE WITH SPATIAL FUNCTION

DECLARE @Center geography
,           @RadiusMile float
,           @RadiusMeter float
,           @OffsetLat float
,           @OffsetLong float
,           @latN float
,           @latS float
,           @longE float
,           @longW float;

-- input the radius
SET @RadiusMile = 50;

SELECT @Center= GEO
FROM Zipcodes
WHERE ZIP = '97219';

-- convert miles to meters
SELECT @RadiusMeter = (@RadiusMile * 1609.344);

-- Get offset of corners from center  in degrees
SELECT @OffsetLat = @RadiusMeter/111200;
SELECT @OffsetLong =  @RadiusMeter/(111200 * COS(radians(@Center.Lat)));

-- calculate lat and long of corners
SELECT @longW = (@OffsetLong - @Center.Long) * -1;
SELECT @longE = @OffsetLong + @Center.Long;
SELECT @latN = (@OffsetLat + @Center.Lat);
SELECT @latS = (@Center.Lat - @OffsetLat);

-- get all points within the square
SELECT ZIP, GEO.Lat, GEO.Long
FROM Zipcodes
WHERE geo.Long < @longE and geo.Long > @longW
and Geo.Lat < @latN and Geo.Lat > @latS
and @Center.STDistance(GEO) <@RadiusMeter;

-- SQUARING THE CIRCLE USING TSQL UDF

DECLARE @CntrLat float
,           @CntrLong float
,           @RadiusMile float
,           @RadiusMeter float
,           @OffsetLat float
,           @OffsetLong float
,           @latN float
,           @latS float
,           @longE float
,           @longW float;

-- input the radius
SET @RadiusMile = 50;

SELECT @CntrLat= latitude
,@CntrLong = longitude
FROM Zipcodes
WHERE ZIP = '97219';

-- convert miles to meters
SELECT @RadiusMeter = (@RadiusMile * 1609.344);

-- Get coordinates of corners in degrees
SELECT @OffsetLat = @RadiusMeter/111200;
SELECT @OffsetLong =  @RadiusMeter/(111200 * COS(radians(@CntrLat)));

-- calculate lat and long of corners
SELECT @longW = (@OffsetLong - @CntrLong) * -1;
SELECT @longE = @OffsetLong + @CntrLong;
SELECT @latN = (@OffsetLat + @CntrLat);
SELECT @latS = (@CntrLat - @OffsetLat);

-- get all points within the square
SELECT ZIP, GEO.Lat, GEO.Long
FROM Zipcodes
-- this filter uses comparisons and standard indexes
WHERE longitude < @longE and longitude > @longW
and latitude < @latN and latitude > @latS
and dbo.DistanceinMiles
(@CntrLat
, @CntrLong
, latitude
, longitude) <= @RadiusMile;

Comments

Plain text