For our master's degree program in Management of Information Systems, we had to do an analysis project for which we needed to learn the nuances of data mining using "real" data. Our previous work was on data that had been cleaned and was intended for students to use to get the same results across the board. Our school had chosen SQL Server 2000 because it offers the functionality of a corporate database coupled with a low price and easy maintenance. We were given great freedom in choosing our analysis topic. Many other students chose marketing or sales analyses, but because we have a natural science background, we chose geological and biological data. By creating a data warehouse and mining this data, we found that SQL Server 2000 provides a fast way to statistically survey geological data.
Scientists around the world are continuously gathering large amounts of geological data; it was from these sources that we obtained our sample data. Because SQL Server is easier to use than traditional statistical methodology, we decided to pool existing data sources into a SQL Server data warehouse. We also had to use SQL Server to verify the original analyses of the geological data, which were performed through traditional statistical methods. We built two data warehouses from two different geological data sources. These sources were comparatively small because we were doing an academic project, but we believe that the findings would hold true no matter the size of the data set. One data set consisting of more than 20 files was a study of the Eastern Continental Shelf that had been collected and analyzed by the United States Geological Survey (USGS) between 1955 and 1970. The second database contained the results from a study of the decomposition effects of burial on six different wood species in the Bahamas and Gulf of Mexico. The amount of wood standing or decaying in water throughout the world is phenomenal. The decomposition of wood produces greenhouse gases, but this process stops when the wood is buried. This study is significant because it provides scientists additional valuable information about the sources of hydrocarbon emissions into our atmosphere.
Cleaning the Original USGS Data
The USGS results in a data set consisting of 20 text files, including both analytical and descriptive data types, was previously published by the National Geophysics Data Center (NGDC). (See Hathaway, John C., 1971 Data File, Continental Margin Program, Atlantic Coast of the United States: WHOI Reference No. 71-15.) We created a composite primary key based on existing columns that held the station number, sample ID, and sub-sample letters. And we set the granularity at the sub-sample (the sub-sample code identifies the samples that were divided from a larger sample). Figure 1 shows sample data from one of the files.
To clean the original USGS data, we imported it first into Microsoft Excel, then into Microsoft Access, and finally into SQL Server 2000. We took this circuitous route because of the large variation in data format and definition within the text files. The import functions in SQL Server and Access weren't as robust as that of Excel. Using the Excel Text Import Wizard, we manually set the field widths and formats for each column, creating new columns for amalgamated data and splitting data fields. This process eliminated much of the data variation. We then loaded the data into Access, using Access's import external data option. During the import, Access found more data-formatting errors; it stored the row numbers of incompatible data in an import error file for reference. Figure 2 shows one problem we encountered: Data formats within two columns changed from numeric to alphabetic, then back to numeric. To correct the errors, we used a combination of editing techniques in Excel and wrote SQL code to reformat the data within the Access table. We then manually edited some of the data and used UPDATE queries to change other data to its numerical equivalent.
We couldn't import some files into Access because of data-type incompatibility errors in the Sample ID field. This field is alphanumeric, but Access was trying to format it as an integer. We tried to use the Excel Text Import Wizard to modify the data format; however, when we imported the data into Access, Access didn't recognize the Excel character format. We then tried to design the table in Access and import external data, but Access still overrode the character format. As a workaround, we loaded the data into SQL Server, declaring the field as a character type, then re-imported it into Access for field identification and data cleanup. We found table manipulations easier to manage in Access 2000 than in SQL Server 2000.
Cleaning Up the Wood Study Data
Figure 3 shows the data from the wood study, which had been published in the dissertation of Elizabeth Heise (A comparison of the fossilization potential and recycling of wood by wood-boring bivalves and isopods on the shelf and slope of the Bahamas and Gulf of Mexico, Texas A&M University, 2001) and was therefore quite clean. The most difficult task was dealing with hidden nulls in a column with a text data type that we were using for a primary key field. The null fields were invisible, so we didn't know they were a problem until Data Transformation Services (DTS) refused to transform the Excel spreadsheet into a SQL Server table. The null fields were invisible in the Excel worksheet, and they didn't appear when the tables were printed. We removed the primary key designation from that column so that the DTS Import Wizard would allow nulls in it. Then we used DTS to create the table in SQL Server format. When we looked at the table with all rows open in SQL Server 2000, three rows of nulls showed up at the bottom of the SQL Server table. As an afterthought, we used the DTS preview option to view the files that contained the bad records. The nulls were then visible. Based on this experience, we recommend using the DTS preview option on all files before attempting to load them into SQL Server.
We edited the Excel spreadsheet, pasting into a new Excel spreadsheet only the rows that contained data. We deleted all the data from the SQL Server table and ran the DTS Import Wizard again to produce eight relational tables in SQL Server format. Figure 4 shows the five tables that represent one of the two locations in the wood study.
Building the Data Warehouses
For the geological data warehouse, we used a star schema design with a fact table containing all analytical results and dimension tables containing time, location, sample description, sampling method, and cruise information. Figure 5 shows the design we used. Initially, we tried to populate the fact table from Access by using the DTS Lookup option. We wanted to use DTS Lookup because we were populating the fact table with analysis data based on the multifield primary keys from many different table sources. After checking various sources on the Web and in SQL Server Books Online (BOL) and asking questions on MSDN user groups, we concluded that it was probably impossible to look up a value in the table you're writing data to. We've also learned that it slows the DTS package execution significantly because it forces row-by-row data loading.
Using SQL Server 2000 DTS, we created a local package that held each of the data transformations. We typed in the commands that Listing 1 shows to update each of the tables inside an execute SQL task in the local package. We found executing each transformation task separately to be more practical than simultaneously because we could find the coding errors more quickly.
Listing 2 shows one command that we used to import data into the dimension tables. After the data was properly transferred, we used a T-SQL query similar to the query that Listing 1 shows to load data into the dimension tables. However, this second query contained a WHERE clause that showed the equality of the fact table's appropriate foreign key to the dimension table's primary key that it was being linked to.
The wood study data resided in eight tables that used the site name as the primary key. The site name was assigned when the original science study began and was unique for each sample. For the wood study data warehouse, we again chose a star schema design, using the site information table as the fact table. The granularity for the wood study was the wood species at each site name.
Mining the Data
The SQL Server 2000 data-mining package can build two types of models—decision tree and clustering—that you can run against OLAP databases or relational databases. The clustering model lets you find general groupings of data. For example, when we used this model to mine the wood study relational tables, Analysis Manager grouped buried and partially buried wood separately. The geological data set was more complex, so we were able to see relationships using both models. However, we could find potentially significant results in the geological data only with the clustering model. When we searched two distinct grain-size categories, we found the pattern that Figure 6 shows. Along the US Eastern seaboard, the sand fraction clustered in the north between New York City and Washington, D.C. The clay fraction didn't have a regional cluster, but it had two predictive areas: one to the north between New York and Washington and the other off Charleston further south. The USGS clustering model showed a high probability that there would be significantly more organic and inorganic constituents in both the clays and the sands than in the other location profiles and grain sizes.
Elizabeth Heise had run the wood study data through statistical analyses several years earlier, but we compared our interpretations of our statistical results to the original statistical results only after we had mined the data in SQL Server 2000. Originally, running this data through traditional statistical analyses (ANalysis Of VAriance between groups—ANOVA, t-test, and linear regression) required 16 months and 234 analyses to obtain statistically significant patterns. In 3 days, most of which we spent getting the data into SQL Server tables, we produced the same statistically significant patterns, thereby verifying the validity of SQL Server 2000's modeling and analysis tools.
For the scientific researcher in need of analysis tools, the difference in time required for the old and new analyses shows that you can get tremendous analyzing power by placing data into a relational model. Researchers can use the power and speed of SQL Server 2000 to statistically model their data. Initially, scientists might use both SQL Server 2000 and traditional statistical methods, but as their confidence in the results grows, they'll realize the usefulness of SQL Server functionality alone. Despite the limitations—the complex data cleanup we had to do with DTS, Access, and Excel and the inability to quickly use DTS Lookup—we've found that SQL Server 2000's clustering model is a fast, easy way to examine data in relational databases.