I recently worked with a development team on a proof-of-concept effort to adapt an application to present data for analysis to Internet subscribers. The application's developers used Microsoft Access for data storage and the GUI. User, business, and data services were in an Access .mdb file: User services were implemented as Access forms, business services were implemented in Visual Basic for Applications (VBA) code in Access, and data services were implemented in the Access Jet database engine. Subscribers periodically downloaded the .mdb file from an FTP site to get new data. This process worked well for download subscribers; they had to deal with a fairly lengthy download to refresh their data, but then they could analyze the data locally. It was an effective data-mart solution.
The client, Vector Vest of Williamsville, New York, builds software for stock analysis and stock portfolio management. The company's database had grown to 800MB and was beginning to stretch the Access platform. In addition, Vector Vest wanted to create an Internet application to offer services to a wider audience, so it needed to move to a more scalable platform. "We were looking for more stability for our database as it grows, and scalability for Web applications," said Steve Idziur, Vector Vest's vice president for product development. "Moving to SQL Server was a way for us to minimize risk in maintaining our data and supporting thousands of users."
When Vector Vest's managers realized that the company's service had a huge potential market on the Internet, they investigated ways to adapt the application by building on the existing system. This new application would ultimately replace Vector Vest's Access 97 subscriber download application and master stock-analysis database. The Access application got its data from other Access systems, which the company's stock market analysts used. These analysts had developed sophisticated and proprietary algorithms for producing end-user data to help subscribers make stock-purchase decisions. So the company wanted to continue using Access for the analysts and to produce the downloadable application for existing subscribers. The challenge was to make the same data, with similar client-side analysis capability, available to thousands of potential Internet subscribers. To accomplish this goal, the proof-of-concept team prototyped a multitiered solution that uses SQL Server 7.0 for the data-services layer.
During the first stage of the project, the development team reviewed the Access database design. After discussing the nature of the data and how current subscribers were using it, we concluded that the database had some dimensional characteristics. That is, users accessed the data by specifying a few basic attributes, one of which was time. The Access database had a Time table that stored the date for each day and flags for each day, such as end-of-week to mark when the business week ended. The data to analyze was divided into three levels of granularity: sector (the highest level), industry, and stock (the lowest level of granularity). These attributes seemed to be good candidates for fact tables—tables that hold the numbers that help users make decisions. The problem with using these tables as fact tables was that the calculations by the analysts aren't additive. Simply adding up the values for individual stocks doesn't result in the correct values for an industry, nor does adding up industries result in the correct values for a sector. So we ended up with three fact tables and four dimension tables for our SQL Server database, as you see in Screen 1. The fact tables use a common time-dimension table, but the hierarchy of the stock, industry, and sector dimensions were broken out in the same way as the fact tables, for simplicity.
The next development stage included moving data into SQL Server daily from Access, while letting the client's analysts continue to operate in the Access environment and producing the .mdb for existing subscribers to download. So in SQL Server, we created a linked server (as Screen 2 shows) to the Access .mdb to download nightly to a network server, and used Data Transformation Services (DTS) to refresh the SQL Server data for the Web application. Listing 1 shows the Transact SQL (T-SQL) code we used to link the server to the Access database. Linking the server to the Access .mdb simplified updating the SQL Server database because specific queries could join tables from Access with SQL Server.
Because we could now reference Access and SQL Server tables in the same query, the data refresh was simpler. The application now downloads a new Access .mdb every evening and copies it to a network drive, using the same file name each day so that the linked server can recognize the file. After the new .mdb is copied, the DTS package executes. The DTS package is a series of T-SQL tasks that bring in the Access data in steps, starting by adding a new row to the day's Time table. The primary key of the SQL Server Time table is called Time_ID, as you see in Screen 1. This column is an integer column with an identity property that assigns the new row a new Time_ID. The prototype used the Time_ID value to load new data or update existing data in the fact tables by joining the Access tables to the SQL Server Time table on the date value. Listing 2 shows the T-SQL script for applying stock splits to the Stock_Fact table by joining Access tables to SQL Server tables.
We built the DTS package on a series of Execute SQL tasks that update the SQL Server data depending on what data is in the Access database. Not all the Access tables show up every day: If a table has no data, it doesn't exist in the Access database, so the existing tables vary from day to day. To deal with this discrepancy, we queried the linked server for a list of the tables in the downloaded Access .mdb and checked for the existence of each table before attempting to process data from it. Using the SP_TABLES_EX system stored procedure, which lists table information for a linked server, we populated a table in SQL Server that each T-SQL task in the DTS package can query as it executes, to check for data to load in certain steps. Screen 3 shows the DTS task that loads the Access table list into a SQL Server table. For each task, simple IF EXISTS statements, such as the one at the beginning of Listing 2, can check for tables in the current day's version of the Access database.
Building Business Services
At this point, we had a SQL Server database and an extract/transform/load (ETL) process to load the Access data periodically. The team then used Visual Basic 6.0, Microsoft Internet Information Server (IIS) 4.0, and Active Server Pages (ASP) to focus on the business and user services. We used a tool called Homer, which is available on the Microsoft Web site (http:// homer.rte.microsoft.com), to facilitate load testing. Although Microsoft doesn't support Homer, this tool was useful in load-testing the application and evaluating the SQL Server solution's scalability.
In the proof-of-concept effort, we also demonstrated Access 2000 and the Microsoft Data Engine (MSDE). Although the documentation on Access 2000/MSDE is sparse, we found that you can easily attach a copy of the SQL Server files (.mdb, .ldf) to Access 2000.
An interesting problem arose when we joined Access tables to SQL Server tables. In one case, an insert to a SQL Server table from a three-table join in Access took more than five minutes when we tried the join on the linked server from the SQL Server side. Because we were trying to keep the load process short, we tried to use an Access Query in the source to pre-join the tables, then join the query to SQL Server. This trick reduced the time for the join operation to 10 to 15 seconds. Listing 3 shows a query that joins a SQL Server table to an Access Query object.
The textual data stored in the Access database's Memo fields presented another challenge during this project. The freeflow text data in the Memo fields generally consumed less than 1000 bytes per column. You might think this amount would pose no problem in SQL Server 7.0 with its 8KB data pages. But that wasn't the case. One table had only one Memo field, so a large varchar column in SQL Server was adequate. But another table had 13 Memo fields, so long varchar columns wouldn't work because in SQL Server, a row can't span data pages. First, we tried to use text data type columns in SQL Server. Not a bad solution: SQL Server 7.0 can store more than one row's text or image data on a page, and the pages are mapped, rather than linked, as in SQL Server 6.5. However, binary large object (BLOB) data is a bit cumbersome to deal with when you're using such programming interfaces as ADO, so we didn't implement this solution in the Web prototype. I guess I'm still afraid of the big, bad BLOB. (See Michael Otey, "Who's Afraid of the Big, Bad BLOB?" April 1999, for information on BLOBs.)
As an alternative to using text data in the SQL Server database, we experimented with writing the Memo field data from the Access database to HTML files, then storing the Universal Naming Convention (UNC) path to the HTML files in the SQL Server database. This way, if Internet users want to see newsletter data, they can simply click on a link in their browsers to bring up the appropriate HTML document. We didn't capture specific performance data during the proof-of-concept, but as the application is developed, we'll investigate the performance differences between using BLOBs and using HTML files.
Microsoft has worked to make SQL Server interoperable with other database management systems (DBMSs) and has made setting up this interoperability fairly easy. I've used linked servers to Oracle and Informix databases to load data to SQL Server with good results. In this case, the team designed a solution that leverages both the client's investment in its Access applications and the power of a SQL Server to provide a stable environment for making data available to a new set of customers on the Internet.