I want to load Microsoft Word documents into SQL Server, then index the documents so I can use them in relational queries. How can I import and index the documents?
A SQL Server lets you import Word documents in several ways. Let's look at the most common methods. Note that before you load the documents into SQL Server, you need to create an image data-type column to store the data. You can then import the documents by using the textcopy.exe command-line utility to read the image files into the database. To obtain basic documentation about this tool, at a command prompt, type textcopy /?. Another approach for getting the Word documents into SQL Server is to write import code by using the ADO Stream interface. You can find sample code for this interface in the Microsoft article "How To: Access and Modify SQL Server BLOB Data by Using the ADO Stream Object."
Alternatively, you can move the binary data to SQL Server. The Microsoft article "How To: Retrieve and Update a SQL Server Text Field Using ADO" explains this approach. Moving the binary data lets you store parts of the data in the database and is useful when you need to control the data format. For example, if you want only between 1000 and 1010 bytes of the data, importing the binary data can be much faster than using the ADO Stream interface because SQL Server doesn't need to retrieve as much data from disk. People often use this technique to store bit masks that represent application on and off switches.
SQL Server 2000 comes with sample code that demonstrates how you can move the binary data. To view this code, just follow the ?\Program Files\Microsoft SQL Server\80\Tools\DevTools\Samples\ado path on the drive where you installed the code samples from your SQL Server 2000 CD-ROM. Expand the executable, then look in the Visual Basic (VB) directory to find the Samples subdirectory. In the Employee sample, notice how the code uses the FillDataFields() function.
To index Word documents, SQL Server 7.0 and later releases provide the full-text search component, which uses a mix of technologies to index large text and image columns. When you perform a full-text search, you need to specify which file type the image column contains and which filter you need to extract meaning from the binary data. For more information about using full-text search, review the topic in SQL Server Books Online (BOL) and read David Jones's article "Build a Better Search Engine," July 2000. Note that indexing Word documents doesn't magically produce a set of relational tables that contain keywords from your documents. However, indexing the files lets you include these Word documents in your searches. Possible ways of extracting keywords from the data include
- Using OLE automation to read user-defined keywords from the document. Save these keywords in relational tables at the same time you load the document.
- Using OLE automation to open the document and save it in Text (.txt) format. To extract the words that are important to you, work through the text file with your own "word breaker"--a program that looks at each word in the document, discards noise words such as a or the, and stores each unique word together with a count of each word.
- Searching newly full-textindexed documents for specific words, then entering the words in relational tables.
SQL Server 2000 provides several powerful tools and interfaces that allow fast loading, searching, and retrieval of binary Microsoft Office documents.