Skip navigation

Exchange Search Indexing and the Problem with PDFs

Or, “Why I Hate Adobe with the Burning Passion of 10,000 Suns”

Full-text indexing is an under-appreciated, but very valuable, feature in Microsoft Exchange Server, which has been part of the product since Exchange Server 2003. Because the server maintains a full-text catalog of message and attachment contents, and handles all the indexing and re-indexing, users can search their mailboxes from a variety of clients. As a user, I love being able to search my mailbox from Outlook, Windows Desktop Search, OWA, and my iPhone or iPad. Sure, there are some clients (I'm looking at you, Microsoft Entourage) that don't support this functionality, but overall it's a very useful feature. (Mobile clients can use Exchange Web Services to search the catalog, but not all vendors have implemented this feature.)

Recently, I took time to do some search-related maintenance on my Exchange 2010 servers. Here's how it went.

First, I downloaded the Office 2010 search filters. Search filters, you'll recall, are the components also known as IFilters, which comes from the COM interface they implement. Of course, the first results you get when searching for "office 2010 ifilter" with either Google or Bing are for the Office 2007 Filter Pack, which is already included with Exchange 2010 RTM. The actual Office 2010 filter pack is here.

Next, I downloaded the latest version of the Adobe PDF IFilter. Don't be fooled by imitations: You need the 64-bit version. Adobe has a tedious manual-installation process document (that, as a bonus, has a couple of mistakes in it!), but I took the easy route instead and used the excellent script that Exchange MVP Pat Richard wrote.

Then I tested my changes to verify that they worked and found that PDF attachments weren't being searched. Uh oh.

The first clue to figuring out the problem was the documentation for the Get-FailedContentIndexDocuments cmdlet, which says that installing a new filter doesn't force re-indexing of existing attachments of that type. Moving a mailbox from one database to another is the only way to force these formerly lost attachments to be re-indexed. Deleting the search catalog seemed like it would do the trick as well, except that it didn't.

Unfortunately, Get-FailedContentIndexDocuments didn't really tell me anything useful. All the PDFs in my mailbox showed as "filter not found," with an error code of 0x80040d16. Naturally, that error code is hard to find on the Internet, but eventually I learned that its associated error is GTHR_E_FILTER_NOT_FOUND. That information seemed reasonable, so I took the next step: cranking up diagnostic logging on the search indexer.

That action didn't turn up anything, either. I was baffled until I heard from a helpful support engineer at Microsoft. He suggested that the problem was with the filters themselves—sometimes, he claimed, for no good reason, a filter will decide to reject a file passed to it for indexing.

I was skeptical at first, but then I decided to put his suggestion to the test. I gathered about a dozen PDF files, including two commercially produced eBooks; some random, downloaded PDFs (I used Google to search for "filetype: PDF" within the past 24 hours); and some documents I already had on my local computer. I found a unique term or phrase in each document, added the document as an attachment to an email message, and sent it to my Exchange 2010 mailbox. After it appeared, I searched for it through OWA.

The results? Sure enough, the Adobe PDF IFilter is just flat-out ignoring some PDFs . . . including genuine PDFs produced by Adobe software! For example, I tried searching for terms from two eBooks produced by using Adobe InDesign. No hits. Likewise, I couldn't get any hits for terms included in PDF files generated by Word 2007 or by the VTeX/W math typesetting package. However, files generated by the free PDFCreator package, the free OpenOffice package, PowerPoint 2007, and Mac OS X all worked fine.

This test provided an unsatisfying result. I don't feel like I found or fixed the problem; I just identified it more closely. Telling my users, "Sure, you can search attachments in Exchange, unless they happen to be PDFs, but then again maybe not," isn't what I had in mind. I hope that Adobe fixes its IFilter to work properly; it's a shame that Adobe's poor implementation is making Exchange search look bad.

In the meantime, I guess if you have to generate PDFs, you should use something other than Adobe products. Pity, that.

Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish