Needed: A Search Engine Tune-Up

I love the Web. I can't remember how long it's been since I actually drove to a library to do research. The Web is so cool. But up until about a year ago, it was a lot cooler. Have you noticed that searching for technical topics has become less useful? This month, I want to explain a bit about how search engines work--or better yet, don't work.

From a technical point of view, all Internet search engines have to do three things: First, they scour the Web to find all its indexable pages; they index the text on those pages; and--here's the hard part--they rank the relevance of those pages. That way, when someone searches for "chocolate chips," the first page of search results doesn't include links to Intel and AMD--companies who probably make nearly as many chips as Nestle.

The first two steps are easy, though laborious and resource-intensive. The search engines start with a list of publicly available DNS domains, and they look for Web servers on those domains, examine those Web sites' pages to find hyperlinks to even more Web pages, then download those Web pages to a truly monstrous database. Once the search engines have the pages in the database, they index them. Then it's time to rank the Web pages' usefulness.

Clearly, Web pages don't come pre-stamped with a "usefulness" or "reliability" rank, so search engines use information that's digitally signed into the page to guess a page's usefulness in answering some query. In a perfect world, this information would explain characteristics about the page, such as who wrote the content, how reliable the author is, who reviewed the page, and whether the author and reviewer were paid for the content. Instead, every search engine uses a variation on the "usefulness" scale used in the academic and professional research worlds.

Research institutions compete for research dollars (I'm simplifying), so benefactors use various measurements to decide the value of those institutions' work and which of those institutions deserve research money. One such measurement is the total of scholarly articles an institution's researchers publish. Because some of this research is more useful than others, another index uses the total of these scholarly articles that have been cited as references in articles written by other researchers. Thus, if Professor Smith cranks out reams of articles about insignificant minutiae that no one ever reads or cites, and Professor Einstein writes one article about matter and energy that's cited by thousands of other researchers, then Professor Einstein is considered to have contributed more to the field.

Search engines use a similar method (again, I'm simplifying) by ranking a Web page according to the number of other Web pages that contain hyperlinks pointing to it. If a given Web page on your site includes text about Queen of the Night tulips, and a zillion other Web pages point to that Web page, then anyone searching for "Queen of the Night tulip" will see your page fairly high up in the search rankings.

Once upon a time, this method served as a moderately decent way for a search engine to present a pretty useful set of links, at least in the realm of technical topics. But nowadays the quest for a buck and a little fame has led to a Web that, well, needs a little work. Next month, we'll see why this has rendered search engines far less useful and how they might improve.

Comments

Plain text