Bayesian Spam Filters

One of the most promising antidotes to spam is so-called Bayesian filtering, which calculates the probability that a given message is spam, based on analysis of messages previously identified as being spam or not being spam. The Bayesian approach demands less maintenance than keyword-based spam filters that require constant updating of word and phrase lists.

Much of the buzz around this technique started with Paul Graham's August 2002 article "A Plan for Spam" (see the first URL below). Although there is some debate about whether Graham's approach is precisely Bayesian, organizations have been exploring Bayesian methods and applying them to spam for several years. Microsoft Research's antispam effort, spearheaded by a group of Bayesian researchers, began in 1997 and has resulted in a patent. If you want to keep up with spam-fighting techniques, some understanding of the Bayesian technique is in order.

The Bayes in Bayesian was an 18th-century British clergyman and amateur mathematician, Thomas Bayes, who suggested in a posthumously published paper that the probability of some event occurring in the future is related to the proportion of times that event occurred in the past under the same circumstances. Later, mathematicians refined Bayes's ideas and, in the 20th century, built a formal system of classification and decision-making and began applying it to many tasks in science and engineering. (I first encountered Bayesian inference in the context of economics.) A key element of the Bayesian approach is that it depends on having some prior information about the problem at hand.

To some extent, the Bayesian approach models our everyday experience of using probability to try to determine the possible outcome of an action and make decisions. The Bayesian interpretation of probability is different from the coin-flipping experiments that most of us did in school (and which, I'm convinced, are largely an effort to convince students of the futility of gambling). Life isn't a series of random experiments from which we calculate frequency distributions. We must make decisions taking into account the likelihood of different consequences arising from those decisions and whether those consequences are good or bad.

In the case of spam, Bayesian inference suggests that if a new message contains text that appeared often in spam in the past but rarely in legitimate messages, then the new message is likely to be spam. The formal methods of calculating such a probability can also take into account the fact that a single false positive--a legitimate message quarantined as spam--is far more costly than many false negatives or spam messages left untouched in your Inbox.

Graham's method analyzes not just the message body but also the message header, which might contain information about the sender's mail server, foreign character sets, and attachments. He claims that his filter catches 99.5 percent of spam with less than one false positive for every 1000 messages received.

Graham presented an update at an antispam conference at MIT last month. He has expanded his list of "tokens"--telltale words and phrases to look for in incoming mail-–to about 187,000 items. And his method can now handle a word differently depending on whether it appears in the subject, in a URL, or in an address field.

Others following Graham's lead are experimenting with variations that calculate the "spamminess" of messages differently. The open-source SpamBayes effort has produced an Outlook add-in (see second URL below). Another free Outlook spam filter using a Bayesian technique is Spammunition, currently in beta. Spam Bully provides a commercial solution.

John Graham-Cumming, the author of POPFile, another open-source project (this one a mail proxy server using a Bayesian filter) reported to the MIT conference that, as well as statistical filters might work, parsing email messages so that such filters can analyze them will continue to be a hard job. Technically savvy spammers constantly devise new ways to make their messages easy for a user to read but difficult for a program to analyze.

"A Plan for Spam" http://www.paulgraham.com/spam.html

SpamBayes http://spambayes.sourceforge.net

Spammunition http://www.upserve.com/spammunition/default.asp

Spam Bully http://spambully.com

POPFile http://popfile.sourceforge.net

Comments

Plain text