Naïve Bayesian Classification Using C# and ASP.NET

Machine learning is a fascinating area of computer science in which we study the techniques and implementations of algorithms that allow computers to learn. Its implementations include, but are not limited to, automatic text classification, search engines, stock market analysis, financial threshold alerts, speech and handwriting recognition, medical diagnostics, data mining, and autonomous land vehicles. For more information on ASP.NET development, see "HTML5 for the ASP.NET Developer " and " ASP.NET MVC Tutorial: Handling Errors and Exceptions ." In this article, we ll discuss naive Bayesian text classification techniques and its implementation.

Text classification can be used to determine some sort of category for a document or block of text. There are three commonly known classification techniques: naive Bayesian classification, vector machines, and semantic indexing. The naive Bayesian technique is the most popular and is used in contemporary spam filtration, document categorization, and help desk systems because of its accuracy, robustness, and ease of implementation. The Bayesian classifier is attributed to a British mathematician, Thomas Bayes, who defined probability as:

The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the chance of the thing expected upon its happening.

So what s the big deal about this? Why are naive Bayesian classifiers being used by industry leaders like Microsoft in its notification platform to filter unwanted messages, Google to provide probable results, and Autonomy to build document classifiers? Because it provides an efficient way to attain reason from uncertainty. Let s consider an example: Have you ever wondered what happens behind the scenes when you receive a spam message in your e-mail inbox and you click on the Mark as spam or Send to Junk mail button? The message gets passed to an indexer that tokenizes the contents, calculates the probability, and adds it to a spam repository memorizing that the particular message had junk mail attributes. Also, for good messages, it learns from the environment and classifies the text as being not junk. When the next message comes in, the system automatically calculates the prior probability on the past experiences and can recommend if the new message falls under the category of ham or spam. This learning is based on the category of the word and its frequency of occurrence.

The Bayes Rule

To understand and code for the naive Bayesian algorithm, we will do some math to understand the procedure. The primary equation for the Bayes rule is:

 P(A|B) = (P(B|A) * P(A)) / P(B)

This states mathematically that the posterior probability or probability of future occurrence can be calculated by the product of previous belief P(A) and the likelihood of B if A is true; i.e., P(B|A). P(A|B) is called posterior probability, P(A) is called prior probability, and P(B) is normalization constant. This equation enables us to calculate the probability that A would occur providing that B has happened.

Following are some examples to further clarify what was just said. Let s take a scenario where we have three teams of developers working on different parts of a problem and generating a percentage of KLOC (thousands of lines of code). Team A is working on a 50% portion of the modules and on average produces 3% buggy code. Team B is responsible for 30% of the class libraries written and has a 4% chance of having failure assertions and test case flaws in their code. Team C is working on the remaining 20% of the source but, because the technology used is in beta, the percentage of their errors is raised to 5%. Now we come to the classification problem: What is the probability that a randomly selected line of code is faulty?

This is the normalization constant we ve seen in the previous equation, which can be calculated as the sum of the product of probabilities and chances of erroneous code:

P (X) = P(A). P(D|A) + P (B). P(D|B) + P (C) . P(D|C)

P(X) = (0.5). (0.03) + (0.3) (0.04) + (0.2) (0.05) = 0.037 = 3.7%

This implies there is a 3.7% chance that a randomly selected line of code would not pass a unit test. However, the power of uncertainty classification demands us to calculate the probability that a randomly selected line came from team X. This is when the Bayes rule comes to the rescue. The Bayes rule tells us that the posterior probability can be found as follows:

P(A|X) = P(X|A) * P (A) / P(X)

P(A|X) = (0.03) (0.5) / 0.037 = 0.405405... 40.5%

Similarly, the probability that the erroneous code belongs to team B or C is 32.5% and 27%, respectively.

This gives you an idea of how the posterior probabilities are calculated or weighted (normalization constant). If the word free and poker occurs in an e-mail and it s marked as junk, there is a big chance that the next e-mail coming in my inbox with the similar contents is also a junk e-mail. It may also mean that I m being invited to a friend s house for a movie night with free pizza and games, but this leads us to a discussion of training domain, false positives, and reduction techniques, which are beyond the scope of this article. So let s get coding.

The Implementation

To mimic the two major operations of a text classifier, learning and classifying, we can create two methods for implementing the classification: AddWords and Classify.

The AddWords method will work as a simple indexer using a uses hashtable to tokenize the text and assign it to a category with a count (see Figure 1). The count will increase every time we add the same word to the same category. The way a token (word) is defined is very simple in this approach.

public int AddWords(string sCategory, string sWords)

{

 Hashtable htWordsInFile = StringtoHashtable(sWords);

 foreach (string sWord in htWordsInFile.Keys)

   {

    _htWords[sCategory + "-" + sWord] =

 SafeKeyNum(_htWords, sCategory + "-" + sWord) +

 Double.Parse(htWordsInFile[sWord].ToString());    }

    bDirtyFlag = true;

    //The index count

    return _htWords.Count;

}

Figure 1: The AddWords method.

The Classify method is the one that performs the matching for provided text and available data in the words.bayes file (see Figure 2). It creates a hashtable of the input for traversing, then calculates the score (prior probability) of every individual token and assigns it to another hashtable, then performs the sorting operation on the probabilities. If the probability of multiple items is the same, this implies that it s a neutral category (the classifier can t really determine which category it belongs to because of a lack of information).

public string Classify(Hashtable htWordsInFile)

{

   double iTotal = 0;

   string Response = string.Empty;

   Hashtable htCategoryCount = new Hashtable();

   foreach (string sEntry in _htWords.Keys)

   {

     string sCategory = (Regex.Split(sEntry, "-"))[0];

      htCategoryCount[sCategory] = (double)(SafeKeyNum(

     htCategoryCount, sCategory) + Double.Parse

      (_htWords[sEntry].ToString ()));

     iTotal += Double.Parse(_htWords[sEntry].ToString());

   }

// This is a ht of doubles.

   _htScore = new Hashtable();

    foreach (string sWord in htWordsInFile.Keys)

   {

     foreach (string sCategory in htCategoryCount.Keys)

     {

       if (_htWords.ContainsKey(sCategory + "-" + sWord))

       {

           _htScore[sCategory] = SafeKeyDouble(_htScore,

            sCategory + "-" + sWord) +

            (Math.Log(Convert.ToDouble(_htWords[sCategory +

           "-" + sWord]) / Convert.ToDouble(

           htCategoryCount[sCategory])));

       }

         else

         {

           _htScore[sCategory] = SafeKeyDouble(_htScore,

           sCategory + "-" + sWord) + Math.Log(0.01 /

           Convert.ToDouble(htCategoryCount[sCategory]));

           }

     }

   }

foreach (string sCategory in htCategoryCount.Keys)

   {

     _htScore[sCategory] = SafeKeyDouble(_htScore,

     sCategory) + Math.Log(Convert.ToDouble(

     htCategoryCount[sCategory]) / Convert.ToDouble(iTotal));

   }

//The sorter

   bool firstLine = true;

   string firstElement = string.Empty; string secondElement

   = string.Empty;

   SortedList sorter = new SortedList(_htScore, this);

   foreach (string sCategory in sorter.Keys)

   {

     if (firstLine)

     {

        Response += (sCategory + "\t" + _htScore[sCategory]

        + " (most likely candidate)");

        firstLine = false;

        firstElement = _htScore[sCategory].ToString();

     }

     else

     {

        Response += (sCategory + "\t" +

         _htScore[sCategory]);

        secondElement = _htScore[sCategory].ToString ();

     }

   }    if (firstElement.Equals(secondElement))

     return "Neutral Category:" + firstElement;

   else

       return Response;

 }

Figure 2: The Classify method.

Now let s use these disjointed pieces in a solution. To test the library functions defined in NaiveBayesClassifier, I've created a Web service called ClassifierService that exposes these methods. Figure 3 shows the class diagram for NaiveBayesClassifier and ClassifierService.

Figure 3: Class diagrams for the ClassifierService and NaiveBayesClassifer class.

The Web service methods are used to access our written functionality. Note the extra method named ClearIndex (see Figure 4). This method renames the existing words.bayes file, if there is one, and therefore clears the data already there. The next time the Learn method is called a new data file will be created.

[WebMethod]

 public string Classify(String words)

 {

   return new NaiveBayesClassifier().Classify(

   NaiveBayesClassifier.StringtoHashtable (words));

 }

  [WebMethod]

 public string Learn(String Category, String Keywords)

 {

   return new NaiveBayesClassifier().AddWords(

   Category, Keywords).ToString ();

 }

  [WebMethod]

 public bool ClearIndex()

 {

   return new NaiveBayesClassifier().FlushIndex();

 }

Figure 4: The Classifier Web service methods.

In a simple example to test the classifier s capabilities, let s create two basic categories, Cats and Dogs, with the attributes outlined in Figure 5.

Cats

Dogs

cat

kitten

meow

barf

furry

hairball

playful

dog

puppy

bark

woof

drool

smell

stink

wretch

hairy

loyal

loud

stupid

Figure 5: Two simple categories and their attributes.

To make a system learn what are the attributes associated with each category, we d have to invoke the Web service method Learn, with the given parameters as shown in Figure 6.

Figure 6A: Invoking the Learn Web method for category creation.

Figure 6B: Invoking the Learn Web method for category creation.

As you can see in Figure 6, the Web service response provides the number of keys added in the hashtable and, as mentioned earlier, if the file words.bayes doesn't exist in the Web services / APP_Data folder, the program will create it and add these keys to the corresponding categories.

Now it s time to see if this really worked. When you call the Classify method, providing the text meow , ClassiferService returns the following response:





Cats -3.04452243772342 (most likely candidate)

Dogs -7.64969262371151

However, when you call it passing an unknown keyword, for example Turtle , it interprets it as a neutral category because the keyword Turtle can t be found in the attributes of either Cats or Dogs. Therefore, the probability of its matching either of these two categories is the same:



 

  Neutral Category:-7.64969262371151

What if we look for the keyword loyal ? Yes, you got it. It will return the Dogs category. Now you might be wondering why one should use naive Bayes? This is so trivial that it can be simply achieved by making a list of known keywords assigned to a category, then matching it up; but what if both the categories have the same keyword? Naive Bayes just doesn't do a string match, it calculates the weight of a particular keyword with its frequency of occurrence in a category and then decides (i.e., assigns it a weight). You ll see this in the next example.

This example shows a personal blog where the subject posts book reviews. In this implementation of a naive Bayesian classifier, the goal is to categorize text; i.e., making the system learn what is junk and what is really useful. Spam comments are a big headache for bloggers when spammers, via automated bots using the blogging engine APIs or just by mere post request, post comments on an unwary blogger s Web site, hence making it a zombie spam link farm. CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) has proven to be a very effective measure against these bots. However, for actual human posters this approach doesn't work. People have moved to moderated comments, but wouldn't it be better if you are notified only about a spammish comment on your Web site instead of you approving comments every time one is posted? This can also provide a resolution to Web sites that provide customers a facility to write product reviews because now the moderator will get notified only if the score of a particular message is higher than spam threshold. Having said that, it s very important to keep systems updated with the newest spam vs. ham feeds so the classifier can understand the difference. Otherwise, you ll end up seeing either too many false positives or spam comments making their way through the classifier.

One of the new features of the Visual Studio .NET 2005 IDE is that you aren't required to have a Web server in order to run a Web service/Web site. The IDE comes with a Web server (Cassini), which is only accessible from localhost and provides development flexibility in case you don t have IIS installed on your machine or you have restrictions to run a Web server locally.

To provide a real-world problem simulation, I've used the design illustrated in Figure 7, wherein the service resides at a different tier along with data and implementation while the facade utilizes it. (For extensibility, security, and various other good reasons, the data layer should also be introduced and separated from the business layer.) In the diagram in Figure 7, boundaries are represented as Blog (front-end facade) and Processing Engine, which has the ClassifierService and NaiveBayesClassifier assemblies with words.bayes, the classifier file.

Figure 7: Classifier s service design.

I've provided a collection of spam messages and proper comments in two files named comments.xml and SpamComments.xml (available for download; see end of article for details). These files are used to train the classifier. An excerpt from these files is shown in the table in Figure 8.

Comments

Spam Comments

Pearl's "Probabilistic Reasoning in Intelligent Systems" is elegantly done seminal work on uncertainty, probabilistic reasoning and all things related inference. As the author says, "This book is a culmination of an investigation into the applicability of probabilistic methods to task requiring ...

2005-05-26T06:00:44.4062500-07:00

723a656a-830c-4ba6-98a1-f831dab5ece2 Adventures in GAC

D463106F-84DA-4C28-A0AB-6A1E95B2A72A

AdnanMasood

[email protected] www.axisebusiness.com/adnano

Please visit some helpful info in the field of online poker http://www.williamchibbard.com/online-poker.html texas hold em http://www.texas-hold-em-world.com/texas-hold-em.html online poker http://www.texas-hold-em-world.com/online-poker.html texas hold em http://www.e-online-poker-777.com/texas-hold-em.html online poker http://www.e-online-poker-777.com/online-poker.html...2005-05-26T06:00:44.4062500-07:002005-05-26T06:00:44.4062500-07:00 723a656a-830c-4ba6-98a1-f831dab5ece2>Adventures in GAC<D463106F-84DA-4C28-A0AB-6A1E95B2A72Aonlinepoker[email protected][JJW7]

For engine training purposes use SpamArchive s repository of spam e-mails: http://www.spamarchive.org.

Figure 8: Excerpts from comments.xml and SpamComments.xml.

There are two ways to create the spam index file: by using the command line executable NaiveBayesFileClassifier.exe or via the Web service s Learn method; you can use either. Here is an important point for the reader to note; if I only train the classifier using SpamComments.xml and post a spam comment like texas and poker , I ll get the correct response from the Web service stating the comment is a spam message. However, for a legitimate comment like the one shown in Figure 9, the Web service will still regard it as a spam response the reason being the amount of training data; Bayesian classifier can t automatically know the difference between categories unless it s trained on different datasets. This is well summed up in a sentence by Martin Overton of IBM Global Services. As with most things in life, he wrote, Bayesian filtering only works well when it is properly trained.


Positive Response	False Positive due to lack of training data

Figure 9: Bayesian filtering is only effective when it is properly trained.

To avoid this insufficient training data problem, we'd further tutor the classifier with comments.xml, which will provide it a prospective of what is spam and what is a comment. With new definitions recorded in words.bayes, when I try to post the correct comment again, the results are different. This time, upon entering spammish keywords, the program classifies them correctly as seen in the table shown in Figure 10.

spam -4.01443287321891 (most likely candidate)

comment -15.1548443302207

comment -10.5496741442326 (most likely candidate)

spam -15.1548443302207

Figure 10: Tutor the classifier to get different results.

The classifier used in these examples can be modified and enhanced for optimization. For instance, the classifier currently is case in-sensitive; however, case sensitivity can provide better approximation. Also, classifier loads the file every time it s called. For optimization and better performance, IO caching could be used to preserve the hashtable in memory and invalidate it upon new entries. Also, multiple file support to store different categories and adding strong typed parameters to make use of the spam threshold on the front-end are some useful features to add in this basic implementation.

Conclusion

Bayesian classification is an excellent technique for text classification and categorization; it s simple, effective, and easy to implement. It s also not language-dependent; i.e., it can support other languages, such as Spanish, Chinese, Japanese, French, Portuguese, Russian, Arabic, etc.

Also, what we covered is only the tip of the iceberg; Bayesian classification can be used for a variety of purposes. DIGIMIMIR is a classic example of its diverse uses. It s subtitled as a Tool for Rapid Situation Analysis of Helpdesk and Support E-mail .

However, this classification is not a silver bullet, either; bad guys are already out there trying to fool Bayesian spam filters by using Wordlists (or Dictionary Salads), passages from books, and by training classifiers with bad data. However, with a large amount of correct data (ham), a Bayes algorithm is capable of nullifying their probabilities and, hence, thwarting their plans.

Academia and implementation have never been as close as they are now. In this rapidly evolving world of IT, the success of a software enterprise is inversely proportional to the time it takes for its products to come out of the lab and get to the masses. The shorter the time, the greater the chance of success providing that the product delivers an innovative solution the industry wants. The implementation of complex algorithms is now an indigenous part of our everyday lives. From MapQuest (shortest path finding) and Google search to advanced traffic routing algorithms in the cellular industry, complex algorithms are out of the books and are living and breathing among us.

Special thanks to Neal Hardesty for his support and the classifier s port from Perl.

The sample code accompanying this article is available for download.

Adnan Masood works as a Sr. Software Engineer in a Monrovia-based financial institution where he develops middle-tier architectures, decoupled designs, and Web services using Microsoft s .NET Framework. He holds various professional memberships (ACM, BCS, and ACS) and several technical certifications (MCSD.NET, MCAD.NET, and[JJW1] SCJP-II). Adnan is attributed and published in print media and on the Web and is currently pursuing his MS (Computer Science) from Nova Southeastern University, FL. He is actively involved in the .NET community as co-founder of The San Gabriel Valley .NET Developers Group. His blog can be found at http://www.axisebusiness.com/adnano and he can be reached via e-mail at mailto:[email protected].