Two models

This page in German thanks to Alice Slaba
This page in Croatian thanks to Milica Novak

There are two dominant Bayesian models for text classification, both are called naive Bayes models because they assume conditional independence.  

With the Multivariate Bernoulli Model, each essay is viewed as a special case of all the calibrated features. As in the illustrated example, the presence or non-presence of all calibrated features is examined. A typical Bayesian Network application, this approach has been used in text classification by Lewis (1992), Kalt and Croft (1996) and others.  

Under the Bernoulli model, the conditional probability of presence of each term is estimated by the proportion of documents within each category that contain the term.  The frequencies are seeded with the 1 to prevent zero probabilities which are a) biased, and b) would dominate the calculations. This a Laplacian correction.  The conditional probabilities of the absence of the term are 1 minus the probabilities of the terms presence.  Because every term in the vocabulary needs to be examined, this model can taker a long time to compute.

With the Multinomial Model, each essay is viewed as a sample of the calibrated features. The probability of each score for a given essay is computed as the product of the probabilities of the features contained in the essay. Often used in speech recognition where it is called a "unigram language model," this approach has been used in text classification by Mitchell (1997), McCallum, Rosenfeld & Mitchell(1998), and others.

Under the Multinomial model, the conditional probability of presence of each term is estimated by the frequency of the term within each category divided by the frequency of all terms within the category. Again, the Laplacian correction is used and the frequencies are seeded with 1.

McCallum and Nigam (1998) have shown that for several datasets, the Multinomial model is as accurate as or more accurate than the Bernoulli model. Since essays are often scored based on the presence or absence of features, research is needed before any conclusions can be drawn with regard to essay scoring. 

Conditional Independence - the Naive Bayes Assumption

The naive Bayes assumption is that word order is irrelevant and consequently that the presence of one word does not affect the presence or absence of another word and. This is assumption is obviously severely violated in the English language. The effect is that the posterior classification probabilities are extreme - often very close to zero or one. Domingos and Pazzani (1997) have shown that classification accuracy is not seriously affected by violations of this assumptions.