Computer Grading using Bayesian Networks

There is a sizeable and promising literature In the information science field with regard to using Bayes Theorem as the underlying model behind text classification (c.f. McCallum and Nigam, 2000; Mitchell, 1997; Kalt and Croft, 1996). Bayesian methods have been applied, for example, to sorting the resumes of new job applicants into multiple job categories based on their similarity to already classified resumes. One interesting application is to sort newsfeed articles into those that are possibly interesting to an individual based on previous articles the individual rated as relevant. Another is to sort spam and other non-relevant e-mail based on Bayesian similarity with previously classified e-mail. We view essay scores as just a different type of bin. In fact, there are computational advantages to essays because they are scored on ordinal scales.

Our approach to computer grading of essays can be viewed as an extension of Bayesian computer adaptive testing which has been described by Frick (1992), Madigan, Hunt, Levidow, and Donnell (1995), and Rudner (2001, in press). With Bayesian CAT, the goal is to determine the most likely classification for the examinee, typically master/non-master based on optimally selected items. With Bayesian essay scoring, we extend the desired classification to a three or four point categorical or nominal scale (e.g. Extensive, Essential, Partial, Unsatisfactory) and use a large set of "items". The "items" in Bayesian essay scoring are a broad set of essay features, which can include surface features such as those employed by PEG (number of words, number of commas, average sentence length, number of verbs, frequency of articles such as "the"), content features such as those employed by LSA (specific words, phrases, frequency of certain content words), and other essay characteristics such as that provided by ArgContent, e.g. the order certain concepts are presented and the occurrence of specific noun-verb pairs.

To explain Bayesian essay scoring, we will provide a simple example where the goal is to classify an examinee’s response as either being a complete, partially complete, or incomplete. As givens, we will have a collection of essay features for which we have determined the following three probabilities: 1) Probability that the feature is included in the essay given that the examinee has provided an Appropriate response, 2) Probability that the feature is included in the essay given that the examinee has provided a Partially Appropriate response, and 3) Probability that the feature is included in the essay given that the examinee has provided an Inappropriate response.

We will denote these as P_i(u_i=1|A), P_i(u_i=1|R), and P_i(u_i=1|I), respectively; the subscript I denotes that we have different probabilities for each feature I, u_i=1 denotes that the essay included feature I, and A, R, and I denotes the essay score as a Appropriate, Partial, or Inappropriate. Here, these conditional probabilities will be determined from a large (> 500) collection of essays scored by expert trained human raters.

Table 1: Conditional probabilities of an essay containing the hypothetical features
Feature (I)	Appropriate P_i(u_i=1\|A)	Partial P_i(u_i=1\|R)	Inappropriate P_i(u_i=1\|I)
1	.7	.6	.1
2	.8	.5	.4
3	.4	.8	.6
4	.2	.8	.9

Note that examinees with higher essay scores are more likely to have included features 1 and 2. The inclusion of feature 3 is most typical of examinees with Partially appropriate essays. Feature 4 represents misinformation that is not typically included in the essays of those with Appropriate answers. Further note that the features are dichotomously scored, either an essay contains a given feature or it does not.

The goal is to classify the examinee essay as most likely being Appropriate, Partial, or Inappropriate based on essay features Lacking any other information about the examinee’s ability, we will assume equal prior probabilities, i.e. P(A)=.33, P(R)=.33 and P(I)=.33. After examining each feature, we will update P(A), P(R) and P(I) based on whether the feature was included in the student’s essay. The updated values for P(A), P(R) and P(I) are referred to as the posterior probabilities. The process for computing these updated probabilities is referred to as Bayesian updating, belief updating (probabilities being a statement of belief), or evaluating the Bayesian network. The algorithm for updating comes directly from a theorem published posthumously by Rev. Thomas Bayes (1764) and known as Bayes Theorem: P(A|B) * P(B) = P(B|A) * P(A).

Let us suppose our examinee essay contains feature 1. By Bayes Theorem, the new probability that the essay is Appropriate is P(A|u_i=1) = P(u_i=1|A) * P(A) / P(u_i=1)

We know that the examinee responded correctly, so P(u_i=1)=1.00 and P(A|u_i=1) = .7 * .33 = .233. Similarly, P(R|u_i=1) = P(u_i=1|R) * P(R) = .6 * .33 = .200, and P(I|u_i=1) = P(u_i=1|I) * P(I) = .1 * .33 = .033. We can then divide by the sum of these joint probabilities to obtain posterior probabilities, i.e. P'(A) = .233 / (.233+.200+.033) = .500, P'(R) = .200 / (.233+.200+.033) = .429, and P'(I) = .033 / (.233+.200+.033) = .071.

At this point, it appears unlikely that the essay is Inappropriate. We next use these posterior probabilities as the new prior probabilities, examine the next feature and again update our estimates for P(A), P(R), and P(I) by computing new posterior probabilities. We iterate the process until all features are examined or some pre-specified stopping criteria, such as Wald’s (1947) Sequential Probability Ratio Test or a minimum posterior probability, is reached.

To continue the example, assume that the examinee essay does not contain feature 2, contains feature 3 and contains feature 4. Table 2 shows the resultant probabilities for all four items. The values of P(u_i|S) come from Table 1. Since the features are dichotomously scored, the probability that a feature is not present, as in feature 2, is P(u_i=0) = 1- P(u_i=1).

Table 2: Scoring an essay
Feature (I)	Present (u_i)	Score (S)	Prior Prob	P(u_i\|S)	Joint Prob	Posterior Prob
1	1	Appropriate	.333	.7	.233	.500
		Partial	.333	.6	.200	.429
		Inappropriate	.333	.1	.033	.071
2	0	Appropriate	.500	.2	.100	.280
		Partial	.429	.5	.215	.600
		Inappropriate	.071	.6	.043	.120
3	1	Appropriate	.280	.4	.112	.169
		Partial	.600	.8	.480	.723
		Inappropriate	.120	.6	.072	.108
4	1	Appropriate	.169	.2	.034	.048
		Partial	.723	.8	.578	.815
		Inappropriate	.108	.9	.097	.137

After administering these four items, the probability that our essay is Partially Appropriate given the presence or non-presence of these four features is .815. Using the maximum a posterior (MAP) approach, this is the most likely score category and the essay would be scored as Partial.

Our example is contrived. In practice, one would expect lower prior probabilities for each feature and the software would examine the presence of a large number of features, typically thousands if not tens of thousands. Perhaps, a more efficient alternative to examining the presence of all calibrated features would be to select the subsequent feature to be analyzed at each iteration to be the feature that maximizes the expected change in the posterior probability.

In theory, this approach to computer grading has the advantages of the best features of PEG, LSA, and e-rater plus several crucial advantages of its own. It can be employed on short essays, is simple to implement, can be applied to a wide range of content areas, can be used to yield diagnostic results, can be adapted to yield classifications on multiple skills, and is easy to explain to non-statisticians. The system is similar to LSA and e-rater in that it the more specific terms are given higher weight and it is similar to e-rater in it’s identification of the most likely score category.