Three prominent writing assessment programs

Tre framstående skriva bedömning program   Swedish translation of this page by Daniela Milton

The most prominent writing assessment programs are Project Essay Grade (PEG), introduced by Ellis Page in 1966, latent semantic analysis (LSA) first introduced for essay grading in 1997 by Thomas Landauer and Peter Foltz, and e-rater used by Educational Testing Service (ETS) and developed by Jill Burstein. Descriptions of these approaches can be found in Whittington and Hunt (1999) and Wresch (1993). Other software projects are briefly mentioned in Breland and Lytle (1990), Vetterli and Furedy (1997), and Whissel (1994).

PEG grades essays predominantly on the basis of writing quality (Page, 1966, 1994). The underlying theory is that there are intrinsic qualities to a person’s writing style called trins that need to be measured, analogous to true scores in measurement theory. PEG uses approximations of these variables, called proxes, to measure these underlying traits. Specific attributes of writing style, such as average word length, number of semicolons, and word rarity are examples of proxes that can be measured directly by PEG to generate a grade. For a given sample of essays, human raters grade a large number of essays (100 to 400), and determine values for up to 30 proxes. The grades are then entered as the criterion variable in a regression equation with all of the proxes as predictors, and beta weights are computed for each predictor. For the remaining unscored essays, the values of the proxes are found, and those values are then weighted by the betas from the initial analysis to calculate a score for the essay.

Page has over 30 years of research consistently showing exceptionally high correlations. In one very relevant study, Page (1994) analyzed samples of 495 and 599 senior essays from the 1998 and 1990 National Assessment of Educational Progress using responses to a question about a recreation opportunity: whether a city government should spend its recreation money fixing up some abandoned railroad tracks or converting an old warehouse to new uses. With 20 variables, PEG reached multiple Rs as high as .87, close to the apparent reliability of the targeted judge groups.

First patented in 1989, LSA was designed for indexing documents for information retrieval. The underlying idea is to identify which of several calibration documents are most similar to the new document based on the most specific (i.e. least frequent) index terms. For essays, the average grade on the most similar calibration documents is assigned as the computer generated score (Landauer, Foltz, Laham, 1998).

With LSA, each calibration document is arranged as a column in a matrix. A list of every relevant content term, defined as a word, sentence, or paragraph, that appears in any of the calibration documents is compiled, and these terms become the matrix rows. The value in a given cell of the matrix is an interaction between the presence of the term in the source and the weight assigned to that term. Terms not present in a source are assigned a cell value of 0 for that column. If a term is present, then the term may be weighted in a variety of ways including a 1 to indicate that it is present, a tally of the number of times the term appears in the source, or some other weight criterion representative of the importance of the term to the document in which it appears or to the content domain overall.

Each essay to be graded is converted into a column vector, with the essay representing a new source with cell values based on the terms (rows) from the original matrix. A similarity score is then calculated for the essay column vector relative to each column of the rubric matrix. The essay’s grade is determined by averaging the similarity scores from a predetermined number of sources with which it is most similar. Their system also provides a great deal of diagnostic and evaluative feedback. Like PEG, Foltz, Kintsch and Landauer (1998) also report remarkably high correlations between LSA scores and human scored essays.

The Educational Testing Service’s Electronic Essay Rater (e-rater) is a sophisticated "Hybrid Feature Technology" that uses syntactic variety, discourse structure (like PEG) and content analysis (like LSA). To measure syntactic variety, e-rater counts the number of complement, subordinate, infinitive, relative clause and occurrences of modal verbs (would, could) to calculate ratios of these syntactic features per sentence and per essay. For structure analysis, e-rater uses 60 different features, similar to PEG’s proxes. Two indices are created to evaluate the similarity of the target essay’s content to the content of calibrated essays. As described by Burstein, et.al (1998), in their EssayContent analysis module, the vocabulary of each score category is converted to a single vector whose elements represent the total frequency of each word in the training essays for that holistic score category. The system computes correlations between the vector for a given test essay and the vectors representing the trained categories. The score that is most similar to the test essay is assigned as the evaluation of its content.

E-rater’s ArgContent analysis module is based on the inverse document frequency, like LSA. The word frequency vectors for the score categories are converted to vectors of word weights. The first part of the weight formula represents the prominence of word I in the score category, and the second part is the log of the word's inverse document frequency. For each argument in the test essay, a vector of word weights is also constructed and correlated. Like PEG, the analyzed features are then regressed to build a model that predict human grader’s scores.

Several studies have reported favorably on PEG, LSA, and e-rater. The programs have returned grades that correlated significantly and meaningfully with human raters. A review of the research on LSA found that its scores typically correlate as well with human raters as the raters do with each other, occasionally correlating less well, but occasionally correlating better (Chung & O’Neil, 1997). Research on PEG consistently reports superior correlations between PEG and human graders relative to correlations between human graders (e. g., Page, Poggio, & Keith, 1997). E-rater was deemed so impressive it is now operational and used to score the General Management Aptitude Test (GMAT).