An On-line, Interactive, Computer Adaptive Testing Tutorial, 11/98 (http://EdRes.org/scripts/cat)

Lawrence M. Rudner

Welcome to our online computer adaptive testing (CAT) tutorial. Here you will have the opportunity to learn the logic of CAT and see the calculations that go on behind the scenes. You can play with an actual CAT. We provide the items and the correct answers. You can try different scenarios and see what happens. You can pretend you are a high ability, average or low ability examinee. You can intentionally miss easy items. You can get items right that should be very hard for you.

This tutorial assumes some background in statistics or measurement. However, I hope even the novice will be able to follow along.

How the demo works

Suggested activities
This item set

Let's get started (This is where the CAT starts)

Background

What and why CAT
Some terms
A little about Item Response Theory

Logic of CAT
Reliability and Standard Error
Potential & Limitations of CAT
Key Technical and Procedural Issues

Interactive IRT mini-tutorial (Generate graphs of item response functions.)
Acknowledgements

Suggested activities

This system allows you to play with different scenarios. You can vary

• your true score. This is the real ability that the test is trying to estimate. The tutorial uses this value to help explain what is happening behind the scenes. This value is not used in item selection or calculating the standard errors.
• whether you respond as expected. The software will provide a probability of a correct response given your true score. You can get easy items wrong and hard items right if you want. You can also pick when during the test session you respond unexpectedly. You can pick the frequency of unexpected responses.
• the number of items that you take. This simulates different stopping rules.
• the scale used to define ability.

During the testing session, the system will show you

• the Information Functions for the 5 items the computer thinks will provide the most information about your ability level. Because the items are picked based on the computer's estimate of your ability, the selected items may not be the best given your true ability.
• the Item Response Function for the selected item.
• the Standard Error of your current ability estimate.

When you push the "Done" button, the system will show you the history of your testing session, including the item difficulties, the probabilities of a correct response, whether you responded correctly, ability estimates and standard errors.

The first time you use the system to take a CAT test, I suggest you pick a true score that is slightly above average, respond to the items as suggested by the software, and take about 10 items. This item bank is optimal for examinees with slightly above average ability. If you respond completely as expected by the IRT model, the ability estimate will quickly converge near the given true score. On this first run, you should notice that

• the software starts with an ability estimate of the 50th percentile. In the absence of any other information, this is the most reasonable estimate.
• correct responses are followed by more difficult items.
• incorrect responses are followed by easier items.
• after only a few items, the ability estimate is near your true ability. The estimate improves as the number of items increases, assuming of course that you are consistent in your responses.
• after only a few items, the item difficulties are close to your true ability and probabilities of a correct response are close to .50.
• the standard error improves with as the number of items increases.

Once you are familiar with the software, try taking the test under different scenarios:

• try different true scores. Notice how many items you need to get a reasonable estimate of ability.
• try a few unexpected responses. Get a very difficult item correct (i.e. respond correctly when the probability of a correct response is < .50), as would be the case with a lucky guess; or get an easy item wrong, as would be the case of a careless mistake. Notice that the ability estimate starts to deviate from your true score, but then quickly starts to converge again. You can also experiment with unexpected responses in the beginning versus the middle versus after 15 items.
• try many unexpected responses. Notice that the ability estimate starts to deviate from your true score, but then is eventually able to converge (assuming you again respond as expected.)
• pick a different threshold for responding correctly. For example, rather than responding correctly when the probability of a correct response is > .50, respond correctly when the probability of a correct response is > .40 or .60. Notice that the software will under or over estimate your true score.

Are you ready to try an actual computer adaptive test? You will first be asked to pick a true score and the true score scale (z, SAT, or percentiles; you can change the scale at any time.) The CAT will then start with the first item. To help you see what is going on behind the scenes, graphs of Information Functions, the Item Response Function for the selected item, and standard error will be presented. If any of the graphs are not clear, push the Explain button and detailed, tailored information will appear. The information presented by the Explain button varies as the CAT progresses, so you may want to push the button several times. After you have responded to about 10 items, you may want to push the Done button. Your item history will be presented. If you respond to more than 5 items, summary graphs will also be presented. Have fun. If you have suggested activities or suggested improvements, please let me (Lawrence M. Rudner, LRudner@edres.org) know.

This item set

The items presented in this tutorial are released, public use items from the National Assessment of Educational Progress, 8th grade Mathematics test. Our item bank consists of 64 multiple choice items. The images where redrawn to improve quality. (Public released items are available at the NAEP site in Adobe format).

The item bank is too small for a real CAT. Two or three times as many items would allow the software to choose from more high quality items at each ability estimate level. To compensate for the smallness of the item bank, we improved the statistical quality of the items by adding .4 to the item discrimination parameter. With more discriminating items, desired levels of precision in ability estimates can be achieved more quickly, that is, with the administration of fewer test items. Thus, this allows the CAT software to converge quicker.

The test information for this item bank is shown in the following graph.

Notice that this item bank is best for examinees whose ability is between 0 and 2 z-score units (the 50th to 98th percentile). The demo will rapidly converge for examinees whose true ability is within that range. This CAT is not well suited for examinees in the low end of the ability spectrum.

When an examinee is administered a test via the computer, the computer can update the estimate of the examinee's ability after each item and then that ability estimate can be used in the selection of subsequent items. With the right item bank and a high examinee ability variance, CAT can be much more efficient than a traditional paper-and-pencil test.

Paper-and-pencil tests are typically "fixed-item" tests in which the examinees answer the same questions within a given test booklet. Since everyone takes every item, all examinees are administered a items that are either very easy or very difficult for them. These easy and hard items are like adding constants to someone's score. They provide relatively little information about the examinee's ability level. Consequently, large numbers of items and examinees are needed to obtain a modest degree of precision.

With computer adaptive tests, the examinee's ability level relative to a norm group can be iteratively estimated during the testing process and items can be selected based on the current ability estimate. Examinees can be given the items that maximize the information (within constraints) about their ability levels from the item responses. Thus, examinees will receive few items that are very easy or very hard for them. This tailored item selection can result in reduced standard errors and greater precision with only a handful of properly selected items.

Some terms

In order to explain the logic of computer adaptive testing, I need to first define some measurement terms and concepts.

Some terms:
true score - The score an examinee would receive on a perfectly reliable test. Since all tests contain error, true scores are a theoretical concept; in an actual testing program, we will never know an individual's true score. We can however, compute an estimate of an examinee's true score and we can estimate the amount of error in that estimate. True ability is denoted as theta (); the true score for examinee j is denoted theta-j ().
ability estimate - The score you might receive based on taking a real test. Also called a true-score estimate. The estimate of true ability is denoted theta-hat ().
standard error - In any testing situation, there are multiple sources of error. Some of this error may be due to variations in the examinees manifest behavior (i.e. having a bad day). The single largest sources of error is usually due to sampling of test content. Standard error is an estimate of the standard deviation of the estimates of ability that might be expected for a given candidate, if this candidate were to take multiple equivalent forms of the test with no learning or instruction between the administrations.

Some concepts:
Probability of a correct response - One usually thinks of an individual either getting an item right or wrong. Suppose we have the universe of people with some common true score () take the item. Some of those people will get the item right, others will get it wrong. The percent getting the item right will be the probability of getting the item right given a specific value for theta. For any given item, we would expect the probability of a correct response to be low (approach guessing) for low ability examinees and high (approach certainty) for high ability examinees.
item response function - When we have the probabilities of a correct response across a range of ability, (i.e. theta values), then we have a function that relates ability to the probability of a correct response. Also called an item characteristic curve. One such function is defined in the next paragraph.
item response theory - A theoretical framework that yields item response functions and extensions of those functions. The Rasch model, the three parameter logistic model, the two parameter Birnbaum model are IRT models. These IRT models relate the probability of correctly answering an item to characteristics of the item and the examinee's true ability.

A little bit about Item Response Theory

Item Response Theory (IRT) is a statistical framework in which examinees can be described by a set of one or more ability scores that are predictive, through mathematical models, linking actual performance on test items, item statistics, and examinee abilities. See van der Linden and Hambleton (1997), Hambleton, Swaminathan, and Rogers (1991) and Hambleton and Jones (1993) for fuller explanations of IRT.

This tutorial employs the widely accepted three parameter item response theory model first described by Birnbaum (1968). Under the 3 parameter IRT model, the probability of a correct response to a given item i is a function of an examinee's true ability and three item parameters

1. the ai, or item discrimination, parameter,
2. the bi, or item difficulty, parameter, and
3. the ci, or guessing, parameter.

Each item i has a different set of these three parameters. These parameters are usually calculated based on prior administrations of the item.

The model states that probability of a correct response to item i for examinee j is a function of the three item parameters and examinee j's true ability.

P(ui=1 | , ai, bi, ci) = ci + (1 - ci) / [1 + exp(-1.7 ai ( - bi)]

This function is plotted below with ai = 2.0, bi = 0.0, ci = .25 and varying from -3.0 to 3.0.

The horizontal axis is the ability scale, ranging from very low (-3.0) to very high (+3.0). When ability follows the normal curve, 68% of the examinees will have an ability between -1 and +1; 95% will be between -2.0 and +2.0. The vertical axis the is the probability of responding correctly to this item (defined by the three item parameters) given = .

The lower asympote is at ci=.25. This is the probability of a correct response for examinees with very little ability (e.g. = -2.0 or -2.6). The curve has an upper asympote at 1.0; high ability examinees are very likely to respond correctly.

The bi parameter defines the location of the curve's inflection point along the theta scale. Lower values of bi will shift the curve to the left; higher to the right. The bi does not effect the shape of the curve.

The ai parameter defines the slope of the curve at its inflection point. The curve would be flatter with a lower value of ai; steeper with a higher value. Note that when the curve is steep, there is a large difference between the probabilities of a correct response for a) examinees whose ability is slightly below (left) of the inflection point and b) examinees whose ability is slightly above the inflection point. Thus ai denotes how well the item is able to discriminate between examinees of slightly different ability (within a narrow effective range).

One of the positive features about IRT is that distributional assumptions about ability do not need to be made and the scaling of ability is arbitrary. It is common, and convenient, to place ability scores on a scale with mean of zero and a standard deviation of one. When, this scaling is done, it is common to find ability scores, mainly, between -3.0 and +3.0, or in other words, regardless of the ability score distribution, it is common to find nearly all of the scores within three standard deviations of the mean ability score.

If you are new to Item Response Theory, I encourage you to play with varying the item parameter values for this function at http://edres.org/scripts/cat/genicc.htm.

The Logic of Computer Adaptive Testing

Computer adaptive testing can begin when an item bank exists with IRT item statistics available on all items, when a procedure has been selected for obtaining ability estimates based upon candidate item performance, and when there is an algorithm chosen for sequencing the set of test items to be administered to candidates.

The CAT algorithm is usually an iterative process with the following steps

1. All the items that have not yet been administered are evaluated to determine which will be the best one to administer next given the currently estimated ability level
2. The "best" next item is administered and the examinee responds
3. A new ability estimate is computed based on the responses to all of the administered items.
4. Steps 1 through 3 are repeated until a stopping criterion is met.

Several different methods can be used to compute the statistics needed in each of these three steps. Hambleton, Swaminathan, and Rogers (1991); Lord (1980); Wainer, Dorans, Flaughter, Green, Mislevey, Steinberg, Thissen (1990); and others have shown how this can be accomplished using Item Response Theory.

Treating item parameters as givens, the ability estimate is the value of theta that best fits the model. When the examinee is given a sufficient number of items, the initial estimate of ability should not have a major effect on the final estimate of ability. The tailoring process will quickly result in the administration of reasonably targeted items. The stopping criterion could be time, number of items administered, change in ability estimate, content coverage, a precision indicator such as the standard error, or a combination of factors.

Step 1 references selecting the "best" next item. Little information about an examinee's ability level is gained when the examinee responds to an item that is much too easy or much too hard. Rather one wants to administer an item whose difficulty is closely targeted to the examinee's ability. Furthermore, one wants to give an item that does a good job of discriminating between examinees whose ability levels are close to the target level.

Using item response theory, we can quantify the amount of information provided by an item at a given ability level. Under the maximum information approach to CAT, the approach used in this tutorial, the "best" next item is the one that provides the most information (In practice constraints are incorporated in the selection process.) With IRT, maximum information can be quantified as the standardized slope of Pi() at . In other words

where Pi() is the probability of a correct response to item i, P'i() is the first derivative of Pi(), and Ii() is the information function for item i.

Thus, for Step 1, Ii() for each item can be evaluated using the current ability estimate. While maximizing information is perhaps the best known approach to selecting items, Kingsbury and Zara (1989) describe several alternative item selection procedures.

In Step 3, a new ability estimate is computed. The approach used in this tutorial is a modification of the Newton-Raphson iterative method for solving equations, outlined in Lord (1980, p 181). The examination starts with an initial estimate of S, computes the probability of a correct response for each item using S, and then adjust the ability estimate to obtain improved agreement of the probabilities and the observed response vector. The process is repeated until the adjustment is extremely small. Thus:

where

The right hand side of the above equation is the adjustment. S+1 denotes the adjusted ability estimate. The denominator of the adjustment is the sum of the item information functions evaluated at S. When S is the maximum likelihood estimate of the examinee's ability, the sum of the item information functions is the test information function, Irr.

The standard error associated with the ability estimate is calculated by first determining the amount of information the set of items administered to the candidate provides at the candidate's ability level--this is easily obtained by summing the values of the item information functions at the candidate's ability level to obtain the test information. Second, the test information is inserted in the formula below to obtain the standard error:

Thus, the standard error for individuals can be obtained as a by product of computing an estimate of an examinees ability.

Reliability and standard error

In classical measurement, the standard error of measurement is a key concept and is used in describing the level of precision of true score estimates. With a test reliability of 0.90, the standard error of measurement for the test is about .33 of the standard deviation of examinee test scores. In item response theory-based measurement, and when ability scores are scaled to a mean of zero and a standard deviation of one (which is common), this level of reliability corresponds to a standard error of about .33 and test information of about 10. Thus, it is common in practice, to design CATs so that the standard errors are about .33 or smaller (or correspondingly, test information exceeds 10--recall that if test information is 10, the corresponding standard error is .33). [This paragraph was contributed by Ron Hambleton, University of Massachusetts].

In general, computerized testing greatly increases the flexibility of test management. There potential is often described (e.g. Urry, 1977; Grist, Rudner, and Wise, 1989; Kreitzberg, Stocking, and Swanson, 1978; Olsen, Maynes, Slawson and Ho, 1989; Weiss and Kingsbury, 1984; Green, 1983). Some of the benefits are:

• Tests are given "on demand" and scores are available immediately.
• Neither answer sheets nor trained test administrators are needed. Test administrator differences are eliminated as a factor in measurement error.
• Tests are individually paced so that a examinee does not have to wait for others to finish before going on to the next section. Self-paced administration also offers extra time for examinees who need it, potentially reducing one source of test anxiety.
• Test security may be increased because hard copy test booklets are never compromised.
• Computerized testing offers a number of options for timing and formatting. Therefore it has the potential to accommodate a wider range of item types.
• Significantly less time is needed to administer CATs than fixed-item tests since fewer items are needed to achieve acceptable accuracy. CATs can reduce testing time by more than 50% while maintaining the same level of reliability. Shorter testing times also reduce fatigue, a factor that can significantly affect an examinee's test results.
• CATs can provide accurate scores over a wide range of abilities while traditional tests are usually most accurate for average examinees.

Limitations

Despite the above advantages, computer adaptive tests have numerous limitations, and they raise several technical and procedural issues:

• CATs are not applicable for all subjects and skills. Most CATs are based on an item-response theory model, yet item response theory is not applicable to all skills and item types.
• Hardware limitations may restrict the types of items that can be administered by computer. Items involving detailed art work and graphs or extensive reading passages, for example, may be hard to present.
• CATs require careful item calibration. The item parameters used in a paper and pencil testing may not hold with a computer adaptive test.
• CATs are only manageable if a facility has enough computers for a large number of examinees and the examinees are at least partially computer-literate. This can be a big limitation.
• The test administration procedures are different. This may cause problems for some examinees.
• With each examinee receiving a different set of questions, there can be perceived inequities.
• Examinees are not usually permitted to go back and change answers. A clever examinee could intentionally miss initial questions. The CAT program would then assume low ability and select a series of easy questions. The examinee could then go back and change the answers, getting them all right. The result could be a 100% correct answers which would result in the examinee's estimated ability being the highest ability level.

Key Technical and Procedural Issues

There is a fair amount of guidance in the literature with regard to technical, procedural, and equity issues when using CAT with large scale or high stakes tests (Mills and Stocking, 1995; Stocking 1994; Green, Bock, Humphreys, and Reckase, 1984; FairTest, 1998). In this section, I outline the following issues:

Balancing content - Most CATs seek to quickly provide a single point estimate for an individual's ability. How can CAT, therefore, accommodate content balance?

The item selection process used in this tutorial depends solely on item information for choosing the next item to administer. While this procedure may be optimal for determining an individual's overall ability level, it doesn't assure content balance and doesn't guarantee that one could obtain subtest scores. Often one wants to balance the content of a test. The test specifications for a mathematics computation test, for example, may call for certain percentages of items to be drawn from addition, subtraction, multiplication and division.

If one is just interested in obtaining subtest scores, then each subtest can be treated as an independent measure and the items within the measure adaptively administered. When subtests are highly correlated, the subtest ability estimates can be used effectively in starting the adaptive process for subsequent subtests.

Kingsbury and Zara (1989, 1991) outline a constrained computer adaptive testing (C-CAT) that provides content balancing:

1. The examinees provisional achievement level is calculated following the administration of an item.
2. The percentage of items already administered in each subgoal in the current test is calculated.
3. The empirical percentages are compared to the preprescribed desired percentages, and the subgoal with the largest discrepancy is identified.
4. With the subgoal with the largest discrepancy, the item providing the most information at the examinee's momentary achievement level estimate is selected and administered to the examinee.

A major disadvantage to this approach is that the item groups must be mutually exclusive. When the number of item features of interest becomes large, the number of items per partition will become small. Further, it may not always be desirable to group items into mutually exclusive subsets.

Wainer and Kiely (1987) proposed the use of testlets as the basis for tailored branching. Items are grouped into small testlets developed following desired test specifications. The examinee responds to all the items within a testlet. The results on the testlet and previously administered testlets are then used to determine the next testlet. Wainer, Kaplan and Lewis (1992) have shown that when the size of the testlets are small, the gain of making the testlets themselves adaptive are modest.

Swanson and Stocking (1993) and Stocking and Swanson (1993) describe a weighted deviations model (WDM) which selects items using linear programming based on numerous simultaneous constraints involving statistical and content considerations. One constraint would be to maximize item information. Other constraints might be mathematical representations of the test specifications or a model to control for item overlap. The traditional linear programming model is not always applicable as some "constraints cannot be satisfied simultaneously with some other (often non-mutually exclusive) constraint. WDM resolves the problem by treating the constraints as desired properties and moving them to the objective function" (Stocking and Swanson, 1993, p280).

While WDM can often be solved using linear programming techniques, Swanson and Stocking (1993) provide a heuristic for solving the WDM:

• For every item not already administered, compute the deviation for each of the constraints if the item were to be added to the test.
• Sum the weighted deviations across all constraints.
• Select the item with the smallest weighted sum of deviations.

The preferred choice would depend on the number and nature of the desired constraints

Administering items belong to sets - Can CAT accommodate items belong to sets?

In the typical reading assessment, the examinee reads a passage and responds to several questions concerning that passage. The stimulus material is presented just once. One would not want a computer adaptive test which treats each item independently. An examinee might be presented with the same stimulus multiple times, and have to work though a passage just to answer one question at a time.

Each passage could be treated like a testlet as described above. An alternative approach, described by Mills and Stocking (1996) would be to present the most targeted items within a passage. Depending on the test specifications, an examinee might receive 3 of the 10 questions associated with a given passage.

Examinee Considerations - What are some examinee issues with regard to CAT?

Wise (1997) raised several issues from the examinee's perspective, including item review and equity. He noted that research consistently reports that examinees want a chance to review their answers. He also noted that when examinees change their answers, they are more likely to legitimately improve their scores. Most CATs cannot accommodate an option for examinees to review their answers (Wainer's (1987) testlet approach is a notable exception). If review and answer changing were possible, a clever examinee could intentionally miss initial questions. The CAT program would then assume low ability and select a series of easy questions. The clever examinee would then go back and change the answers, getting them all right. The result could be a high percentage of correct answers which would result in an artifically high estimate of the examinee's ability.

While poor and minority children have less access to computers, Wise noted that the research on equity and CAT is mixed. He noted that there are racial and ethnic differences on the use of and desired amount of testing time. Yet some research has found that Blacks fare better on computer tests than on conventional tests. Wise concluded that, since the research is inconclusive, the issue should be investigated with regard to each test being developed.

Item exposure - How can CAT be modified to insure that certain items are not over-used.

Without constraints, the item selection process will select the statistically best items. This will result in some items being more likely to be presented than others, especially in the beginning of the adaptive process. Yet one would be interested in assuring that some items are not over-used. Overriding the item selection process to limit exposure will better assure the availability of item level information and enhance test security. However, it also degrades the quality of the adaptive test. Thus, a longer test would be needed.

One approach to control exposure is to randomly select the item to be administered from the a small group of best fitting items. For example, McBride and Martin (1983) suggest randomly selecting the first item from the five best fitting items, the second item from the four best fitting items, the third from a group of three, and the forth from a group of two. The fifth and subsequent items would be selected optimally. After the initial items, the examinees would be sufficiently differentiated and would optimally receive different items. Kingsbury and Zara (1989, p 369) report adding an option to Zara's CAT software to randomly select from the two to ten best items.

Sympson and Hetter (1985) developed an approach which controls item exposure using a probability model. The approach seeks to assure that the probability the item is administered, P(A) is less than some value r - the expected, but not observed, maximum rate of item usage. If P(S) denotes is the probability an item is selected as optimal, and P(A|S) denotes the probability the item is administered given that it was selected as optimal, then P(A)=P(A|S)*P(S). The values for P(A|S), the exposure control parameters for each item, can be determined though simulation studies.

Item pool characteristics - Can any test be used for CAT?

Lord (1980, p 152) pointed out that an item provides the most information at bi and that the most information that can be provided is

Maximum information is a function of both the ai and ci parameters. An item whose ai=1.00 will be 4 times more effective than an item whose ai=.25. An item whose ci=0.00 (i.e. a free response item) will be 1.6 times as effective and an item with ci= .25.

Thus, the ideal item pool for a computer adaptive test would be one with a large number of highly discriminating items at each ability level. The information functions for these items would appear as a series of peaked distributions across all levels of theta.

The item pool used in this tutorial is not ideal for computer adaptive testing. There are large numbers of low discriminating items and most item difficulties are between -1.0 and 1.0.

Another way to look at an item bank, is to look at the sum of the item information functions. This Test Information Function shows the maximum amount of information the item bank can provide at each level of by either traditional or CAT administration. The test information function for the item pool used in this tutorial is shown in the next figure:

.

The information that can be provided by this item bank peaks at = 1.4. The item bank is strongest when 0 < < 2. Again, these curves define upper bounds. In practice, the amount of information will be lower at all levels of because an examinee only takes a sample of the items in the item bank. In terms of CAT, fewer items will need to be administered to examinees in the 0 < < 2 range in order to achieve a given level of precision.

Item pool size - How big does an item pool need to be?

The size of the item pool needed depends on the intended purpose and characteristics of the tests being constructed. Weiss (1985) points out that satisfactory implementations of CAT have been obtained with an item pool of 100 high quality, well distributed items. He also notes that properly constructed item pools with 150-200 items are preferred. If one is going to incorporate a realistic set of constraints (e.g. random selection from among the most informative items to minimize item exposure; selection from within subskills to provide content balance) or administer a very high stakes examination, then a much larger item bank would be needed.

Shifting parameter estimates - Can one expect the item response theory item parameters to be stable under a computer adapted item administration?

Numerous studies using live examinees have documented the equivalence of paper-and-pencil and computer adaptive tests by demonstrating equal ability estimates, equal variances, and high correlations (see Bergstrom, 1992 for a synthesis of 20 such studies). This equivalence implies that the underlying assumptions for CAT were met and that CAT was robust.

Two key assumptions for IRT-based computer adaptive testing are unidimensional item pools and fixed item parameters. The dimensionality of the item pool should not be a major concern since it is routinely investigated as part of quality test development. Of concern, however, is whether the IRT parameters change due to mode of administration or change over time. When examining person-fit by test booklet, Rudner, Bracey and Skaggs (1996) noted that the fit for calculator items was much worse than the fit for the same items when administered without a calculator. Thus, there is a very real possibility that the item parameters under CAT may not be the same as under paper-and-pencil administration. Also of concern is the possibility that the IRT parameters will shift due to changes in curriculum or population characteritsics. The issue of shifting parameters, however, could easily be addressed by recalculating IRT parameters after a CAT administration and comparing values.

Stopping rules - How does one determine when to stop administering items? What are the implications of different stopping rules?

One of the theoretical advantages of computer adaptive tests is that testing can be continued until a satisfactory level of precision is achieved. Thus, if the item pool is weak at some section of the ability continuum, additional items can be administered to reduce the standard error for the individual. Stocking (1987), however, showed that such variable-length testing can result in biased estimates of ability, especially if the test is short. Further, the nuances of a precision based (and hence a variable test length) stopping rule would be hard to explain to a lay audience.

Baker, F. (1985) The basics of item response theory. Portsmouth, NH: Heinemann Educational Books (out of print).

Bergstrom, B. (1992) Ability measure equivalents of computer adaptive and pencil and paper tests: A research synthesis. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.

Birnbaum, A. "Some latent trait models and their use in infering an examinee's ability. Part 5 in F.M. Lord and M.R. Novick (1986) Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley (out of print).

Cleary, T.A. and R. Linn (1969). Error of measurement and the power of a statistical test. British Journal of Mathematics and Statistical Psychology, 22, 49-55.

Eignor, D., Stocking, M, Way, W., & Steffen, M. (1993). Case studies in computer adaptive test design through simulation. Princeton, NJ: Educational Testing Service Research Report RR-93-66.

Fairtest (1998). Computerized Testing: More Questions Than Answers. FairTest Fact Sheet. http://www.fairtest.org/facts/computer.htm

Green, B. F., Bock, R.D., Humphreys, L., Linn, R.L., Reckase, M.D. (1984). Technical Guidelines for Assessing Computerized Adaptive Tests, Journal of Educational Measurement, 21, 4, pp. 347- 360.

Hambleton, R.K. & R.W. Jones (Fall, 1993). An NCME Instructional Module on Comparison of Classical Test Theory and Item Response Theory and Their Applications to Test Development. Educational Measurement: Issues and Practice, 12(3), 38-47.

Hambleton, R.K., H. Swaminathan, & H.J. Rogers, Fundamentals of Item Response Theory. Newbury Park CA: Sage.

Kingsbury, G., Zara, A. (1989). Procedures for Selecting Items for Computerized Adaptive Tests. Applied Measurement in Education, 2(4), 359-75.

Kingsbury, G., Zara, A. (1991). A Comparison of Procedures for Content-Sensitive Item Selection in Computerized Adaptive Tests. Applied Measurement in Education, 4 (3) 241-61

Kreitzberg, C., Stocking, M.L. & Swanson, L. (1978). Computerized Adaptive Testing: Principles and Directions, Computers and Education. 2, 4, pp. 319-329.

Lord, F.M. and M. Novick (1968) Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley (out of print).

Lord, F.M. (1980). Application of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates.

Mills, C. & Stocking, M. (1996) Practical issues in Large Scale Computerized Adaptive Testing. Applied Measurement in Education, 9(4), 287-304.

Rist, S., Rudner, L. & Wise, L. (1989). Computer Adaptive Tests. ERIC Digest Series.

Rudner, L. (1990). Computer Testing: Research Needs Based on Practice. Educational Measurement: Issues and Practice, 2, 19-21.

Rudner, L., Bracey, G. & Skaggs, G.. (1996). Use of person fit statistics in one high quality large scale assessment. Applied Measurement in Education, January.

Stocking, M.L. & Swanson, L. (1993). A Method for Severely Constrained Item Selection in Adaptive Testing. Applied Psychological Measurement, 17(3), 277-292.

Swanson, L. & Stocking, M.L. (1993). A Model and Heuristic for Solving Very Large Item Selection Problems. Applied Psychological Measurement, 17(2)2, 151-166.

Sympson, J.B. & Hetter, R.D. (1985) Controlling item exposure rates in computerized adaptive testing. Proceedings of the 27th annual meeting of the Military Testing Association. San Diego, CA: Navy Personnel Research and Development Center.

van der Linden & amp; R.K. Hambleton (Editors) (1997). Handbook of Modern Item Response Theory. London: Springer Verlag

Wainer, H.. On Item Response Theory and Computerized Adaptive Tests: The Coming Technological Revolution in Testing, Journal of College Admissions. 1983, 28, 4, pp. 9-16.

Wainer, H. (1993) Some practical considerations when converting a linearly administered test to an adaptive format. Educational Measurement: Issues and Practice, 12,1,15-21.

Wainer, H & Kiely, G. (1987). Item clusters and computerized adaptive testing: the case for testlets. Journal of Educational Measurement, 24, 189-205.

Wainer, H., Dorans, N., Flaughter, R. Green, B., Mislevy, R., Steinberg, L. & Thissen, D. (1990) Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum Associates.

Wainer, J., Kaplan, B., & Lewis, C. (1990) A comparison of the performance of simulated hierarchical and linear testlets. Journal of Educational Measurement, 27, 1-14., and

Weiss, D.J. (1985). Adaptive Testing by Computer, Journal of Consulting and Clinical Psychology. 1985, 53, 6, pp. 774-789.

Wise, S. (1997). Examinee issues in CAT. Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.

Acknowledgements

This tutorial was developed with funds awarded by the National Library of Education, Office of Educational Research and Improvement, US Department of Education to the ERIC Clearinghouse on Assessment and Evaluation (RR #93002002). The views and opinions expressed herein are those of the author and do not necessarily reflect the views of any sponsoring agency. The software incorporated in this tutorial is proprietary and belongs to Lawrence Rudner.

Special thanks go to Ronald K. Hambleton, University of Massachusetts; Dennis Roberts, Pennsylvania State University; and Pamela R. Getson, National Institutes for Health, for their helpful comments on earlier versions of this program. My appreciation also goes to Kristen Starret for redrawing the graphics that accompany each item and Scott Hertzberg for proofreading the text.

This tutorial was developed using Active Server Pages, the scripting language of Microsoft's Windows NT Internet Information Server.

From: Rudner, Lawrence M. (1998). An On-line, Interactive, Computer Adaptive Testing Tutorial, http://edres.org/scripts/cat