Constructing and Grading Multiple-Choice Exams: An Interview with Anthony Marini, Spring 2006

Accessible Version

By Mike Atkinson
Faculty Associate, Teaching Support Centre
Number 54, March 2006

Anthony Marini, 3M Teaching Fellow and professor of Educational Psychology at the University of Calgary, is an expert in measurement and assessment. Recently, we had an opportunity to chat about several issues related to setting and grading multiple-choice exams.

MA: I’m often asked about the number of alternatives one should use for a multiple-choice item. What’s the current thinking on this?

AM: We used to suggest four or five alternatives for an item but recently, the move is toward three well-constructed alternatives.

MA: Are three alternatives really enough?

AM: Absolutely, but they must be well-written, meaningful alternatives. It turns out that five alternatives are not that effective … it’s simply too hard to construct good distracters (the alternatives that are not correct).

MA: What about the guessing rate … is 33% acceptable?

AM: Sure, the critical factor is that the distracters should reflect common errors in understanding or reasoning. Too often, the distracters are easily dismissed so the guessing rate is less than you think anyway. You should always run the item analysis and remove those alternatives that are not working.

MA: Before we talk about the item analysis, how many items would you suggest an instructor should use on an exam?

AM: There’s no hard rule about this—you need enough items to generate a valid test. As a rule of thumb, I usually allow 65 seconds per item. Speed should not be an issue and this typically is adequate time for most students.

AM: What about the use of multiple-multiples (e.g., A and B, but never D or E ) … many people in the professional schools like this kind of item.

AM: The literature is pretty clear on this. Do not use them. The American Medical Association, once of fan of these items, has now abandoned them.

MA: Are they considered too hard?

AM: Initially, multiple-multiples appear harder, but they are essentially an exercise in logical analysis. Once you learn the logic “trick”, the item actually becomes easier and the test loses content validity.

MA: Interesting. Is there a good way to maintain content validity (ensuring that the test accurately reflects the content to be learned) in your test?

AM: The best way is to use a Test Blueprint. Essentially, this is a 2-way table where the rows represent the content actually covered in the course (topics, chapters, etc.) and the columns reflect Bloom’s taxonomy. (http://faculty.washington.edu/krumme/ guides/bloom1.html) In this way, you can gage how many items you have included for each topic area and the level of cognitive complexity assessed. Your test should mirror the content actually covered and the weight placed on each of the topics.

MA: Let’s turn to the issue of item analysis. You’ve constructed your test, given the exam and graded it. What do we need to do next?

AM: Your multiple-choice test is not complete until you look at the item analysis. No one writes perfect items. You must determine the flaws and correct them.

MA: Where do we start?

AM: Most multiple-choice grading programs will automatically generate an item analysis. Start by looking at the difficulty score for each item. (Editor’s note: Scanexam generates a complete set of item analysis statistics). I first delete any item with a difficulty rating of 80% or higher and then re-score the exam.

MA: You delete the most difficult items! Aren’t you just increasing the average by getting rid of the hard questions?

AM: Not really. When 80% of the class gets the item wrong, there’s probably something wrong with the way the question was phrased. When you look at the item more closely, it’s unlikely that only the best students got the item right.

MA: Would you consider re-scoring the item? For example, keep the item with the answer you coded as correct and then accept another alternative as correct also?

AM: No. The item is flawed. Scoring another alternative as “right” does not make the item any better. Delete the item and re-work it for use on another exam.

MA: Would you eliminate the easy items too?

AM: Remember that the goal is to examine the items for flaws. If an item is too easy, change it on the next exam. But the exam is not flawed or unfair simply because an item is easy. You should not penalize the students for your work.

MA: O.K., what’s the next step?

AM: Look at the point-biserial correlations and delete any item with a negative correlation. The negative correlation tells you that most of the top students got the item wrong while those who got it right were in the bottom portion of the class. This item does not belong on the exam.

MA: Is there anything else we should look at?

AM: Check the distracter analysis. Evaluate the utility of the distracters. Is anyone choosing this alternative? Who? You want the exam to be a fair test of knowledge and the distracters should draw some attention. If they are not, then change them for the next exam.

MA: We’ve given the exam, checked the item analysis, deleted the flawed items and made notes for the next exam. Now it’s time to turn in the final grades. Let’s say that you, or your chair, think that the grades are too high or too low. What’s your advice on altering the final grade distribution?

AM: Do not get into these situations. If the final distribution is “too high or too low”, the problem is not with the distribution. The assessment instruments (the test) were flawed in some fundamental fashion. If you have been using a blueprint, running item analyses, etc., then you should be able to demonstrate the validity of the exams. Consequently, the final distribution is valid as well. The cure for these problems is to produce valid exams in the first place. Making adjustments to the final distribution is not only poor testing practice, but reinforces the idea that assessment is trivial.

MA: Anything new on the testing horizon?

AM: I’m advocating the use of scoring rubrics for all exams. They are very rich in content and a well-designed rubric can be used over and over.

MA: What exactly is a scoring rubric?

AM: Essentially, it is a scoring guide for questions that not only gives the right answer, but gives examples of an excellent, satisfactory, and an unacceptable answer. These can be shared with students for selfassessment as well as information purposes.

MA: Sounds like rubrics are best designed for essay questions. Can you use them with multiple choice?

AM: Sure. In this case, you would explain why your alternative is correct and why the various alternatives are wrong. This informs both the student and the instructor and keeps us focused on learning. Note: for information on rubrics, go to: here