What makes a good exam question? Not surprisingly, I try to write exams that most students who are keeping up with the course should do well on — almost by definition, the exam should be evaluating what I’m teaching. But I also want the exam to reveal and assess different levels of understanding; it would be useless to have an exam that everyone aced, or that everyone failed. Also not surprisingly, I’m not perfect at coming up with questions that achieve these aims. For years, however, I’ve been using the data from the exam scores themselves to tell me about the exam. Here’s an illustration:
I recently gave a midterm exam in my Physics of Energy and the Environment course. It consisted of 26 multiple choice questions and 8 short answer questions. For the multiple choice questions, I can calculate (i) the fraction of students who got a question correct, and (ii) the correlation between student scores on that question and scores on the exam as a whole. The first number tells us how easy or hard the question is, and the second tells us how well the question discriminates among different levels of understanding. (It also tells us whether the question is assessing the same things that the exam as a whole is aiming for, roughly speaking.) These are both standard things to look at, and I’ll note for completeness there’s lots of literature I tend not to read and can’t adequately cite about the mechanics of testing.
Here’s the graph of correlation coefficient vs. fraction correct for each of the multiple choice questions from my exam:
We notice first of all a nice spread: there are questions in the lower right that lots of people get right. These don’t really help distinguish between students, but they probably make everyone feel better! The upper left shows questions that are more difficult, and that correlate strongly with overall performance. In the lower left are my mistakes (questions 6 and 15): questions that are difficult and that don’t correlate with overall performance. These might be unclear or irrelevant questions. Of course I didn’t intend them to be like this, and now after the fact I can discard them from my overall scoring. (Which, in fact, I do.)
I can also include the short answer questions, now plotting mean score rather than fraction correct (since the scoring isn’t binary for these). We see similar things — in general the correlation coefficients are higher, as we’d expect, since these short answer questions give more insights into how students are thinking.
It’s fascinating, I think, to plot and ponder these data, and it has an important goal of assessing whether my exam is really doing what I want. I’m rather happy to note that only a few of my questions fall into the lower-left-corner of mediocrity. I was spurred to post this because we’re doing a somewhat similar exercise with my department’s Ph.D. qualifier exam. One might think, given the enormous effect of such an exam on students’ lives, and the fact that a building full of quantitative scientists create it, that (i) we routinely analyze the exam’s properties, and (ii) it passes any metrics of quality one could think of. Sadly, neither is the case. Only recently, thanks to a diligent colleague, do we have a similar analysis of response accuracy and question discrimination. Frighteningly, we have given exams in which a remarkable fraction of questions are poor discriminators, correlating weakly or even negatively with overall performance! I am cautiously optimistic that we will do something about this. Of course, it is very difficult to write good questions. However: rather than telling ourselves we can do it flawlessly, we should let the results inform the process.