The AP Calculus Grading System is Badly Broken

The following is a report on my experiences and observations at the 2001 AP calculus reading. My basic conclusion is: The AP Calculus scoring system is one which is fundamentally unfair to the students who take the exam. The scoring rubrics in many ways fail to reflect the knowledge of calculus displayed by the students taking the exam, and in some instances egregiously so. Furthermore, the reading system is deliberately designed to restrict the exercise of the readers' judgment. I spent the entire week frustrated by my inability to apply my judgment in scoring books, and infuriated by being forced to score books in ways that I felt were simply wrong. All in all, the AP Calculus reading was one of the worst professional experiences I have ever had. Unless and until there are major changes, I would never participate in it again, and I would strongly advise colleagues not to do so either.

Let me define a technical term. I shall refer to all those responsible for developing the scoring rubrics, from the Office of the Chief Reader down, as the reading leadership (RL). I wish to include in the RL not only the relevant persons from this year's readings, but also from previous years' readings. But I want to emphasize that by the RL I do not mean specific individuals, and this letter is not intended to be a criticism of specific individuals. Rather, by the RL I mean the corporate entity responsible for the readings, and we all know that such entities take on lives of their own.

Before coming to the reading, it occurred to me that one of the greatest challenges faced by the RL was to ensure that books were read in a uniform way. With hundreds of readers, each with their own teaching and grading philosophies, it is clearly necessary to develop a method to achieve consistent reading. At the start, I was impressed by the briefing system, by my reaction quickly turned to one of dismay at the incredible minutia of the scoring rules presented there. The only explanation for this that makes sense is that the RL does not trust the readers' judgment (despite the periodic remarks about our exercising our professional judgment), and so feels compelled to develop a rule to handle every circumstance. But it is a futile exercise to attempt to anticipate every circumstance, and the consequence of this attempt is that many of the rules, taken literally (as we readers were instructed to do), wind up being applied in ways that are unfair to the students. It seems to me that in its pursuit of consistency, the RL has lost sight of the goal of fairness. It was remarked to me that ETS is pleased that the Calculus exam scoring is the most consistent, statistically speaking, of all the AP exams. But there is another way to interpret this fact, and I am afraid that this interpretation is the correct one. It is that the RL has taken the goal of consistency too far. I must ask, what is the point of scoring exams consistently, if they are scored consistently wrong? I must also report a common remark I have heard other readers make at meal times, one which I doubt is made within earshot of RL members. It is, in so many words, that we are trying to be consistent, not fair.

Of course, the main problem with this is its effect on the students, but the RL should also consider its effect on the readers. Does the RL really want to encourage the cynical attitude reported above? Also, this mistrust is demeaning to the readers. I would never volunteer to serve in a capacity where my judgment was not trusted, and if I had known this was going to be the case, I never would have participated in the reading. There are many repeat readers, but there are also many readers who do not serve out their terms, and I suspect this mistrust is a contributing factor. I suspect this is particularly true of college faculty, who are used to operating with a certain degree of autonomy, and find it unpleasant to serve in a situation where this autonomy is taken away from them.

Let me now turn to a detailed discussion of the scoring rubrics themselves. The most important point I want to make is that they were simply unfair and wrong. They compelled us not to grant credit for correct answers, and to grant credit for nonsense. Both of these are out and out indefensible. But there were many other errors of judgment in the rubrics as well.

Note: In order to more easily follow the discussion below, the reader should consult the AP's web page giving the questions and scoring rubrics for the 2001 calculus exam .

Point 1. Question AB-2/BC-2 part c): P(t)=20+10te^(-t/3). Find P'(12). Using appropriate units, explain the meaning of your answer in terms of water temperature.

Consider the following three answers:
Student A. P'(12)=-.549. Thus the water temperature on day 12 is changing at a rate of -.549 degrees/day.
Student B. P'(12)=-.549. Thus the water temperature on day 12 is decreasing at a rate of .549 degrees/day.
Student A. P'(12)=-.549. Thus the water temperature on day 12 is decreasing at a rate of -.549 degrees/day.

The scoring rubric for this question did not test students' mathematical abilities, but rather their psychic powers. To be sure of credit for this problem, a student had to be both clairvoyant and telepathic, and read the minds of the people who would be composing the scoring rubric several months after the exam in order to divine that they wanted him/her to mention the word "decreasing" in the answer. Thus student A, who gave a correct answer, correctly reasoned, was marked wrong. Student B, who mentioned the magic word "decreasing" was marked right. Student C, whose answer is technically wrong, was also marked right, for mentioning the magic word. This is a joke, except that it's not funny.

Point 2. Question AB-2/BC-2 part (c): P(t)=20+10e^(-t/3). Find P'(12).

Consider the following two answers:
Student D. P'(t) = 10e^(-t/3)+10t(-1/3)e^(-t/3), so P'(12)=-.54.
Student E. P'(t) = 10(-1/3)e^(-t/3), so P'(12)=-.183.

Note that student D has correctly found the derivative, but has only reported the answer to two decimal places. Student E believes that the derivative of a product is the product of the derivatives, and has reported a wrong answer to three decimal places. Clearly student D has a much better knowledge of calculus of student E, yet the rubric grades both students exactly the same. (In fact the rubric grades both exactly the same, no matter how badly the student is mistaken about how to find the derivative.) Thus the rubric clearly fails to measure the student's knowledge here.

Point 3. Question AB-3/BC-3 part (d): At what times in the interval 0<=t<=18, if any, is the car's velocity equal to zero? Justify your answer.

The scoring rubric gave a point for the correct answer (that the car's velocity is never zero) and a point for the correct reason. To be precise, it gave a point for the correct answer regardless of the reason. This meant that students who had no idea what they were talking about got a point for guessing the right answer, or for arriving at the right answer by completely wrong reasoning. Anyone who actually read students' books (and this includes the people who drew up the rubric for this question) was flooded with answers that read "The car's velocity is never zero" followed by a reason that showed they did not understand velocity, or acceleration, or the connection between the two. A typical reason was something like "because it always has an acceleration and if it had a zero velocity, it couldn't have an acceleration" (except that my formulation is relatively clear compared to what I often read). Thus we were forced to award a point for complete nonsense, also a bad joke.

Point 4. Question BC-1 part (d): Find the position of the object at time t=3.

The correct answer is x(3)=4+the integral from 2 to 3 of cos(t^3)dt. The rubric assigned 2 points for finding x, and two points for finding y, which was similar. However, the points were not 1 point for the above correct expression for x, and another point for evaluating it correctly (to 3.953), but rather 1 point for writing "the integral from 2 to 3 of cos(t^3)dt" (of course in mathematical notation) and 1 point for proceeding further. Thus a student who wrote x(3)=the integral from 2 to 3 of cos(t^3)dt got a point, even though that answer is incorrect, as did one who wrote x(3)=the integral from 2 to 3 of cos(t^3)dt-4, for example. I don't understand this. Why are students awarded credit for work that is simply wrong?

Point 5. Question BC-1 part (d):

The question gave dx/dt=cos(t^3) and x(2)=4, and similarly for y. The question was deliberately formulated with expressions for dx/dt and dy/dt that don't have closed-form primitives. Yet the rubric stated that any student who "solves" for x and plugs in for the initial condition, and also for y, gets 1 point (out of a possible 4). In other words, the student who writes x(t)=sin(t^3)/(3t^2)+c (a popular choice for the antiderivative of cos(t^3)) and then sets 4=sin(8)/12+c, and does the same for y, gets a point. Clearly part of the point of the original problem was that students need to recognize that they need to do numerical integration on their calculator here ("appropriate technology"), yet students were again awarded (partial) credit for totally incorrect answers. (I can assure you that during the grading of this problem, the words "nonsense" and "garbage" were frequently muttered by several readers in my room.)

Point 6. Question BC-1 part (d):

A popular method of solving this problem, for which full credit was properly awarded, was to do it in two steps:

Step 1. Write x(2)=c+the integral from 0 to 2 of cos(t^3)dt, numerically integrate, and solve to get c=.809 (=x(0)), and similarly for y.
Step 2. Write x(3)=.809+the integral from 0 to 3 of cos(t^3)dt, numerically integrate to get 3.145, and then add to get an answer of 3.954, and similarly for y.

The briefing procedure is that after the scoring rubric is explained, we are given sample answers to score, to make sure we do it right. One of these was a sample which I graded a 9, as it was a perfect paper. To my astonishment, the briefer told us that was a 7, and the two points the student lost were on part (d). Here was the student's answer:

(integral from 0 to 3 of cos(t^3), integral from 0 to 3 of 2sin(t^2))

(.809+3.145, 2.321+2.586)

with an arrow pointing to 3.145 in the expression for x followed by the
remark "using initial condition t=2 x=4" and with an arrow pointing to
2.586 in the expression for y followed by the remark "using initial
condition t=2 y=5"

(3.954, 4.907)

The explanation for this deduction of two points was that the student had not written down the intermediate work for the numbers .809 and 2.321. When I heard this, I sat there shaking my head. I then asked the briefer "Can you suggest a plausible method, other than the correct one, by which the student arrived at .809 and 2.321?" to which the reply was simply that the student hadn't shown all work, so lost 2 points out of 4. It is clear to anybody in their right mind that this student knew exactly what (s)he was doing and solved the problem perfectly, and whether the student showed all work as the directions instructed or not, it is a complete travesty of justice not to award full credit for this answer. (Personally, this was the single most infuriating thing for me all week. I was pretty fed up by the time we got to this problem, but being instructed to award only 7 out of 9 points for a perfect solution that absolutely everybody knew full well was a perfect solution made me totally disgusted.)

Point 7. Question AB-4/BC-4 (a): Given h'(x)=(x^2-2)/x, find all values of x for which the graph has a horizontal tangent,...

As part of the solution for this problem, students were given 1 point for a correct analysis, which often included a "sign chart". Here is a correct sign chart:

	   -       +       -       +
h'(x)  --------+-------+-------+--------
	   -sqrt(2)    0    sqrt(2)

We were instructed to deduct this point if the student mislabelled the sign chart, i.e., wrote f'(x) instead of h'(x), or failed to label the sign chart. The only function in this entire problem is h. There is no f. Also, at this point, the student is clearly looking at h', not h'' (which has yet to be computed) or h (which is never even given). Thus it is clear what the student meant. What is the point of deducting credit? (Let me remark that at the "Conversation with the Test Development Committee", somebody suggested that all functions be labelled f so that this accidental mislabelling (from force of habit) doesn't occur. Thus this point clearly bothered a lot of readers.)

Point 8. Question BC-6 (a): Find the interval of convergence of this power series.

This part was worth 4 points, one of which was awarded for properly checking both endpoints x=3 and x=-3. At x=3 the series is the sum, n running from 0 to infinity, of (n+1)/3, and at x=-3 the series is the sum, n running from 0 to infinity, of ((-1)^n)(n+1)/3. Here the problem with the rubric occurred in one of the samples we were given to grade after the briefer had left the room. (Actually, one of my fellow readers had raised this question with the briefer, but this point was overlooked then.) On one sample, checking the endpoint x=-3, the student wrote "Because the series does not converge because a_n+1 is not less than a_n, the series will not converge" (admittedly not a model of clarity). Let's remember that many calculus texts write alternating series in the form ((-1)^n)a_n, so the student was undoubtedly referring to n+1 when (s)he wrote a_n. We were told to mark this answer wrong, and deduct a point. For clarification, I asked "What if the student had written because a_n+1 is not less than a_n for each n?" and was told to still mark it wrong. But this is a correct argument, for if a_n+1 >= a_n for each n, and the terms are not identically zero, then a_n does not approach 0, so the sequence diverges. After some discussion, in our room we were told simply to defer this point unless and until it came up on someone's paper. This was hardly a full resolution of the matter.

Point 9. I was told that the RL realized that there was a big problem with the question of accuracy on this exam, with the directions on the exam about this point being unclear. In my over 20 years of teaching, I have occasionally made mistakes in posing questions on exams. When I do so, I grade the students generously on those questions, on the grounds that they should not have to suffer for my mistakes. Thus, what would have been reasonable for the RL to do would have been to relax the standard of accuracy in scoring this year's exam. Instead the RL adopted the strategy of trying to figure out all possible answers consistent with the directions, and giving credit only for those. There were over 180,000 students taking the exam, so, predictably, the RL missed some possibilities. This resulted, for example, in the readers being instructed to accept only the answer 25.757 to three decimal places for AB-2/BC-2 part (d), until late in the week when the RL found out that some calculators would give 25.758, at which point the rubric was changed. Also, the rubric specified that the answer to part (c) of this question had to be be in the interval [-.550,-.549]. In one of the books I graded, a student, correctly following directions, which allowed intermediate round-offs, arrived at an answer of -.554, having obtained answer by an unusual method. Fortunately, this student had showed all the intermediate steps in the computation, which was not required by the instructions, so in this case I marked it right even though it did not conform to the rubric. I cannot understand why the RL was so hard-nosed on this matter, but it is, sadly, consistent with the grading of (c) on what the RL wanted to see, rather than on what the question asked.

Also I would note, as I am sure the RL realized, but for some reason never decided to adopt, that a standard of n significant figures makes sense, while one of n digits to the right of the decimal point does not.

I note that in question AB-2/BC-2 the Test Development Committee quite reasonably specified the temperature in the pond to the nearest degree. (It would make physical sense to measure the temperature to the nearest tenth of a degree, but there were six measurements, so if that were the case it would literally be a 1 in 1,000,000 chance that all measurements came out to be integers.) Thus, in this case, forming a model and then using that model to conclude something about the average temperature, or rate of change of temperature, to the nearest thousandth of a degree, or thousandth of a degree per day, as specified in the RL's rubric for this question, is meaningless. Furthermore, it is a priori meaningless to talk about the temperature in a pond (as opposed to in a physics lab, say) to within a thousandth of a degree. I would say that someone who reports their model as giving an average temperature of "around 25.75 degrees" or "between 25.7 and 25.8 degrees" actually understands the problem better than someone who reports an average temperature of "25.757 degrees". I am not proposing that calculus classes spend much time on accuracy of measurements--there are many more mathematically important things to do--but the RL is not composed of beginning calculus students, and I would expect the RL to show more sense.

So far I've addressed the problems. Now let me propose some first steps towards their solution. I would like to suggest three axioms for AP Calculus scoring.

Axiom 1. An answer which is mathematically correct, correctly expressed, and supported by correct reasoning, shall be given full credit.

Axiom 2. An answer, correct or incorrect, the reasoning for which is totally specious, shall be given no credit.

Frankly, I find it unbelievable that the AP Calculus program could have existed for more than 45 years without having adopted these axioms, but that is the case. Note that Axiom 1 would have made point 1 above impossible, and Axiom 2 would have made point 3 above impossible.

Axiom 3. Readers who, using their best professional judgment, are convinced beyond a reasonable doubt of the correctness of a student's solution may overlook harmless omissions and inconsequential errors and assign full credit for the solution.

Of course Axiom 3 directly contradicts the current RL philosophy of not allowing readers to exercise their judgment. (There is room for discussion of the proper standard here. I have given the standard of "beyond a reasonable doubt", which seems right to me. "Beyond the shadow of a doubt" seems too strong, and "by a preponderance of the evidence" seems too weak. This is a point on which reasonable people may differ.) Adoption of this axiom would make grading fairer, and show some respect for the readers. Note that this axiom implicitly leaves the determination of what constitutes "harmless omissions and inconsequential errors" to the individual readers. (Admittedly, there would be some problems, both for the RL and for long-time readers, in transitioning from pre-Axiom 3 to post-Axiom 3 grading, but the existence of transition problems is not a valid reason to stay with a bad system.) Note that applying Axiom 3 would have made point 6 above impossible (since in this instance there is not even the shadow of a doubt that the student got it right) and also point 7 impossible, as the error here is certainly inconsequential.

As the scoring rubric stands, students are not so much graded on their knowledge of Calculus as on the way they play the AP Calculus game. (To be sure, the two are correlated, but they are nevertheless distinct.) As many points may be deducted for not following the game's rules exactly or not jumping through the game's hoops, as for a lack of mathematical knowledge or understanding. Surely students' scores should reflect their mathematical capabilities rather than their game-playing abilities.

As things stand now, students may lose as many points for relatively trivial mistakes as for relatively serious ones. This is both unfair to the student and an inaccurate measurement of the student's mathematical attainment. Thus, as another, much needed, modification to the scoring system, I would suggest establishing a category of "minor mistakes". This category might include such items as copying errors and specifying answers to an insufficient number of decimal places. I would further suggest adopting one of the following two scoring systems:

a) Deduct 1/2 point for each minor mistake, and round the score up.
b) Deduct nothing for the first minor mistake, and 1 point for each minor mistake thereafter.

(Actually, these two systems would produce similar results, differing only in the cases of problems with at least 3 minor mistakes.) On a 9 point problem, each point is worth 11% of the grade, so the deduction of a point for the most minor mistakes is far out of proportion, and does not allow discrimination between such minor mistakes and serious mathematical errors. (Consider the following: If this were a 100 point exam, and a student made one such error, would the appropriate grade be at least 95? If so, it should be rounded to a grade of 100, not 89.)

The AP Calculus program is an important program. The free-response questions on the 2001 AP Calculus exam were good questions. The scoring rubrics for these questions were terrible.

In conclusion: The AP Calculus scoring system is badly broken. It needs to be fixed--not tinkered with, but fixed.