Test scores and teachers: A mathematician’s perspective | Forest Hills Connection

News that DCPS isn’t renewing Wilson High Principal Pete Cahall’s contract at the end of the school year launched cries of protest from parents, and discussions about how DC and other school systems evaluate teachers, schools and principals. One such discussion on the Chevy Chase listserv resulted in this post by Walter Rosenkrantz. He gave us permission to republish it here.

Using Students’ Performance on High Stakes Tests to Evaluate the Teacher, the Principal and the School: A Mathematician’s Perspective

This note was prompted by an earlier [Chevy Chase listserv] post, “DC’s testing and evaluation system of teachers and principals is a mess,” and in particular the story of Lynn Main, the former principal of Lafayette, who was feted at the Kennedy Center two years ago for being one of the top principals in the city and who was rated “Ineffective” the following year.

How is this possible? Short answer: It is not only possible but inevitable; with mathematical certainty, as I will now explain.

Reading the story about Lafayette principal Lynn Main, I was reminded of that curious phenomenon in modern finance known as the “performance chasing investor.” These are the investors who pile into hot stocks, mutual funds, and IPO’s, while ignoring the well-meaning and accurate advice that “past performance is no guarantee of future results.” Not surprisingly, this strategy frequently fails because there is a strong element of chance in the performance of the stocks and bonds selected by the mutual fund manager (over which he has no control); which explains why only 10% of mutual fund managers ranked in the top quarter of their peers maintain their rank two years later.

“The overwhelming majority of fund managers,” writes Jack Hough [Barron’s Business and Financial Weekly, Feb. 4, 2013], “fail to beat their benchmarks’ returns in most years, studies show, and picking ones with good past performance doesn’t necessarily help. Of 707 funds that were in the top 25% for returns in September 2010, just 10% remained there by September 2012, according to Standard & Poor’s. The financial performance of hedge fund managers is equally inconsistent. For example, eleven of the 25 top hedge fund managers on the so-called Alpha’s Rich List in 2007, are no longer on it five years later. That is, 44% of them drop out of the top 25, and eight of them no longer run hedge funds.”

The unreliability, lack of consistency, and surprising variability of evaluating managers of mutual funds and hedge funds and teachers – mainly on the basis of their students’ performance on high stakes tests – would not have surprised W.E. Deming (1900—1993) who was originally trained as a physicist but employed as a statistician, and who became a highly respected, if not always welcome, consultant to America’s largest corporations.

“The basic cause of sickness in American industry and resulting unemployment,” he argued, “is failure of top management to manage.”

Among the poor management practices he criticizes most sharply is performance evaluation of employees.

“Fair rating is impossible,” writes Deming. “A common fallacy,” he continues, “is the supposition that it is possible to rate people; to put them in rank order of performance next year, based on performance last year.” [Out of the Crisis, MIT Press, (1982), pp 109-112].

These 1943 Wilson High School students were more concerned about the war effort than about high-stakes tests. (photo from loc.gov)

Let us now apply Deming’s insights to the high-stakes testing regime in grades 3-12, required by the No Child Left Behind (NCLB) law, which measures not only the performance of the students, but their teachers as well. Currently, one of the most popular methods for evaluating teachers are the so-called value-added models (VAM), an increasingly popular methodology for measuring teacher effectiveness, based on a variety of complex statistical techniques originally developed to analyze complex data sets arising in agricultural and industrial quality control.

The problems with VAM as a teacher evaluation tool have been skillfully summarized by John Ewing, who writes, “…it is essential to ask whether the results [from VAM] are consistent year to year. Are the computed teacher effects comparable over successive years for individual teachers? Are value-added models consistent?” [John Ewing, “Mathematical Intimidation: Driven by the Data,” Notices of the American Mathematical Society, vol. 58, number 5, May 2011, pp.667-673]

Empirical data in a paper published by the Economic Policy Institute (EPI), “Problems with the Use of Student Test Scores to Evaluate Teachers,” suggests that the answer is no.

For a variety of reasons, analyses of VAM results have led researchers to doubt whether the methodology can accurately identify more and less effective teachers. VAM estimates have proven to be unstable across statistical models, years, and classes that teachers teach. One study found that across five large urban districts, among teachers who were ranked in the top 20% of effectiveness in the first year, fewer than a third were in that top group the next year, and another third moved all the way down to the bottom 40%. Another found that teachers’ effectiveness ratings in one year could only predict from 4% to 16% of the variation in such ratings in the following year.

Thus, a teacher who appears to be very ineffective in one year might have a dramatically different result the following year. The same dramatic fluctuations were found for teachers ranked at the bottom in the first year of analysis. This runs counter to most people’s notions that the true quality of a teacher is likely to change very little over time and raises questions about whether what is measured is largely a “teacher effect” or the effect of a wide variety of other factors” [Eva L. Baker, et. al, “Problems with the Use of Student Test Scores to Evaluate Teachers,” Economic Policy Institute Briefing Paper\#278, August 29, 2010, Washington, DC.]

In other words, the predictive power of annual rankings of mutual fund managers, hedge fund managers, and teachers is quite poor. In particular, these methods of evaluating teachers’ effectiveness give little or no weight to socioeconomic factors. The crucial question, emphasized by Deming, is how much of the variation in a teacher’s performance is due to the teacher, and how much to the system they are working in.

Consider, for example, the 2011 test score data of the Washington DC Comprehensive Assessment System exams given annually to students in grades 3 through 8. In Ward 3, the city’s wealthiest area, where the median household income is $97,960, the reading proficiency pass rate for the elementary schools is 84%, while in Ward 8, the city’s poorest area, where the median household income is $31,188, the pass rate is 28%. Is it any surprise, then, that 35% of the teachers in Ward 3, were rated highly effective, and only 5% in Ward 8?

Walter Rosenkrantz is an Emeritus Professor of Mathematics (Univ. of Massachusetts-Amherst). He and his wife moved to DC – and this neighborhood – in 2004.

Using Students’ Performance on High Stakes Tests to Evaluate the Teacher, the Principal and the School: A Mathematician’s Perspective

Share this post!

Related