Return to Home PageAn Article by Chris Carter
The Case Against Standardized Tests
 

"The not-for-profit are different from you and me. Tennis courts, a swimming pool, a baseball diamond, a croquet lawn, a private hotel, 400 acres of woods and rolling hills, cavorting deer, a resident flock of Canada geese – I’m loving every minute here at the Educational Testing Service, the great untaxed, unregulated, unblinking eye of the American meritocracy."

David Owen - Chapter 1 of None of the Above, 1999

Standardized testing is big business. Every year Americans spend millions on the tests they are required to write in order to be evaluated for admission into undergraduate and graduate programs, and many millions more are spent on coaching schools in an attempt to raise scores. The testing companies, especially ETS, play a major role as gatekeepers to American higher education.

How valid are test scores are predictors of grades? Do they have any validity as predictors of actual accomplishment? Are the tests biased against certain members of society? This essay will review the extensive critical literature on the subject of standardized tests in an attempt to answer these questions.

The testing companies claim that test scores are useful when used to help predict grades. However, as the testing companies admit, a substantial body of research indicates that previously earned grades are the best single predictor of future grades. Standardized tests such as the GMAT and the SAT are only designed to predict first year grades, and their predictive power is not impressive.

The degree of correlation between two variables, such as test scores and grades, is measured by a statistic called the correlation coefficient, which ranges in value from -1 to 1. A value of 1 indicates perfect positive correlation, and a value of zero indicates no correlation. The proportion of variation in one variable that is explained by variation in the other is given by the correlation coefficient squared, called "r-squared." Another interpretation of r-squared is the degree of improvement in prediction over pure guesswork that we gain by using one variable to predict the other.

The SAT has the most predictive validity of the tests1, with correlation coefficients ranging from .2 to .5 at most (R-squared ranging from .04 to .25). The correlation between weight and height is about .5. What sort of basketball team do you think you would have if all members were chosen only on the basis of their weight?

But of course no school admits applicants on test scores alone. Previously earned grades are usually combined with scores, and no school is going to stop requiring applicants to submit grades. So the real question is: how much do predictions improve when test scores are added to grades?

Crouse and Trushem, authors of The Case Against the SAT, argue that the improvement is so small it is meaningless. If first year grades are used as a measure of success, their figures show that using both class rank and SAT scores means only 1 to 3 % fewer errors in prediction than using class rank alone. If graduation from university is the standard, adding the test scores makes even less than 1% difference. 2

Balance this tiny improvement in increased accuracy against the enormous cost imposed on students to write the SAT (currently $24), the GRE (currently $99) and the GMAT (currently $190), and the cost of test preparation (coaching schools cost between $400 and $1200). If admissions offices had to pay the fee required to write the test, would they show greater interest in test validity?

And do the tests have any validity as predictors of actual accomplishment? Harvard psychologist David McLelland writes:

"Thorndike and Hagen (1959), for instance, obtained 12,000 correlations between aptitude test scores and various measures of later occupational success on over 10,000 respondents and concluded that the number of significant correlations did not exceed what would be expected by chance. In other words, the tests were invalid... Holland and Richards (1965) and Elton and Shevel (1969) have shown that no consistent relationships exist between scholastic aptitude test scores in college students and their actual accomplishments in social leadership, the arts, science, music, writing, and speech and drama." 3

A more recent review of the literature appeared in a 1985 issue of the journal Research in Higher Education. Over eighty pages in length, it is one of the most exhaustive literature reviews on the question of test validity.

The author Leonard Baird focused on studies completed between 1966 and 1984, reported in any of nineteen highly-regarded scholarly journals. In study after study many of the reported correlation coefficients were zero or near zero, and some studies even showed significant negative coefficients. Most striking, many of these negative correlations appear in the studies concerning the relationship between test scores and the number of publications and citations for graduates of PhD programs. For instance:

"Clark and Centra studied two samples of doctoral recipients… The resulting sample consisted of 239 chemists, 142 historians, and 221 psychologists, all of whom had at least one GRE score. In chemistry, the correlation of number of articles and book chapters with GRE-verbal was -.02; with GRE-quantitative it was -.01; and with GRE-advanced it was .15… For all historians, these correlations were -.24, -.14, and .00. For all psychologists, the correlations were -.05, -.02, and .02.

Clark and Centra also examined the distribution of number of publications by GRE scores. The distributions were essentially flat, with no particular trend. In fact, the largest number of publications was reported by the lowest scoring groups in all three fields(emphasis added)." 4

Another study mentioned used the number of citations to each sample member’s works as the criterion, for 6,300 doctorate recipients in mathematics/statistics, physics, chemistry, biochemistry, and psychology. Only the correlation in physics was significantly different from zero, at an impressive .10. 5 A small study of 47 PhD alumni of the industrial relations program at Carnegie-Mellon University used an index of research publications as the criterion (essentially the number of publications adjusted for quality). Baird writes that "Scores on standardized tests (GRE and the Admission Test for Graduate Study) did not discriminate within the range covered by the sample." 6

Correlations of zero and near zero may not surprise those of us who have always been skeptical of the value of multiple-choice tests. But what could explain the negative correlations? Research in the field of cognitive psychology provides some intriguing suggestions. Some researchers in cognitive psychology have divided the thinking process into at least two levels: a surface level concerned mostly with retrieving information, and a deep cognitive level involving the synthesis and analysis of a variety of sources of information in order to interpret that information, solve a complicated problem, or create something new. A 1994 study that examined the thinking styles of 530 students and their performance on the SAT suggests that standardized tests may penalize students that tend to favor deeper approaches to problem solving.

The researchers found that the group that scored highest on the SAT tended to use more superficial thinking strategies than those who scored in the low and moderate ranges. Also, the lowest-scoring students employed the deep approach more often than the higher scoring students.7 Of course some of the high scoring individuals may be extraordinarily capable, as they may possess some of the important qualities the tests fail to detect. But these studies strongly suggest that standardized tests fail to measure the qualities that are truly important, reward the ability to adopt a superficial style of thinking, and may in fact penalize many of the candidates with the deepest minds.

This criticism of standardized tests is not new. Banesh Hoffman, professor of mathematics and former collaborator with Albert Einstein, made exactly this point in his 1962 book The Tyranny of Testing. According to Dr. Hoffman, it is the multiple-choice format that is to blame. "Multiple choice tests penalize the deep student, dampen creativity, foster intellectual dishonesty, and undermine the very foundations of education" he remarked in a 1977 interview. 8

What is it about multiple-choice tests that penalize the finer mind? Occasionally, individual questions are defective, with the wanted answer or all of the answers being incorrect. More frequently, questions are ambiguous so that more than one answer may be defended as plausibly being 'the best', and only those candidates with deep minds are likely to notice the ambiguity and be troubled by it. However, according to Dr. Hoffmann:

"It is not the presence of defective questions that makes multiple-choice tests bad. Such questions merely make them worse. Even if all the questions were impeccable, the deep student would see more in a question than his more superficial competitors would ever dream was in it, and would expend more time and mental energy than they in answering it. That is the way his mind works. That is, indeed, his special merit. But the multiple-choice tests are concerned solely with the candidates choice of answer, and not with the reasons for his choice. Thus they ignore that elusive yet crucial thing we call quality." 9

The Myth of Objectivity

The test makers call their multiple-choice tests 'objective' and would have us regard objectivity as a virtue. But the term 'objective', when applied to the tests, is really a misnomer. The objectivity resides not in the tests as a whole but merely in the fact that no subjective element enters the grading process once the key has been decided upon. Yet the choice of questions to ask, topics to cover, and the choice of format, that is, multiple-choice as opposed to essay-answer, are all subjective decisions. All 'objective' means, in the narrow technical sense, is that the same mark will be received no matter who grades the test. The chosen answer is simply judged as 'correct' or 'incorrect' in accordance with the key, no argument or rationale is permitted, and the grading can be done by computer. In this sense, all multiple-choice tests are "objective."

But it is important to realise that saying a test is "objective" does not mean that the questions are relevant or unambiguous; nor does it mean that the required answers are correct or even "the best." Even more important, calling the tests "objective" does not mean that the tests are not biased. As discussed above, standardized tests may discriminate against many of the best candidates. It is more generally accepted that these tests are biased against women, minorities, and the poor.

Bias can take many different forms. With women, test scores underpredict grades. Although women tend to score lower on standardized tests, they tend to earn higher grades in college. 10 At least one study has found the scores also underpredict grades for Hispanic students. 11 Bias against black students takes a different form. Although there is no clear evidence that test scores consistently underpredict the grades of black students, it seems that test scores are far less reliable predictors for black students. Or in other words, even more errors in prediction will be made for black than for white students. This form of bias is known as differential validity.

Differential validity means that the tests do a better job predicting grades for some groups than for others. Attorney Andrew Strenio, in his 1983 book The Testing Trap, mentions the case of Larry P. v. Riles, which as fought because IQ test scores were being used to place a disproportiate number of black school children in remedial classes. Judge R.F. Peckham of the Northern District of California issued his ruling on October, 1979. Andrew Strenio writes:

"Judge Peckham cited two studies of the relation of IQ scores to grades. The studies found a correlation (known as the r-value) of IQ scores to grades for white children of .25 in one case and .46 in the other. Those are low r figures to start with. But the r-values for the same test for blacks were even smaller: .14 and .20 in the two instances. In other words, to the limited extent these tests were able to predict, they did a better job on white children than black children. Judge Peckham wrote, ‘Differential validity means that more errors will be made for black children than for whites, and that is unacceptable.’" 12

Authors Block and Dworken, in their recent book The Shape of the River, also find that grades and test scores have less predictive validity for blacks than for whites. They find that for all students, an additional 100 points of combined SAT score is associated, on average, with a modest improvement of only 5.9 percentile points in class rank. However,

"The relationship between SAT scores and predicted rank in class is , however, even ‘flatter’ for black students than it is for all students: an additional 100 points of combined SAT score is associated with a class rank improvement of only 5.0 points for black students." 13

Are the tests biased against the poor? Well, it depends on what you mean by "bias." The poor certainly do not score as highly on average as wealthy students. Over the last forty years SAT scores have been positively correlated with family income. Here is the relationship as of 1994: 14

Family Income
Average SAT Score
$30 - $40K
885
$50 - $60K
929
$70K +
1000

So the SAT appears biased against the poor in the sense that the poor tend to score lower and therefore will be less likely to be admitted to the college of their choice. But, as is the case with black applicants, the test scores may have less predictive validity for the poor.

Chuck Stone, former director of minority affairs at ETS, has testified that the value of the predictions about college performance varies according to the level of the score itself. Stone illustrated this point by saying that while the SAT-Verbal validity coefficeint is .48 for test takers scoring in the 90th percentile, the coefficient is only .17 for students in the 10th percentile.15 So, it seems reasonable to conclude that the test is a less reliable predictor for the poor.

And the situation may be even worse. Some of the coaching schools claim that they can raise scores by as much as 250 points, and their claim that coaching works have been verified by a number of independent studies.16 Given that the cost of coaching varies from $500 - $1200, if coaching works then the existence of effective coaching school puts the poor at an even greater disadvantage.

Not surprisingly, the testing companies deny that coaching can be more than marginally effective. ETS has said in official statements that "particular groups of students or particular programs have achieved average score gains as high as 25 – 30 points"17 (emphasis added). This figure bears no relation whatsoever to the impressive gains from coaching reported in several independent studies. For instance, J.P. Zuman presented a paper at the annual meeting of the American Educational Research Association in April 1988. Using research from a Harvard University doctoral dissertation, he substantiated a 110 point average increase in SAT scores after test coaching by the Princeton Review.18 Another study by the Federal Trade Commission found an average 50 point increase in scores from coaching, and concluded that ETS and College Board material for students did not accurately describe the real possibility of meaningful score gains from coaching.19

So coaching adds to testing bias against the poor. John Katzman, founder of The Princeton Review, puts it this way. "Most of our kids are wealthy. Those are the kids who have an advantage to begin with. And we’re moving them up another level."20

The Truth about ETS

ETS rightly recognizes that if coaching is effective, the value and validity of their tests are compromised. If short term coaching works, then the wealthy have an even greater advantage over the poor, and the predictive validity of the test becomes even more questionable. So why should ETS be interested in obscuring the truth?

ETS calls itself a "testing service" instead of a company, and describes itself as a "non-profit" organization. So it would seem that ETS has no economic incentive to promote the use of it’s tests. But nothing could be farther from the truth.

ETS is a revenue-hungry monopoly, probably the most powerful unregulated monopoly in America. People who wish to attend almost all undergraduate and graduate programs have no choice but to take its tests. ETS is indeed "non-profit" in the accounting sense that it has no shareholders. It was founded in 1947 by a grant from the Carnegie foundation, and pays no taxes. But ETS had revenues of $432 million in 1997, and these tax free revenues support a very comfortable life and generous salaries for its over 2,000 employees. The current president Nancy Cole made $339,000 in 1996 and had use of a manorhouse in Lawrenceville. Vice President Robert Altman received $358,000 that year; three other employees had salaries over a quarter million, and 749 employees exceeded $50,000.

Forbes magazine called ETS "one of the hottest little growth companies in U.S. business" in 1976. In 1982 then-president Gregory Anrig started his term in office by commissioning a $500,000 strategic plan from management consultants Booz, Allen, & Hamilton. For the study Anrig divided ETS employees into a dozen "revenue growth teams" charged with identifying new opportunities for profits. Later, Anrig issued a Corporate Plan, calling for "corporate intelligence gathering, external relations and government relations focused to provide a positive climate and receptive clients for ETS marketing initiatives." Anrig’s plans for revenue growth seemed to have come true: total sales increased 256% form 1980 to 1995, from $106 million to $378 million. By June 1997, ETS was sitting on cash reserves of $42 million, even after spending millions on new property, buildings and equipment over the past few years. 21

Incidentally, despite having a mailing address in Princeton, New Jersey, ETS has no connection with Princeton University. Its luxurious headquarters, including tennis courts, a swimming pool and a private hotel, are in Lawrence Township, not Princeton. The Princeton mailing address is merely for public relations.

And by the way, official denials that coaching is effective have not stopped ETS from running a nice line of business selling coaching material. ETS and its parent organization The College Board now sell over 218 books and manuals on test preparation, such as 10 Real SATs ("the only book with real SATs!"), The Official Guide for GMAT Review ("The Official Guide for GMAT Review is the starting point if you are serious about being a competitive MBA candidate" the back cover reads), the GRE: Practicing to take the General Test, and many others.

Cost of Testing

Although some schools have stopped using standardized tests, the fact remains that the vast majority of schools still require test scores. There are several reasons for this. First of all, from the schools’ perspective, the scores are completely free, and provide an easy way of sorting applicants. Examining evidence of actual accomplishment, such as samples of work, projects, and extracurricular activities is much more time consuming, and the use of cut-off scores to reduce the number of applications even considered allows colleges to make slightly less expensive admissions decisions. Since colleges do not pay for these tests, they have little incentive to examine test validity.

Can this marginal benefit be balanced against the cost of testing? Test takers pay ETS over $300 million per year for the privilege of taking their tests and test takers who can afford it spend another $100 million on coaching courses. And this may only be the tip of the iceburg. These figures do not include the cost of testing at all levels of the educational system, and do not include the opportunity costs borne when teachers spend time giving students drills as preparation for tests. When such opportunity costs are factored in, American’s annual expenditure on state and local testing programs are staggering. In a 1993 study Walter Haney, George Madaus, and Robert Lyons estimated that American taxpayers are spending as much as $20 billion annually in direct payments to testing companies and in indirect expenditures of time and resources devoted to taking and preparing for standardized tests. 22

Defenders of standardized tests will often remark that the scores provide a common measure for applicants that come from widely different backgrounds. This is nonsense. Admissions officers already study individual high schools and colleges and adjust grades and class ranks accordingly. And an SAT score of 1100 for instance does not mean the same thing for candidates from different backgrounds. It means something different depending on the applicant’s sex, race, and whether or not the applicant’s high school offers test preparation classes.

Some admissions officers will tell you that they are aware that the same score means different things for different applicants, and that they adjust scores accordingly. This is indeed a curious state of affairs; test scores that are meant to provide a common standard for applicants from different backgrounds are adjusted for differences in applicants’ backgrounds.

As Bok and Bowen conclude 23, admissions committees need to abandon their narrow preoccupation with predicting first year grades, and focus on admitting those applicants that are likely to contribute the most to their field and to society. Samples of work, references, statements of purpose, and extra-curricular activities are all better indicators of future behavior than test scores.

How can we justify the continued emphasis on standardized test scores as a criterion for admission to any program? Should admissions offices be more concerned with intellectual curiosity, demonstrated ability to do research, and the ability to write and think critically?

 

Footnotes:

1. ETS has published studies showing that the SAT-GPA correlation is higher than the average correlation between scores on the GMAT and grades in business school. See ETS, Test Use and Validity (Princeton, N.J., ETS, 1980), page 16.

Also, according to ETS’s own data, the various GRE sub-tests (verbal, quantitative, analytical) do predict first-year grades, but the relationship is feeble. In studies of 1,000 graduate departments nationwide and 12,000 test takers, the GRE could account for just 9% of the variation in first year grades. In engineering departments, the GRE quantitative test explained 4% of the variation in grades. In graduate business schools, the GRE analytical test explained 6% of the variation in grades. See ETS, GRE Guide to the Use of Scores, 1998-1999 (Princeton, N.J., ETS).

In 1995 Todd Morrison and Melanie Morrison wrote an article in the journal Educational and Psychological Measurement based on their meta-analysis of twenty-two studies covering more than 5,000 test takers from 1955 through 1992. They found that the combined GRE verbal and quantitative score could explain just 6 percent of the variation of grades of graduate students. They wrote:

"The average amount of variance (in graduate grade point average) accounted for by performance on these dimensions of the GRE was of such little magnitude that it appears they are virtually useless from a prediction standpoint. When this finding is coupled with studies suggesting that performance on the GRE is age, gender, and race-specific … the use of this test as a determinant of graduate admission becomes even more questionable."

The above quote is from Todd Morrison and Melanie Morrison, "A Meta-Analytic Assessment of the Predictive Validity of the Quantitative and Verbal Components of the Graduate Record Examination with Graduate Grade Point Averages Representing the Criterion of Graduate Success," Educational and Psychological Measurement 55, no. 2, April 1995, pages 309-316.

2. James Crouse and Dale Trusheim, The Case Against the SAT (University of Chicago Press, 1988), pages 53-71.

Crouse and Trusheim write:

"When we use bachelor’s degree attainment as the yardstick, the results are even less impressive than when freshman grade success is the criterion. Indeed, correct forecasts increase only 0.1 per 100 by using the SAT with the 2.5 predicted GPA admissions standard and by 0.2 per 100 using the 3.0 predicted GPA admissions standard." (page 58)

And it is important to note here that these findings do not result from a restricted range in test scores. Crouse and Trusheim write:

"Our results do not, however, arise because of restricted ranges. Recently, ETS searched its Validity Study Service records for the College Board and found twenty-one colleges where the distributions of SAT scores and high school records are virtually identical to those for the over-all SAT taking population. In these carefully chosen colleges with unrestricted range for high school records and SAT scores, the optimal equation for predicting freshman grades using high school records and SAT scores is among the best we have seen…. If any data should show large benefits of the SAT, it should be these.

Yet they do not. … the gains in freshman grades for the students selected with the SAT only average 0.03 on a four-point scale, again almost identical to the gains we report above." (Ibid, page 67)

 

3.

David McClelland, "Testing for Competence rather than Intelligence", appearing in The IQ Controversy, edited by Block and Dworken (Pantheon Books, 1976), page 49.

4.

Leonard L. Baird, "Do Grades and Tests Predict Adult Accomplishment?" Research in Higher Education 23, no. 1, 1985, page 25.

5. Ibid, page 22.
6.

Ibid, page 22.

7.

Melissa Hargett, et.al, "Difference in Learning Strategies for High, Middle, and Low Ability Students Measured by the Study Process Questionaire", presented at the Annual Meeting of the National Association of School Psychologists, Seattle, March, 1994, ERIC Document 376 402.

8. The Myth of Measurability, edited by Paul Houts, (Hart Publishing Company, 1977), page 202.
9. Banesh Hoffman, The Tyranny of Testing, (Collier Books, 1962), page 92.
10.

Bridgeman, B., & Wendler, C. "Gender Differences in Predictors of College Mathematics Performance and in College Mathematics Course Grades". Journal of Educational Psychology, v.83, N.2, 1991.

Clark, M.J., & Grandy, J. "Sex Differences in the Academic Performance of Scholastic Aptitude Test Takers", College Board Report, 84-88, New York: College Examination Board, 1984.

11. Kanarek, E. A. "Gender Differences in Freshman Performance and their Relationship to Use of the SAT in Admissions". Paper presented at the Northeast Association for Institutional Research Forum, Providence, RI, October 1988.

Rosser, P. "Sex Bias in College Admissions Tests: Why Women Lose Out" (4th ed.). Cambridge, MA; National Center for Fair & Open Testing, 1992.

Pearson, B. "Predictive Validity of the Scholastic Aptitude Test (SAT) for Hispanic Bilingual Students". Hispanic Journal of Behavioural Sciences, v.15, N.3, August 1993.

12. Strenio, Andrew, The Testing Trap (Rawson Wade Publishers, 1981), page 203.
13. Bowen, William and Bok, Derek, The Shape of the River (Princeton University Press, 1998), page 75.
14. Owen, David, ibid, page 227.
15. Strenio, Andrew, ibid, page 135.
16. Federal Trade Commission. Staff Report on the Federal Trade Commission Investigation of Coaching for Standardized Admission Tests. Boston Regional Office, April 1981.
17. ETS, Taking the SAT, 1983, page 6.
18. Zuman, J.P. The Effectiveness of Special Preparation for the SAT: An Evaluation of a Commercial Coaching School. Paper presented at the annual meeting of the American Educational Research Association, April 1988.
19. Federal Trade Commission. Staff Report on the Federal Trade Commission Investigation of Coaching for Standardized Admission Tests. Boston Regional Office, April 1981.
20. Owens, David, ibid., page 133.
21. Sach, Peter, ibib. Page 228.
22. Walter Haney, George Madaus, and Robert Lyons, The Fractured Marketplace for Standardized Testing, (Boston: Kluwer Academic Publishers 1993), page 95.
23. Bowen, William and Bok, Derek, ibid, pages 276-286.

 

References:

The IQ Controversy, edited by Block and Dworken (Pantheon Books, 1976).

Test Use and Validity, ETS (Princeton, N.J., 1980)

The Testing Trap, Andrew Strenio (Rawson Wade Publishers, 1981)

"Sex Differences in the Academic Performance of Scholastic Aptitude Test Takers", Clark, M.J., & Grandy, J. College Board Report, 84-88, New York: College Examination Board, 1984.

"Do Grades and Tests Predict Adult Accomplishment?", Leonard L. Baird, Research in Higher Education, Vol 23, No. 1, 1985.

The Case Against the SAT, James Crouse, Dale Trusheim (U. of Chicago Press, 1988).

"Gender Differences in Freshman Performance and their Relationship to Use of the SAT in Admissions", Kanarek, E. A. Paper presented at the Northeast Association for Institutional Research Forum, Providence, RI, October 1988.

"Gender Differences in Predictors of College Mathematics Performance and in College Mathematics Course Grades", Bridgeman, B., & Wendler, C. Journal of Educational Psychology, v.83, N.2, 1991.

"Sex Bias in College Admissions Tests: Why Women Lose Out (4th ed.)", Rosser, P. Cambridge, MA; National Center for Fair & Open Testing, 1992.

"Predictive Validity of the Scholastic Aptitude Test (SAT) for Hispanic Bilingual Students"Pearson, B.. Hispanic Journal of Behavioural Sciences, v.15, N.3, August 1993.

The Fractured Marketplace for Standardized Testing, Walter Haney, George Madaus, and Robert Lyons(Boston: Kluwer Academic Publishers 1993).

"Difference in Learning Strategies for High, Middle, and Low Ability Students Measured by the Study Process Questionaire", Melissa Hargett and others, presented at the Annual Meeting of the National Association of School Psychologists, Seattle, March, 1994, ERIC Document 376 402.

None of the Above, David Owen (Rowman & Littlefield, 1999).

Standardized Minds, Peter Sacks (Perseus Books, 1999).

Chris Carter

Return to Main Page