Debunking Intelligence Experts: Walter Lippmann Speaks Out

Debunking Intelligence Experts: Walter Lippmann Speaks Out

"There is nothing about an individual as important as his IQ," declared psychologist Lewis M. Terman in 1922. To the extent that this is true, it is in large measure because of Terman himself and the opportunity that World War I afforded for the first widespread use of intelligence testing. The army’s use of intelligence tests lent new credibility to the emerging profession of psychology, even as it sparked public debate about the validity of the tests and their implications for American democracy. The idea that experts could confidently assign a man to his proper place in the army—and by extension his place in life—suggested a kind of determinism that some found profoundly at odds with American democracy and its credo of upward mobility through hard work. Walter Lippmann, an influential political commentator and journalist, skewered the army intelligence tests in a series of six essays that appeared in the New Republic in 1922. He denounced as “nonsense” the claim that the average mental age of an American adult was fourteen years, and forcefully warned his readers of the danger of uncritical acceptance of IQ as destiny. He addressed the conditions of IQ testing, the possible biases of army intelligence tests, and the larger social problems raised by such classifications.

This is the first of a series of six articles, an analysis and estimate of intelligence tests. It is a critical inquiry into the claim, now widely made and accepted, that the psychologists have invented a method of measuring the unborn intelligence of all people. The series will discuss what sort of measure the tests furnish, beginning with a description of how the tests are made up. It goes on to discuss the claim that psychologists test intelligence which is fixed by heredity, and is therefore more or less impervious to education and environment. It concludes with some comment on the future usefulness of the tests.—The Editors

****

A startling bit of news has recently been unearthed and is now being retailed by the credulous to the gullible. “The average mental age of Americans,” says Mr. Lothrop Stoddard in The Revolt Against Civilization, “is only about fourteen.”

Mr. Stoddard did not invent this astonishing conclusion. He found it ready-made in the writings of a number of other writers. They in their turn got the conclusion by misreading the data collected in the army intelligence tests. For the data themselves lead to no such conclusion. It is impossible that they should. It is quite impossible for honest statistics to show that the average adult intelligence of a representative sample of the nation is that of an immature child in that same nation. The average adult intelligence cannot be less than the average adult intelligence, and to anyone who knows what the words “mental age” mean, Mr. Stoddard’s remark is precisely as silly as if he had written that the average mile was three quarters of a mile long.

The trouble is that Mr. Stoddard uses the words “mental age” without explaining either to himself or to his readers how the conception of “mental age” is derived. He was in such an enormous hurry to predict the downfall of civilization that he could not pause long enough to straighten out a few simple ideas. The result is that he snatches a few scarifying statistics and uses them as a base upon which to erect a glittering tower of generalities. For the statement that the average mental age of Americans is only about fourteen is not inaccurate. It is not incorrect. It is nonsense.

Mental age is a yard stick invented by a school of psychologists to measure “intelligence.” It is not easy, however, to make a measure of intelligence and the psychologists have never agreed on a definition. This quandary presented itself to Alfred Binet. For years he had tried to reach a definition of intelligence and always he had failed. Finally he gave up the attempt, and started on another tack. He then turned his attention to the practical problem of distinguishing the “backward” child from the “normal” child in the Paris schools. To do this he had to know what was a normal child. Difficult as this promised to be, it was a good deal easier than the attempt to define intelligence. For Binet concluded, quite logically, that the standard of a normal child of any particular age was something or other which an arbitrary percentage of children of that age could do. Binet therefore decided to consider “normal” those abilities which were common to between 65 and 75 percent of the children of a particular age. In deciding these percentages he thus decided to consider at least twenty-five percent of the children as backward. He might just as easily have fixed a percentage which would have classified ten percent of the children as backward, or fifty percent.

Having fixed a percentage which he would hence-forth regard as “normal” he devoted himself to collecting questions, stunts and puzzles of various sorts, hard ones and easy ones. At the end he settled upon fifty-four tests, each of which he guessed and hoped would test some element of intelligence; all of which together would test intelligence as a whole. Binet then gave these tests in Paris to two hundred school children who ranged from three to fifteen years of age. Whenever he found a test that about sixty-five percent of the children of the same age could pass he called that a Binet test of intelligence for that age. Thus a mental age of seven years was the ability to do all the tests which sixty-five percent of a small group of seven year old Paris school children had shown themselves able to do.

This was a promising method, but of course the actual tests rested on a very weak foundation indeed. Binet himself died before he could carry his idea much further, and the task of revision and improvement was then transferred to Stanford University. The Binet scale worked badly in California. The same puzzles did not give the same results in California as in Paris. So about 1910 Professor L. M. Terman undertook to revise them. He followed Binet’s method. Like Binet he would guess at a stunt which might indicate intelligence, and then try it out on about 2,300 people of various ages, including 1,700 children “in a community of average social status.” By editing, rearranging and supplementing the original Binet tests he finally worked out a series of tests for each age which the average child of that age in about one hundred Californian children could pass.

The puzzles which this average child among a hundred Californian children of the same age about the year 1913 could answer are the yardstick by which “mental age” is measured in what is known as the Stanford Revision of the Binet-Simon Scale. Each correct answer gives a credit of two months‘ mental age. So if a child of seven can answer all tests up to the seven-year-old tests perfectly, and cannot answer any of the eight-year-old tests, his total score is seven years. He is said to test “at age,” and his “intelligence quotient” or “I.Q.” is unity or 100 percent. Anybody’s I.Q. can be figured, therefore, by dividing his mental age by his actual age. A child of five who tests at four years’ mental age has an I.Q. of 80 (4/5=.80). A child of five who tests at six years' mental age has an I.Q. of 120 (6/5=1.20).

The aspect of all this which matters is that “mental age” is simply the average performance with certain rather arbitrary problems. The thing to keep in mind is that all the talk about “a mental age of fourteen” goes back to the performance of eighty-two California school children in 1913–14. Their success and failures on the days they happened to be tested have become embalmed and consecrated as the measure of human intelligence. By means of that measure writers like Mr. Stoddard fix the relative values of all the peoples of the earth and of all social classes within the nations. They don’t know they are doing this, however, because Mr. Stoddard at least is quite plainly taking everything at second hand.

However, I am willing for just a moment to grant that Mr. Terman in California has worked out a test for the different ages of a growing child. But I insist that anyone who uses the words “mental age” should remember that Mr. Terman reached his test by seeing what the average child of an age group could do. If his group is too small or is untypical his test is in the same measure inaccurate.

Remembering this, we come to the army tests. Here we are dealing at once with men all of whom are over the age of the mental scale. For the Stanford-Binet scale ends at “sixteen years.” It assumes that intelligence stops developing at sixteen and everybody sixteen and over is therefore treated as “adult” or as “superior adult.” Now the adult Stanford-Binet tests were “standardized chiefly on the basis of results from 400 adults” (Terman p. 13) “of moderate success and of very limited educational advantages” and also thirty-two high school pupils from sixteen to twenty years of age. Among these adults those who tested close together have the honor of being considered the standard of average adult intelligence.

Before the army tests came along, when anyone talked about the average adult he was talking about a few hundred Californians. The army tested about 1,700,000 adult men. But it did not use the Binet system of scoring the mental ages. It scored by a system of points which we need not stop to describe. Naturally enough everyone interested in mental testing wanted to know whether the army tests agreed in any way with the Stanford-Binet mental age standard. So, by another process, which need also not be described, the results of the army tests were translated into Binet terms. The result of this translation is the table which has so badly misled poor Mr. Stoddard. This table showed that the average of the army did not agree at all with the average of Mr. Terman’s Californians. There were then two things to do. One was to say that the average intelligence of 1,700,000 men was a more representative average than that of four hundred men. The other was to pin your faith to the four hundred men and insist they gave the true average.

Mr. Stoddard chose the average of four hundred rather than the average of 1,700,000 because he was in such haste to write his own book that he never reached page 785 of Psychological Examining in the United States Army, the volume of the data edited by Major [Robert] Yerkes.1 He would have found there a clear warning against the blunder he was about to commit, the blunder of treating the average of a small number of instances as more valid than the average of a large number.

But instead of pausing to realize that the army tests had knocked the Stanford-Binet measure of adult intelligence into a cocked hat, he wrote his book in the belief that the Stanford measure is as good as it ever was. This is not intelligent. It leads one to suspect that Mr. Stoddard is a propagandist with a tendency to put truth not in the first place but in the second. It leads one to suspect, after such a beginning that the real promise and value of the investigation which Binet started is in danger of gross perversion by muddleheaded and prejudiced men.

****

II. The Mystery of the "A" Men

Because the results are expressed in numbers, it is easy to make the mistake of thinking that the intelligence test is a measure like a foot rule or a pair of scales. It is, of course, a quite different sort of measure. For length and weight are qualities which men have learned how to isolate no matter whether they are found in an army of soldiers, a heap of bricks, or a collection of chlorine molecules. Provided the footrule and the scales agree with the arbitrarily accepted standard foot and standard pound in the Bureau of Standards at Washington they can be used with confidence. But “intelligence” is not an abstraction like length and weight; it is an exceedingly complicated notion which nobody has as yet succeeded in defining.

When we measure the weight of a schoolchild we mean a very definite thing. We mean that if you put the child on one side of an evenly balanced scale, you will have to put a certain number of standard pounds in the other scale in order to cancel the pull of the child’s body towards the center of the earth. But when you come to measure intelligence you have nothing like this to guide you. You know in a general way that intelligence is the capacity to deal successfully with the problems that confront human beings, but if you try to say what those problems are, or what you mean by “dealing” with them or by “success,” you will soon lose yourself in a fog of controversy. This fundamental difficulty confronts the intelligence tester at all times. The way in which he deals with it is the most important thing to understand about the intelligence test, for otherwise you are certain to misinterpret the results.

The intelligence tester starts with no clear idea of what intelligence means. He then proceeds by drawing upon his common sense and experience to imagine the different kinds of problems men face which might in a general way be said to call for the exercise of intelligence. But these problems are much too complicated and too vague to be reproduced in the classroom. The intelligence tester cannot confront each child with the thousand and one situations arising in a home, a workshop, a farm, an office or in politics, that call for the exercise of those capacities which in a summary fashion we call intelligence. He proceeds, therefore, to guess at the more abstract mental abilities which come into play again and again. By this rough process the intelligence tester gradually makes up his mind that situations in real life call for memory, definition, ingenuity and so on.

He then invents puzzles which can be employed quickly and with little apparatus, that will according to his best guess test memory, ingenuity, definition and the rest. He gives these puzzles to a mixed group of children and sees how children of different ages answer them. Whenever he finds a puzzles that, say, sixty percent of the twelve year old children can do, and twenty percent of the eleven year olds, he adopts that test for the twelve year olds. By a great deal of fitting he gradually works out a series of problems for each age group which sixty percent of his children can pass, twenty percent cannot pass and, say, twenty percent of the children one year younger can also pass. By this method he has arrived under the Stanford-Binet system at a conclusion of this sort: Sixty percent of children twelve years old should be able to define three out of the five words: pity, revenge, charity, envy, justice. According to Professor Terman’s instructions, a child passes this test if he says that “pity” is “to be sorry for some one”; the child fails if he says “to help” or “mercy.” A correct definition of “justice” is as follows: “It’s what you get when you go to court”; an incorrect definition is “to be honest.”

A mental test, then is established in this way: The tester himself guesses at a large number of tests which he hopes and believes are tests of intelligence. Among these tests those finally are adopted by him which sixty percent of the children under his observation can pass. The children whom the tester is studying select his tests.

There are, consequently, two uncertain elements. The first is whether the tests really test intelligence. The second is whether the children under observation are a large enough group to be typical. The answer to the first question—whether the tests are tests of intelligence—can be determined only by seeing whether the results agree with other tests of intelligence, whatever they may be. The answer to the second question can be had only by making a very much larger number of observations than have yet been made. We know that the largest test made, the army examinations, showed enormous error in the Stanford test of adult intelligence. These elements of doubt are, I think, radical enough to prohibit anyone from using the results of these tests for large generalization about the quality of human beings. For when people generalize about the quality of human beings they assume an objective criterion. These puzzles may test intelligence, but they may not. They may test an aspect of intelligence. Nobody knows.

What then do the tests accomplish? I think we can answer this question best by starting with an illustration. Suppose you wished to judge all the pebbles in a large pile of gravel for the purpose of separating them into three piles, the first to contain the extraordinary pebbles, the second normal pebbles, and the third the insignificant pebbles. You have no scales. You first separate from the pile a much smaller pile and pick out one pebble which you guess is the average. You hold it in your left hand and pick up another pebble in your right hand. The right pebble feels heavier. You pick up another pebble. It feels lighter. You pick up a third. It feels still lighter. A fourth feels heavier than the first. By this method you can arrange all the pebbles from the smaller pile in a series running from the lightest to the heaviest. You thereupon call the middle pebble the standard pebble, and with it as a measure you determine whether any pebble in the larger pile is sub-normal, a normal or a supernormal pebble.

This is just about what the intelligence test does. It does not weigh or measure intelligence by any objective standard. It simply arranges a group of people in a series from best to worst by balancing their capacity to do certain arbitrarily selected puzzles, against the capacity of all the others. The intelligence test, in other words, is fundamentally an instrument for classifying a group of people. It may also be an instrument for measuring their intelligence, but of that we cannot be at all sure unless we believe that M. Binet and Mr. Terman and a few other psychologists have guessed correctly but, as we shall see later, the proof is not yet at hand.

The intelligence test, then, is an instrument for classifying a group of people, rather than “a measure of intelligence.” People are classified within a group according to their success in solving problems which may or may not be tests of intelligence. They are classified according to the performance of some Californians in the years 1910 to about 1916 with Mr. Terman’s notion of the problems that reveal intelligence. They are not classified according to their ability in dealing with the problems of real life that call for intelligence.

With this in mind let us look at the army results, as they are dished up by writers like Mr. Lothrop Stoddard and Professor [William] McDougall of Harvard. The following table is given:

41/₂% of the army were A men

9% [of the army were] B [men]

161/₂% [of the army were] C+ [men]

25% [of the army were] C [men]

20% [of the army were] C- [men]

15% [of the army were] D [men]

10% [of the army were] D- [men]

But how, you ask, did the army determine the qualities of an “A" man? For an ”A" man is supposed to have “very superior intelligence,” and of course mankind has wondered for at least two thousand years what were the earmarks of very superior intelligence. McDougall and Stoddard are quite content to take the army’s word for it, or at least they never stop to explain, before they exploit the figures, what the army meant by “very superior intelligence.” The army, of course, had no intention whatever of committing itself to a definition of very superior intelligence. The army was interested in classifying recruits. It therefore asked a committee of psychologists to assemble from all the different systems, Binet and otherwise, a series of tests. The committee took this series and tried it out in a few camps. They timed the tests. “The number of items and the time limits were so fixed that five percent or less in any average group would be able to finish the entire series of items in the time allowed.”2 It is not surprising that tests devised to pass five percent or less “A" men should have passed four and a half percent ”A" men.

The army was quite justified in doing this because it was in a hurry and was looking for about five percent of the recruits to put into officers' training camps. I quarrel only with the Stoddards and McDougalls who solemnly talk about the 4 1/2 percent “A" men in the American nation without understanding how these 4 1/2 percent were picked. They do not seem to realize that if the army had wanted half the number of officers, it could by shortening the time have made the scarcity of ”A" men seem even more alarming. If the army had wanted to double the "A" men, it could have done that by lengthening the time. Somewhere, of course, in the whole group would have been found men who could not have answered all the questions correctly in any length of time. But we do not know how many men of the kind there were because the tests were never made that way.3

The army was interested in discovering officers and in eliminating the feeble-minded. It had no time to waste, and so it adopted a rough test which would give a quick classification. In that it succeeded on the whole very well. But the army did not measure the intelligence of the American nation, and only very loose-minded writers imagine that it did. When men write as Mr. Stoddard does that “only four and a half millions (of the whole population) can be considered ‘talented,’” the only possible comment is that the statement has no foundation whatsoever. We do not know how many talented people there are: first, because we have no measure of talent, and second, because we have never made the attempt to devise one or apply one. But when we see men like Stoddard and McDougall have exploited the army tests, we realize how necessary, but how unheeded, is the warning of Messrs. [Clarence S.] Yoakum and [Robert M.] Yerkes that “the ease with which the army group test can be given and scored makes it a dangerous method in the hands of the inexpert. It was not prepared for civilian use, and is applicable only within certain limits to other uses than that for which it was prepared.”

****

III. The Reliability of Intelligence Tests

Suppose, for example, that our aim was to test athletic rather than intellectual ability. We appoint a committee consisting of Walter Camp, Percy Haughton, Text Rickard and Bernard Darwin, and we tell them to work out tests which will take no longer than an hour and can be given to large numbers of men at once. These tests are to measure the true athletic capacity of all men anywhere for the whole of their athletic careers. The order would be a large one, but it would certainly be no larger than the pretensions of many well known intelligence testers.

Our committee of athletic testers scratch their heads. What shall be the hour’s test, they wonder, which will “measure” the athletic “capacity” of [Jack] Dempsey, [Ben] Tilden, [golfer Jess] Sweetser, [Boxer “Battling”] Siki, Suzanne Lenglen and Babe Ruth, of all sprinters, Marathon runners, broad jumpers, high divers, wrestlers, billiard players, marksmen, cricketers and pogo bouncers? The committee has courage. After much guessing and some experimenting the committee works out a sort of condensed Olympic games which can be held in any empty lot. These games consist of a short sprint, one or two jumps, throwing a ball at a bull’s eye, hitting a punching machine, tackling a dummy and a short game of clock golf. They try out these tests on a mixed assortment of champions and duffers and find that on the whole the champions do all the tests better than the duffers. They score the result and compute statistically what is the average score for all the tests. This average score then constitutes normal athletic ability.

Now it is clear that such tests might really give some clue to athletic ability. But the fact that in any large group of people sixty percent made an average score would be no proof that you had actually tested their athletic ability. To prove that, you would have to show that success in the athletic tests correlated closely with success in athletics. The same conclusion applies to the intelligence tests. Their statistical uniformity is one thing; their reliability another. The tests might be a fair guess at intelligence, but the statistical result does not show whether they are or not. You could get a statistical curve very much like the curve of “intelligence” distribution if instead of giving each child from ten to thirty problems to do you had flipped a coin the same number of times for each child and had credited him with the heads. I do not mean, of course, that the results are as chancy as all that. They are not, as we shall soon see. But I do mean that there is no evidence for the reliability of the tests as tests of intelligence in the claim, made by Terman,4 that the distribution of intelligence quotients corresponds closely to “the theoretical normal curve of distribution (the Gaussian curve).” He would in a large enough number of cases get an even more perfect curve if these tests were tests not of intelligence but of the flip of a coin.

Such statistical check has its uses, of course. It tends to show, for example, that in a large group the bias and errors of the tester have been canceled out. It tends to show that the gross result is reached in the mass by statistically impartial methods, however wrong the judgment about any particular child may be. But the fairness in giving the tests and the reliability of the tests themselves must not be confused. The tests may be quite fair applied in the mass, and yet be poor tests of individual intelligence.

We come then to the question of the reliability of the tests. There are many different systems of intelligence testing and, therefore, it is important to find out how the results agree if the same group of people take a number of different tests. The figures given by Yoakum and Yerkes5 indicate that people who do well or badly in one are likely to do more or less equally well or badly in the other tests. Thus the army test for English-speaking literates, known as Alpha, correlates with Beta, the test for non-English speakers or illiterates at .80. Alpha was a composite test of Alpha, Beta and Stanford-Binet gives .94. Alpha with Trabue B and C completion-tests combined gives .72. On the other hand, as we noted in the first article of this series, the Stanford-Binet system of calculating “mental ages” is in violent disagreement with the results obtained by the army tests.

Nevertheless, in a rough way the evidence shows that the various tests in the mass are testing the same capacities. Whether these capacities can fairly be called intelligence, however, is not yet proved. The tests are all a good deal alike. They all derive from a common stock, and it is entirely possible that they measure only a certain kind of ability. The type of mind which is very apt in solving Sunday newspaper puzzles, or even in playing chess, may be specially favored by these tests. The fact that the same people always do well with puzzles would in itself be no evidence that the solving of puzzles was a general test of intelligence. We must remember, too, that the emotional setting plays a large role in any examination. To some temperaments the atmosphere of the examination room is highly stimulating. Such people “outdo themselves” when they feel they are being tested; other people “cannot do themselves justice” under the same conditions. Now in a large group these differences of temperament may neutralize each other in the statistical result. But they do not neutralize each other in the individual case.

The correlation between the various systems enables us to say only that the tests are not mere chance, and that they do seem to seize upon a certain kind of ability. But whether this ability is a sign of general intelligence or not, we have no means of knowing from such evidence alone. The same conclusion holds true of the fact that when the tests are repeated at intervals on the same group of people they give much the same results. Data of this sort are as yet meager, for intelligence testing has not been practiced long enough to give results over long periods of time. Yet the fact that the same child makes much the same score year after year is significant. It permits us to believe that some genuine capacity is being tested. But whether this is the capacity to pass tests or the capacity to deal with life, which we call intelligence, we do not know.

This is the crucial question, and in the nature of things there can as yet be little evidence one way or another. The Stanford-Binet tests were set in order about the year 1914. The oldest children of the group tested at that time were 142 children ranging from fourteen to sixteen years of age. Those children are now between twenty-two and twenty-four. The returns are not in. The main question of whether the children who ranked high in the Stanford-Binet tests will rank high in real life is now unanswerable, and will remain unanswered for a generation. We are thrown back, therefore, for a test of the tests on the success of these children in school. We ask whether the results of the intelligence test correspond with the quality of work, with school grades and with school progress.

The crude figures at first glance show a poor correspondence. In Terman’s studies6 the intelligence quotient correlated with school work, as judged by teachers, only .45 and with intelligence as judged by teachers, only .48. But that in itself proves nothing against the reliability of the intelligence tests. For after all the test of school marks, of promotion or the teacher’s judgments, is not necessarily more reliable. There is no reason certainly for thinking that the way public school teachers classify children is any final criterion of intelligence. The teachers may be mistaken. In a definite number of cases Terman has shown that they are mistaken, especially when they judge a child’s intelligence by his grade in school and not by his age. A retarded child may be doing excellent work, an advanced child poorer work. Terman has shown also that teacher make their largest mistakes in judging children who are above or below the average. The teachers become confused by the fact that the school system is graded according to age.

A fair reading of the evidence will, I think, convince anyone that as a system of grading the intelligence test may prove superior in the end to the system now prevailing in the public schools. The intelligence test, as we noted in an earlier article, is an instrument of classification. When it comes into competition with the method of classifying that prevails in school it exhibits many signs of superiority. If you have to classify children for the convenience of school administration, you are likely to get a more coherent classification with the tests than without them. I should like to emphasize this point especially, because it is important that in denying the larger pretensions and misunderstanding we should not lose sight of the positive value of the tests. We say, then, that none of the evidence thus far considered shows whether they are reliable tests of the capacity to deal intelligently with the problems of real life. But as gauges of the capacity to deal intelligently with the problems of the classroom, the evidence justifies us in thinking that the tests will grade the pupils more accurately than do the traditional school examinations.

If school success were a reliable index of human capacity, we should be able to go a step further and say that the intelligence test is a general measure of human capacity. But of course no such claim can be made for school success, for that would be to say that the purpose of the schools is to measure capacity. It is impossible to admit this. The child’s success with school work cannot be a measure of a child’s success in life. On the contrary, his success in life must be a significant measure of the school’s success in developing the capacities of the child. If a child fails in school and then fails in life, the school cannot sit back and say: you see how accurately I predicted this. Unless we are to admit that education is essentially impotent, we have to throw back the child’s failure at the school, and describe it as a failure not by the child but by the school.

For this reason, the fact that the intelligence test may turn out to be an excellent administrative device for grading children in school cannot be accepted as evidence that it is a reliable test of intelligence. We shall see in the succeeding articles that the whole claim of the intelligence testers to have found a reliable measure of the human capacity rests on an assumption, imported into the argument, that education is essentially impotent because intelligence is hereditary and unchangeable. This belief is the ultimate foundation of the claim that the tests are not merely an instrument of classification but a true measure of intelligence. It is this belief which has been seized upon eagerly by writers like Stoddard and McDougall. It is a belief which is, I am convinced, wholly unproved, and it is this belief which is obstructing and perverting the practical development of the tests.

(A number of letters have been received, commenting on the two articles of Mr. Lippmann’s series already printed. We have thought it best not to print any of these letters until the completion of the series, when it will be possible to classify and present the points brought up by our correspondents more intelligently.—The Editors.)

****

IV. The Abuse of the Tests

We have found reason for thinking that the intelligence test may prove to be a considerable help in sorting out children into school classes. If it is true, as Professor Terman says,7 that between a third and a half of the school-children fail to progress through the grades at the expected rate, then there is clearly something wrong with the present system of examinations and promotions. No one doubts that there is something wrong, and that in consequence both the retarded and the advanced child suffer.

The intelligence test promises to be more successful in grading the children. This means that the tendency of the tests in the average is to give a fairly correct sample of the child’s capacity to do school work. In a wholesale system of education, such as we have in our public schools, the intelligence test is likely to become a useful device for fitting the child into the school. This is, of course, better than not fitting the child into the school, and under a more correct system of grading, such as the intelligence test promises to furnish, it should become possible even where education is conducted in large classrooms to specialize the teaching, because the classes will be composed of pupils whose capacity for school work is fairly homogeneous.

Excellent as this seems, it is of the first importance that school authorities and parents realize exactly what this administrative improvement signifies. For great mischief will follow if there is confusion about the spiritual meaning of this reform. If, for example, the impression takes root that these tests really measure intelligence, that they constitute a sort of last judgment on the child’s capacity, that they reveal “scientifically” his predestined ability, then it would be a thousand times better if all the intelligence testers and all their questionnaires were sunk without warning into the Sargasso Sea. One has only to read around in the literature of the subject, but more especially in the work of popularizers like McDougall and Stoddard, to see how easily the intelligence test can be turned into an engine of cruelty, how easily in the hands of blundering or prejudiced men it could turn into a method of stamping a permanent sense of inferiority upon the soul of a child.

It is not possible, I think, to imagine a more contemptible proceeding than to confront a child with a set of puzzles, and after an hour’s monkeying with them, proclaim to the child, or to his parents, that here is a C-individual. It would not only be a contemptible thing to do. It would be a crazy thing to do, because there is nothing in these tests to warrant a judgment of this kind. All that can be claimed for the tests is that they can be used to classify into a homogeneous group the children whose capacities for school work are at a particular moment fairly similar. The intelligence test shows nothing as to why those capacities at any moment are what they are, and nothing as to the individual treatment which a temporarily retarded child may require.

I do not mean to say that the intelligence test is certain to be abused. I do mean to say it lends itself so easily to abuse that the temptation will be enormous. Suppose you have a school in which there are fifty ten year old children in the seventh grade and fifty eleven year old in the eighth. In each class you find children who would jump ahead if they could and others who lag behind. You then regrade them according to mental age. Some of the ten year olds go into the eight grade, some of the elevens into the seventh grade. That is an improvement. But if you are satisfied to leave the matter there, you are doing a grave injustice to the retarded children and ultimately to the community in which they are going to live. You cannot, in other words, be satisfied to put retarded eleven year olds and average ten year olds together. The retarded eleven year olds need something besides proper classification according to mental age. They need special analysis and special training to overcome their retardation. The leading intelligence testers recognize this, of course. But the danger of the intelligence tests is that in a wholesale system of education, the less sophisticated or the more prejudiced will stop when they have classified and forget that their duty is to educate. They will grade the retarded child instead of fighting the causes of his backwardness. For the whole drift of the propaganda based on intelligence testing is to treat people with low intelligence quotients as congenitally and hopelessly inferior.

Readers who have not examined the literature of mental testing may wonder why there is reason to fear such an abuse of an invention which has many practical uses. The answer, I think, is that most of the more prominent testers have committed themselves to a dogma which must lead to such abuse. They claim not only that they are really measuring intelligence, but that intelligence is innate, hereditary, and predetermined. They believe that they are measuring the capacity of a human being for all time and that his capacity is fatally fixed by the child’s heredity. Intelligence testing in the hands of men who hold this dogma could not but lead to an intellectual caste system in which the task of education had given way to the doctrine of predestination and infant damnation. If the intelligence test really measured the unchangeable hereditary capacity of human beings, as so many assert, it would inevitably evolve from an administrative convenience into a basis for hereditary caste.

In the next article we shall examine the evidence for the claim that the intelligence tests reveal the fixed hereditary endowment.

****

V. Tests of Hereditary Intelligence

The first argument in favor of the view that the capacity for intelligence is hereditary is an argument by analogy. There is a good deal of evidence that idiocy and certain forms of degeneracy are transmitted from parents to offspring. There are, for example, a number of notorious families—the Kallikaks, the Jukes, the Hill Folk, the Nams, the Zeros and the Ishmaelites, who have a long and persistent record of degeneracy. Whether these bad family histories are the result of a bad social start or of defective germplasm is not entirely clear, but the weight of evidence is in favor of the view that there is a taint in the blood. Yet even in these sensational cases, in fact just because they are so sensational and exceptional, it is important to remember that the proof is not conclusive.

There is, for example, some doubt as to the Kallikaks. It will be recalled that during the Revolutionary War a young soldier, known under the pseudonym of Martin Kallikak, had an illegitimate feeble-minded son by a feeble-minded girl. The descendants of this union have been criminals and degenerates. But after the war was over Martin married respectably. The descendants of this union have been successful people. This is a powerful evidence, but it would, as Professor [James McKeen] Cattell8 points out, be more powerful, and more interesting scientifically, if the wife of the respectable marriage had been feeble-minded, and the girl in the tavern had been a healthy, normal person. Then only would it have been possible to say with complete confidence that this was a pure case of biological rather than of social heredity.

Assuming, however, that the inheritance of degeneracy is established, we may turn to the other end of the scale. Here we find studies of the persistence of talent in superior families. Sir Francis Galton, for example, found “that the son of a distinguished judge had about one chance in four of becoming himself distinguished, while the son of a man picked out at random from the general population had only about one chance in four thousand of becoming similarly distinguished.”9 Professor Cattell in a study of families of one thousand leading American scientists remarks in this connection: “Galton finds in the judges of England a notable proof of hereditary genius. It would be found to be much less in the judges of the United States. It could probably be shown by the same methods to be even stronger in the families conducting the leading publishing and banking houses of England and Germany.” And in another place he remarks that “my data show that a boy born in Massachusetts or Connecticut has been fifty times as likely to become a scientific man as a boy born along the Southeastern seaboard from Georgia to Louisiana.”

It is not necessary for our purpose to come to any conclusion as to the inheritance of capacity. The evidence is altogether insufficient for any conclusion, and the only possible attitude is an open mind. We are, moreover, not concerned with the question of whether intelligence is hereditary. We are concerned only with the claim of the intelligence tester that he reveals and measures hereditary intelligence. These are quite separate propositions, but they are constantly confused by the testers. For these gentlemen seem to think that if Galton’s conclusion about judges and the tale of the Kallikaks are accepted, then two things follow: first, that by analogy10 all the graduations of intelligence are fixed in heredity, and second that the tests measure these different grades of heredity intelligence. Neither conclusion follows necessarily. The facts of heredity cannot be proved by analogy; the facts of heredity are what they are. The question of whether the intelligence test measures heredity is a wholly different matter. It is the only question which concerns us here.

We may start then with the admitted fact that children of favored classes test higher on the whole that other children. Binet tests made in Paris, Berlin, Brussels, Breslau, Rome, Petrograd, Moscow, in England and in America agree on this point. In California Professor Terman11 divided 492 children into five social classes and obtained the following correlation between the median intelligence quotient and social status:

Social group

Median IQ

Very Inferior85

Inferior93

Average99.5

Superior107

Very Superior106

On the face of it this table would seem to indicate, if it indicates anything, a considerable connection between intelligence and environment. Mr. Terman denies this, and argues that “if home environment really has any considerable effect upon the IQ we should expect this effect to become more marked, the longer the influence has continued. That is, the correlation of IQ with social status should increase with age.” But since his data show that at three age levels (5–8 years) and (9–11 years) and (12–14 years) the coefficient of correlation with social status declines (it is .43, .41, and .29 respectively), Mr. Terman concludes that “in the main, native qualities of intellect and character, rather than chance (sic) determine the social class to which a family belongs.” He even pleads with us to accept this conclusion: “After all does not common observation teach us that etc. etc.” and “from what is already know about heredity should we not naturally expect” and so forth and so forth.

Now I propose to put aside entirely all that Mr. Terman’s common observation and natural expectations teach him. I should like only to examine his argument that if home environment counted much its effect ought to become more and more marked as the child grew older.

It is difficult to see why Mr. Terman should expect this to happen. To the infant the home environment is the whole environment. When the child goes to school the influences of the home are merged in the larger environment of school and playground. Gradually the child’s environment expands until it takes in a city, and the larger invisible environment of books and talk and movies and newspapers. Surely Mr. Terman is making a very strange assumption when he argues that as the child spends less and less time at home the influence of home environment ought to become more and more marked. His figures, showing that the correlation between social status and intelligence declines from .43 before eight years of age to .29 at twelve years of age, are hardly an argument for hereditary differences in the endowment of social classes. They are a rather strong argument on the contrary for the traditional American theory that the public school is an agency for equalizing the opportunities of the privileged and the unprivileged.

But Mr. Terman could by a shrewder use of his own data have made a better case. It is not necessary for him to use an argument which comes down to saying that the less contact the child has with the home the more influential the home ought to be. This is simply the gross logical fallacy of expecting increasing effects from a diminishing cause. Mr. Terman would have made a more interesting point if he had asked why the influence of social status on intelligence persists so long after the parents and the home have usually ceased to play a significant part in the child’s intellectual development. Instead of being surprised that the correlation has declined from .43 at eight to .29 at twelve, he should have asked why there is any correlation left at twelve. That would have posed a question which the traditional eulogist of the little red schoolhouse could not answer offhand. If the question had been put that way, no one could dogmatically have denied that differences of heredity in social classes may be a contributing factor. But curiously, it is the mental tester himself who incidentally furnishes the most powerful defence of the orthodox belief that in the mass differences of ability are the result of education rather than of heredity.

The intelligence tester has found that the rate of mental growth declines as the child matures. It is faster in infancy than in adolescence, and the adult intelligence is supposed to be fully developed somewhere between sixteen and nineteen years of age. The growth of intelligence slows up gradually until it stops entirely. I do not know whether this is true or not, but the intelligence testers believe it. From this belief it follows that there is “a decreasing significance of a given amount of retardation in the upper years.”12 Binet, in fact, suggested the rough rule that under ten years of age a retardation of two years usually means feeble-mindedness, while for older children feeble-mindedness is not indicated unless there is a retardation of at least three years.

This being the case the earlier the influence the more potent it would be, the later the influence, the less significant. The influences which bore upon the child when his intelligence was making its greatest growth would leave a profounder impression than those which bore upon him when his growth was more nearly completed. Now in early childhood you have both the period of the greatest growth and the most inclusive and direct influence of the home environment. Is it surprising that the effects of superior and inferior environments persist, though in diminishing degree, as the child emerges from the home?

It is possible, of course, to deny that the early environment has any important influence on the growth of intelligence. Men like Stoddard and McDougall do deny it, and so does Mr. Terman. But on the basis of the mental tests they have no right to an opinion. Mr. Terman’s observations begin at four years of age. He publishes no data on infancy and he is, therefore, generalizing about the heredity factor after four years of immensely significant development have already taken place. On his own showing as to the high importance of the earlier years, he is hardly justified in ignoring them. He cannot simply lump together the net result of natural endowment and infantile education and ascribe it to the germplasm.

In doing just that he is obeying the will to believe, not the methods of science. How far he is carried may be judged from this instance which Mr. Terman cites13 as showing the negligible influence of environment. He tested twenty children in an orphanage and found only three who were fully normal. “The orphanage in question,” he then remarks, “is a reasonably good one and affords an environment which is about as stimulating as average home life among the middle classes.” Think of it. Mr. Terman first discovers what a “normal mental development” is by testing children who are growing up in the abnormal environment of an institution and finds that they are not normal. He then puts the blame for abnormality on the germplasm of the orphans.

****

VI. A Future for the Tests

How does it happen that men of science can presume to dogmatize about the mental qualities of the germplasm when their own observations begin at four years of age? Yet this is what the chief intelligence testers, led by Professor Terman, are doing. Without offering any data on all that occurs between conception and the age of kindergarten, they announce on the basis of what they have got out of a few thousand questionnaires that they are measuring the hereditary mental endowment of human beings. Obviously this is not a conclusion obtained by research. It is a conclusion planted by the will to believe. It is, I think, for the most part unconsciously planted. The scoring of the tests itself favors an uncritical belief that intelligence is a fixed quantity in the germplasm and that, no matter what the environment, only a predetermined increment of intelligence can develop from year to year. For the result of a test is not stated in terms of intelligence, but as a percentage of the average for that age level. These percentages remain more or less constant. Therefore, if a child shows an IQ of 102, it is easy to argue that he was born with an IQ of 102.

There is here, I am convinced, a purely statistical illusion, which breaks down when we remember what IQ means. A child’s IQ is his percentage of passes in the test which the average child of a large group of his own age has passed. The IQ measures his place in respect to the average at any year. But it does not show the rate of his growth from year to year. In fact it tends rather to conceal the fact that the creative opportunities in education are greatest in early childhood. It conceals the fact, which is of such far-reaching importance, that because the capacity to form intellectual habits decreases as the child matures, the earliest education has a cumulative effect on the child’s future. All this the static percentages of the IQ iron out. They are meant to iron it out. It is the boast of the inventors of the IQ that “the distribution of intelligence maintains a certain constancy from five to thirteen or fourteen years of age, when the degree of intelligence is expressed in terms of the intelligence quotient.”14 The intention is to eliminate the factor of uneven and cumulative growth, so that there shall be always a constant measure by which to classify children in class rooms.

This, as I have pointed out, may be useful in school administration, but it can turn out to be very misleading for an unwary theorist. If instead of saying that Johnny gained thirty pounds one year, twenty-five the next and twenty the third, you said that measured by the average gain for children of his age, Johnny’s weight quotients were 101, 102, 101, you might, unless you were careful, begin to think that Johnny’s germplasm weighed as much as he does today. And if you dodged that mistake, you might, nevertheless come to think that since Johnny classified year after year in the same position, Johnny’s diet had no influence on his weight.

The effect of the intelligence quotient on a tester’s mind may be to make it seem as if intelligence were constant, whereas it is only the statistical position in large groups which is constant. This illusion of constancy has, I believe, helped seriously to prevent men like Terman from appreciating the variability of early childhood. Because in the mass the percentages remain fixed, they tend to forget how in each individual case there were offered creative opportunities which the parents and nurse girls improved or missed or bungled. The whole more or less blind drama of childhood, where the habits of intelligence are formed, is concealed in the mental test. The testers themselves become callous to it. What their footrule does not measure soon ceases to exist for them, and so they discuss heredity in school children before they have studied the education of infants.

But of course no student of human motives will believe that this revival of predestination is due to a purely statistical illusion. He will say with Nietzsche that “every impulse is imperious, and, as such, attempts to philosophize.” And so behind the will to believe he will expect to find some manifestation of the will power. He will not have to read far in the literature of mental testing to discover it. He will soon see that the intelligence test is being sold to the public on the basis of the claim that it is a device which will measure pure intelligence, whatever that may be, as distinguished from knowledge and acquired skill.

This advertisement is impressive. If it were true, the emotional and the worldly satisfactions in store for the intelligence tester would be very great. If he were really measuring intelligence, and if intelligence were a fixed hereditary quantity, it would be for him to say not only where to place each child in school, but also which children should go to high school, which to college, which into the professions, which into the manual trades and common labor. If the tester could make good his claim, he would soon occupy a position of power which no intellectual has held since the collapse of theocracy. The vista is enchanting, and even a little of the vista is intoxicating enough. If only it could be proved, or at least believed, that intelligence is fixed by heredity, and that the tester can measure it, what a future to dream about! The unconscious temptation is too strong for the ordinary critical defenses of the scientific methods. With the help of a subtle statistical illusion, intricate logical fallacies and a few smuggled obiter dicta, self-deception as the preliminary to public deception is almost automatic.

The claim that we have learned how to measure hereditary intelligence has no scientific foundation. We cannot measure intelligence when we have never defined it, and we cannot speak of its hereditary basis after it has been indistinguishably fused with a thousand educational and environmental influences from the time of conception to the school age. The claim that Mr. Terman or anyone else is measuring hereditary intelligence has no more scientific foundation than a hundred other fads, vitamins and glands and amateur psychoanalysis and correspondence courses in will power, and it will pass them into that limbo where phrenology and palmistry and characterology and the other Babu sciences are to be found. In all of these there was some admixture of primitive truth which the conscientious scientist retains long after the wave of popular credulity has spent itself.

So, I believe, it will be with mental testing. Gradually under the impact of criticism the claim will be abandoned that a device has been invented for measuring native intelligence. Suddenly it will dawn upon the testers that this is just another form of examination, differing in degree rather than in kind from Mr. Edison’s questionnaire or a college entrance examination. It may be a better form of examination than these, but it is the same sort of thing. It tests, as they do, an unanalyzed mixture of native capacity, acquired habits and stored-up knowledge, and no tester knows at any moment which factor he is testing. He is testing the complex result of a long and unknown history, and the assumption that his questions and his puzzles can in fifty minutes isolate abstract intelligence is, therefore, vanity. The ability of a twelve-year-old child to define pity or justice and to say what the lesson the story of the fox and crow “teaches” may be a measure of his total education, but it is no measure of the value or capacity of his germplasm.

Once the pretensions of this new science are thoroughly defeated by the realization that these are not “intelligence tests” at all nor “measurements of intelligence,” abut simply a somewhat more abstract kind of examination, their real usefulness can be established and developed. As examinations they can be adapted to the purposes in view, whether it be to indicate the feeble-minded for segregation, or to classify children in school, or to select recruits from the army for officers' training camps, or to pick bank clerks. Once the notion is abandoned that the tests reveal pure intelligence, specific tests for specific purposes can be worked out.

A general measure of intelligence valid for all people everywhere at all times may be an interesting toy for the psychologist in his laboratory. But just because the tests are so general, just because they are made so abstract in the vain effort to discount training and industry. Instead, therefore, of trying to find a test which will with equal success discover artillery officers, Methodist ministers, and branch managers for the rubber business, the psychologists would far better work out special and specific examinations for artillery officers, divinity school candidates and branch managers in the rubber business. On that line they may ultimately make a serious contribution to a civilization which is constantly searching for more successful ways of classifying people for specialized jobs. And in the meantime the psychologists will save themselves from the reproach of having opened up a new chance for quackery in a field where quacks breed like rabbits, and they will save themselves from the humiliation of having furnished doped evidence to the exponents of the New Snobbery.

Notes:

1 “For norms of adult intelligence the results of the Army examinations are undoubtedly the most representative. It is customary to say that the mental age of the average adult is about sixteen years. This figure is based, however, upon examinations of only 62 persons. . . . This group is too small to give very reliable results and is further more probably not typical.” Psychological Examining in the United States Army, p.785.

The reader will note that Major Yerkes and his colleagues assert that the Stanford standard of adult intelligence is based on only sixty-two cases. This is a reference to page 49 of Mr. Terman’s book on the Stanford Revision of the Binet-Simon Scale. But page 13 of the same book speaks of 400 adults being the basis on which the adult tests were standardized. I have used this larger figure because it is more favorable to the Stanford-Binet scale.

It should also be remarked that the army figures are not the absolute figures but the results of a “sample of white draft” consisting of nearly 100,000 recruits. In strictest accuracy we ought to say then that the disagreement between army and Stanford-Binet results derives from conclusions drawn from 100,000 cases as against 400.

If these 100,000 recruits are not a fair sample of the nation, as they probably are not, then in addition to saying that the army tests contradict the Stanford-Binet scale, we ought to add that the army tests are themselves no reliable basis for measuring the average American mentality.

2 Yoakum and Yerkes, Army Mental Tests, p. 3.

3 Psychological Examining in the United States Army, p. 419. “The high frequencies of persons gaining at the upper levels (often 100%) indicate for the people making high scores on single time the ‘speed’ element is predominant.”

4 Stanford Revision Binet-Simon Scale, p. 42.

5 Army Mental Tests, p. 20.

6 Stanford Revision of Binet-Simon Scale, Chapter VI.

7 The measurement of Intelligence, p.3.

8 Popular Science Monthly, May, l915.

9 Galton, Hereditary Genius (1869) cited by Stoddard, Revolt Against Civilization, p. 49.

10 cf. McDougall, p. 40.

11 Revision, p. 89.

12 Revision, p.51.

13 Revision, p. 99.

14 Revision, p. 50.

Source: Walter Lippman, “The Mental Age of Americans,” New Republic 32, no. 412 (October 25, 1922): 213–215; no. 413 (November 1, 1922): 246–248; no. 414 (November 8, 1922): 275–277; no. 415 (November 15, 1922): 297–298; no. 416 (November 22, 1922): 328–330; no. 417 (November 29, 1922): 9–11.

localhost

Debunking Intelligence Experts: Walter Lippmann Speaks Out