Orwell Award Announcement SusanOhanian.Org Home


Standards and Criteria Redux

Susan Notes:

I am posting only the beginning of this paper because I can't reproduce the figures. It is definitely worth your while to go to the url below and read the entire paper. This observation comes near the end of the paper:


To my knowledge, every attempt to derive a criterion score is either blatantly arbitrary or derives from a set of arbitrary premises. But arbitrariness is no bogeyman, and one ought not to shrink from a necessary task because it involves arbitrary decisions. However, arbitrary decisions often entail substantial risks of disruption and dislocation. Less arbitrariness is safer.


There's a great term in this paper: Pseudoquantification. There are plenty of places we can put it to good use.


by Gene V. Glass

[Note: I published the paper "Standards and criteria" in the Journal of Educational Measuremen in 1978 (Vol. 15, 237-261). I am now revisiting it because the message is more urgent now than it was then. This paper is essentially a reprinting of the 1978 paper, to which I plan to add a prologue.]

A common expression of wishful thinking is to base a grand scheme on a fundamental, unsolved problem. Politicians outline energy policy under the assumption that physicists will soon be able to control the intense heat generated by nuclear fusion. Planners chart the future course of cancer research with faith that basic discoveries will be made at an expenditure of $2 billion plus or minus. Those who think on exalted levels are prone to underrate the complexity of what seem lesser problems. Utilitarianism in ethics is an example. "The greatest good for the greatest number" is not only logically inconsistentâ€"since one can't maximize two functions simultaneouslyâ€"but as a social policy, it falls at the final hurdle: there exists no social calculus by which one can compute the amount of good eventuating from a social policy.

Contemporary educational movements present a similar situation: accountability, mastery learning, assessment, competency-based education, minimal competence graduation requirements. A literature search under any one of these categories brings a deluge of reports, speeches, and position papers. The movements have spawned laws, jobs, conferences, and distinguished commissions. And, much of the language and thinking rests at bottom on a common notion: that a minimal acceptable level of performance on a task can be specified. Whether it goes by the name "mastery," "competence," or "proficiency," it is the same fundamental notion. A judge (technician, professional, and the like) inspects an exercise or task or test and somehow determines that the score Cx represents mastery, minimal competence, proficiency, etc.. A recent incident in New England could be a bellwether for school districts across the country:

By a vote of 6 to 2, the board of education in Stamford, Conn., has adopted a resolution requiring applicants for teaching jobs to "demonstrate mastery of written and spoken English as a pre requisite to being hired." The resolution also stipulated that teachers now employed in the Stamford schools would be tested in English and those found "deficient in communication" would receive remedial instruction.

I have read the writings of those who claim the ability to make the determination of mastery or competence in statistical or psychological ways. They can't. At least, they cannot determine "criterion levels" or standards other than arbitrarily. The consequences of the arbitrary decisions are so varied that it is necessary either to reduce the arbitrariness, and hence the unpredictability of the consequences of applying the standards, or to abandon the search for criterion levels altogether in favor of ways of using test data that are less arbitrary and, hence, safer.

This monograph has grown out of a series of discussions and a six-month period of reading and reflecting on the literature which were initiated by Fritz Mosher's suggestions to the National Assessment of Educational Progress (NAEP) to examine the "standards" question. Conversations with Mosher himself and the staff of NAEP have been most influential. The Analysis Advisory Committee of NAEP, under Fred Mosteller's chairmanship, proved a rigorous testing ground for many of the ideas.

In the following pages, I shall (a) examine the ordinary usage of the words "standards" and "criteria" in the measurement literature; (b) trace the evolution of the notion of performance standards in the criterion-referenced testing movement; (c) analyze and critique six methods of setting performance standards on criterion-referenced tests; and (d) reflect briefly on the political forces which have become focused on the standards issue.

"Standards" In Common Parlance

Setting standards or mastery levels is frequently written about as though it is a well-established and routine phase of instructional development. In conversations with measurement specialists and instructional development experts over the past few years, I have been literally dumbfounded by the nonchalance with which they handle the standards problem. One will report that he always sets a standard of two-thirds of the items correct for mastery because he's a sort of "liberal guy." Another expert will report that he holds learners to 70% mastery, and a third advances his 90% standard with an air of tough-mindedness and respect for excellence. None of them bothers with such apparently extraneous considerations as how the test items are to be composed and whether they will be abstruse or obvious. In one of the sacred writings of the instructional objectives movement, Robert F. Mager (1962) identified standard setting as an integral part of stating an objective properly:

If we can specify at least the minimum acceptable performance for each objective, we will have a performance standard against which to test our instructional programs; we will have a means for determining whether our programs are successful in achieving our instructional intent. What we must try to do, then, is indicate in our statement of objectives what the acceptable performance will be, by adding words that describe the criterion of success. (p. 44)

Mager went on to illustrate what he meant by a behavioral objective and its associate standard:

  • The student must be able to correctly solve at least seven simple linear equations within a Period of thirty minutes.

  • Given a human skeleton, the student must be able to correctly identify by labeling at least 40 of the. . . bones; there will be no penalty for guessing.

  • The student must be able to spell correctly at least 80 percent of the words called out to him during an examination period. (p. 44)


  • This language of performance standards is pseudoquantification, a meaningless application of numbers to a question not prepared for quantitative analysis. A teacher, or psychologist, or linguist simply cannot set meaningful standards of performance for activities as imprecisely defined as "spelling correctly words called out during an examination period. And, little headway is made toward a solution to the problem by specifying greater detail about how the questions, tasks, or exercises will be constructed.

    Can a more meaningful performance standard be stated for an objective as molecular as "the pupil will be able to discriminate the grapheme combination 'vowel + r' spelled 'ir' from other graphemes"? Can it be asserted confidently about this narrow objective that a pupil should be able to make 9 out of 10 correct discriminations? In point of fact, this objective appears on the Stanford Reading Test where it is assessed by two different items:

    a) Mark the word "firm" (Read by proctor)

    firm

    form

    farm

    b) Mark the word "girl" (Read by proctor)

    goal

    girl

    grill

    The percentages of second-grade pupils in the norm population answering items a) and b) correctly were 56% and 88%, respectively. Any performance standards (e.g., "8 out of 10 correct") for a group of items like item a would be quite inappropriate for a group of items like item b, since they are so different in difficulty. Results from a grade seven assessment by the Department of Education in New Jersey illustrate the same point. Pupils averaged 86% on vertical addition, but only 46% on horizontal addition. The vagaries of teaching and measurement are so poorly understood that the a priori statement of performance standards is foolhardy.

    Benjamin S. Bloom (1968), whose name has become closely associated with the notion of "mastery learning," has written of instructional psychology in ways that depend fundamentally on notions of performance standards:

    Most students (perhaps over 90 percent) can master what we have to teach them. (p. 1)

    There is little question that the schools now do provide successful learning experiences for some studentsâ€"perhaps as high as one third of the students. If the schools are to provide successful and satisfying learning experiences for at least 90 percent of the students, major changes must take place in the attitudes of students, teachers, and administrators... (p.2)

    Thus, we are expressing the view that, given sufficient time (and appropriate types of help), 95 percent of students...can learn a subject up to a high level of mastery. We are convinced that the grade of A as an index of mastery of a subject can, under appropriate conditions, be achieved by up to 95 percent of the students in the class. (p. 4)

    Popham (1973), writing on instructional objectives for teachers in training, reaffirmed the centrality of performance standards:

    There is, however, another dimension to objective writing, a dimension that further aids the teacher in planning and evaluating his instruction. It involves establishing performance standards, that is, specifying prior to instruction the minimal levels of pupil achievement. (p. 3)

    The notion of performance standards is repeatedly illustrated in Popham's teachers' manual:

    In a math class, the student will be able to solve ten of fifteen perimeter problems. (p. 3)

    The student will be able to identify correctly, through chemical analysis procedures, at least five unknown substances. (p. 6)

    Wiersma and Jurs (1976), in outlining the instructional evaluation component of Individually Guided Education (the University of Wisconsin R & D Center instructional plan), gave the following description of criterion-referenced testing


    When an individual's performance score is interpreted with reference to an established criterion and without reference to the level of the performance of a group, we have a criterion referenced interpretation. The criterion is usually established prior to any actual measurement being done. The criterion or criteria are usually stated in the instructional objectives or in supplements to the stated objectives. For example, a list of objectives may have an accompanying statement indicating that when students score 90 percent correct on the related test, they should be considered as having attained the objectives. (p. 14)

    In detailing the role of testing in assessment programs, Ralph W. Tyler (1973) illustrated a performance standard for determining mastery:

    For example, in primary reading, the children who enter without having learned to distinguish letters and sound might be tested by the end of the year on letter recognition, association of letters with sounds, and word-recognition of one hundred most common words. For each of these specified "things to be learned," the child would be presented with a large enough sample of examples to furnish reliable evidence that he could recognize the letters of the alphabet, he could associate the appropriate sounds with each letter, alone and in words, and he could recognize the one hundred most common words. A child has demonstrated mastery of specified knowledge, ability, or skill when he performs correctly 85 percent of the time. (Some small allowance, like 15 percent, is needed for lapses common to all people.) (p. 105)

    The staff of the National Assessment of Educational Progress have grappled with the performance standards problem for years to almost no one's satisfaction. Though they have never adopted an official position on the matter, they did cooperate with the National Council for the Social Studies in an effort to apply performance standards to the assessment results in citizenship and social studies (Fair, 1975). A fully representative panel of nine judges (3 minorities, 5 women, 3 under the age of 30) was formed. Each judge was shown an assessment item and then asked, "Realistically what level of performance nationally for the age level being considered would satisfy you for this exercise? (1) less than 20% correct, (2) 20-40%, (3) 41-60%, (4) 61-80%, or (5) more than 80%?" The panel rendered over 5,000 judgments in a three-day sitting, and it has been reported that "...panel members agreed more often than not, but at times spread their responses across all the available categories" (Fair, 1975, p. 45). About half of the exercises were given a "satisfactory performance level" of "more than 80%." About 35% of the exercises would satisfy the panel if between 60% and 80% of the examinees answered correctly. The desired performance levels were generally above the actual rates of correct response. What is to be made of the gap? Ought it to be read as evidence of the deficiency of the educational system; or is it testament to the panel's aspirations, American hustle and the indomitable human spirit ("Man's reach Should exceed his grasp, etc.")?

    The reader can justifiably ask, "What manner of discourse is being engaged in by these experts?" How is one to regard such statements as "the student must be able to correctly solve at least seven simple linear equations in thirty minutes" or "90 percent of all students can master what we have to teach them." If such statements are to be challenged, should they be challenged as claims emanating from psychology, statistics, or philosophy? Do they maintain something about learning or something about measurement? Are they disconfirmable empirical claims or are they merely educational rhetoric spoken more for effect than for substance? . . .

    Please go to the hot link below to read the rest of the paper.

    — Gene V. Glass

    2003-02-
    http://glass.ed.asu.edu/gene/papers/standards/


    INDEX OF RESEARCH THAT COUNTS


    FAIR USE NOTICE
    This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of education issues vital to a democracy. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit to those who have expressed a prior interest in receiving the included information for research and educational purposes. For more information click here. If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner.