Chapter
4
FORMAL
ASSESSMENTS
A.
Understanding the Basics
1. Graphic Organizer
----------------------------------------------------
Insert
Figure 4-1 about here: Chapter 4
Graphic Organizer
---------------------------------------------
2. Chapter Overview
This chapter describes formal assessments
of reading and literacy, the use of standardized and other highly structured
tests. Emphasis is placed upon wise and
knowledgeable testing policies, including establishing purposes for assessment,
criteria for selecting instruments, and uses and misuses of data obtained from
such instruments. Attention is given to
the basic statistical information necessary for understanding standardized test
scores. Several general categories of
norm-referenced reading/literacy tests, each of which serves a different
function, are described and examples provided.
These general categories include the following: Reading survey tests, general achievement
tests, and reading diagnostic tests.
Some guidelines for the implementation of standardized tests are
provided, with a discussion of advantages and disadvantages of this approach to
assessment. The chapter provides
several opportunities for the reader to engage in analysis and interpretation
of test data.
3. Understanding Formal Assessment
Formal assessment generally refers to the use of standardized
tests or other tests designed for administration under specified, controlled
conditions to measure students' reading ability or general school achievement
and aptitude. Standardized tests are
often called norm-referenced because norms are provided from a reference
population. That is, the test publisher
has administered the test to a sample of children (the reference population) and
provided statistical information about the resulting range of scores (the
norms).
Some standardized tests are developed for
individual administration, but most are developed for group
administration. One purpose of such
tests is to provide a standard to which individuals' and groups' performance
can be compared. The scores obtained
from standardized tests are intended to be quantitative, non-biased, and
non-subjective. Examiner judgment and
other qualitative factors that might influence a pupil's score, such as
attitude and interest, are minimized in these assessments. Thus, these tests are perceived to be
objective devices (Calfee & Hiebert, 1991; Pearson & Dunning, 1985).
Standardized testing has its roots in the
works of educational and psychological researchers of the early 1900's such as
E. L. Thorndike, and C. Spearman.
Scientific objectivity in testing became an important issue in
psychological and educational assessments during this period when researchers
focused on eliminating the subjective judgment of an examiner in an attempt to
be more impartial and fair to the person being examined. Thus, fairness in testing has long been
equated with objective tests in which the test scores became the primary source
of data that is used to judge a person's ability.
To ensure their objectivity, standardized
tests must be administered according to the instructions in the examiner's
manual that accompanies each test.
Examiners are required to read the directions for taking the test
exactly as they are stated in the examiner's manual and to adhere to time
limitations for completing the test.
This procedure ensures that all test takers will receive the same
instructions, participate in completing the same sample items, and have the
same amount of time to complete the test.
Some individually administered
standardized tests have guidelines for establishing a basal level and a ceiling
level. This procedure shortens the time
needed for testing by eliminating some items that are too easy or too difficult
for the child. These basal and ceiling
levels can only be established when the items on a test are arranged from easy
to difficult. Basal levels are
based on data from the norming population which suggests that students of
certain ages can successfully complete items of that difficulty level. For example, the standard basal level, or
starting point, of a test for children at a certain age might not be item #1
but might be item #15 or higher. The
examiner is instructed by the manual to make sure that the child is able to
successfully answer the first few questions.
Most children at that age will be able to do that, but some lower
ability students will not be able to.
Then, the examiner must proceed backwards through the test items until
the student responds correctly to a designated number of items (usually
5-7). This point is considered the
basal level for that child and an assumption is made that all other items below
this level would be answered correctly.
Determination of a ceiling level allows
the test to be terminated at a point at which the student is making too many
errors within a given number of items.
For example, the examiner's manual may instruct that when a student
makes five errors out of eight items, testing will be stopped because a ceiling
for that student has been reached. The
assumption is made that the remaining items on the test are too difficult for
the student.
Objective
tests have been widely used from the early 1900's to the present time and play
a major part in the assessment of students' academic growth and placement in
particular school programs (Pikulski, 1990;
Valencia & Pearson, 1988).
Assignment of students to a reading group, for example, is often based
on test results. Standardized tests are
also used as a yardstick in measuring the effectiveness of teachers and school
programs (Smith, 1991). Neill &
Medina (1989) estimate that over 105 million standardized tests are
administered each year to over 39.8 million students in schools in the U.S.
Characteristics
of standardized tests
Most of the standardized tests used in
schools today are group tests with multiple choice answer formats. They are developed, published, and marketed
by commercial companies that determine the test content, construct the items,
and gather the norming data from the targeted population.
Norming.
In the norming process, the test is given to a large group of students
who are representative of students who will later use the test. The publisher determines the population that
will serve as the norming group. Those
involved in the norming group, or reference group, set the standards by which
other individuals' performance will be measured in future administrations. The norming sample should represent people
from various geographic areas, ages and grade levels, socio-economic status,
racial and cultural groups, as well as
a variety of community groups, such as urban, rural, suburban. If the norming procedure is inadequate, as
when the norming sample is too small or not representative of the target
population, the test scores may be misleading.
To help ensure that test results will be
dependable, statisticians have developed several criteria that the tests should
satisfy. Among these criteria are two
that all norm-referenced tests should meet:
reliability and validity.
Reliability.
Reliability is a measure of consistency and generalizability. Salvia and
Ysseldyke (1991) suggest that three questions be asked when evaluating
the reliability of a test:
1. If the test were re-administered or
scored by another examiner, would
the pupil's performance yield the same or very similar score? This
is often called rater reliability.
2.
Are the scores stable over time? Would
the student's score be essentially
the same if he or she took the test again next week, next month or two months from now? This is often called test-retest reliability.
3.
Can we generalize similar behaviors to other test items? If the student were tested on a representative sample from a subset of
items from the test -- such as a sample of letter names and sounds rather than
all of the names and sounds on the test-- would that student get a similar
score if he or she were tested on a different sample from that same set of
items? For example, would the student
get the same or similar score on the odd numbered items as on the even numbered
items. This is often called split-half
reliability. A similar reliability
measure, often called alternate form reliability, is carried out using two
versions of the same test. Publishers
often provide different versions, called forms, of the same test for use with
the same children at different times.
For example, Form A of the test might be administered in September and
Form B in June.
Reliability is often reported by the test
publisher in the form of a reliability coefficient. This is a statistical measure of the test's reliability. It is usually in the form of a correlation
coefficient which measures consistency:
The closer to 1.0 the measure, the greater the reliability. In general, the reliability of a test or
subtest score is highly related to the number of items used to assess that
score. The more items, the higher the
reliability.
Validity.
A test has validity if it actually measures what it claims to
measure. Test publishers often report
their efforts to substantiate the validity of a test in terms of a validity
coefficient, in similar fashion to the reliability coefficient reported
above. A validity coefficient is
usually in the form of a correlation coefficient, measuring the consistency
between the test and some other appropriate criterion measure. Common types of validity include content
validity, criterion-related validity, and construct validity.
Content
validity is determined by evaluating factors such as the appropriateness
of the questions (The questions should represent the content to be measured.)
and the completeness of the items sampled (The test items should cover the
range of behaviors associated with the topic.).
The content of the test should represent the common curricular
requirements of schools where the test is being used. A frequent complaint of teachers about standardized tests is that
the school curriculum and the test content do not match well; that is, the test
does not measure what students have been taught.
Criterion-related validity is based on how
accurately a test score can estimate a current criterion score or how
accurately a test score can predict a criterion score at a later time (Salvia
& Yseldyke, 1991). A criterion
score is one that is generally recognized as representing a person's ability in
a particular domain. For example, if
the test publisher wanted to demonstrate the criterion-related validity of the
reading comprehension passages of a new standardized test, the publisher
would find another measure of reading
comprehension that is generally agreed on as being an accurate measure of reading comprehension. Then both assessments, the new test and the
existing accepted measure, would be administered to the same students and the
results compared.
Two
major types of criterion-related validity are known as concurrent and
predictive validity. Concurrent
validity refers to how well the test compares to other established measures
administered at the same time.
Predictive validity refers to how well the scores on the test correlate
to established measures at a later time.
For example, the most useful test of validity for college entrance
examinations would be the student's actual later performance in college. The most useful test of validity for a
traditional reading readiness test would be the child's actual later
achievement in reading. In practice,
concurrent validity is usually obtained by comparing how well a new test
correlates to other established tests and/or to teacher judgment. Predictive validity is usually determined by
measuring how well the scores on the new test compare to later assessments.
Construct validity refers to how well the
test measures its theoretical construct--that is, to the trait it supposedly is
assessing. For example, consider a test
of vocabulary knowledge. Of course,
such a test should be strongly related to a student's ability in
vocabulary. However, general
intelligence might also predict to some degree how many words a student knows,
so it is possible that what appears to be a test of vocabulary is in fact a
test of general intelligence. To
document the construct validity of the vocabulary test, the test designer might
show that the test predicts actual knowledge better than does a test of general
intelligence.
Test format can also hinder accurate
assessment of an individual's performance by being inappropriate to measure the
desired trait. Suppose students must
indicate their knowledge of higher level reading by marking tiny spaces in
columns on a computer-scored answer sheet.
It is up to the test publisher to demonstrate the construct validity of
its test, that this artificially designed multiple choice device that bears
little actual resemblance to the performance of higher level reading tasks in
real life, actually does measure the construct of higher level reading.
Understanding
Standardized Test Scores
Teachers, counselors, and administrators
often use and report on the results of standardized test scores. They must be able to interpret these test
scores in ways that are meaningful not only to themselves, but to those to whom
the results are being reported. They
also must use the results in ways that are consistent with the purposes and
limitations of the testing instruments.
To
be useful, an educator must be able to tell how a student's score compares to
the educational performance of other children.
A raw score is the number of points earned on the test, usually measured
in terms of number of correct responses.
The raw score of a standardized test is usually unhelpful for the
purpose of comparing performance among children.
For example, a teacher may want to know
how each of her third grade readers compares to all the third graders in the
school district who took the test, a task that could involve hundreds of raw
scores. There is a need for a more
economical way of comparison than by trying to make sense of all the raw
scores. The teacher must also be able
to determine how the score on one test compares to the scores on similar tests. These different tests will have been scored
on different scales due to different numbers of questions and different levels
of question difficulty: The raw score
on one test could be 35 and on another 500 and on another 72. In the raw score form, the test results will
not be comparable. For comparisons to
be possible, a method must be employed that will allow teachers to make sense
of many raw scores resulting from many testing instruments. This method, called standardization, is
accomplished through statistical procedures.
The following sections provide a brief overview of basic statistical
tools and standardized scores.
Measures of central tendency. The scores in a set of scores, regardless of
the scoring scale used, will represent a range from high to low, called a
distribution. Descriptive statistics
can be used to economically summarize the data represented in this
distribution. Among the most common
descriptive statistics used for reporting test results are the measures of
central tendency, which describe the way scores cluster around the middle of
the distribution. Among these measures
of central tendency are the mean, the median, and the mode.
The mean is the arithmetic average of the
scores in a set of scores. To obtain
the mean, add all the scores and divide the sum by the number of scores in the
set. The median is the middle score in
a set. To obtain the median score, list
all the scores in the set from highest to lowest. The median score will be the score in the middle of the list: Half the scores will be at or above the
median, and half will be at or below it.
The mode, or modal score, is the score that appears most often in a set. To obtain the mode, count how many times
each score appears in the set. The
score that appears most frequently is the mode.
Distribution and Standard Deviation. Another common descriptive statistic used
for reporting test results is the standard deviation, which is related to the
normal curve (sometimes called the bell or
bell-shaped curve). One
assumption common to standardized tests is that among large groups of people
taking the same test, the distribution of scores will include a few very high
scores, some high scores, many average scores, some low scores, and a few very
low scores. This symmetrical type of
distribution is called a normal curve, or normal distribution. When it misrepresented graphically, it
resembles a bell shape (see Figure 4-2).
----------------------------------------------------
Insert
Figure 4-2: Normal Distribution
------------------------------------------
In a graphical depiction of the normal
distribution, the mean runs vertically through the center of the bell. The two portions of the bell on either side
of the mean can be divided through statistical procedures into standard
deviation units, which show the spread
(or variability) of scores around the mean.
In a normal curve, corresponding portions of the curve, or standard
deviation units, contain the same percentage of scores. That is, the portion of the normal curve up
to one standard deviation immediately to the left of the mean (-1) and the
portion up to one standard deviation immediately to the right of the mean (+1)
will each have the same percentage of scores.
As you move farther away from the mean on each side of this normal
curve, the pattern of equal percentages of scores will hold for the remaining
pairs of mirror-image standard deviation units.
In a normal curve, approximately 68% of the scores will fall between +1 and -1
standard deviations of the mean, with roughly 34% of the cases in each standard
deviation (see Figure 4-2). Another
approximately 28% of the scores will
fall between +1 and +2 and between -1 and -2 standard deviations from the mean,
with approximately 14% of the cases in each of these standard deviations. Another approximately 4% of the scores will
fall in the third outermost pairs of standard deviations, with about 2% of the
cases in each of these standard deviations.
In an administration of a given test, the
larger the testing population, the closer the distribution of results will be
to the normal curve described above.
The samples on which standardized tests are normed are almost always
large enough to produce nearly perfect normal curves. Also, most standardized tests are designed so that very few
students will achieve a perfect score--a further assurance of a normal curve
for the test results.
The standard deviation is valuable in that
it shows how the scores in any test will cluster around the mean. That is, the standard deviation shows how
close the test scores are to the average score and to one another. For example, let us suppose that the same
standardized reading test was given in two different schools. In both schools, the two sets of test scores
happened to yield the same mean. With
only the mean for use, an administrator might conclude that the schools were
basically equivalent in achievement.
But let us further suppose that, for some
reason, the distribution of the scores in the two schools was very
different. By having access to the
standard deviation, as well as the mean, the administrator could recognize this
difference. Suppose School A and B both
had a mean score of 65. But School A
had a standard deviation of 4, while School B had a standard deviation of
14. The standard deviation of 14 would
indicate that the range of raw scores was very broad for the scores in that
set, with a large number of low and very low raw scores and a large number of
high and very high raw scores.
Conversely, the small standard deviation of 4 would indicate that the
raw scores in School A had a much narrower range, with a much smaller interval
between the higher and lower sets of scores.
This difference in variability, as
indicated by the difference in standard deviation, would lead to an
administrator drawing different conclusions about school performance. As stated above, in any normal distribution,
approximately 68% of the scores will fall within the two standard deviations
closest to the mean--the ones on either side of the mean (from -1 to +1). A student in School B whose raw score is 56
will be within the range in which 68%, over two-thirds, of the students in the
school scored. We can verify this
statement by subtracting the student's raw score of 56 from the school mean of
65. The result, 9, is less than School
B's standard deviation of 14. Thus
these student's score is within one standard deviation below the mean, between
the mean and -1 standard deviation. Any
student scoring in that same middle range as 68% of the other students in the
school is probably performing at an average level of achievement.
On the other hand, any student from the
other school, School A, whose raw score is 56 is much more seriously at risk
for adequate performance. Even though
such a student would be 9 raw score points below the mean, just as in School B,
the smaller standard deviation at School A would indicate that this student's
score was much further outside the mainstream of School A student performance. With School A's standard deviation of 4, a
score 9 raw score points below the mean would place this student more than 2
standard deviations below the mean.
This student would be performing among the 2% lowest achievers in the school.
In actual practice, two schools' standard
deviations would infrequently differ by so much, but the point should be taken
that mere knowledge of the test mean is inadequate for appropriate
interpretation of the test results. One
school might serve a very homogeneous population and obtain smaller variance,
while another school might serve a very heterogeneous population and obtain a
larger variance.
Knowledge of test variance as shown by the
standard deviation is also quite important when comparing results of different
tests. Tests usually differ
significantly in their variance. A five
point difference in raw scores between two students might be insignificant on a
test with a large standard deviation, but be quite important on a test with a smaller
variance.
Standard error of measurement. All standardized tests are subject to
error. A person taking a test today
would probably have a different score if he or she took the test again next
week simply because changing conditions affect performance. To provide for adequate understanding of
this possibility of error, test publishers use statistical procedures to
determine a range of scores around any given student's score. This range is called a standard error of
measurement (SEM). The standard error
of measurement is defined as the amount of error that is expected for a score
on a particular standardized test. It
is viewed as the difference between the score obtained by an actual student and
that student's hypothetical "true" score.
For
example, if a student obtains a raw score of 20 and the standardized test
standard error of measurement is 5, the student's hypothetical "true"
score would be between a possible low of 15 and a possible high of 25. Tests with high reliability have low SEMs,
and tests with poor reliability have high SEMs.
The standard error of measurement has very
practical implications for teachers, especially when using a test to
differentiate student performances. On
the test described above, where the SEM is 5, if George obtains a raw score of
59 (indicating that his "true" score lies somewhere between 54 and
64) and Susan a raw score of 57 (indicating that her "true" score
lies within a range of 52 and 62, which largely overlaps that of George), the
teacher would be ill-advised to differentiate instruction for the two students
based on only that testing information.
The odds are fairly high that Susan's "true" score is the same
or even higher than George's.
Out-of-level testing. Out-of-level testing involves the
administration of a test that is intended for students at another grade level
(Baumann, 1988). Out-of-level testing
is usually reserved for those students who are reading at a level far below
their peers. Administering an
out-of-level test can provide a more accurate picture of such a pupil's reading
performance.
For example, a fifth grade student who
reads at the second grade level would quickly become frustrated and discouraged
by the standardized reading test designed to be administered to fifth
graders. The difficulties with the test
would be so great that the student's answers would show little beyond almost
random choice of answers. Any multiple
choice test that has, for example, five answer choices per question practically
guarantees students that they will get at least about 20% of the questions
correct, even if all answers are randomly chosen. The closer a student's obtained raw score comes to that chance
level, the less value the results to the teacher, since determination of how
much is random chance and how much is adequate performance is difficult.
Standardized test scores are most stable
when pupils answer between 1/3 and 3/4 of the items correctly (Roberts,
1988). Less able students who answer
less than this may have unstable scores that credit them with higher reading
levels than they actually possess.
Some tests, such as the Gates-MacGinitie
Reading Tests (MacGinitie, 1989), provide teachers with a booklet of
out-of-level norms for selected levels of the test. If Sam, a seventh grader with serious reading problems, was administered
the test level designed for third graders, it would be possible to use the
out-of-level norms booklet to find scores that compare him to other seventh
graders taking that lower level test.
For many other standardized tests, no
out-of-level norming has been carried out by the publisher. In those cases, to insure accuracy of
interpretation, reporting out-of-level scores requires that the pupil's actual
grade level be reported as well as the intended grade level of the test
taken. If Sam, a seventh grader took
the third grade standardized test and scored in the 43rd percentile, his
results might be reported as follows:
"Sam, a seventh grader, scored in the 43rd percentile on the test
intended for third graders. His score
indicates that he did as well as or better than 43% of the third graders who
took the same test, placing him in the average range of third grade
readers."
Teachers should be aware that the value of
administering a standardized test to students who are reading well below or
well above their grade level is often questioned because the norms do not apply
to such students. Most standardized
tests have not been designed for or normed for seriously at-risk students
(Fuchs, Fuchs, Benowitz, & Barringer, 1987).
Derived
scores
When students take standardized tests, their answer sheets are often
sent to a district central office or to the test publisher for scoring and
analysis. Occasionally, the answer sheets
are scored by hand by the teacher, who uses the test manual or microcomputer
software to analyze the scores. In the
scoring process, the students' raw scores are transformed into derived scores,
scores based upon the raw scores but changed into different formats to
facilitate interpretation. There are
several types of derived scores, including percentiles, grade equivalents, or
stanines.
Percentiles (%iles). Percentile ranks are derived scores that
show a student's position relative to the norming sample, using a system in
which the distribution of scores is divided into sections, each of which has
1/100 of the scores in the total distribution.
Students with a percentile rank of 50 have scored at the median of the
norming group, since 50/100 (that is, 50 percent, or 1/2) of the scores are
above and 50/100 (that is, 50 percent or 1/2) of the scores are below. A student with a percentile rank of 40
scored as well as or better than 40% of the sample, and poorer than 60%.
One difficulty with interpreting
percentiles is that the intervals between percentile scores are not equal,
unlike the standard deviations described above. The percentile scores will cluster tightly around the mean
because this is where most of the raw scores fall. As a result, a small change in a raw score near the mean can
produce a large change in percentile.
The same change in raw score at the extreme ends of the range will
produce little change in percentile ranking, as the following illustration
shows:
Raw
Score : 55 60 65 70 75 80 85
Percentile: 1 2 16 50 84 98 99
A
change in a student's raw score from 65 to 70, which is not much of a real
difference on a test with 100 questions, will result in a shift from the 16th
to the 50th percentile. The same
five-point increase in raw score from 80 to 85 will move the student only from
the 98 to the 99th percentile.
Thus use of percentiles in interpreting
test results can easily lead to inaccurate conclusions about average students,
in which educationally insignificant differences seem to be blown out of
proportion. On the other hand,
educationally significant differences at the extreme high and low ends of the
ability scale seem to be minimized.
Grade equivalent scores. Grade equivalent scores are derived scores
that show a student's position relative to the norming sample, using a system
which is based on the average scores obtained by students at given grade levels
in the sample population on which the
test was standardized. Usually the
grade levels are divided into tenths, such as 2.1, 2.2, 2.3 and so forth. Grade equivalent scores are widely used among
teachers and administrators, in part because they appear to provide direct
comparisons of student achievement with difficulty levels of materials. For example, it would seem that a third
grade child who scores at the 6.5 grade equivalent should be assigned reading
materials with readability at the sixth grade.
These comparisons are misleading, however, and as a result of the almost
insurmountable confusions arising from use of grade equivalent scores, some professional
organizations, including the International Reading Association, have criticized
their use (Harris & Hodges, 1995).
The confusions are particularly serious
when test results are reported to parents, who are for the most part unfamiliar
with test standardization procedures.
For example, many do not understand that in deriving grade equivalents
from a statistically normal standardization sample, a significant portion of
children's scores will almost always be in grade levels lower than that of
their actual classroom assignment (An
exception might be, for example, in testing initial consonant recognition with
tenth graders. Virtually all will score
perfect scores, placing the grade equivalent of a perfect raw score at the
tenth grade level. All those students
will be on grade level.). Having
one-third of your school's students scoring below grade level--and being
utterly unable to do much about that because of the inherent structure of the
grade equivalent scoring system--is frustrating to educators and politically
damaging.
Grade equivalent scores are not only
confusing to parents, but they can be confusing to educators, as well. As mentioned above, grade equivalent scores
are based upon the average score of pupils at different grade placements. The grade equivalent score represents how
the raw score of one pupil compares with the scores obtained by the average
pupils at each grade level in the standardization sample. Why should not the third-grade student
mentioned above, who scored 6.5 grade equivalent on a reading test, be assigned
sixth grade level reading assignments?
The answer is simple: A 6.5
grade equivalent score does not mean that the student is capable of doing sixth
grade work. It is much more likely that
this student has been very successful in mastering the third grade work
provided in the test items, and so has scored as well as the average sixth
grader would on that third grade test.
To use the test score to simply skip the content of the reading
curriculum in fourth and fifth grades for this child would be a gross misuse of
grade equivalent scores.
Grade level equivalent scores are
established by testing pupils at several grade placements with the same test
and finding the average score for each group.
The raw scores are plotted and mathematically extrapolated to find
scores above and below the averages found until grade equivalent scores are
determined for the level of each published test (Lyman, 1986). For example, if a reading test was given to
a fourth-grade norming group, which produced an average raw score of 40, this
would be set as the grade equivalent for grade four. Similarly, the average achieved by a fifth-grade group might be
50. Points between are assigned
decimals and reported as "months."
A score halfway between 40 and 50 would be 4.5, or equivalent to fifth
month of fourth grade.
The concepts underlying grade equivalent scores have to be well
understood by both those who report and those who receive the scores. Since
such understanding is rare, these scores should not be used by teachers to
report a pupil's progress or reading level (Berk, 1981). Several other caveats about using grade
level scores include:
1. The use of grade level scores assume
that learning progresses in a uniform way during the school year, that the
increase from level 2.1 to 2.2 is the same as from 2.9 to 3.0. Learning does not occur in such a
mathematically precise way.
2.
Since grade level scores are based on average scores, some (in fact, most)
children perform better or worse than the average. This is a difficult concept for many to grasp and is made
politically volatile by the grade equivalent reporting system.
3.
Grade equivalent scores from different test publishers will not
necessarily be the same, nor will grade equivalent readability scores of reading
materials necessarily match the scores of test publishers.
Standard scores. Standard
scores are the results of a statistical procedure in which raw scores are
transformed so that they have a given mean and a given standard deviation. They express how far a child's results lie
from the mean of the distribution in terms of that given standard deviation
(Sattler, 1988, p. 32). A common
standard score is the T score. It has a
mean of 50 and a standard deviation of 10.
Another is the z-score, which has a mean of zero and a standard
deviation of 1. Z scores range from -3
to +3.
Stanines. Stanines are also standard scores and are more widely used than
T scores or z-scores. The term is
derived from "standard nine."
Stanines range from 1 (the lowest ability level) to 9 (the highest),
with a mean of 5 and a standard deviation of 2. When raw scores are converted to stanines, each stanine
represents half of a standard deviation.
Stanine scores, however, are always reported as whole numbers. Therefore, a student who has a stanine score
of 7 (that is, one standard deviation above the mean) actually fell somewhere
from .75 to 1.25 standard deviations above the mean.
Normal curve equivalents.
Normal Curve Equivalents (NCEs) are based on the basic concept underlying
percentile ranks, of dividing the scale of scores into 1/100 units. But NCEs offer some advantages over
percentiles in terms of clarity of interpretation. As mentioned above, use of percentile scores can lead to
misinterpretation due to the unequal size of the 1/100 units in their
scale. NCEs have been transformed into
equal units across their scale. Like
percentiles, they range from 1 to 99, with a mean of 50. However, all units in the scale, whether
near the mean or at either end, are equivalent. Thus, unlike with percentiles, if two poorer ability students
differ by 5 NCEs, and two average students differ by 5 NCE's, the educational
significance of those differences is about the same. Generally, NCE's are "more spread out at the ends and less
spread out in the middle" (Gorlund & Linn, 1990).
Many schools now use NCEs for interpretation and reporting of test
scores. They have become popular
because they avoid some of the difficulties inherent with use of grade equivalents
and percentile ranks.
B. Examining Specific Factors:
Formal Assessment Devices for Reading and
Literacy
A
wide variety of standardized, norm-referenced tests are used for assessment of
reading and literacy. Because of the
variety, categories often overlap. The
following categorization system will describe norm-referenced tests in two
major categories: Achievement survey
tests of reading and literacy provide general information of most value when
assessing groups of students. They
include both reading survey tests and the general achievement batteries used in
many schools. Diagnostic reading tests
provide detailed information about reading skills appropriate for use in
designing literacy curricula for individual students.
There are three additional varieties of
formal assessments, all of which are designed to fulfill either a survey or a
diagnostic purpose or both, that will be described: Criterion-referenced tests, the Degrees of Reading Power (DRP)
Tests (College Entrance Examination Board, 1983), and performance-based
assessment tests. Each type of test has
a different purpose and format and taps different aspects of reading. Formal tests of reading readiness and of
emergent literacy are described in Chapter 5--"Assessing Emergent
Literacy".
Reading
Achievement Survey Tests
Reading survey tests are screening devices
used to evaluate reading
development
in a general rather than a specific way.
The tests are designed for group administration. A reading teacher might use such a test for
preliminary screening of children for possible inclusion in Chapter 1 remedial
programs or for identification of students who might be making slow progress in
reading. Survey tests are efficient in
terms of amount of teacher preparation
and effort needed to administer the test and student time needed to complete
it.
Description. Reading survey tests provide information about a student's basic
vocabulary and comprehension. One
example of a reading survey test that uses a unique cloze approach to assessing
literacy is the Degrees of Reading Power Test (DRP) (College Entrance
Examination Board, 1983), which is described in detail later in this
chapter. An example of a more
traditional reading survey which has long been used in reading assessment is
the Gates-MacGinitie Reading Test (MacGinitie & MacGinitie, 1989). These multiple choice, survey tests consist
of several test levels designed for preschool through grade 12. On the vocabulary subtest for third grade,
for example, students are given a target word and must choose an appropriate
match from a list. On the comprehension
subtest, students read a short passage and answer questions about it. As with many norm-referenced tests, student
scores are reported in terms of stanines, NCEs, percentile ranks, and grade equivalents. Standard scores called Extended Scale Scores
are also provided.
The Nelson-Denny Reading Test (Brown,
Fishco, & Hanna, 1993) is a survey test designed specifically for high
school and college students. Both the
Stanford Reading Tests (Psychological Corporation, 1989) and the Metropolitan
Achievement Tests Reading Survey Tests (Psychological Corporation, 1985) are
actually the reading-related subtests of broader scale general achievement
tests (described below), printed separate from the larger test battery for
purchase by those who are interested specifically in reading assessment.
Advantages and disadvantages. Reading survey tests are effective in
screening large numbers of pupils in a short time period and in identifying
those whose reading behaviors may fall above or below certain levels. They are best used for preliminary
screening. In preliminary screening,
poorer students are tentatively identified for further assessment, which will
be designed to verify the preliminary results and provide more detailed
information appropriate for instructional decisionmaking.
Survey tests are not diagnostic in form or
content. The norm-referenced scores
that are obtained alert the examiner to those pupils who score very high, at
the average, or very low. It is
important to remember that survey tests are global measures of reading,
intended to measure group achievement
or a student's general rank within a group
rather than an individual's achievement.
Reading survey tests are often chosen by
reading specialists for use in school reading programs in preference to general
achievement tests (see below). The
reading survey tests offer attention directly targeted to the needs of reading
programs. In addition, reading survey
tests such as the Gates-MacGinitie (MacGinitie & MacGinitie, 1989) offer
out-of-level norms for better interpretation of the reading performance of
disabled readers.
General
achievement tests
General achievement tests (sometimes
called batteries instead of tests--indicating that a large group of subtests
are being administered) are used by schools to evaluate achievement in a
variety of subject areas, such as mathematics, English usage, spelling,
reading, study skills, social studies, and science.
Achievement tests are general rather than
specific and, in the assessment of reading and literacy, function in similar
fashion to the reading survey tests described above. In fact, the reading subtests of such tests have the same
function as reading survey tests; the general achievement tests simply include
additional subtests on other skill and subject areas. Like survey tests, the results of general achievement tests are
used to assess growth of individuals and to compare the performance of students
within and across classes in the same school and in different schools across
the nation.
These tests are commonly administered by
schools as end of the year tests in May.
Due to the time, effort, and expense needed for their administration,
many schools administer them every second or even third year. They are used for screening purposes, to
identify those pupils who demonstrate very high or very low scores in
comparison with their peers. They are
also used for general progress evaluation of given classes or schools. Teacher and school accountability for
adequate performance is an important factor when the overall results of general
achievement tests are released to the public.
Description: The test content is developed by test authors based on common
curricula found at all grade levels throughout the country. Group achievement tests have several
different levels which represent the curricula content common to specific
grades. Students are assigned different
levels of the test based on their grade in school.
Some
common achievement tests include: The
California Achievement Test (CAT) (Macmillan/McGraw-Hill School Publishing
Company, 1992), the Iowa Tests of Basic Skills (ITBS) (Riverside Publishing
Company, 1993), the Metropolitan Achievement Tests (MAT) (Psychological Corporation,
1992); SRA Achievement Series, (Science Research Associates, 1978); the
Stanford Achievement Test Series (Psychological Corporation, 1991) and the
Comprehensive Test of Basic Skills (Macmillan/McGraw-Hill Publishing Company,
1990). In addition, some states have
developed their own general achievement tests for periodic evaluation of their
students.
Advantages and disadvantages. As with reading survey tests, scores can be
used to compare an individual's overall achievement to his or her peers. A group or class's overall achievement can
be compared to other groups or classes in the same school. School performance can also be compared with
other schools. In tests designed to
obtain survey information, however, the teacher does not learn much information
of value in dealing with the instructional needs of specific children: Seldom is information obtained about how a
particular pupil answered specific items on the test, why the student didn't
respond correctly, and what thinking processes or reading difficulties might
have interfered with his or her response.
In interpreting scores from these tests,
it is important to be aware of the nature of each subtest. Subtests on different achievement test
batteries might have similar names, such as inferential comprehension or
paragraph comprehension, but the way the skill is assessed may vary from test
to test. It is important to examine
the content of the individual tests before making a judgment about students'
achievement in the skills associated with reading.
Test publishers have expanded their
reporting services to provide schools with computer printouts showing the
scores of all the pupils within a class.
They also provide individual profiles of pupils and include a computerized
analysis of individuals. These
analyses identify general strengths and
weaknesses in the various skills and content areas. While these can be helpful, instructional decisions and
diagnostic statements about individual students must be interpreted with
caution, since the tests are more appropriate for decisionmaking about groups
than about individuals.
Individual achievement tests. As noted above, most general achievement
tests are designed for large scale group administration to entire classes or
schools. Individual achievement tests,
however, are designed to be administered one-on-one to a student. These achievement batteries cover a broad
range of skill and content areas and are often administered to students who
have special schooling needs. If
teachers are concerned for students who might not follow the directions or who
might not perform to the best of their ability on a group administered
achievement test, an individually administered achievement test is
appropriate. These tests are often used
by special education teachers and reading teachers as a means of obtaining a
more accurate estimate of at-risk students' growth in reading and other subject
areas.
As with general achievement tests,
individual achievement batteries include a variety of subtests such as word
recognition, word analysis, reading comprehension, spelling, mathematics and
written language usage. Commonly used
individual achievement test batteries include:
The Basic Achievement Skills Individual Screener (BASIS) (Psychological
Corporation, 1983); the Diagnostic Achievement Battery (DAB) (Newcomer, 1990),
the Kaufman Test of Educational Achievement(KTEA), (American Guidance Service, 1985); the Peabody Individual
Achievement Test-Revised (PIAT-R, (Markwardt, 1989), and the Wide Range Achievement
Test (WRAT) (Wilkinson, 1993). The
Diagnostic Achievement Test for Adolescents (DATA) (Newcomer & Bryant,
1993) is designed for older students.
Teacher-student interaction is an
advantage of Individually administered tests.
Behaviors of the student on certain subtests and items within tests can
be noted so that the test is not only used to measure growth but to measure
strengths and weaknesses in functioning.
This diagnostic aspect, however, does not play as key a role as on
actual diagnostic tests, described below.
Another advantage to individually administered tests is that the teacher
can draw conclusions as to how seriously the student worked on the test and
whether its results are valid, a serious problem with at-risk students taking a
group test. In fact, the one-on-one
nature of individual testing usually insures that students do take them
seriously.
Diagnostic
tests
Diagnostic reading tests are designed to
provide a detailed profile of individuals' reading strengths and
weaknesses. They are more specific than
survey and general achievement tests, they assess specific reading skills that
should be mastered at certain grade levels.
Diagnostic reading tests differ from survey tests in that the former
have a larger number of subtests to evaluate a wider array of reading skills
and more items within each subtest than survey tests to provide greater
reliability.
Diagnostic
tests can be designed for either group or individual administration. While all diagnostic reading tests provide a
profile of individuals' skill development in reading, their specific skills
strengths and weaknesses, group tests can also help identify skill weaknesses
characteristic of the group, such as a class or school, that might be
contributing to that group's poor reading performance.
Description. Diagnostic reading tests include several subtests that tap
specific skills thought to be important in reading. Diagnostic reading tests are based on the assumption that reading
is a skills based process and that teaching to skill weaknesses will ultimately improve the student's
reading ability. Gronlund & Linn
(1990) have warned that the difficulty level of diagnostic tests appears to be
lower than survey tests, because the former are intended for students who are
experiencing difficulty in acquiring the skills and abilities needed for
reading.
Like all norm referenced tests, the scores
obtained will reflect how well a particular pupil scores in relation to others
who took the same test and are of the same age or grade level. They provide information about a pupil's
reading in a number of different skill areas, such as: Auditory discrimination, visual
discrimination, letter identification, listening, oral reading and fluency,
spelling, blending, phoneme/grapheme identification at a variety of level (such
as, initial consonants, final consonants, short vowels, long vowels, and so
forth), sight words, context analysis, structural analysis, syllabication,
vocabulary, comprehension (both oral and silent) at a variety of levels (such
as, literal, inferential and evaluative), and reading rate. Group tests do not include oral reading
tasks and have a multiple-choice format.
Individual reading diagnostic tests are
often composed largely of tests of oral and silent reading in which graded
passages are read with accompanying comprehension assessment. In this sense, the tests are similar to
Informal Reading Inventories (see Chapter 6--"Informal Reading
Assessment"), but IRI's are not normed against a standardization
sample. In addition to the oral and
silent reading, the diagnostic batteries include subtests on additional
skills. The Diagnostic Reading Scales
(Revised ed.) (DRS), (Spache, 1982), the Durrell Analysis of Reading Difficulty
(DARD), (Durrell & Catterson,1980), and the Gates-McKillop-Horowitz Reading
Diagnostic Tests (Gates, McKillop & Horowitz, 1981) are examples of formal
reading diagnostic batteries.
The Woodcock Reading Mastery Tests-Revised
(Woodcock, 1987) is another example of a diagnostic reading test. This test includes a battery of six
subtests: Visual-auditory learning,
letter identification, word identification, word attack, word comprehension,
and passage comprehension. Three
cluster scores are developed from the six subtests: Readiness, basic skills and reading comprehension. Conversion of raw scores to standard scores
is a complex process for the examiner, involving several steps. Scores are reported in terms of a Relative
Performance Index, which is the range or band of performance indicated by the
test's Standard Error of Measurement.
Scores are also converted to percentiles, grade equivalents, and age
equivalents (based on the average scores obtained by students at given ages in
the sample population on which the test
was standardized).
Advantages and disadvantages.
Standardized reading diagnostic tests tend to focus on assessing
discrete reading skills rather than reading as a wholistic process. Thus, remedial
procedures are often based on skill weaknesses, while other reading processes
are neglected.
In creating group diagnostic tests such as
the Stanford Diagnostic Reading Tests (Karlsen & Gardner, 1984), test
publishers have sacrificed two important means of obtaining information, namely
oral reading and teacher-student interaction.
Group reading diagnostic tests lack oral reading tasks. They offer the teacher few indications of
the processes a child has used in selecting answers. Teachers may sometimes be able to infer the thinking processes
used by children, based on their own experiences with reading. However, students who have difficulty
acquiring reading don't necessarily use the same processes as normal readers or
adults. Manzo (1994), for example,
argued that disabled readers tend to use reader-based, top-down skills and
relied more on context, because many of their problems involve word
recognition.
Another difficulty with group diagnostic
reading tests involves administration time.
These tests attempt to provide a thorough, reliable analysis of skill
areas associated with reading by providing a substantial number of subtests,
each of which takes a significant amount of time. They are tedious for students
to complete and require a long administration time.
Publishers of group diagnostic tests may
state in their test manuals that the results can be used for individual
instructional and program planning. But
care should be taken in making such judgments based on derived scores, without
the insights available from one-on-one sessions with the student (Salvia &
Ysseldyke, 1991). Careful analysis of
individual patterns of response is necessary for the tests to be used in
program planning.
Individual diagnostic reading tests
require that the examiner have closely studied and practiced administration of
the test. Like group diagnostic tests,
they are time-consuming, but at any point in the administration the examiner
can decide to administer only those subtests which are considered necessary for
understanding the child's reading behaviors.
Thus, time of testing can be shortened without jeopardizing the results
of the test. The grading of the test
after administration is also time consuming for the examiner.
Although individual diagnostic reading
tests are considered formal tests, in that administration occurs under closely
controlled conditions, the standardization procedures are sometimes not of the
magnitude of most norm-referenced, standardized tests designed for group
administration. Norming often involves
a small, limited standardization sample, which does not meet the criteria of a
test that has been rigorously standardized.
In fact, the tests may not have been normed in any formal sense at all,
but rather simply field-tested with populations.
Specific diagnostic tests. Some standardized reading tests are designed
to closely assess a specific aspect of reading and literacy, rather than
provide a battery of subtests for assessment of the range of reading and
literacy skills. Silent reading
comprehension tests are such tests, in that only a student's comprehension
under silent reading conditions is assessed.
Since they are norm-referenced and standardized, a student's score can
be compared to other students of the same age or grade level.
The Test of Reading Comprehension (TORC),
(Brown, Hammill, & Wiederholt, 1987) is a group silent reading
comprehension test. Eight reading
comprehension subtests make up this silent reading battery. Three subtests are in the General Reading Comprehension
Core: General Vocabulary, Syntactic
Similarities, and Paragraph Reading.
Diagnostic supplements include:
Mathematics Vocabulary, Social Studies Vocabulary, and Science
Vocabulary. Other subtests include
Sentence Sequencing, and Reading the Directions of Schoolwork. An overall Reading Comprehension Quotient
can also be obtained.
The TORC test manual indicates that it was
constructed according to psycholinguistic theory, with its emphasis on the
syntactic and semantic components of comprehension (Smith, 1978). But most of the subtests are traditional in
format and yield little more information than a survey reading test. The General Vocabulary subtest and the
content area vocabulary subtests, for example, have students choose an answer
that best matches a list of target words, and the Paragraph Reading subtest has
students answer questions based on reading a short paragraph. In the Syntactic Similarities subtest,
however, students examine several sentences, all of which have similar
vocabulary in varied syntactic arrangements.
They must choose the two sentences which mean most nearly the same
thing. In a sample exercise, for
example, "Sam plays" and "Sam is playing" would be chosen,
not "Sam is going to play."
In the Sentence Sequencing subtest, students read several sentences
which represent a story, but the sentences are out of order. Students must
decide in what order the sentences should be rearranged.
The Gray Oral Reading Test (GORT)
(Wiederholt & Bryant, 1986) is an individualized diagnostic test of oral
reading and comprehension. Students
read paragraphs at different difficulty levels. A Passage Score is generated from a combination measurement of
reading rate and number of miscues (that is, oral reading errors, called
deviations from print in the GORT test manual). A Comprehension Score is based on multiple choice answers to
comprehension questions. The examiner
can also carry out a categorization activity to determine patterns of miscues
in the student's oral reading.
Criterion-referenced
tests
Criterion-referenced tests assess learning
in terms of the kinds of behaviors or skills that have been mastered at a given
level. As noted earlier in the chapter,
norm-referenced tests provide information about an individual's performance in
relationship to the norming sample.
Criterion-referenced tests, on the other hand, provide information about
an individual's performance in relationship to his or her ability to perform a
given task. For example, a first grade
child might have scored "at mastery level" (that is, above a
specified criterion score selected by the test publisher) on a given task such
as letter identification, but "below mastery level" on another task
such as sight word identification.
Criterion-referenced tests of reading are
used primarily to measure individual students' ability to perform in reading
and literacy skill tasks, not to compare students' reading behavior with a
norming group. Reading is viewed as an
accumulation of skills that can be taught and measured. Determining whether students have
"mastered" certain skills associated with reading is an important
function of criterion-referenced tests.
Description. Generally criterion-referenced tests have components that assess
specific instructional objectives of importance to the curriculum. A first grade test, for example, might have
components assessing such skills as letter recognition, sight word
identification, initial consonants, final consonants, initial consonant blends,
digraphs, short vowel sounds and others.
All of these are key skills learned at the first grade level.
Assessment of each objective typically
occurs in terms of whether the child has achieved mastery. Mastery does not mean perfect performance of
the objective. Rather, a predetermined
criterion score is used to determine whether a child has mastered the
objective. Often that score is between
75% and 80% correct on items pertaining to the specific objective (Lyman,
1980), but this can vary depending upon the constraints of the test publisher,
the school district, or even the classroom teacher.
Criterion-referenced tests can be either
formal or informal, depending upon the efforts made in their construction. At the one extreme, classroom teachers can
and do easily devise informal devices which test the skills they are
teaching. At the other extreme, a test
publisher can devise a formal criterion-referenced test and norm it against a
standardization sample, or vice versa, thereby providing users of the test both
with a list of skills which the children have and have not yet mastered and
with derived scores such as percentile ranks, grade equivalents, and NCEs. The reading subtests of the Stanford
Achievement Test (Psychological Corporation, 1989), for example, are traditional norm-referenced tests and
provide teachers with a report on derived scores, such as percentile ranks,
stanines, and grade equivalents. Items
on the tests have been further classified according to a variety of skills to
yield a criterion-referenced report, in which students are reported to be above,
at, or below mastery level for each specified skill.
Some teachers, unhappy with the limited
information yielded by survey reading and general achievement tests, have
developed item classification systems similar of their own, based on normed
tests (Fantauzzo, 1995). The results
can yield information similar to that of a formal criterion-referenced
test. For example, errors on
individual items in a test of word recognition might be analyzed and classified
as to student strategies (see Figure 4-3).
----------------------------------------------------------
Insert
Figure 4-3: Informal Classification of
Student Strategies
on Errors in a Formal Word Recognition Test
----------------------------------------------------------
Most published criterion-referenced tests
lie somewhere in the middle of these two extremes: They have been field-tested, but they have not been normed. Because of the relatively few questions for
each skill on such tests, the reliability would be very low, too low for use in
instructional decisionmaking without significant corroboration from other
assessment sources (Calfee & Hiebert, 1990).
One example of a standardized
criterion-referenced test is the Prescriptive Reading Inventory Reading System
(PRI/RS), CTB/McGraw Hill, 1980. This
test measures 171 objectives in four major reading skill areas: 1) Oral
Language and Oral Comprehension, 2) Word Attack and Usage, 3) Comprehension,
and 4) Reading Applications.
Advantages and disadvantages. As mentioned above, reliability for scores
on the different skills assessed can be very low. On some tests, for example, only 3 or 4 items are used to assess
each skill. Since reliability of a
score is highly related to the number of items used to assess that score, the
reports on mastery and non-mastery of skills would be highly unreliable on such
a test. Important instructional
decisions should not be made on the basis of unreliable information.
The major advantage of
criterion-referenced assessment is the close match between the test and the
classroom curriculum. A norm-referenced
report that states that Mary, a second grader, is in the 60th percentile for
second graders on word recognition skills is of limited use in planning
instruction for Mary. A reliable
criterion-referenced report, on the other hand, might suggest that Mary has
demonstrated mastery at the second grade level for recognition of consonant
blends, digraphs and for syllabication skills, but not for structural analysis,
short and long vowel sounds, and diphthongs.
Such a report would be extremely useful in planning her future
instruction.
Some criterion-referenced tests include
complex management systems in which students' skills that are in need of
improvement are matched to published resources for teaching these skills. This can aid teachers in selection of
materials.
Degrees
of Reading Power Tests (DRP)
Purpose.
The Degrees of Reading Power Test (College Entrance Examination Board,
1983) was developed to assess students'
comprehension of short selections by offering a substantially different psychological
construct for test taking from that typically taken in standardized tests. These tests were designed to assess
comprehension by requiring readers to integrate content knowledge with their
semantic and syntactic uses of language.
Description. The DRP is a reading achievement survey test. As such, it is used to evaluate reading
development in a general rather than a specific way.
The format consists of short
paragraphs. Selected words have been
deleted and replaced with blank spaces.
For each blank, students are given a multiple choice item with five
target words from which to choose. The
students are required to replace each deleted word in the selection with one
that fits the content of the selection and meets the syntactic and semantic
requirements of the passage.
Degrees of Reading Power tests (DRP) are
described by Kibby (1981) as, "highly sophisticated, highly developed,
formalized, informal reading inventories." Results are reported in DRP units, a derived score that is a unique construct of the test publisher
based upon field-testing of the test. The publisher also provides a
comprehensive list of readability measures of content texts, basal readers, and
literature, all reported in DRP units.
According to the publisher, students' DRP unit scores from the test can
be used to locate reading material that is a match with the student's
instructional level.
Advantages and disadvantages. A major advantage to the The Degrees of
Reading Power Test is that it does not yield typical norm-based scores, such as
grade equivalent and percentile rank scores, that are so readily
misinterpreted. Instead, the test
yields raw scores that are converted to DRP scores, which are uninterpretable
except in terms of making matches with the list of reading and instructional
materials. This makes availability of
the list for teachers and for anyone to whom the results are reported a
necessity.
Another
advantage of the DRP is that the format of the test requires readers to use
their knowledge of language (grammar, word meanings, sentence meanings, and
passage meanings) to guide their responses.
Also, teachers can easily examine patterns of student word choice
selections for closer evaluation of the students' strengths and weaknesses.
Performance-based
assessments
Traditional formal testing of reading and
literacy attempts to test student learning through indirect means. To test reading, for example, children fill
in dots on a computer-scorable answer sheet after reading short passages and
questions that have limited similarity to actual real life reading tasks. To test writing, children do not write. Instead, again they fill in dots in answer
to questions someone else has written.
Slowly over the past few decades,
educators have increasingly recognized the limitations of such indirect
testing, as well as negative effects involved in putting pressure on teachers
to "teach to the test" when the test is not representative of desired
reading and writing tasks. This
increasing awareness has led to adoption of a variety of alternative assessment
strategies, such as teacher observation, portfolio assessment, and
performance-based assessment.
Description. Performance-based assessment involves the direct assessment (as
opposed to indirect assessment in traditional testing) of student knowledge and
ability (Berk, 1986; Stiggins, Conklin & Bridgeford, 1986). It is based on student performance which
integrates processes, skills and concepts and allows the teacher to gauge the
depth of student knowledge (DeLain, 1995). Performance-based assessments
require students to construct their own answers rather than to identify a
correct response on a test (Valencia, 1992).
In reading, for example, the assessment
task is designed to simulate a real-life reading task as closely as
possible. At first, most
performance-based assessment was, in essence, instructional assessment or
diagnostic teaching carried out by the classroom teacher (see Chapter
8--"Instructional Assessment").
The teacher would design a task especially for the purpose of
evaluation, though it would often have important teaching and learning value,
as well, since it would be closely related to the classroom curriculum. Then teacher would implement the assessment
with children and evaluate them on its performance.
For example, perhaps the class was
studying a unit on spiders, and the teacher wanted to evaluate the children's
ability to find important information from science content readings. The teacher could design a task in which
students read a selection about the characteristics of spiders, then listed the
important characteristics in a semantic web.
Evaluation would be based upon how well each child succeeded in
including the important characteristics in the web. In observing the children at work during the task, additional
information could gleaned by the teacher pertaining to such issues as
independence of student functioning, word recognition ability, vocabulary
ability, and ability to bring background knowledge to bear on science
reading. The teacher might use an
observational checklist to carry out such observational assessment.
With the rise of interest in
performance-based assessment in the late 1980's, test publishers developed
their own formal versions of such devices.
Published performance-based assessments provide the teacher with a
script for dialogue and guidelines for interacting with pupils throughout the
assessment. Scoring is designed to be
carried out by the teacher and usually involves considerably more effort than
traditional multiple choice assessments.
Some, such as the California Achievement
Test/5 K/1 Assessment Activities (Macmillan-McGraw Hill, 1992) and the
Performance Assessments for the Iowa Test of Basic Skills (Riverside Publishing
Company, 1993) are designed as add-ons to traditional tests. Others, such as GOALS: A Performance-Based
Measure of Achievement (Psychological Corporation, 1992) and The Riverside
Performance Assessment Series (Riverside Publishing Company, 1993) are designed
as stand-alone tests.
Only a few performance-based assessments
have been normed, including the add-on Performance Assessment series by
Riverside Publishing and GOALS from the Psychological Corporation. Raw scores from these assessments can be
transformed to scaled scores, percentile ranks, stanines, grade equivalents and
normal curve equivalents.
Advantages and disadvantages. Performance-based assessments provide a
bridge between formal and informal testing.
These tests offer a unique format in which students can demonstrate
their competence by applying what they have learned in a practical
situation. But publishers face a
serious problem in developing instruments that are simultaneously sufficiently
generic for use across the country and sufficiently specific to fit an
individual teacher's curriculum. In
fact, much of the advantage of informal performance-based assessment is lost in
the formalization process. Readings,
for example, are no longer directly tied into classroom curriculum, and the
teacher's role becomes less fluid and responsive to observed behaviors and more
structured by the assessment routines (Wiggins, 1993).
Performance-based assessments offer
teachers a choice in how to assess their students, but choice requires
reflection about one's purpose for testing.
Teachers should ask some questions before using performance-based
assessments: 1) Does the assessment
reflect your curriculum?; 2) Does the assessment reflect your beliefs about
what should be learned? 3) Does the assessment have an impact on your
instructional decisions? (Valencia, 1992).
In addition, careful attention should be paid to reliability of
assessment results, which remains a serious problem with performance-based
testing.
Performance-based
testing is not a replacement for the information gleaned from more traditional
formal tests. One's choice of
assessment instrument should be based on the information that is needed about a
student's learning.
Implementation of Standardized Testing
Advantages
and disadvantages
Much of the controversy about standardized
tests in our schools stems from three sources:
The misuse and overuse of standardized tests, the misinterpretation of
test results, and the varying purposes for assessment among people with
differing relationship to the schools.
First, there is no doubt but that
standardized tests have been misused and overused. Children and teachers spend a very substantial amount of school
time on standardized testing. Yet in
many cases, there is little or no direct educational benefit to the children
from those tests. Test results are
filed away without any impact on the planning of instruction. Sometimes the results are irrelevant to
classroom curricula, and sometimes the delay in reporting results makes them
out-of-date. In addition, tests are
often used as weapons against students and teachers under the guise of
accountability--holding students and teachers accountable for the educational
progress in the classroom. The test
becomes the enemy, a hurdle to be overcome by any means
necessary--educationally valid or not--even if that means teaching to the test
by means of rote drill and practice multiple choice exercises instead of
authentic reading and writing experiences.
Second, as noted repeatedly in this
chapter, the reporting of test results to students, teachers, parents,
administrators, and governmental and educational policy makers is filled with
peril. Numbers and statistics based on
supposedly objective assessment can appear deceivingly important--conclusive
and scientific-- when in fact those numbers and statistics are very limited in
their ability to summarize the sum total of what goes on in a child's mind or
in a classroom. Standardized testing is
a starting point in assessment, not the final conclusion. In addition, appropriate interpretation of
those numbers and statistics requires a clear understanding not only of
statistics and educational measurement, but also of the relationship between
standardized testing and classroom practice.
Third, different people have different
uses for assessment. These differences,
especially the differences in purpose between teachers and educational policy
makers, have led to some of the most heated controversies in recent years
(Calfee & Hiebert, 1993).
For teachers, major concern is with:
1. Validity (Does the assessment match what we are teaching in the classroom,
and the way we are teaching it?)
2.
Suitability (Do the methods used
to carry out the assessment fit our purposes?
3.
Availability (Will the
information I need to know about my children be there when I need it?)
For policy makers, on the other hand,
major concern is with:
1.
Reliability (Does the evidence
have adequate statistical, scientific backing?)
2.
Efficiency (How inexpensively
can the assessment be carried out?)
3.
Aggegability (Can the
information be expressed simply, in a few numbers that can be easily understood
and interpreted?)
Standardized tests have a place in our
schools. Poor use of the tests,
however, does far more harm than good.
Standardized reading tests should never be used as the sole criterion
for assessing an educational program, whether for a child, a class, a school,
or a district. They should not be
overused, nor over-interpreted. They
should not be misused nor misinterpreted (Farr, 1987).
Testing reform movement. Dissatisfaction with the use of standardized
tests as the primary tool for assessing student growth and a growing movement
that seeks to identify the uniqueness and individuality of learning has led to
a move to bring about reforms in testing and assessment. Some test publishers have developed
instruments that seek to compare individuals' performance as well as highlight
their unique development.
Performance-based assessments and Degrees of Reading Power tests, as
mentioned earlier in this chapter, represent some of the changes in testing due
to the reform movement. In an attempt
to add authenticity to the assessment experience, some standardized tests now
use literature selections as the basis of the content.
There has also been a widespread movement
for the use of portfolio assessment in schools (See chapter 7 for a complete
discussion). In their broadest form,
student portfolios include samples of several types of assignments, informal
tests, and written expressions. This
accumulation of student work provides an additional means to assess individual
growth in a wholistic way.
Special
concerns for formal reading assessment
Item construction. As mentioned above, different publishers
assess the same theoretical construct (such as reading comprehension or word
recognition) in different ways. Reading
teachers should be especially attuned to these differences, as each approach to
assessment may tap into different aspects of the theoretical construct. For example, in Matthew's case, the
classroom teacher and reading specialist were puzzled about the disparate
reading comprehension scores this remedial third grade student had received. On the Gates-MacGinitie Reading Test
(MacGinitie & MacGinitie, 1989), he had done fairly well, but on an
individually administered test, where he had been asked to read a selection and
answer questions about it, he had done poorly.
After observation and instructional assessment, the teachers concluded
that he had developed comprehension strategies that allowed him to increase his
success despite the constraints of his reading difficulties: He used key words in printed questions to
look back in the text in order to answer the questions. On the Gates-MacGinitie, he was able to
employ these strategies fairly successfully.
On the individualized test, he had not had that opportunity. Without multiple assessments, these teachers
would not have gained such insights into Matthew's comprehension strengths and
weaknesses.
Student behaviors during testing. Group tests perform best when assessing
groups of students. A small number of
students exhibiting inappropriate behaviors during the testing session will not
necessarily invalidate the results, since the effect on overall group scores
may be insignificant and since some degree of similar inappropriate behaviors
probably occurred during administration to the original norming sample. However, such behaviors could very well
completely invalidate the assessment of the students involved. If a teacher were to examine test results of
such a student without knowing about the inappropriate behavior, an incorrect
assessment would almost certainly result.
Examiners should pay close attention to
student behaviors during a testing situation and make reports as
necessary. Inappropriate testing
behavior is most likely among the very population whose scores might be
individually examined. Examiners should
be concerned about issues such as:
Was there any indication that a student
was sharing answers?
Did a student significantly delay start of
the test or take a break in the middle to sharpen a pencil or look around the
room?
Did a student exhibit signs of anxiety,
such as lots of erasing, noises of frustration, unease?
Did a student complete the test too
quickly, indicating lack of attention to the questions or answers even at the
easy level?
Did a student exhibit signs of sleepiness?
C. Demonstrating by Example: Case Studies
Norm-referenced
and criterion-referenced testing: Sid
Sid is 8 years, 10 months of age. He attended kindergarten, pre-first grade,
first grade and is currently enrolled in 2nd grade. While a perusal of his scores suggests that he is developing at
an average rate, the fact that he attended pre-first grade suggests that he was
not progressing at a normal rate in kindergarten. Reports from his kindergarten teacher indicated that he was
immature, did not socialize with his classmates, did not engage in reading and
writing activities such as shared reading and journal writing, and could not
identify alphabet letters. His teacher was concerned about his oral language
development and considered him to be a high risk pupil.
Pre-first grade provided Sid opportunities
to engage in non-threatening oral and written language activities. He progressed nicely and was promoted to
first grade. His first grade teacher
stated that he made slow but steady progress in learning sight words and sound/symbol
relationships of consonants. He wrote
daily in his journal and used invented spelling.
His second grade teacher is concerned
about his slow progress in reading, especially his difficulty in recalling
sight words and inability to use medial vowels to decode. She decided to examine the scores from the Stanford
Achievement Test (Psychological Corporation, 1989) administered in May of Sid's
first grade year, to determine if she could identify areas of strengths and
weaknesses.
Examine the scores from the Stanford Achievement Test (see Figure 4-4). Be sure to note the number of items within each test and Sid's
raw scores (RS).
Read the information for the first derived
score, percentile rank. Recall that the
percentile rank represents how well Sid performed compared to the norming group
that was used for this test. For
example, the first item, Total Reading, indicates that Sid is in the 30th
percentile in this category, which is a compiled score from the three subtests
listed immediately below it. This can
be interpreted in the following way:
Sid did as well or better than 30% of the students of the same age and
grade level in the norming group that took this test. Continue by reading the rest of the percentile information.
Recall that stanines consist of nine
bands. Those whose scores fall within
stanines 1, 2, and 3 are at the low end of the scale; those whose scores fall
within 4, 5, and 6 are average; and those whose scores fall within 7, 8, and 9
are above average. Sid's stanine score
for Total Reading is 4, which places him at the lower end of the average
range. Continue examining Sid's stanine
scores for the rest of the subtests on the Stanford Achievement Test. What information about Sid's reading
development can be found from the percentile and stanines? What would you say about Sid's reading in
relation to his classmates?
Compare Sid's scores on his reading
subtests to those on his math, Language, Spelling, Environment and Listening
subtests? What can you say about Sid's
performance in these areas? What area
of development seems to be strongest for him, and which is the weakest?
-------------------------------------------
Insert
Figure 4-4: Test Profile for Sid
-------------------------------------------
Now examine the criterion-referenced test
scores that are listed as Content Clusters.
On the Stanford Achievement Test, the items in the norm-referenced
subtests are further classified according to skill, allowing the provision of a
criterion-referenced analysis. In this
section, Sid's raw scores on specific skill categories are listed and given a
rating of Below Average, Average, Above Average.
Interpretation. An examination of Sid's grade equivalent scores in reading
suggest that Sid was performing about 3-5 months below grade level at the end
of first grade and his stanine scores placed him at the average to low average
range. Sid fell below the 50th
percentile on all reading measures.
Sid's scores on language and spelling
tests fall within these same parameters.
It can be tentatively concluded from these scores that Sid's language functioning
is in the low-average range for a pupil in his grade level, with one major
exception in the area of listening.
Note his high-average percentile and stanines scores in this area.
Sid's math scores show stanines and
percentile ranks to be above average for national norms. These math scores are much higher than his
language scores.
Further analysis of specific skill scores
on the criterion-referenced report shows that structural analysis appears to be
a weak area for Sid. His understanding
of vowels may be weak. Even though the
test ranks him at the "average" level in this skill, he only answered
6 of 12 items correctly. Punctuation
also is identified as a weak area. Most
of Sid's skill ratings in the language area are in the average range, even
those in the area of listening. The
criterion-referenced report does not provide enough information about specific
skills to differentiate low-average from middle-average, as might be done in
reporting stanines. Perhaps the
criterion-referenced measures do not have sufficient reliability for such
differentiation in reporting.
Sid does better in mathematics tasks than
he does in reading. At this grade level
he probably doesn't have to read story problems in verbal form. His lower performance in reading may affect
his mathematics performance in later years, if it is not remediated.
The results of this test battery indicate
that Sid is functioning at the low-average to average range for reading at his
grade level. This data might be
misleading, given his advanced age and his year in pre-first grade. Sid is a pupil who should be carefully
monitored to be sure that learning is progressing. Keep in mind that, from the information provided here, we do not
know how skills were tested in this battery, nor do we know the publisher's
operational definition for reading, math, language, and so forth. Until we examine the test, we cannot be sure
what the publisher means by structural analysis, reading comprehension, and
word reading, nor do we know how they were measured. We have very little information regarding how and what we should
teach Sid. We are only dealing with
comparisons to norming groups. With
only this formal testing information available, the teacher should conduct
informal assessments to form an instructional strategy.
Norm-referenced
and criterion-referenced testing:
Aileen
Examine the scores of Aileen, a fifth
grade pupil, on the Comprehensive Tests of Basic Skills (CTBS)
(Macmillan/McGraw-Hill School Publishing Company, 1990) in Figure 4-5.
-------------------------------------------
Insert
Figure 4-5: Test Profile for Aileen
-------------------------------------------
Aileen's teacher has been concerned with
her general achievement in the fifth grade classroom. It is October of the school year and Aileen does not seem to be
working at the same level as her peers.
Her teacher decided to examine her standardized test scores from the
CTBS/4 given in May of fourth grade.
The standardized test report also indicated that when compared to the
national norm group, her scores in the total battery were below the fiftieth
percentile, the national average. In
Total Reading her scores were as good as or better than approximately 15% of
the norming group. Her Total Language
scores were as good as or better than 13% of the norming group. Her Total Math scores were as good as or
better than 34% of the norming group.
An accompanying printed report, provided
by the test scoring service, suggested a need to develop her skills in the
following: Using paragraph context to
infer word meaning, spelling consonant sounds in words, analyzing and
interpreting passages, and interpreting written forms and techniques. These suggestions are based on a criterion
scoring system derived from an analysis of the test items according to
subskills.
The scores labeled "range"
indicate the band of percentile rank score reliability, based on the test's
standard error of measurement. For
example, her national percentile on reading vocabulary is 27, indicating that
she performed as well or better than 27% of those in the norming population at
the fourth grade level. The 27th percentile is her obtained score. When the test's statistical error factor is
taken into account through the standard error of measurement (SEM), her
hypothetical true score should lie within a band from the 19th percentile to
the 38th percentile.
Examine Aileen's and determine what you
can conclude about her reading, language and mathematics abilities from these
scores. Compare areas of skill
strengths and weaknesses. Look at the
normal curve diagram of derived scores in Figure 4-. Where do Aileen's scores place her on the NCE scales, the
percentile rank scales, and stanines?
How would you describe her performance as a fourth grader, and how would
you project she will perform in fifth grade?
What other information would you like to have before planning
instruction or remediation?
Interpretation: Aileen performed well below the 50th percentile in the reading
and language subtests. Examining the
normal curve it appears that she falls from 1 to 1/2 standard deviations from
the mean on most reading and language measures. This suggests that she is well below her peer group on these
measures. Aileen's scores suggest that
she is a remedial reader who has serious difficulty handling written
language. Her mathematics scores are
much higher than her reading/language scores.
Her high score in Mathematics Computation suggests that she can do basic
work with numbers. But the lower scores
in Mathematical Concepts and Applications indicates she might have difficulty
when she reads math problems or works with higher level mathematical concepts.
Her teacher can use these test scores to
confirm her observations of Aileen's difficulties in the fifth grade
classroom. However, the scores on this
test, as on all general achievement tests, are general rather than
specific. From the printed, verbal
skills report from the scoring service, we can tentatively conclude that Aileen
has reading and written language difficulties in the areas of higher level
reading comprehension, use of context, interpreting written forms and
techniques, and spelling consonant sounds.
But we are unsure as to the reliability of these reports, and the
vaguely described "interpreting written forms and techniques"
statement is unhelpful.
We
do not know if Aileen has a word recognition problem involving grapheme-phoneme
correspondence, but we might infer that she does, based on the low spelling
score. We do not have enough information
to decide whether Aileen has difficulty understanding what she reads because
she can't decode the words, because she doesn't understand the author's
message, or because of some other reason.
We know, however, that her reading difficulty did not start in fifth
grade. We also know that her teacher
should continue to assess Aileen's reading to determine specific strengths and
weaknesses.
D. Applying Through Practice
Selecting
tests. Read the following descriptions
and decide what type of assessments you would use to best meet your purposes,
and why you would choose them.
1.
Mr. Slater. Mr. Slater is the
new reading coordinator in a K-12 school with a total population of 1500. The school had been without a reading
coordinator for several years and the principal noted that the standardized
test scores in reading had been dropping a few points each year. Now, he was concerned about provision for
improved reading instruction in both the developmental reading/language arts
and the remedial reading programs. As a
first step in this process, the principal asked Mr. Slater to identify those
pupils who needed extra help in reading.
Given the discussion of standardized tests in this chapter, as well as
your own background knowledge about such tests, what type of testing would you
institute to determine which pupils were in need of extra reading instruction.
2. Ms. Kornell. Ms. Kornell, a reading coordinator, has been concerned about a
group of 20 third graders in her school who scored well below grade level on
the reading and language sections of the Iowa Test of Basic Skills (Riverside
Publishing Company, 1993), May testing.
Although she has general vocabulary and comprehension information, she
doesn't have information about their specific skill strengths and weaknesses.
She wants to have a plan in place for these pupils when they return to school
in the Fall. What standardized tests
would be appropriate for these purposes?
3. Mark.
Mark, a third grade student, is not making progress in his classroom
reading program. It is January of the
school year and his teacher, Mrs. Ives, wants to know his level of
reading. She feels that he might need
special reading instruction with the remedial reading teacher or that he might
be learning disabled and need a special education placement. Mrs. Ives asks the reading teacher to
administer a reading test to Mark that will provide her with some of the
information she needs to proceed with a referral for remedial reading or
special education help. What standardized
tests might be appropriate for Mark?
4.
Barry. Barry is a fifth grade pupil
who has been receiving special reading help for the past two years. His classroom teacher has recently noticed a
positive change in his reading, in that word recognition no longer seems to be
a problem. Barry's classroom teacher
asked the reading teacher to administer a test to determine Barry's present
reading level and areas of skill development that were still needed. What test(s) would you administer to Barry?
Reporting
scores to parents
Stewart, a third grade student in the
school at which you are a reading specialist, took the Metropolitan Achievement
Tests (Psychological Corporation, 1992) in May of the school year. His scores are listed in Figure 4-6.
--------------------------------------------------
Insert
Figure 4-6
Test
Profile for Stewart
---------------------------------------------------
Stewart's
mother is concerned about his reading and has planned to conference with you
this week. She received this printout but doesn't know how to interpret the
information. She is well educated and
is insistent on being provided a clear explanation of the various scores. How will you explain the various scores,
and what will you say about Stewart's reading ability based on the test
information? In addition, consider how
you will report the scores to Stewart's classroom teacher and to the building
principal.
E.
Reviewing What you've Learned
Formal assessment involves the use of
standardized tests or other tests designed for administration under specified,
controlled conditions. Standardized
tests have been administered to a norming population to provide statistical
information on the test's reliability and validity, as well as on derived
scores: Percentile ranks, normal curve
equivalents grade equivalents, standard scores, and stanines are described in
detail.
Reading/literacy tests are categorized
most broadly as reading survey, general achievement, and reading
diagnostic. For students with special
needs, some general achievement and diagnostic tests are designed for
individual administration. Some formal
tests of reading focus on specific aspects of the process, such as silent or
oral reading.
Criterion-referenced tests have attempted
to better match testing to curriculum by providing measurement of student
performance on specific skills as compared to a cut-off criterion score that
indicates mastery of the skills.
Performance-based assessment is a more recent development that also
attempts to better match testing to curriculum by providing tasks that better
simulate classroom activities than the traditional multiple choice format.
While
standardized tests provide the user with information about groups' and
individuals' performances, they do not provide information about how readers
process print or select answers. Data
from standardized tests is mostly quantitative. Teachers have limited ability to make qualitative inferences
about pupils' strengths and weaknesses in strategies or in specific skill
areas.
F. Further Reading: For your information
Annotated
bibliography of major studies
Farr, R. & Carey, R.(1986). Reading: What can be measured? Newark, DE: International Reading
Association.
This monograph has seven chapters related
to assessing reading comprehension, word recognition, vocabulary, study skills
and rate. The authors include a chapter
on validity and reliability in reading assessment, as well as accountability in
assessment.
Baumann, J. (1988). Reading Assessment: An instructional
decision- making perspective. Columbus, Ohio: Merrill Publishing Co.
This is an excellent reference for both
classroom and clinic. The author
provides interesting chapters on interpreting standardized reading tests,
evaluating reading tests, and using test data to make instructional
decisions. He includes a chapter about
informal reading inventories and a chapter on making instructional decisions
from the data. There are many examples
in this book that help a teacher understand the use of tests for reading
assessment.
Kibby, M. (1981). The degrees of reading power. Journal of
Reading, 24, 416-427.
The author describes the development of
the DRP and its relationship to cloze testing.
A detailed report of the scoring and interpretation, administration, and
the reliability and validity of the instrument can be found in this article.
Baumann, J. & Stevenson, J.
(1982). Understanding standardized reading test scores. The Reading Teacher, 35, 648-655.
This article explains how to interpret
scores obtained from standardized reading tests: grade equivalent, stanines,
and percentiles. The authors provide a
detailed explanation of grade equivalent scores, how they are developed and the
use of interpolation and extrapolation in obtaining such scores. Percentiles and stanines are also explained
in terms of their construction and how they can be interpreted. The advantages and limitations of each score
are also discussed.
Ysseldyke J. & Marston, D.
(1982). A critical analysis of
standardized reading tests. School Psychology Review. 11, 257-266.
Although this article was written for
school psychologists, it is also relevant for the reading teacher or classroom
teacher who is responsible for test selection. The test analysis is based on a
bottom-up model of reading: units, skills, and knowledge which the authors use
as a framework to assess the content of commonly used standardized tests of
reading. Charts are provided with coefficients of criterion validity,
reliability, and ratings of norming procedures. Standardized tests are
also
rated on the usefulness for screening, placement, instructional planning, pupil
evaluation and program evaluation. Even
if the reader is not in agreement with the reading model or the uses for such
tests, it is important to understand that data from standardized tests are
often used in such a straightforward manner.
Valencia, S. & Pearson, P.D.
(1988). Principles for classroom comprehension
assessment. Remedial and special education, 9, 26-35.
The authors base this paper on an
interactive model of reading. They
state positive and negative aspects of several types of assessment instruments
such as standardized tests, criterion-referenced tests, and teacher made tests. They recommend appropriate uses for different
types of tests. They provide 5
principles of reading comprehension assessment: 1) reading assessment must acknowledge the complexity of the
process, 2) reading assessment should
focus on the orchestration of many kinds of knowledge and skills, 3) reading assessment must allow teachers to
assess the dynamic quality of the comprehension process, 4) the teacher is the
advocate for students in their progress
toward expert reading, 5) teachers must employ a variety of measure for making
instructional decisions.
Farr, R. (1987). New trends in reading assessment: Better tests, better uses. Curriculum Review, Pp. 21-23.
Farr presents an interesting article about
the dilemma of testing in our schools.
He raises the issue of tests as accountability instruments and discusses
the widespread misinterpretation of test scores. He suggests the need to develop better tests, to use caution in
interpreting scores, to develop additional procedures to collect information,
and to assess the process of reading rather than the result.