|
James
McMillan, Editor

A fundamental assumption of the standards-based accountability
reform movement is that schools and students demonstrate improvement
in performance. This implies change in student achievement
that leads policy makers to conclude that adequate yearly
progress has occurred. While this tenet appears straightforward,
there are technical, practical, and value issues that have
important implications for determining progress over time.
How much progress is required to be considered “adequate”
or “more than adequate?” How much yearly progress is “exceptional?”
How many years of progress are needed to suggest a clear trend?
Many state testing programs are now at a point where several
years of data are available, and accurate interpretation of
these trends is needed.
In the past few decades the most common approach to evaluate
school progress has usually been based on student achievement
on norm-referenced, standardized tests. These tests allow
year-to-year comparisons by using standard scores and providing
a national norm group to show achievement relative to the
nation as a whole. State assessment systems are now using
tests specific to state standards, in addition to norm-referenced
national tests. In Virginia, for example, school accountability
is based primarily on reaching a specified standard in terms
of the percentage of students passing tests. This type of
data is what is used for examining yearly progress, with norm-referenced
testing playing a supplemental or secondary role.
However determined, year-to-year determinations are essential
in making personnel, staff development, curricular, and resource-allocation
decisions. Progress over time is also needed to evaluate
the effectiveness of different educational reforms and programs
so that effective ones can be highlighted and shared with
others, and ineffective ones revised or discarded. Measuring
progress or change is also a foundation of national testing
policy. President Bush has signaled a need for annual testing
in English and mathematics. The primary reason for yearly
testing is to be able to document progress over several years
of schooling for accountability.
There are two approaches to examining progress over time.
One approach uses unadjusted pass rates each year. This is
called a current status score method. A second approach makes
statistical adjustments to account for student background
factors, such as socioeconomic status and pretest scores
of students. This method is termed value-added.
Current Status Scores
Many states, including Virginia, base all measures of progress
on the performance of students as related to whether or not
each student “passed” each test. Typically, the percentage
of students passing one year is compared to the percentage
of students passing the next year in the same grade, and overall
judgments of quality are made solely on the basis of what
percentage of students in a school obtain passing scores.
The level of achievement required to “pass” is the same for
all schools and students, and the same benchmarks are used.
This approach sets the same level of expectation for all students
and schools and does not require testing in each grade. Factors
such as student socioeconomic status and entering student
capabilities are not used. Rather, progress is defined by
how students in each grade perform for several years. That
is, for example, the third grade performance of students in
1999 may be compared to the third grade scores from 2000 and
2001.
The disadvantage of using current status scores to examine
progress is that different students are compared each year.
This results in test score variation that is due to having
students with different capabilities in a given year. This
occurs commonly in schools that have high student mobility,
in areas that are experiencing population growth or decline
(e.g., rural divisions becoming more suburban), and in small
schools where there can be significant differences from one
year to the next just because of chance fluctuations in student
ability.
Value-Added
The value-added approach to measuring school progress is
essentially a longitudinal analysis of achievement of the
same students over time. Value-added models begin with an
assumption that clearly differentiates it from current status
score models. This assumption is that the most reasonable
estimate of yearly progress is what is actually gained over
a specified period of time, rather than comparisons to pre-established
levels of performance that are the same for all schools.
As argued by Wheat (2000), school accreditation should not
depend on the students a school has. Rather, it should depend
on what a school does with the students it has.” (p. 4)
The best known value-added model is one developed by William
Sanders and implemented in Tennessee in 1992. In the Tennessee
Value-Added Assessment System (TVAAS) student achievement
data from several previous years are used to establish an
“input” level for a particular year at each school. Thus,
each school has a unique input “index” for the year. Based
on this index, the amount of gain is calculated to estimate
the “value” that is added by the school. Only test scores
for students in each of the previous years is used in the
Tennessee model.
Variations of the Sanders model are used in several states.
In North Carolina, the average rate of growth observed across
the state as a whole from one grade to the next is used to
establish a benchmark of “expected” change. In California
and Pennsylvania, targeted improvement is dependent at least
in part on how schools with similar student socioeconomic
status levels perform.
The basic tenet of value-added models is that schools should
only be held accountable for what they can control, which
is taking a group of students who enter at the beginning of
the year with certain levels of knowledge and skills, to a
higher level of performance. While this idea has a compelling
logic from the standpoint of what schools can do to impact
change, there are also disadvantages. First and foremost
is that value-added approaches essentially set different levels
of expectations for different groups of students, typically
tied closely to socioeconomic status. Schools with primarily
low socioeconomic status students have lower expectations
than do schools with mostly high socioeconomic status students.
Many believe that this is unfair to low socioeconomic status
students and perpetuates low achievement. Another important
limitation is that the statistical methods that are used in
value-added models are complicated and difficult to explain.
Small differences in the complex statistical procedures can
result in very different numbers of students meeting standards
(Linn and Baker, 1999). A continuing criticism of the Sanders
model is that the statistical procedures have not been publically
available for verification. This results in possible mistrust
since the data appear to be transformed somehow in a “black
box.” Finally, value-added models work best when there is
testing each year. This requirement adds even more testing
time, taking time away from instruction, and increases the
cost of the program.

24 states, including Virginia, evaluate schools by comparing
current to past performance; only 7 states use school-to-school
comparisons.
One state, Texas, requires subgroups of students to reach
the same standards.
19 states (not Virginia) use student progress as an indicator
to reward schools.
15 states test students in reading and math every year in
grades 3-8, including North and South Carolina, Texas, Tennessee,
Maryland, and Florida.
In Texas, the percentage of Black and Hispanic students retained
in grade 9 has rising to nearly 30%, and the percentage of
students “in special education” nearly doubled from 1994 to
1998. These exclusions account for some of the increase in
10th grade TAAS tests, (Haney, 2000)

When implementing policies that examine school progress there
is a need to consider several factors. Some of these are
technical issues while others are matters of perspective or
value. The first is concerned with how progress may be
misinterpreted to mean that students are learning more about
the standards being tested.
Why Have Scores Increased?
As has been demonstrated repeatedly with all large-scale
testing programs, gains in scores the first few years may
or may not indicate a real improvement in the broader achievement
goals that have been identified. It is clear that some improvement
in high-stakes test results will occur simply because of curriculum
narrowing and special test preparation, particularly for mathematics
skills and content areas such as history and science. For
example, a national survey of teachers in 2000 found that
45% instructed their classes in test-taking skills a great
deal, while another 34% indicated somewhat. (A better balance:
Standards, tests, and the tools to succeed, 2001, p.21)
English and language arts skills are more difficult to impact
by changing the curriculum or teaching test-taking skills.
This principle is illustrated fairly well in Table 1, which
shows the percentages of Virginia students passing grade 5
Standards of Learning Tests for the first four years of the
program. Percentage passing scores improve much more in mathematics,
writing, and history than in English.
Table 1
Percentages of Grade 5 Students Receiving Passing
Scores on SOL Tests*
| |
Year
|
| |
1998
|
1999
|
2000
|
2001
|
|
English
|
68
|
70
|
68
|
73
|
|
Mathematics
|
47
|
51
|
63
|
67
|
|
Writing
|
65
|
81
|
81
|
84
|
|
History
|
33
|
46
|
52
|
63
|
*Source: Virginia Department of Education.
The Effect of Using New Tests
Whenever a newly normed form of a standardized achievement
test is released, the scores of students in the first year
of the new form are almost always lower than the previous
year scores. This tendency is illustrated in Figure 1 (adapted
from Linn, 1998).

This effect can be expected with any new testing program, including state high-stakes assessments. This trend is important for state programs that seek to revise standards every few years. It would be unusual not to see a drop in test scores the first year or two after new standards, and new tests based on those standards, are implemented. In Virginia, for example, as new standards are adopted in social studies and history, it would be expected that tests covering these new standards would show lower pass rates than the current pass rates.
Similarly, test scores will fluctuate if the nature of the population being tested changes substantially from one year to the next because of policy changes or unintended consequences, resulting in a more or less capable group any specific year. For example, the inclusion or exclusion of students with disabilities, changing grade promotion policies, or increasing drop out rates could change the population being tested.
Sampling From Domains
Standards-based assessments are classified as domain-referenced tests. This means that the test items used in any given year are selected to represent a domain of knowledge. The domain is defined by goals, objectives, or standards, and usually a single domain represents content and skills covered in several lessons or even years of curriculum. Typically, a small number of items are used to represent an entire domain. This means that not all objectives or standards can be assessed in any single administration of the test. One year an item covering one standard could be included, but the next year there may be no items from that standard. This principle of domain-referenced test development is illustrated in Figure 2 with the 5th grade Virginia SOL test in Social Studies:

Figure 2. Illustration of Domain-Referenced Test Items
This sampling process complicates year to year comparisons because, while the same general goals are tested, specific standards or objectives are likely to have a different emphasis from one year to the next. In other words, there will be fluctuation in test scores from year to year simply on the basis of how the items were selected to represent the larger domains of knowledge. This suggests that only changes that are consistent for several years should be interpreted to mean that an actual improvement in student performance has been demonstrated.
How Much Change Is Enough?
A continuing policy issue when school progress is evaluated is knowing how much of a change is needed to make valid conclusions that use value labels such as adequate, inadequate, and exemplary. How much change should there be from year to year to suggest that a truly significant improvement has occurred? This is a difficult question to answer because “significant” and adequate can be defined in different ways. Suppose very small increases are noted from one year to the next. It could be as small as moving from 57.8% passing to 58.2% passing. Some may interpret this small increase as a positive conclusion, such as: adequate school progress is indicated by the test scores. Others interpret such a small increase as potentially being caused by chance fluctuation or changing demographics. What about schools that have very high scores one year and slightly lower scores the next year? Is it reasonable to conclude that the school is not making adequate progress because of this small decline? In fact, because of what is called a ceiling effect (scoring at the top) it is usually much more difficult for initially high-scoring schools to show improvement. Because of chance fluctuations, initially high scores may lessen slightly in subsequent years. Likewise, initially very low scores are likely to increase because of chance fluctuation. To provide for improvement for all students test items need to be fairly difficult. This leads to some confusion since the percentage correct that is typically used in classroom grading (e.g., 94% for A, 88% for B) can not be duplicated in high-stakes tests because of the lack of opportunity for high scoring students to show improvement. Thus, most high-stakes tests have difficult items, even for the best students. This means that even though the percentage correct may seem low for a school as a whole (e.g., 70%) if classroom grading standards are used, it may represent adequate knowledge on the standards being tested.
One approach to determining whether the amount of change is adequate is to use measures of substantive or practical significance. One such indicator of practical significance that has become popular is effect size or magnitude-of-effect. Effect size is a number that provides guidance when making conclusions about the substantive significance of differences. It is calculated by dividing the difference observed across years by a measure of the variability of the scores. This results in a number that is typically between 0 and 2. The current convention is to evaluate an effect size of .3 or higher as important or substantive, while effect size less than .3 tends to be minimal.
Effect size is illustrated in Figure 3, which uses Virginia SOL test scale scores as an example. Here the calculated effect size is 1. The figure shows how much change is:

Figure 3. Effect Size Calculation for Individual Student Scores on the Virginia SOL Algebra I Test.
needed to obtain this degree of change, which would be evaluated as meaningful or significant progress for a single student. The amount of change needed to be demonstrated by a school to obtain a significant effect size indicee would be much smaller.
Which Scores Are Used?
Graphs used to represent the results of the test may suggest different conclusions about improvement, depending on which scores are used and the nature of the scale. In Virginia, many policy-makers focus on the percentage of students passing. Another index of test score performance is the average scale score achieved each year. The average takes into account the spread of scores, while percentage of students passing only takes into account two values, at or above and below. These indicators are graphed in Figures 4 and 5 for Virginia SOL test data from 1998 to 2000 for grade 5 reading and mathematics.

Both of these graphs show actual change over three years. Which provides the clearest, most accurate indication of school progress? The graphs vary somewhat, which could lead to different conclusions. Figure 4 suggests that improvement for math has been significant, with little progress in English. Figure 5 looks as though there was little progress for math and English. The most appropriate way of presenting the results would be the format that communicates conclusions that can be verified by other data, though this is not usually done.
Progress also looks different if scores of various subgroups of students are examined in addition to scores of all students together. The subgroups could be defined by school, by teacher, by overall level of performance (e.g., students in the bottom quartile the first year), by race, and by gender. This allows more specific evaluations that can target resources for improvement. It is likely, in fact, that relative progress over several years will not be the same for all subgroups of students. The more disaggregated the data, the more likely the data will be accurate to better understand all aspects of school progress and to target resources for improvement.
Compare Progress With Other Indicators
One procedure to help insure valid conclusions about student progress is to match progress scores with other indicators of student achievement. These other indicators could be division tests, placement tests such as the SAT and ACT, international tests, and National Assessment of Educational Progress (NAEP) tests. Additionally, more informal indicators may also be useful. Teacher perceptions, student performance on teacher-made tests, and grades students receive can also provide a valuable check. When different measures point to the same trend, the progress reported is well supported. One caution with using additional indicators is that it is essential to know the degree of overlap between what is being measured. Other standardized tests will have the best overlap with state standards-based tests, while teacher perceptions, grades, and international tests will not be aligned nearly as well. It is also important to include some estimate of student motivation, which is a key element in demonstrating highest or best improvement. Some national and international testing programs are often not taken seriously by students, particularly students in higher grade levels.

How progress is reported has a direct bearing on school report cards and the consequences of high-stakes testing. The challenge for school report cards is to strike a balance between presenting sets of scores that show improvement over time and scores on a sufficient number of indicators to provide a balanced interpretation of school performance. For example, a school report card may include only high-stakes testing results, but show several years of progress. While this may indicate improvement on student performance on the tests, a complete evaluation would need to consider additional indicators, such as dropout rate, graduation rate, attendance, and percentage of teachers with appropriate credentials.
The consequences of high-stakes testing are directly affected by the approach taken to signify progress. Decisions concerning which scores to report, and whether and how student background characteristics are accounted for, will impact overall determinations of school quality. Clearly, judgments about teachers or administrators will depend on the approach taken to indicate whether progress is being made. One of the advantages of stressing progress over time, i.e., several years, is that the approach recognizes that change does not occur swiftly, especially for veteran teachers. There is a reasonable assumption when this approach is used that it takes time, and judgments based on only a year or two provide only a tentative indication of improvements in teaching.

Tracking school progress with meaningful data, and deriving valid conclusions, is a process that requires attention to some technical issues related to test score changes and test development, as well as to values and perspectives about what constitutes progress. Policy-makers will make more accurate conclusions by following principles that have been demonstrated to be helpful:
- Consider using both current status scores and some type
of value-added indicee. This will focus attention on improvement
over time by taking into account input factors that the
school has little or no control of. Input changes can be
charted along with test score data. It may be beneficial
to use status score results to trigger value-added scores
(e.g., if status scores are low, then look at trends over
time or value-added scores).
- Understand that progress on domain-referenced tests may
be incremental due to changes each year in what is sampled
to be assessed.
- Consider magnitude of effect and ignore very small changes
from year to year. Test score error and changes in samples
from year to year will effect score variation. Place most
emphasis on relatively large and sustained changes, and
report the margin of error, including, if possible, the
probability that a school has been misclassified.
- Use several years of progress to draw conclusions about
whether students are learning more and whether teachers
or specific programs have been effective. It takes time
for change in curriculum and teaching to have an effect
on student achievement.
- Validate progress trends with data from other indicators
of student achievement such as a nationally normed achievement
test, NAEP, or placement tests such as the SAT, Advanced
Placement tests, and the ACT.
- Disaggregate data as much as possible to make better links
between progress and instructional factors responsible for
the changes.
Sources
A better balance: Standards, tests, and the tools to succeed. (2001). Education
Week, 20(17)
A closer look: State policy trends in three key areas of the bush education plan -- testing, accountability and school choice. (2001) Denver: Education Commission of the States.
AERA position statement concerning high-stakes testing in
prek-12 education. (2000). www.aera.net.about.policy.stakes
Haney, W. (2000). The myth of the Texas miracle in education. Education
Policy Analysis Archives, 8 (41).
Klein, S. P., & Hamilton, L. (1999). Large-scale testing: Current practices
and new directions. Santa Monica, CA: RAND Education.
Linn, R. L. (1998). Assessments and accountability. CSE Technical Report 490. Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.
Linn, R. L. (2001). Reporting school quality in standards-based accountability systems. CRESST Policy Brief 3. Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.
Linn, R. L., & Baker, E. L. (1999). Standard-Based Accountability Systems’
Adequate Yearly Progress: Absolutes, Wishful Thinking, and
Norms. The CRESST Line. Los Angeles: National Center
for Research on Evaluation, Standards, and Student Testing.
Linn, R. L., & Herman, J. L. (1997). A policymaker’s guide to standards-led
assessment. Denver, CO: Education Commission of the
States.
Standards for educational and psychological testing (3rd Ed.). (2000).
Washington, DC: American Educational Research Association.
Wheat, D. (2000). Value-Added Accountability: A Systems Solution to the School Accreditation Problem. Springfield, VA: The Thomas Jefferson Institute for Public Policy.
Organizations
Achieve, Inc., web site: www.Achieve.org
American Educational Research Association, web site: www.aera.net
CCSSO State Collaborative on Assessment and Student Standards
(SCASS). web site: www.ccsso.org.
National Center for Research on Evaluation, Standards, and
Student Testing (CRESST). web site: www.cse.ucla.edu
Education Commission of the States, web site: www.ecs.org
FairTest (National Center for Fair & Open Testing). web
site: www.fairtest.org
Fordham Foundation, web site: www.edexcellence.net
National Council on Measurement in Education, web site: www.ncme.org
Rand Corporation. web site: www.Rand.Org

Click cepi@vcu.edu to provide
comments or additional information. Please indicate in an
e-mail the copyright source and contact information for new
inclusions. Back to Top
Copyright © CEPI 2000
CEPI grants permission to reproduce this paper for noncommercial purposes if
CEPI is credited.
|