View Print Friendly Version | Close Window

CEPI - Commonwealth Educational Policy Institute
Policy Issues - Standards / Assessment / Accountability

James McMillan, Editor

Measuring Yearly Progress

Descriptive Context

A fundamental assumption of the standards-based accountability reform movement is that schools and students demonstrate improvement in performance.  This implies change in student achievement that leads policy makers to conclude that adequate yearly progress has occurred.  While this tenet appears straightforward, there are technical, practical, and value issues that have important implications for determining progress over time.  How much progress is required to be considered “adequate” or “more than adequate?”  How much yearly progress is  “exceptional?”  How many years of progress are needed to suggest a clear trend?  Many state testing programs are now at a point where several years of data are available, and accurate interpretation of these trends is needed.

In the past few decades the most common approach to evaluate school progress has usually been based on student achievement on norm-referenced, standardized tests.  These tests allow year-to-year comparisons by using standard scores and providing a national norm group to show achievement relative to the nation as a whole.  State assessment systems are now using tests specific to state standards, in addition to norm-referenced national tests.  In Virginia, for example, school accountability is based primarily on reaching a specified standard in terms of the percentage of students passing tests. This type of data is what is used for examining yearly progress, with norm-referenced testing playing a supplemental or secondary role.

However determined, year-to-year determinations are essential in making personnel, staff development, curricular, and resource-allocation decisions.  Progress over time is also needed to evaluate the effectiveness of different educational reforms and programs so that effective ones can be highlighted and shared with others, and ineffective ones revised or discarded.   Measuring progress or change is also a foundation of national testing policy.  President Bush has signaled a need for annual testing in English and mathematics.  The primary reason for yearly testing is to be able to document progress over several years of schooling for accountability.


Differing Perspectives

There are two approaches to examining progress over time.  One approach uses unadjusted pass rates each year.  This is called a current status score method.  A second approach makes statistical adjustments to account for student background factors, such as socioeconomic status and “pretest” scores of students.  This method is termed value-added.

Current Status Scores

Many states, including Virginia, base all measures of progress on the performance of students as related to whether or not each student “passed” each test.  Typically, the percentage of students passing one year is compared to the percentage of students passing the next year in the same grade, and overall judgments of quality are made solely on the basis of what percentage of students in a school obtain passing scores.  The level of achievement required to “pass” is the same for all schools and students, and the same benchmarks are used.  This approach sets the same level of expectation for all students and schools and does not require testing in each grade.  Factors such as student socioeconomic status and entering student capabilities are not used.  Rather, progress is defined by how students in each grade perform for several years.  That is, for example, the third grade performance of students in 1999 may be compared to the third grade scores from 2000 and 2001.

The disadvantage of using current status scores to examine progress is that different students are compared each year.  This results in test score variation that is due to having students with different capabilities in a given year.  This occurs commonly in schools that have high student mobility, in areas that are experiencing population growth or decline (e.g., rural divisions becoming more suburban), and in small schools where there can be significant differences from one year to the next just because of chance fluctuations in student ability.

Value-Added

The value-added approach to measuring school progress is essentially a longitudinal analysis of achievement of the same students over time.  Value-added models begin with an assumption that clearly differentiates it from current status score models.  This assumption is that the most reasonable estimate of yearly progress is what is actually gained over a specified period of time, rather than comparisons to pre-established levels of performance that are the same for all schools.  As argued by Wheat (2000), “school accreditation should not depend on the students a school  has.  Rather, it should depend on what a school does with the students it has.” (p. 4)

The best known value-added model is one developed by William Sanders and implemented in Tennessee in 1992.  In the Tennessee Value-Added Assessment System (TVAAS) student achievement data from several previous years are used to establish an “input” level for a particular year at each school.  Thus, each school has a unique input “index” for the year.  Based on this index, the amount of gain is calculated to estimate the “value” that is added by the school. Only test scores for students in each of the previous years is used in the Tennessee model.

Variations of the Sanders model are used in several states.  In North Carolina, the average rate of growth observed across the state as a whole from one grade to the next is used to establish a benchmark of “expected” change.  In California and Pennsylvania, targeted improvement is dependent at least in part on how schools with similar student socioeconomic status levels perform.

The basic tenet of value-added models is that schools should only be held accountable for what they can control, which is taking a group of students who enter at the beginning of the year with certain levels of knowledge and skills, to a higher level of performance.  While this idea has a compelling logic from the standpoint of what schools can do to impact change, there are also disadvantages.  First and foremost is that value-added approaches essentially set different levels of expectations for different groups of students, typically tied closely to socioeconomic status.  Schools with primarily low socioeconomic status students have lower expectations than do schools with mostly  high socioeconomic status students.  Many believe that this is unfair to low socioeconomic status students and perpetuates low achievement.  Another important limitation is that the statistical methods that are used in value-added models are complicated and difficult to explain.  Small differences in the complex statistical procedures can result in very different numbers of students meeting standards (Linn and Baker, 1999).  A continuing criticism of the Sanders model is that the statistical procedures have not been publically available for verification.  This results in possible mistrust since the data appear to be transformed somehow in a “black box.”  Finally, value-added models work best when there is testing each year.  This requirement adds even more testing time, taking time away from instruction, and increases the cost of the program.

 

Snapshots of Researrch and Court Decisions

24 states, including Virginia, evaluate schools by comparing current to past performance; only 7 states use school-to-school comparisons.

One state, Texas, requires subgroups of students to reach the same standards.

19 states (not Virginia) use student progress as an indicator to reward schools.

15 states test students in reading and math every year in grades 3-8, including North and South Carolina, Texas, Tennessee, Maryland, and Florida.

In Texas, the percentage of Black and Hispanic students retained in grade 9 has rising to nearly 30%, and the percentage of students “in special education” nearly doubled from 1994 to 1998.  These exclusions account for some of the increase in 10th grade TAAS tests,  (Haney, 2000)

 

The Issue in Practice

When implementing policies that examine school progress there is a need to consider several factors.  Some of these are technical issues while others are matters of perspective or value.  The first is concerned with how “progress” may be misinterpreted to mean that students are learning more about the standards being tested.

Why Have Scores Increased?

As has been demonstrated repeatedly with all large-scale testing programs, gains in scores the first few years may or may not indicate a real improvement in the broader achievement goals that have been identified.  It is clear that some improvement in high-stakes test results will occur simply because of curriculum narrowing and special test preparation, particularly for mathematics skills and content areas such as history and science.  For example, a national survey of teachers in 2000 found that 45% instructed their classes in test-taking skills “a great deal,” while another 34% indicated “somewhat.” (A better balance:  Standards, tests, and the tools to succeed, 2001, p.21)

English and language arts skills are more difficult to impact by changing the curriculum or teaching test-taking skills.  This principle is illustrated fairly well in Table 1, which shows the percentages of Virginia students passing grade 5 Standards of Learning Tests for the first four years of the program.  Percentage passing scores improve much more in mathematics, writing, and history than in English.

Table 1

Percentages of Grade 5 Students Receiving Passing Scores on SOL Tests*

 

Year

 

1998

1999

2000

2001

English

68

70

68

73

Mathematics

47

51

63

67

Writing

65

81

81

84

History

33

46

52

63

*Source:  Virginia Department of Education.

 

The Effect of Using New Tests

Whenever a newly normed form of a standardized achievement test is released, the scores of students in the first year of the new form are almost always lower than the previous year scores.  This tendency is illustrated in Figure 1 (adapted from Linn, 1998).

 

This effect can be expected with any new testing program, including state high-stakes assessments.  This trend is important for state programs that seek to revise standards every few years.  It would be unusual not to see a drop in test scores the first year or two after new standards, and new tests based on those standards, are implemented.  In Virginia, for example, as new standards are adopted in social studies and history, it would be expected that tests covering these new standards would show lower pass rates than the current pass rates.

Similarly, test scores will fluctuate if the nature of the population being tested changes substantially from one year to the next because of policy changes or unintended consequences, resulting in a more or less capable group any specific year.  For example, the inclusion or exclusion of students with disabilities, changing grade promotion policies, or increasing drop out rates could change the population being tested.

Sampling From Domains

Standards-based assessments are classified as domain-referenced tests.  This means that the test items used in any given year are selected to represent a domain of knowledge.  The domain is defined by goals, objectives, or standards, and usually a single domain represents content and skills covered in several lessons or even years of curriculum.  Typically, a small number of items are used to represent an entire domain.  This means that not all objectives or standards can be assessed in any single administration of the test.  One year an item covering one standard could be included, but the next year there may be no items from that standard.    This principle of domain-referenced test development is illustrated in Figure 2 with the 5th grade Virginia SOL test in Social Studies:

Figure 2.  Illustration of Domain-Referenced Test Items

This sampling process complicates year to year comparisons because, while the same general goals are tested, specific standards or objectives are likely to have a different emphasis from one year to the next.  In other words, there will be fluctuation in test scores from year to year simply on the basis of how the items were selected to represent the larger domains of knowledge.  This suggests that only changes that are consistent for several years should be interpreted to mean that an actual improvement in student performance has been demonstrated.

How Much Change Is Enough?

A continuing policy issue when school progress is evaluated is knowing how much of a change is needed to make valid conclusions that use value labels such as “adequate,” “inadequate,” and “exemplary.”  How much change should there be from year to year to suggest that a truly significant improvement has occurred?  This is a difficult question to answer because “significant” and “adequate” can be defined in different ways.  Suppose very small increases are noted from one year to the next.    It could be as small as moving from 57.8% passing to 58.2% passing.  Some may interpret this small increase as a positive conclusion, such as:  “adequate school progress is indicated by the test scores.”   Others interpret such a small increase as potentially being caused by chance fluctuation or changing demographics.  What about schools that have very high scores one year and slightly lower scores the next year?  Is it reasonable to conclude that the school is not making adequate progress because of this small decline?  In fact, because of what is called a ceiling effect (scoring at the top) it is usually much more difficult for initially high-scoring schools to show improvement.  Because of chance fluctuations, initially high scores may lessen slightly in subsequent years.  Likewise, initially very low scores are likely to increase because of chance fluctuation.  To provide for improvement for all students test items need to be fairly difficult.  This leads to some confusion since the percentage correct that is typically used in classroom grading (e.g., 94% for A, 88% for B) can not be duplicated in high-stakes tests because of the lack of opportunity for high scoring students to show improvement.  Thus, most high-stakes tests have difficult items, even for the best students.  This means that even though the percentage correct may seem “low” for a school as a whole (e.g., 70%) if classroom grading standards are used, it may represent adequate knowledge on the standards being tested.

One approach to determining whether the amount of change is adequate is to use measures of substantive or practical significance.  One such indicator of practical significance that has become popular is effect size or magnitude-of-effect.  Effect size is a number that provides guidance when making conclusions about the substantive significance of differences.  It is calculated by dividing the difference observed across years by a measure of the variability of the scores.  This results in a number that is typically between 0 and 2.  The current convention is to evaluate an effect size of .3 or higher as important or substantive, while effect size less than .3 tends to be minimal.

Effect size is illustrated in Figure 3, which uses Virginia SOL test scale scores as an example.  Here the calculated effect size is 1.  The figure shows how much change is:

Figure 3. Effect Size Calculation for Individual Student Scores on the Virginia SOL Algebra I Test.

needed to obtain this degree of change, which would be evaluated as meaningful or significant progress for a single student.  The amount of change needed to be demonstrated by a school to obtain a significant effect size indicee would be much smaller.

Which Scores Are Used?

Graphs used to represent the results of the test may suggest different conclusions about improvement, depending on which scores are used and the nature of the scale.  In Virginia, many policy-makers focus on the percentage of students passing.  Another index of test score performance is the average scale score achieved each year.  The average takes into account the spread of scores, while percentage of students passing only takes into account two values, “at or above” and “below.”  These indicators are graphed in Figures 4 and 5 for Virginia SOL test data from 1998 to 2000 for grade 5 reading and mathematics.

Both of these graphs show actual change over three years. Which provides the clearest, most accurate indication of school progress?  The graphs vary somewhat, which could lead to different conclusions.  Figure 4 suggests that improvement for math has been significant, with little progress in English.  Figure 5 looks as though there was little progress for math and English.  The most appropriate way of presenting the results would be the format that communicates conclusions that can be verified by other data, though this is not usually done.

Progress also looks different if scores of various subgroups of students are examined in addition to scores of all students together.  The subgroups could be defined by school, by teacher, by overall level of performance (e.g., students in the bottom quartile the first year), by race, and by gender.  This allows more specific evaluations that can target resources for improvement.  It is likely, in fact, that relative progress over several years will not be the same for all subgroups of students.  The more disaggregated the data, the more likely the data will be accurate to better understand all aspects of school progress and to target resources for improvement.

Compare Progress With Other Indicators

One procedure to help insure valid conclusions about student progress is to match progress scores with other indicators of student achievement.  These other indicators could be division tests, placement tests such as the SAT and ACT, international tests, and National Assessment of Educational Progress (NAEP) tests.  Additionally, more informal indicators may also be useful.  Teacher perceptions, student performance on teacher-made tests, and grades students receive can also provide a valuable check.  When different measures point to the same trend, the progress reported is well supported.  One caution with using additional indicators is that it is essential to know the degree of overlap between what is being measured.  Other standardized tests will have the best overlap with state standards-based tests, while teacher perceptions, grades, and international tests will not be aligned nearly as well.  It is also important to include some estimate of student motivation, which is a key element in demonstrating highest or best improvement.  Some national and international testing programs are often not taken seriously by students, particularly students in higher grade levels.

 

Related Issues

How progress is reported has a direct bearing on school report cards and the consequences of high-stakes testing.  The challenge for school report cards is to strike a balance between presenting sets of scores that show improvement over time and scores on a sufficient number of indicators to provide a balanced interpretation of school performance.  For example, a school report card may include only high-stakes testing results, but show several years of progress.  While this may indicate improvement on student performance on the tests, a complete evaluation would need to consider additional indicators, such as dropout rate, graduation rate, attendance, and percentage of teachers with appropriate credentials.

The consequences of high-stakes testing are directly affected by the approach taken to signify progress.  Decisions concerning which scores to report, and whether and how student background characteristics are accounted for, will impact overall determinations of school quality.  Clearly, judgments about teachers or administrators will depend on the approach taken to indicate whether progress is being made.  One of the advantages of stressing progress over time, i.e., several years, is that the approach recognizes that change does not occur swiftly, especially for veteran teachers.  There is a reasonable assumption when this approach is used that it takes time, and judgments based on only a year or two provide only a tentative indication of improvements in teaching.

 

CEPI Summary

Tracking school progress with meaningful data, and deriving valid conclusions, is a process that requires attention to some technical issues related to test score changes and test development, as well as to values and perspectives about what constitutes “progress.”  Policy-makers will make more accurate conclusions by following principles that have been demonstrated to be helpful:

  • Consider using both current status scores and some type of value-added indicee.  This will focus attention on improvement over time by taking into account input factors that the school has little or no control of.  Input changes can be charted along with test score data.  It may be beneficial to use status score results to trigger value-added scores (e.g., if status scores are low, then look at trends over time or value-added scores).
  • Understand that progress on domain-referenced tests may be incremental due to changes each year in what is sampled to be assessed.
  • Consider magnitude of effect and ignore very small changes from year to year.  Test score error and changes in samples from year to year will effect score variation.  Place most emphasis on relatively large and sustained changes, and report the margin of error, including, if possible, the probability that a school has been misclassified.
  • Use several years of progress to draw conclusions about whether students are learning more and whether teachers or specific programs have been effective.  It takes time for change in curriculum and teaching to have an effect on student achievement.
  • Validate progress trends with data from other indicators of student achievement such as a nationally normed achievement test, NAEP, or placement tests such as the SAT, Advanced Placement tests, and the ACT.
  • Disaggregate data as much as possible to make better links between progress and instructional factors responsible for the changes.

 

Sources, Cites, Links

Sources

A better balance:  Standards, tests, and the tools to succeed. (2001). Education Week, 20(17)

A closer look:  State policy trends in three key areas of the bush education plan -- testing, accountability and school choice. (2001)  Denver:  Education Commission of the States.

AERA position statement concerning high-stakes testing in prek-12 education. (2000). www.aera.net.about.policy.stakes

Haney, W. (2000).  The myth of the Texas miracle in education.  Education Policy Analysis Archives, 8 (41).

Klein, S. P., & Hamilton, L. (1999).  Large-scale testing:  Current practices and new directions.  Santa Monica, CA: RAND Education.

Linn, R. L. (1998).  Assessments and accountability.  CSE Technical Report 490.  Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing.

Linn, R. L. (2001).  Reporting school quality in standards-based accountability systems.  CRESST Policy Brief 3.  Los Angeles:  National Center for Research on Evaluation, Standards, and Student Testing.

Linn, R. L., & Baker, E. L. (1999).  Standard-Based Accountability Systems’ Adequate Yearly Progress:  Absolutes, Wishful Thinking, and Norms.  The CRESST Line.  Los Angeles:  National Center for Research on Evaluation, Standards, and Student Testing.

Linn, R. L., & Herman, J. L. (1997).  A policymaker’s guide to standards-led assessment.  Denver, CO:  Education Commission of the States.

Standards for educational and psychological testing (3rd Ed.). (2000).  Washington, DC: American Educational Research Association.

Wheat, D. (2000).  Value-Added Accountability:  A Systems Solution to the School Accreditation Problem.  Springfield, VA:  The Thomas Jefferson Institute for Public Policy.

Organizations

Achieve, Inc., web site: www.Achieve.org

American Educational Research Association, web site:  www.aera.net

CCSSO State Collaborative on Assessment and Student Standards (SCASS).  web site: www.ccsso.org.

National Center for Research on Evaluation, Standards, and Student Testing (CRESST).  web site:  www.cse.ucla.edu

Education Commission of the States, web site: www.ecs.org

FairTest (National Center for Fair & Open Testing).  web site: www.fairtest.org

Fordham Foundation, web site:  www.edexcellence.net

National Council on Measurement in Education, web site: www.ncme.org

Rand Corporation.  web site:  www.Rand.Org

 

E-mail Response

Click cepi@vcu.edu to provide comments or additional information. Please indicate in an e-mail the copyright source and contact information for new inclusions.

Back to Top

Copyright © CEPI 2000
CEPI grants permission to reproduce this paper for noncommercial purposes if CEPI is credited.

 

 

View Print Friendly Version | Close Window