The Vermont Portfolio Assessment Program is one of only a few large-scale performance assessments in the United States. Daniel Koretz, Brian Stecher, Stephen Klein and Daniel McCaffrey, researchers at the RAND Institute on Education and Training, have evaluated the experiences of educators in the program as well as the achievement data it has generated.
Beginning in 1991-1992, portfolios for math and writing were made mandatory in grades four and eight throughout the state. The program has two purposes: to improve instruction and to provide high-quality data about student achievement that permits comparisons of schools and districts.
The Vermont program was designed to allow local schools and districts to make curricular decisions while “encouraging a high common standard of achievement for all students.” These goals, according to Koretz et al., exceed current assessment technologies and therefore require a long period of development.
How it works
The childrens’ portfolios are created throughout the year and graded by classroom teachers. Teachers are free to select the work they wish to include in portfolios. Committees consisting mostly of teachers have primary responsibility for developing operational plans, suggesting guidelines for teacher and student choices and for designing the rubrics that are used to score the portfolios.
Portfolios are used in conjunction with standardized “uniform tests.” For writing, the uniform test consists of a single, standardized prompt. The writing portfolio consists of a single best piece and a number of writing samples of prescribed types. In mathematics, the uniform test consists mainly of multiple-choice problems. For the mathematics portfolio, teachers and students choose five to seven of the best papers for scoring.
Teachers initially score portfolios for all their own students. However, portfolios used for state comparisons are re-scored externally in regional meetings by teachers other than the students’ own. Because of financial constraints, only a small portion of the portfolios were re-scored for state reporting purposes in the first two years.
Feedback from schools
Teachers and principals were interviewed in a random sample of 80 schools. Overall, they agree that the program is a “worthwhile burden.” They report that it demands a lot of time and resources and imposes considerable stress. For example, excluding training, math teachers report spending about 30 hours a month working on portfolios.
More than half the teachers it difficult choosing what to include in the portfolios. Many had misgivings about the way portfolio scores would be used, the rapid implementation of the program and the inadequate and inconsistent information provided by the state.
On the other hand, teachers generally found the program to have a strong, positive influence on instruction. Mathematics teachers devoted more class time to problem solving and communication as well as to applying knowledge to new situations. Children to made charts, graphs and diagrams and to wrote reports about mathematics. They spent more time working in pairs and small groups. In English classes, students spent considerable time writing in a variety of genres. Half the teachers interviewed report that their schools had expanded the use of portfolios beyond the mandates of the state program.
The amount of revision teachers allow for writing samples varies widely. Sixty-five percent of fourth-grade teachers and 44 percent of eighth-grade teachers reported placing limits on the amount and type of adult assistance students may receive on the writing pieces for their portfolios. However, one-quarter of the teachers reported that portfolio pieces were typically not revised at all. Such differences greatly affect the quality of the products as well as the grades students receive.
Quality of achievement data
Koretz et al. found the reliability of the scoring by teachers to be very low in both subject,s although considerable improvement was seen in mathematics scoring in the second year. Disagreement among scorers alone accounts for much of the variance in scores and therefore invalidates any comparisons of scores.
Scores on the individual pieces within each writing portfolio were consistent, because there was usually little difference between the “best piece” and other writing samples. There was far less consistency in math portfolios, however, indicating that many samples of math performance will be needed to obtain a valid measure of a student’s achievement.
Also, the freedom teachers have in choosing what to include in portfolio as well as the wide variation in the amount of revision and assistance students are allowed, adds greatly to the variability in performance between classes. Although this is consistent with the instructional goals of the program, researchers fear that the wide variation in teachers’ use of portfolios, threatens the validity of the comparisons based on these scores.
Careful monitoring is key
The Vermont experience demonstrates the need to set realistic expectations for the short-term success of performance-assessment programs and to acknowledge the large costs of these programs. The Vermont experience also shows the necessity for careful monitoring, not only of the implementation and impact of such programs, but also of the data they yield.
Koretz et al. conclude that the Vermont portfolio program has been largely unsuccessful in meeting its goal of providing high-quality data about student performance. The writing assessment, in particular, is still hampered by unreliable scoring. This, they speculate, is at least partly due to the fact that genres were not well defined and specific scoring rubrics were not designed for each type of writing, preventing teachers from interpreting or applying the rubrics in a uniform manner.
These problems led researchers to recommend that Vermont publish only statewide average total scores for 1992. Because of greater reliability among scorers on math portfolios in 1993, the state was justified in reporting average mathematics scores for supervisory districts. However, these math scores were not judged to be conclusive because the number of students for whom external scoring was available in each supervisory district was small.
Although it has been suggested that educators must be willing to accept lower levels of reliability as a price for using performance assessments, Koretz et al. warn that it is essential to understand the ramifications of lower reliability for published scores.
Interviews and questionnaires suggest that Vermont was much more effective in meeting its goal of improving instruction. Although there is no firm baseline data, the program appeared to be a powerful catalyst for encouraging teachers to change their teaching in positive ways. Nevertheless, despite extensive training with sample tasks, the program has not been fully successful in fostering a consistent understanding of what constitutes outstanding instruction. Teachers understood, for example, that they should put more emphasis on problem-solving in math and on writing in English, but it was more difficult to communicate how to incorporate particular skills in the curriculum. And again, despite its positive effects, the program has been slowed because of the time and money needed for its implementation.
Koretz et al. believe that the tension between instructional and measurement goals is fundamental and cannot be fully resolved by refinements in design. Greater standardization in tasks, revision rules and test preparation can increase reliability and validity. But, standardization is contrary to the instructional goals of performance assessments because it isolates assessment tasks from ongoing instruction and thus decreases teachers’ incentives to reform instruction on a daily basis.
“The Vermont Portfolio Assessment Program: Findings and Implications”, Educational Measurement: Issues and Practice, Volume 13, Number 3, Fall 1994, pp.5-16.
Published in ERN, November/December 1994, Volume 7, Number 5.