As more states implement writing assessments as part of their standardized testing, there appear to be some problems with the scoring of these assessments. Since students may be required to pass a writing exam in order to graduate, reliability and consistency of scoring are essential.
A recent study conducted in Virginia revealed that the type of training scorers receive affects test results, specifically the percentage of students who receive passing grades. Because the usefulness of a performance assessment relies on the comparability of scores across years, researchers sought to determine what affects these tests’ reliability.
Two different training scoring models were studied. Results reveal that writing scores vary according to the method of training received by the scorers.
According to Tonya R. Moon, University of Virginia, and Kevin R. Hughes, Albemarle County (Virginia) Public Schools, past research indicates that training scorers to search for those aspects of work that are most indicative of good quality increases the consistency of performance scores.
Research also shows that training all the scorers at the same time and having them rate student essays during the same scoring session can achieve better agreement between two sets of scorers on the same essay. The purpose of the current study was to determine whether other variables in training methods affect scores.
Moon and Hughes studied two training sequences — the first in which teachers score several responses to one writing prompt (the short paragraph that describes the topic to write about) versus scoring responses to different writing prompts in the same training session.
Data was collected in a statewide writing assessment of sixth-grade students. Ten different writing prompts were randomly distributed to students in the study. About 350 students responded to each prompt. Their essays were scored in five areas: composition, style, sentence formation, usage, and mechanics. The scorers were randomly assigned to one of the two training models.
Training for both groups was conducted over a two-day period, and for every prompt there were sample papers that exemplified each score point on the rating scale. Each scoring area was explained using a standardized script illustrated with samples of student responses. Scorers then scored sample papers for a variety of prompts. To qualify to score assessments during the study, their scores had to be in exact agreement with the established score at least 60 percent of the time and had to be within one point of the established score at least 95 percent of the time.
Training scorers to score multiple writing prompts
Following the initial explanation of scoring and study of sample essays, the training for the two groups diverged. The first group was trained sequentially on a single prompt at a time and then practiced scoring actual student essays created from that prompt. They proceeded sequentially through all 10 writing prompts.
set of student essays drawn from all 10 prompts This second training method appeared to be more cognitively complex because it required scorers to reorient themselves to a different scoring scale with each essay. It took longer for scorers in this second group to meet the qualifying criteria than it did for sequentially trained scorers.
When data from these two groups was analyzed, the percentage of essays receiving failing grades was calculated for each group of trainees. In addition, differences in the scores students received were examined by writing prompt and by training method. Results revealed that scorers trained to score essays from different prompts at the same time produced consistently higher and more accurate scores than those trained on one prompt at a time.
These researchers speculate that having to learn to score different writing prompts at the same time required scorers to continually look to the standards for that prompt, which meant that their scores were better grounded in the scoring scale. Moon and Hughes also believed that because they were scoring different prompts, their attention to the task was greater. These researchers speculate that teachers who practiced scoring one type of essay at a time were more likely to be influenced by the other student essays they had just scored on the same topic, rather than by the scoring standards for that prompt.
The results of this study indicate that the cognitively more complex training involving scoring essays from different prompts produces more accurate scores. Moon and Hughes recommend that educators use caution when evaluating changes in students’ writing scores from year to year. They conclude that understanding issues of scorer training and the interpretation of assessment scores are more complex than originally thought. Since the type of training affects scoring, these researchers warn educators that changing training methods could result in scores that aren’t comparable across years.
“Training and Scoring Issues Involved in Large-Scale Writing Assessments” Educational Measurement: Issues and Practice Volume 21, Number 2, June 2002 pp. 15-19.
Published in ERN November 2002 Volume 15 Number 8