Olatokunbo S. Fashola, senior scientist at the Center for Social Organization, Johns Hopkins University and the American Institutes of Research, recommends that educators study their own institution’s needs and goals as well as existing research when making decisions about educational reforms. It is not enough, writes Fashola, to identify educational programs that have been effective in other schools. Educators must determine whether a certain program will meet the particular needs of their own school.
To use research effectively, educators must understand their capacity, needs and goals and any barriers to attaining these goals. Educators should select programs that have demonstrated effectiveness with students with similar needs and in similar contexts. They need to consider such factors as the turnover rate of teachers and students, students’ socioeconomic status and other demographic characteristics, and the motivation and capacity of their staff to effectively implement reforms. After diagnosing their needs, educators should evaluate potential programs to determine if they have been widely used and effectively replicated and if they fit the identified needs.
Requirements of Research Design
To ensure that a program’s results can be replicated in various school settings, evaluation must be rigorous and robust. Rigor refers to the extent to which the evaluation adheres to strict standards of research methodology. Robustness refers to the degree to which changes in outcomes are due to the program rather than to other factors such as student or school characteristics. The ideal evaluation uses a true experimental design: Students or schools are randomly assigned to treatment and control groups, and both groups are tested with identical assessments before and after the treatment. It is not always possible to achieve this ideal because of financial, time or ethical constraints. The alternative is a quasi-experimental design in which students are assigned to control and treatment groups and are matched on characteristics such as race, gender, socioeconomic status, grade, standardized test scores and English-language proficiency. The idea is that if students in treatment and control groups are matched according to these characteristics, any result must be due to the effects of the program rather than other factors.
Also, there should be evidence of fairness. Measures of effectiveness should be externally developed (not by program designers), independent, reliable and valid. Standardized tests are considered the most objective measures of student achievement and are the most frequently used. A program’s ability to repeatedly improve its participants’ scores on standardized tests that meet the criteria of fair measures indicates that the program is effective and replicable. To be unbiased, measures used must be based on an objective set of goals rather than on treatment-specific goals determined by the program’s creators. Standardized tests vary in their level of difficulty. To produce reliable results, tests must be of the same level of difficulty as those administered by the school. It is important that both pre- and posttest data is available on the two groups to ensure that they were not significantly different prior to treatment, especially if students were not randomly assigned to treatment groups. Additionally, posttest information must be based on similar or identical measures so that any difference between pre- and posttest scores can be attributed to the program and not to differences in the tests themselves
Educators should examine the characteristics of the evaluation sites to determine whether the research sites faced the same kinds of problems that their own school is experiencing. The research sites’ characteristics should be similar to those of schools considering adopting the program. Student populations should be similar in terms of turnover, race, gender and socioeconomic levels.
Information about pre- and post-test scores and the gain scores for both groups tells educators where the groups started, what progress they made, and whether there are significant differences between the two groups’ achievement that may be due to the treatment. However, different types of tests produce scores that tell us different things. For example, norm-referenced standardized test scores help measure the progress of both groups in comparison to other students in the nation. Many schools adopt reform programs to improve their average test scores to at least the 50th percentile — meaning that half the nation’s students scored higher and half scored lower than those in their school. Fashola stresses that it takes time to raise the average scores of an entire school or district. He points out that some remedial programs are designed to raise students to a functional level, but not necessarily to competence. Therefore, it may be appropriate for a school to begin reform efforts with a remedial program and then replace it with another curricular program once students’ test scores demonstrate that they have reached functional levels.
Fashola also reminds educators that a particular score on one test does not necessarily translate into the same score on another. It is possible to standardize scores from different tests by turning them into normal-curve equivalents. With this technique, the results of different tests become uniformly interpretable.
Another consideration for consumers of research is statistical significance, or the extent to which differences are a result of chance rather than treatment. Statisticians use levels of significance to distinguish between the two possibilities. These can be expressed as <.05,
Another important measure of a program’s success is illustrated by effect sizes. An effect size measures the size of the differences in scores between the two groups being compared. It expresses achievement gains and losses across studies showing differences between groups. An effect size of 1.00 indicates that the experimental group outperformed the control group by one full standard deviation. To give a sense of this difference, it is equal to 100 points on the SAT scale or 15 points of IQ. It is enough to move a student from the 10th percentile to above the 50th percentile. In statistics, +.30 is considered a small effect size, +.50 a medium effect size, and +.80 a large effect size. Effect sizes illustrate the size of the difference in test scores between the groups.
Fashola concludes that for a program to show effectiveness, it must have comparison groups, use fair measures, show significant differences in test scores between groups, and be able to produce significant effect sizes. Several factors contribute to the likelihood that a program can meet these criteria. Programs that are designed to achieve a set of clearly defined goals and that develop specialized, research-based methods to achieve their goals and assess students’ progress on a regular basis are more likely to prove their effectiveness. However, no program will succeed if it is not implemented correctly. This is why program developers create manuals, books and other curricular materials to guide use in the classroom. If educators adopt a program, they must implement it thoroughly if they expect to achieve the same results as demonstrated in the research.
Training and technical assistance are essential components of school-wide reform programs. Extensive professional training on an ongoing basis is often necessary for reform programs to be successfully implemented; one-day workshops are not enough. In particular, peer coaching, role-playing and modeling are effective techniques for developing the skills necessary to effectively implement reforms. Trainers realize that schools have a variety of needs and they are willing to make appropriate modifications to training if doing so will not change the program provided to students. Any reform must have sufficient support from all the stakeholders — administrators, teachers and parents. All must be involved from the beginning in the selection and evaluation of programs. Many reform programs require the support of 80 percent of the staff prior to working with a school. Surveying teachers by secret ballot is the most common and accurate measure of support.
“Being an Informed Consumer of Quantitative Educational Research”, Phi Delta Kappan, Volume 85, Number 7, March 2004, pp. 532-38
Published in ERN April 2004 Volume 17 Number 4