Wednesday, June 30, 2010

Abandoning the Pejorative

Psychometricians often speak of error in educational and psychological measurement. But error seems like such a pejorative term. According to the Merriam-Webster online dictionary Merriam-Webster Dictionary, an error is "an act involving an unintentional deviation from truth or accuracy." In ordinary language, error suggests unflattering connotations and alarming inferences.

Under classical test theory, error is the difference between observed student performance and an underlying "true score." But who has special insight into what is true? As Nichols, Twing, Mueller and O’Malley in their recent standard setting article* point out, human judgment is involved in every step of test development. Typically, a construct or content framework describing what we are trying to test is constructed by experts. But the experts routinely disagree amongst themselves and certainly disagree with outside experts. The same expert may even later disagree with a description constructed by them earlier in their career! So what psychometricians call error is just the difference between how one group of experts expects students to perform and how students actually perform.

Psychometricians might even be able to control the amount of "error" by careful selection of what they declare is true. After all, the size of the difference between the experts' description and student performance, or error, depends on what experts you listen to or when you catch them. A strategy for decreasing error might be to gather a set of experts' descriptions and declare as “true” the experts' description that shows the smallest difference between how experts' expect students to perform and how students actually perform. Who is to say this group of experts is right and that group of experts is wrong?

So let's agree to abandon the pejorative label "error" variance. The variance between how one group of experts expects students to perform and how students actually perform might be more appropriately referred to as "unexpected" or "irrelevant." Let's recognize that this variance is irrelevant to the description constructed by the experts, i.e., construct irrelevant variance. Certainly this unexpected variance threatens the interpretation of student performance using the experts' description. But does the unexpected deserve the pejorative label "error?"

*Nichols, P. D., Twing, J., Mueller, C. D., & O’Malley, K. (2010). Standard setting methods as measurement processes. Educational Measurement: Issues and Practices, 29, 14-24,


Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson

Monday, June 14, 2010

Pearson at CCSSO - June 20-23. 2010

I'm looking forward to seeing folks at the upcoming Council of Chief State School Officers (CCSSO) National Conference on Student Assessment (in Detroit, June 20-23, 2010). Pearson employees will be making the following presentations. We hope to see you there.

Multi-State American Diploma Project Assessment Consortium: The Lessons We’ve Learned and What Lies Ahead
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Marquette (Detroit Marriott at the Renaissance Center)
Shilpi Niyogi

Theory and Research On Item Response Demands: What Makes Items Difficult? Construct-Relevant?
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Michael J. Young

Multiple Perspectives On Computer Adaptive Testing for K-12 Assessments
Sunday, June 20, 2010: 3:30 PM-5:00 PM
Nicolet (Detroit Marriott at the Renaissance Center)
Denny Way

The Evolution of Assessment: How We Can Use Technology to Fulfill the Promise of RTI
Monday, June 21, 2010: 3:30-4:30 PM
Cadillac B (Detroit Marriott at the Renaissance Center)
Christopher Camacho
Laura Kramer

Comparability: What, Why, When and the Changing Landscape of Computer-Based Testing

Tuesday, June 22, 2010: 8:30 AM-10:00 AM
Duluth (Detroit Marriott at the Renaissance Center)
Kelly Burling

Measuring College Readiness: Validity, Cut Scores, and Looking to the Future

Tuesday, June 22, 2010: 8:30 AM-10:00 AM
LaSalle (Detroit Marriott at the Renaissance Center)
Jon Twing & Denny Way

Best Assessment Practices
Tuesday, June 22, 2010: 10:30 AM-11:30 AM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Jon Twing

Distributed Rater Training and Scoring

Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Richard (Detroit Marriott at the Renaissance Center)
Laurie Davis, Kath Thomas, Daisy Vickers, Edward W. Wolfe

Identifying Extraneous Threats to Test Validity for Improving Tests and Using of Tests
Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Joliet (Detroit Marriott at the Renaissance Center)
Allen Lau


--------------------------
Edward W. Wolfe, Ph.D.
Senior Research Scientist

Wednesday, June 09, 2010

Innovative Testing

Innovative testing refers to the use of novel methods to test students in richer ways than can be accomplished using traditional testing approaches. This generally means the use of technology like computers to deliver test questions that require students to watch or listen to multimedia stimuli, manipulate virtual objects in interactive situations, and/or construct rather than select their responses. The goal of innovative testing is to measure students’ knowledge and skills at deeper levels and measure constructs not easily assessed, such as problem solving, critical analysis, and collaboration. This will help us better understand what students have and haven’t learned, and what misconceptions they might hold—and thus support decisions such as those related to accountability as well as what instructional interventions might be appropriate for individual students.

Educational testing has always involved innovative approaches. As hidden properties of students, knowledge and skill are generally impossible to measure directly and very difficult to measure indirectly, often requiring the use of complex validity arguments. However, to the extent that newer technologies may allow us to more directly assess students’ knowledge and skills—by asking students to accomplish tasks that more faithfully represent the underlying constructs they’re designed to measure—innovative testing holds the promise of more authentic methods of testing based upon simpler validity arguments. And as such, measurement of constructs that are “higher up” on taxonomies of depth of understanding, such as Bloom’s and Webb’s, should be become more attainable.

Consider assessing a high school student’s ability to design an experimental study. Is this the same as his or her ability to identify one written description of a well-designed experiment amongst three written descriptions of poorly designed experiments? Certainly there will be a correlation between the two; the question is how good a correlation, or more bluntly, how artificial the context. And further, to what extent is such a correlation a self-fulfilled prophesy, such that students who might be good at thinking and doing science but not at narrowly defined assessment tasks are likely to do poorly in school as a result of poor test scores due to the compounding impact of negative feedback?

Many will recall the promise of performance assessment in the 90’s to test students more authentically. Performance testing didn’t live up to its potential, in large part because of the challenges of standardized administration and accurate scoring. Enter innovative questions—performance testing riding the back of digital technologies and new media. Richer assessment scenarios and opportunities for response can be administered equitably and at scale. Comprehensive student interaction data can be collected and scored by humans in efficient, distributed setting, automatically by computer, or both. In short, the opportunity for both large-scale and small-scale testing of students using tasks that more closely resemble real-world application of learning standards is now available.

Without question, creating innovative test questions presents additional challenges over that of simpler, traditional ones. As with any performance task, validity arguments become more complex and reliability of scoring becomes a larger concern. Fortunately, there has been some excellent initial work in the area of understanding how, including the development of taxonomies and rich descriptions for understanding innovative questions (e.g., Scalise; Zenisky). Most notable are two approaches that directly address validity. The first is evidence-centered design, an approach to creating educational assessments in terms of evidentiary arguments built upon intended constructs. The second is a preliminary set of guidelines for the appropriate use of technology in developing innovative questions through application of universal design principles that take into account how students interact with those questions as a function of their perceptual, linguistic, cognitive, motoric, executive, and affective skills and challenges. Approaches such as these are especially essential if we are to help ensure the needs of students with disabilities and English language learners are considered from the beginning in designing our tests.

Do we know that innovative questions will indeed allow us to test students to greater depths of knowledge and skill than traditional ones, and whether will they do so in a valid, reliable, and fair manner? And will the purported cost effectiveness be realized? These are all questions that need ongoing research.

As we solve the challenges of implementing innovative questions in technically sound ways, perhaps the most exciting aspect of innovative testing is the opportunity of integration with evolving innovative instructional approaches. Is this putting the cart before the horse to focus so much on innovation in assessment before we figure it out in instruction? I believe not. Improvements to instructional and assessment technologies must co-evolve. Our tests must be designed to pick up the types of learning gains our students will be making, especially when we consider 21st century skills, which will increasingly rely on innovative, technology-based learning tools. Plus our tests have a direct opportunity to impact instruction: despite all our efforts, “teaching to the test” will occur, so why not have those tests become models of good learning? And even if an emphasis on assessment is the cart, at least the whole jalopy is going the correct way down the road. Speaking of roads, consider the co-evolution of automobiles and the development of improved paving technologies: improvement in each couldn’t progress without improvement in the other.

Bob Dolan, Ph.D.
Senior Research Scientist

Wednesday, June 02, 2010

Where has gone the luxury of contemplation?

I ran across an Excel spreadsheet from some years ago that I had used to plan my trip to attend the 2004 NCME conference in San Diego. The weather was memorable that year. But I also attended a number of sessions during which interesting papers were presented and discussants and audience members made compelling comments.

My new memories of the 2010 NCME conference are different. I am grateful that the weather was pleasant. But I have memories of rushing from one responsibility to the next responsibility. I am sure the 2010 NCME conference included interesting papers and compelling commentary but memories of them were overshadowed by a sense of haste and feeling of urgency. This impression was of my own doing. First, I arrived several days late because of my already crowded travel schedule. Second, I participated in the conference in several roles: as presenter, discussant and co-author.

What I missed this year in Denver was the luxury of contemplation. I missed the luxury of sitting in the audience and reacting to the words and ideas as they rolled from the tongues of the presenters. I missed the luxury of mentally inspecting each comment from the discussants or the audience members and comparing them with my own reactions. I missed the luxury of chewing over the last session with a colleague as we walked through the hotel hallway and maybe grabbed lunch before the next session.

I can and did benefit from attending the NCME conference without the luxury of contemplation. But I missed the pleasure and comfort from indulging in the calm and thoughtful appreciation of the labors of my colleagues. These days we rarely indulge in the luxury of contemplation and we are often impoverished because of it.

Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson