Performance assessments. Performance-based assessments. Authentic assessments. Constructed response, open-ended, performance tasks. Our field has devised many terms to describe assessments in which examinees demonstrate some type of performance or create some type of product. Whatever you call them, performance-based assessments (PBAs) have a long history in educational measurement with cycles of ups and downs. And once again, PBAs are currently in vogue. Why?
To address the federal government’s requirements for assessment systems that represent “the full performance continuum,” the two consortia formed in response to Race to the Top funding have both publicized assessment plans that involve a heavy dose of performance-based tasks. Thus, PBAs are relevant to any discussion about the future of testing in America.
The old arguments in favor of PBAs are still appealing to today’s educators, parents, and policy-makers. Proponents claim these types of tests are more motivating to students. They provide a model for what teachers should be teaching and students should be learning. They serve as professional development opportunities for teachers involved in developing and scoring them. They constitute complex, extended performances that allow for evaluation of both process and product. Moreover, performance-based tasks provide more direct measures of student abilities than multiple-choice items. They are able to assess students’ knowledge and skills at deeper levels than traditional assessment approaches and are better suited to measuring certain skill types, such as writing and critical thinking. They are more meaningful because they are closer to criterion performances (Or so the story goes. To be fair, these are all claims requiring empirical validation.)
Despite their recent renaissance, PBAs have well-known limitations: lower reliability and generalizability than selected-response items, primarily because of differences in efficiency between the two task types (one hour of testing time buys you many fewer performance-based tasks than multiple-choice items). But these limitations also arise because PBAs are frequently scored by humans—a process that introduces a certain amount of rater error. In exchange for greater depth of content coverage, PBAs compromise breadth of coverage. Generalizability studies of PBAs have found that significant proportions of measurement error are attributable to task sampling, manifested in both person-by-task interactions and person-by-task-by-occasion interactions (in designs incorporating occasion). Again, this is largely because there are many fewer performance-based tasks on any given test.
PBAs are used in a variety of contexts, including summative, high-stakes contexts, such as certification and licensure, as well as employment and educational selection. PBAs are also used for formative or instructional purposes. When well-designed PBAs are administered and scored in the classroom, they can provide information for diagnosing student misconceptions, evaluating the success of instruction, and planning for differentiation because of their instructional alignment and because student performances offer windows into students’ thinking processes. In high-stakes contexts, strict standardization of task development, administration, and scoring is critical for promoting comparability, reliability, and generalizability. In classroom assessment contexts, such rigid standardization may be relaxed. Clearly, what makes a particular PBA useful for one context will make it less so for the other. For example, standardization of task development, administration, and scoring (which is impractical in classroom settings anyway) moves assessment further from instruction and makes it less amenable to organic adjustment by the teacher to meet student needs. In turn, the unstandardized procedures typically favored in classroom settings—extended administration time, student choice of tasks, student collaboration—introduce construct-irrelevant variance and diminish the comparability of tasks that is necessary for high-stakes contexts.
Bottom line: PBAs are here for the foreseeable future. The measurement community needs to revise its expectations about the reliability of individual assessment components. PBAs will prove less reliable than traditional assessment approaches! However, we should move forward with the expectation that this compromise in reliability means an upgrade in terms of greater construct validity for skills not easily assessed using traditional approaches. In addition, I suggest we focus on the reliability, comparability, and generalizability of scores and decisions emanating from assessment systems that incorporate multiple measures representing multiple assessment contexts taken over multiple testing occasions (e.g. through-course assessments).
Moreover, there are ways of making PBAs—even those administered in the classroom—more reliable and comparable. Although pre-service teacher training in measurement and assessment is notoriously weak (e.g., see Rick Stiggins’ work), teachers can be taught to design assessments according to measurement principles. For example, carefully-crafted test specifications can go a long way in creating comparable tasks. Although the measurement field has traditionally avoided classroom assessment, I suggest we consider participating in collaborative initiatives to create curricula with psychometrically-sound, embedded PBAs (e.g., see Shavelson and company’s SEAL project.
Doing this well will require new assessment development models that incorporate close collaboration between curriculum designers and assessment developers to ensure tight alignment and seamless integration of assessment and instruction. Such models will also require closer collaboration between the content specialists who write the tasks and the psychometricians charged with collecting evidence to support overall assessment validity and reliability. Finally, such embedded assessments will need to be piloted—not only to investigate task performance, but also to obtain feedback from teachers regarding assessment functionality and usefulness.
It’s a brave, new world of assessment. To truly advance and sustain these developments, we need to start thinking in brave, new ways. Such an approach will help ensure that the current wave of performance assessment has more staying power than the last.
Emily Lai
Associate Research Scientist
To address the federal government’s requirements for assessment systems that represent “the full performance continuum,” the two consortia formed in response to Race to the Top funding have both publicized assessment plans that involve a heavy dose of performance-based tasks. Thus, PBAs are relevant to any discussion about the future of testing in America.
The old arguments in favor of PBAs are still appealing to today’s educators, parents, and policy-makers. Proponents claim these types of tests are more motivating to students. They provide a model for what teachers should be teaching and students should be learning. They serve as professional development opportunities for teachers involved in developing and scoring them. They constitute complex, extended performances that allow for evaluation of both process and product. Moreover, performance-based tasks provide more direct measures of student abilities than multiple-choice items. They are able to assess students’ knowledge and skills at deeper levels than traditional assessment approaches and are better suited to measuring certain skill types, such as writing and critical thinking. They are more meaningful because they are closer to criterion performances (Or so the story goes. To be fair, these are all claims requiring empirical validation.)
Despite their recent renaissance, PBAs have well-known limitations: lower reliability and generalizability than selected-response items, primarily because of differences in efficiency between the two task types (one hour of testing time buys you many fewer performance-based tasks than multiple-choice items). But these limitations also arise because PBAs are frequently scored by humans—a process that introduces a certain amount of rater error. In exchange for greater depth of content coverage, PBAs compromise breadth of coverage. Generalizability studies of PBAs have found that significant proportions of measurement error are attributable to task sampling, manifested in both person-by-task interactions and person-by-task-by-occasion interactions (in designs incorporating occasion). Again, this is largely because there are many fewer performance-based tasks on any given test.
PBAs are used in a variety of contexts, including summative, high-stakes contexts, such as certification and licensure, as well as employment and educational selection. PBAs are also used for formative or instructional purposes. When well-designed PBAs are administered and scored in the classroom, they can provide information for diagnosing student misconceptions, evaluating the success of instruction, and planning for differentiation because of their instructional alignment and because student performances offer windows into students’ thinking processes. In high-stakes contexts, strict standardization of task development, administration, and scoring is critical for promoting comparability, reliability, and generalizability. In classroom assessment contexts, such rigid standardization may be relaxed. Clearly, what makes a particular PBA useful for one context will make it less so for the other. For example, standardization of task development, administration, and scoring (which is impractical in classroom settings anyway) moves assessment further from instruction and makes it less amenable to organic adjustment by the teacher to meet student needs. In turn, the unstandardized procedures typically favored in classroom settings—extended administration time, student choice of tasks, student collaboration—introduce construct-irrelevant variance and diminish the comparability of tasks that is necessary for high-stakes contexts.
Bottom line: PBAs are here for the foreseeable future. The measurement community needs to revise its expectations about the reliability of individual assessment components. PBAs will prove less reliable than traditional assessment approaches! However, we should move forward with the expectation that this compromise in reliability means an upgrade in terms of greater construct validity for skills not easily assessed using traditional approaches. In addition, I suggest we focus on the reliability, comparability, and generalizability of scores and decisions emanating from assessment systems that incorporate multiple measures representing multiple assessment contexts taken over multiple testing occasions (e.g. through-course assessments).
Moreover, there are ways of making PBAs—even those administered in the classroom—more reliable and comparable. Although pre-service teacher training in measurement and assessment is notoriously weak (e.g., see Rick Stiggins’ work), teachers can be taught to design assessments according to measurement principles. For example, carefully-crafted test specifications can go a long way in creating comparable tasks. Although the measurement field has traditionally avoided classroom assessment, I suggest we consider participating in collaborative initiatives to create curricula with psychometrically-sound, embedded PBAs (e.g., see Shavelson and company’s SEAL project.
Doing this well will require new assessment development models that incorporate close collaboration between curriculum designers and assessment developers to ensure tight alignment and seamless integration of assessment and instruction. Such models will also require closer collaboration between the content specialists who write the tasks and the psychometricians charged with collecting evidence to support overall assessment validity and reliability. Finally, such embedded assessments will need to be piloted—not only to investigate task performance, but also to obtain feedback from teachers regarding assessment functionality and usefulness.
It’s a brave, new world of assessment. To truly advance and sustain these developments, we need to start thinking in brave, new ways. Such an approach will help ensure that the current wave of performance assessment has more staying power than the last.
Emily Lai
Test, Measurement & Research Services
Assessment & Information
Pearson
Assessment & Information
Pearson