Monday, November 22, 2010
Where Will Innovation Come From?
I have just come from a learning trajectories conference associated with the DELTA Project at North Carolina State University. This conference was supported by the Pearson Foundation. At this event, there was a lot of talk about innovation in how and what we assess, including talk about so-called “innovative” items. Those are items that might use simulation, multimedia and figural response item formats. The Race to the Top consortia want to hire companies to construct these innovative items that will be freely shared across states. But where will innovations in assessment items and tasks come from?
There are people who do research on innovations and where innovations come from. Consider Eric von Hippel, a professor of technological innovation at MIT. In the book, The Sources of Innovation, von Hippel points out that companies that manufacture in a field were innovators in that field if the companies could expect the innovation to become a commercially successful product that would result in an attractive return. These innovative companies gained an enhanced position in the market. But the research by von Hippel raises a question: What motivates companies to invest in developing innovative items if those innovations will then be given away by the consortia to the states? Where is the enhanced market position? Are the consortia taking the best approach to ensuring innovation in how and what we assess?
Some other researchers on innovation, Kevin Boudreau and Karim Lakhani, writing about outside innovation might have an answer. They say that when the technology and the customers’ needs are well understood a company can do internal research and development. But opening up innovation to a collaborative community can work better when the technology and design approaches have not been established and the innovation involves cumulative knowledge, continually building on past advances.
These collaborative communities tend to develop knowledge-sharing and dissemination mechanisms, and converge on common norms with a culture of sharing and cooperation, broad agreement on a technology paradigm, and common technical jargon to support productive collaboration. A good example of this is Apple’s iPhone. Thousands of outside software developers have written complementary applications (the apps we all love so much) that have made the iPhone the center of a thriving business ecosystem. Another but somewhat different example is the Semiconductor Research Corp., a Durham, North Carolina-based nonprofit consortium that includes members from industry, the government, and academia. The Semiconductor Research Corp was established in 1982 to accumulate fundamental knowledge in silicon technology and semiconductor manufacturing.
As the Rolling Stones sang, “You can't always get what you want. But if you try sometimes, yeah, you just might find you get what you need!” I don’t know the best way to guarantee we get the innovation we want in how and what we assess. But after meeting and listening to the researchers and policymakers at the learning trajectories conference, a way to get what we need might be a partnership between companies like Pearson, the government, and academia.
Paul Nichols, Ph.D.
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
Tuesday, November 02, 2010
TrueScores Redux
Welcome to the first blog entry under our new editorial format! In the past, different staff members from Research Services have taken turns writing blog entries. The result was some interesting blogs but not really a consistent voice. So we decided to try something different.
First, we will have a blog writer appointed to a six month term (or sentence). Obviously, I’m the first blogger. If you don’t like my blogs, just wait six months and someone else will take over. But use the comments function and let me know what you think, one way or the other.
And that’s another change. Readers can send comments back to me. But I get to decide if I want to publicly respond to those comments. Some rules about comments:
1. Be polite;
2. Be relevant;
3. And be polite.
Third, I will periodically invite people I believe are interesting to write a blog entry. I will invite guest bloggers for two reasons: One, I want a break once in awhile. Two, I’m not that interesting and the readers need a break once in awhile. For those of you who follow the TMRS Research Newsletter, I will coerce the Newsletter editors to write a blog entry when they begin their one-year term of service and tell us what their vision is over the next year. I will plead with the leaders at the Pearson Global Psychometric Centers at Oxford, the University of Western Australia and The University of Texas at Austin to write blog entries. And I will ask a graduate student participating in Pearson's Summer Research Fellowship Program to write a blog about their experience over the summer with Pearson. So readers who are thinking about applying for Pearson's Summer Research Fellowship, you’ve been warned.
Finally, my picture is posted with the blog. I tried to slip in Brad Pitt’s picture but I was caught. Well, some changes are positive and others …
So, I promise to provide new blog entries on a regular basis. I promise each entry will address something that I care about. I cannot promise more.
"So long as you write what you wish to write, that is all that matters; and whether it matters for ages or only for hours, nobody can say.”
Virginia Woolf
Paul Nichols, Ph.D.
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
First, we will have a blog writer appointed to a six month term (or sentence). Obviously, I’m the first blogger. If you don’t like my blogs, just wait six months and someone else will take over. But use the comments function and let me know what you think, one way or the other.
And that’s another change. Readers can send comments back to me. But I get to decide if I want to publicly respond to those comments. Some rules about comments:
1. Be polite;
2. Be relevant;
3. And be polite.
Third, I will periodically invite people I believe are interesting to write a blog entry. I will invite guest bloggers for two reasons: One, I want a break once in awhile. Two, I’m not that interesting and the readers need a break once in awhile. For those of you who follow the TMRS Research Newsletter, I will coerce the Newsletter editors to write a blog entry when they begin their one-year term of service and tell us what their vision is over the next year. I will plead with the leaders at the Pearson Global Psychometric Centers at Oxford, the University of Western Australia and The University of Texas at Austin to write blog entries. And I will ask a graduate student participating in Pearson's Summer Research Fellowship Program to write a blog about their experience over the summer with Pearson. So readers who are thinking about applying for Pearson's Summer Research Fellowship, you’ve been warned.
Finally, my picture is posted with the blog. I tried to slip in Brad Pitt’s picture but I was caught. Well, some changes are positive and others …
So, I promise to provide new blog entries on a regular basis. I promise each entry will address something that I care about. I cannot promise more.
"So long as you write what you wish to write, that is all that matters; and whether it matters for ages or only for hours, nobody can say.”
Virginia Woolf
Paul Nichols, Ph.D.
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
Thursday, October 14, 2010
Success on the Largest Scale
This summer I had the pleasure of working with Heather Klesch, Senior Area Director, and Tracey Magda, Psychometrician, at the Evaluation Systems group of Pearson in Hadley, Massachusetts. As mentors Heather and Tracey designed an internship comprised of three major events: (1) drafting a technical manual, (2) researching contact modes, and (3) assisting at a national benchmark-setting conference.
The general public was the intended audience for the technical manual. Wording needed to be precise and non-technical while describing topics like equating and reported statistics. Many of the technical staff in Hadley kindly discussed test construction and score validation with me, offering me insight into their fields. During this project, I gained a greater understanding of both the practice and the description of operational psychometrics.
Researching survey contact modes was directed by operational questions regarding content validation surveys. After attending meetings and reading relevant literature, I presented a one-page bulleted synopsis of literature highlights, addressing comments and concerns. On an abstract level, both the technical manual and the survey mode research required retaining meaning while summarizing.
At the St. Louis benchmark-setting conference, I saw the positive effects of a well-planned event. From all 50 states, roughly 700 educators converged on St. Louis to set benchmarks for the 31 tests of the National Evaluation Series (NES). Subject matter expert training was standardized, and there was a protocol for all confidential materials. In addition to planned standardized procedures, we were able to demonstrate appropriate protocol during an unexpected fire alarm. I enjoyed the responsibility of assisting at the conference, knowing that our process would become part of the validity evidence for the NES.
On a personal note, and as a baseball fan, I was thrilled about the St. Louis trip. There has been a new iteration of Busch Stadium since 2004, but on my birthday I got to stand in the air space of what I consider to be the most life-changing stadium of all time. In 2004, the Red Sox pulled off a complex coordinated effort. After they won, the Red Sox belonged to the set of teams that accomplished their goals on the largest stage. Red Sox fans could believe that years of suffering had finally paid off. In my summer of 2010, the same could be said of Pearson; I saw Pearson succeeding at a complex coordinated effort that also took considerable talent. After the conference, I saw that Pearson can accomplish goals on the largest scale, and I can believe that my years of school will also have been well-spent.
It was a special event in my life to be at Pearson this summer with Heather and Tracey. I was given responsibility to do exciting work that mattered. Now, in addition to an excellent graduate school experience, I have an empirically-borne confidence that I will enjoy my future in psychometrics.
Amy Semerjian
The general public was the intended audience for the technical manual. Wording needed to be precise and non-technical while describing topics like equating and reported statistics. Many of the technical staff in Hadley kindly discussed test construction and score validation with me, offering me insight into their fields. During this project, I gained a greater understanding of both the practice and the description of operational psychometrics.
Researching survey contact modes was directed by operational questions regarding content validation surveys. After attending meetings and reading relevant literature, I presented a one-page bulleted synopsis of literature highlights, addressing comments and concerns. On an abstract level, both the technical manual and the survey mode research required retaining meaning while summarizing.
At the St. Louis benchmark-setting conference, I saw the positive effects of a well-planned event. From all 50 states, roughly 700 educators converged on St. Louis to set benchmarks for the 31 tests of the National Evaluation Series (NES). Subject matter expert training was standardized, and there was a protocol for all confidential materials. In addition to planned standardized procedures, we were able to demonstrate appropriate protocol during an unexpected fire alarm. I enjoyed the responsibility of assisting at the conference, knowing that our process would become part of the validity evidence for the NES.
On a personal note, and as a baseball fan, I was thrilled about the St. Louis trip. There has been a new iteration of Busch Stadium since 2004, but on my birthday I got to stand in the air space of what I consider to be the most life-changing stadium of all time. In 2004, the Red Sox pulled off a complex coordinated effort. After they won, the Red Sox belonged to the set of teams that accomplished their goals on the largest stage. Red Sox fans could believe that years of suffering had finally paid off. In my summer of 2010, the same could be said of Pearson; I saw Pearson succeeding at a complex coordinated effort that also took considerable talent. After the conference, I saw that Pearson can accomplish goals on the largest scale, and I can believe that my years of school will also have been well-spent.
It was a special event in my life to be at Pearson this summer with Heather and Tracey. I was given responsibility to do exciting work that mattered. Now, in addition to an excellent graduate school experience, I have an empirically-borne confidence that I will enjoy my future in psychometrics.
Amy Semerjian
Psychometric Intern
Test, Measurement & Research Services
Pearson
Friday, October 08, 2010
Professionals Sharing Knowledge
My summer internship at Pearson in San Antonio, Texas, has been a rewarding experience. I worked along side exceptional individuals and gained valuable experience. In addition, I developed new friendships.
Over the eight weeks, I worked primarily on a research project that dealt with the selection of common items in creating a vertical scale. The study investigated how decisions such as the structure of a common-item design, the composition of the common-item set, and the procedure used for selecting stable common items impact the nature of students’ growth from grade to grade. I learned a great deal from my mentors, Michael J. Young and Qing Yi. They taught me how to conceptualize research ideas while using available data. We submitted a proposal of this study for the 2011 National Council on Measurement Education annual conference. As well, this study set the foundation for what I hope will become my dissertation.
I attended regular training seminars and skill-enrichment meetings. Attending these seminars and meetings helped me to further develop my skills with psychometric software programs (e.g., WINSTEPS and SAS training) and recognize how those skills are applied in practice. Also, I broadened my understanding of the current issues in the field of educational measurement (e.g., consequential validity). As I listened and observed, I appreciated learning how serious Pearson research scientists are about their work.
From time to time, I participated in discussions with research scientists and received instruction from them about relevant topics related to psychometrics (e.g., test construction, simulation, equating, dimensionality). Given their busy schedule, they were very accommodating. I would like to express my gratitude to those individuals who took the time to share their knowledge with me, namely Allen Lau, Marc Johnson, Tim O’Neil, Kwang-lee Chu, Thanh Nguyen, Hua Wei, and Daeryong Seo.
Some of the more memorable moments were simply conversing with everyone in the department. Thanks to Agnes Stephenson, Mark Robeck, Serena Lin, Toby Parker, and Stephen Jirka for making me feel welcome and a very special thank-you to Dee Heiligmann and David Shin for looking after me!
For those future fellows, my suggestion would be to take advantage of the vast knowledge and resources available to you and your internship will surely surpass your expectations.
Assunta Hardy
Psychometric Intern
Test, Measurement & Research Services
Pearson
Over the eight weeks, I worked primarily on a research project that dealt with the selection of common items in creating a vertical scale. The study investigated how decisions such as the structure of a common-item design, the composition of the common-item set, and the procedure used for selecting stable common items impact the nature of students’ growth from grade to grade. I learned a great deal from my mentors, Michael J. Young and Qing Yi. They taught me how to conceptualize research ideas while using available data. We submitted a proposal of this study for the 2011 National Council on Measurement Education annual conference. As well, this study set the foundation for what I hope will become my dissertation.
I attended regular training seminars and skill-enrichment meetings. Attending these seminars and meetings helped me to further develop my skills with psychometric software programs (e.g., WINSTEPS and SAS training) and recognize how those skills are applied in practice. Also, I broadened my understanding of the current issues in the field of educational measurement (e.g., consequential validity). As I listened and observed, I appreciated learning how serious Pearson research scientists are about their work.
From time to time, I participated in discussions with research scientists and received instruction from them about relevant topics related to psychometrics (e.g., test construction, simulation, equating, dimensionality). Given their busy schedule, they were very accommodating. I would like to express my gratitude to those individuals who took the time to share their knowledge with me, namely Allen Lau, Marc Johnson, Tim O’Neil, Kwang-lee Chu, Thanh Nguyen, Hua Wei, and Daeryong Seo.
Some of the more memorable moments were simply conversing with everyone in the department. Thanks to Agnes Stephenson, Mark Robeck, Serena Lin, Toby Parker, and Stephen Jirka for making me feel welcome and a very special thank-you to Dee Heiligmann and David Shin for looking after me!
For those future fellows, my suggestion would be to take advantage of the vast knowledge and resources available to you and your internship will surely surpass your expectations.
Assunta Hardy
Psychometric Intern
Test, Measurement & Research Services
Pearson
Tuesday, September 28, 2010
Trends in Alternate Assessments
While working for Pearson we have seen a shift in the assessment of students receiving special education services that, for the most part, seems to be in the right direction. The movement away from off-grade level testing was at first met with resistance and disbelief from state departments, contractors, and teachers but now seems to be reluctantly accepted. The introduction of the 1% (alternate achievement standards) and 2% (modified achievement standards) assessments forced states to look at the number of students taking special education tests. Federal accountability only allows states to count 3% of their student population as proficient if they test outside of the general assessment (more than 3% of students may take the 1% and 2% assessments, but if more than 3% are proficient they will not count as proficient for school, district, and state Adequate Yearly Progress reporting). Some states had a larger percentage of students being assessed with special education tests, and the development of participation requirements or guidelines became necessary to decrease the number of students taking these assessments and to try to make sure that the students were being assessed appropriately. There is still considerable controversy about whether students are being assessed with the right test.
The 1% assessment (designed for students with the most significant cognitive disabilities) allows the assessment of academic content standards through content that has been linked to grade-level standards. Common approaches to the 1% assessment have included checklists, portfolios, and structured observation, all requiring substantial teacher involvement. Assessment of these students using pre-requisite skills linked to the grade-level curriculum was not popular with teachers. Prior to No Child Left Behind Act (NCLB), many of these students had been taught life skills, and assessing academic content was criticized as being unimportant for this student population. However, once teachers started teaching the pre-requisite skills associated with the grade-level curriculum, we heard many positive reports. Teachers were surprised to find that their students could handle the academic content. One of the most common things heard at educator meetings were teachers saying that they never knew their students could do some of the things they were now teaching.
Providing psychometric evidence to support the validity and reliability of the 1% assessment has been challenging. The student population taking the 1% assessment is a unique and varied group. Creating a standardized test to meet the needs of the population requires assessment techniques (checklists, portfolios, and structured observations) that require a great deal of teacher involvement and fall outside the more traditional types of psychometric analyses used for multiple-choice assessments. The role of the teacher in the assessment of the 1% population is a source of controversy because of their high level of involvement. In order to develop a standardized test that requires a great deal of input from the teacher, additional research studies to evaluate reliability and validity should be included as part of the test development process. Approaches that we have seen used include interrater reliability studies and validity audits. These types of studies provide evidence that the assessments are being administered as intended and that teachers are appropriately evaluating student performance. These approaches provide evidence that the results of the 1% assessments are a true indication of what the student can do rather than what the teacher says the student can do.
The federal government’s legislation to allow an optional 2% assessment (for students who are not able to progress at the same rate as their non-disabled peers) has been met with varying levels of acceptance. At this time only two states have received approval on their 2% assessments and 15 states are in the process of developing a 2% assessment. However, there is some talk that with the reauthorization of the Elementary and Secondary Education Act (ESEA) and the movement towards common core the 2% assessment will go away. The communication from the Department of Education in the context of Race to the Top and Common Core has been that the students participating in 2% assessments should be able to participate in the new general assessments developed for the state consortia. However, actual changes to the NCLB assessment and reporting requirements will have to be legislated, most likely during ESEA reauthorization.
Many states simply did not bother to develop a 2% test. It’s an expensive endeavor for an optional assessment. Those states that have developed (or are in the process of developing) a 2% test have struggled to find a cost-effective approach to setting modified achievement standards and to modifying or developing grade-level items that are accessible to this group of students.
There seems to be a need for an assessment that falls between the 1% test and the general test offered with accommodations. But there are differences in opinion about how students should be performing on the 2% test. If students perform poorly is that to be expected because they shouldn’t be able to perform well on grade level material or does that indicate that the test has not been modified enough for these students to be able to show what they know? If students perform well on the assessment does that mean that the modifications have been done well or that the wrong students are taking the test? We would like to think that the intent of the legislation was for states to develop a test that assesses grade-level content in such a way that students could be successful. Even so, we have heard the argument that if students taking the 2% test are not doing well they are still performing better than they would have if they had taken the general test.
Most 2% assessments developed or in development use a multiple-choice format, and traditional psychometric analyses associated with multiple-choice items work well here. But there have been discussions about what the data for a 2% test should look like. Of particular interest is whether a vertical scale should or could be developed for a 2% assessment. Recent investigations show that the vertical scale data do not look like vertical scale data seen on general assessments, but it is unclear whether this is a problem or whether this is to be expected. Our initial recommendation to one state was not to develop a vertical scale since the vertical scale focuses on a year’s worth of student growth and a year’s worth of growth for a 2% student may be very different from what we see in the general population. But after collecting vertical scale data for that 2% assessment, the data looked better than expected though not close enough to a general assessment vertical scale to recommend its implementation. Further research is being conducted.
Growth models for both the 1% and 2% assessments are also being developed. Again, the type of growth expected from the students taking these assessments is questionable, especially for the 1%. The question is how to capture the types of growth these students do show. Models are being implemented now, and we are curious to see what the evaluation of these models will show. Are we able to capture growth for these students?
As the lifespan of alternate assessments under NCLB has increased, they have received increasing scrutiny under peer review and by assessment experts. The tension between creating flexible, individualized assessments and meeting technical requirements for validity and reliability has led to increased structure, and often increased standardization, in the development of alternate assessments. Yet, for myriad reasons, alternate assessments do not, and should not, look like the current primarily multiple-choice format of general assessments. The unique nature of alternate assessments has allowed psychometrics and other research to better understand non-traditional measurement. Providing reliability and validity evidence for assessments with alternate and modified achievement standards has required innovative thinking; thinking that has already been informing assessment design ideas for the common core assessment systems, which are expected to be innovative, flexible, and, to some extent, performance based.
Natasha J. Williams, Ph.D.
Senior Research Scientist
Psychometric and Research Services
Pearson
The 1% assessment (designed for students with the most significant cognitive disabilities) allows the assessment of academic content standards through content that has been linked to grade-level standards. Common approaches to the 1% assessment have included checklists, portfolios, and structured observation, all requiring substantial teacher involvement. Assessment of these students using pre-requisite skills linked to the grade-level curriculum was not popular with teachers. Prior to No Child Left Behind Act (NCLB), many of these students had been taught life skills, and assessing academic content was criticized as being unimportant for this student population. However, once teachers started teaching the pre-requisite skills associated with the grade-level curriculum, we heard many positive reports. Teachers were surprised to find that their students could handle the academic content. One of the most common things heard at educator meetings were teachers saying that they never knew their students could do some of the things they were now teaching.
Providing psychometric evidence to support the validity and reliability of the 1% assessment has been challenging. The student population taking the 1% assessment is a unique and varied group. Creating a standardized test to meet the needs of the population requires assessment techniques (checklists, portfolios, and structured observations) that require a great deal of teacher involvement and fall outside the more traditional types of psychometric analyses used for multiple-choice assessments. The role of the teacher in the assessment of the 1% population is a source of controversy because of their high level of involvement. In order to develop a standardized test that requires a great deal of input from the teacher, additional research studies to evaluate reliability and validity should be included as part of the test development process. Approaches that we have seen used include interrater reliability studies and validity audits. These types of studies provide evidence that the assessments are being administered as intended and that teachers are appropriately evaluating student performance. These approaches provide evidence that the results of the 1% assessments are a true indication of what the student can do rather than what the teacher says the student can do.
The federal government’s legislation to allow an optional 2% assessment (for students who are not able to progress at the same rate as their non-disabled peers) has been met with varying levels of acceptance. At this time only two states have received approval on their 2% assessments and 15 states are in the process of developing a 2% assessment. However, there is some talk that with the reauthorization of the Elementary and Secondary Education Act (ESEA) and the movement towards common core the 2% assessment will go away. The communication from the Department of Education in the context of Race to the Top and Common Core has been that the students participating in 2% assessments should be able to participate in the new general assessments developed for the state consortia. However, actual changes to the NCLB assessment and reporting requirements will have to be legislated, most likely during ESEA reauthorization.
Many states simply did not bother to develop a 2% test. It’s an expensive endeavor for an optional assessment. Those states that have developed (or are in the process of developing) a 2% test have struggled to find a cost-effective approach to setting modified achievement standards and to modifying or developing grade-level items that are accessible to this group of students.
There seems to be a need for an assessment that falls between the 1% test and the general test offered with accommodations. But there are differences in opinion about how students should be performing on the 2% test. If students perform poorly is that to be expected because they shouldn’t be able to perform well on grade level material or does that indicate that the test has not been modified enough for these students to be able to show what they know? If students perform well on the assessment does that mean that the modifications have been done well or that the wrong students are taking the test? We would like to think that the intent of the legislation was for states to develop a test that assesses grade-level content in such a way that students could be successful. Even so, we have heard the argument that if students taking the 2% test are not doing well they are still performing better than they would have if they had taken the general test.
Most 2% assessments developed or in development use a multiple-choice format, and traditional psychometric analyses associated with multiple-choice items work well here. But there have been discussions about what the data for a 2% test should look like. Of particular interest is whether a vertical scale should or could be developed for a 2% assessment. Recent investigations show that the vertical scale data do not look like vertical scale data seen on general assessments, but it is unclear whether this is a problem or whether this is to be expected. Our initial recommendation to one state was not to develop a vertical scale since the vertical scale focuses on a year’s worth of student growth and a year’s worth of growth for a 2% student may be very different from what we see in the general population. But after collecting vertical scale data for that 2% assessment, the data looked better than expected though not close enough to a general assessment vertical scale to recommend its implementation. Further research is being conducted.
Growth models for both the 1% and 2% assessments are also being developed. Again, the type of growth expected from the students taking these assessments is questionable, especially for the 1%. The question is how to capture the types of growth these students do show. Models are being implemented now, and we are curious to see what the evaluation of these models will show. Are we able to capture growth for these students?
As the lifespan of alternate assessments under NCLB has increased, they have received increasing scrutiny under peer review and by assessment experts. The tension between creating flexible, individualized assessments and meeting technical requirements for validity and reliability has led to increased structure, and often increased standardization, in the development of alternate assessments. Yet, for myriad reasons, alternate assessments do not, and should not, look like the current primarily multiple-choice format of general assessments. The unique nature of alternate assessments has allowed psychometrics and other research to better understand non-traditional measurement. Providing reliability and validity evidence for assessments with alternate and modified achievement standards has required innovative thinking; thinking that has already been informing assessment design ideas for the common core assessment systems, which are expected to be innovative, flexible, and, to some extent, performance based.
Natasha J. Williams, Ph.D.
Senior Research Scientist
Psychometric and Research Services
Pearson
Thursday, September 16, 2010
Under Pressure
I cannot seem to get the rhythmic refrain of the famous Queen/David Bowie song “Under Pressure” out of my head when thinking about Common Core and the Race to the Top (RTTT) Assessment Consortia these days. Yes, this is a remarkable time in education presenting us with opportunities to reform teaching, learning, assessment, and the role of data in educational decision making, and all of those opportunities come with pressure. But, when Bowie’s voice starts ringing in my head, I am not thinking about those issues. I am instead worried about a relatively small source of pressure the assessment systems must bear; a source of pressure that is about 2% of the target population of test takers in size.
Currently under Elementary and Secondary Education Act (ESEA), states are allowed to develop three categories of achievement standards: general achievement standards, alternate achievement standards, and modified achievement standards. These standards all refer to criteria students must meet on ESEA assessments to reach different proficiency levels. Modified achievement standards only became part of the No Child Left Behind Act (NCLB) reporting options in 2007* after years of pressure from states. It was felt that the general assessment and alternate assessments did not fully meet states’ needs for accurate and appropriate measurement of all students. There were many students for whom the alternate assessment was not appropriate, yet nor was the general assessment. These kids were often referred to as the “grey area” or “gap” kids.
I do not think anyone would have argued that the modified achievement standards legislation fully addressed the needs of this group of kids, but it did provide several benefits. States that opted to create assessments with modified achievement standards were able to explicitly focus on developing appropriate and targeted assessments for a group of students with identifiably different needs. The legislation also drew national attention in academics, teaching, and assessment to the issue of “gap” students. This raised important questions, including:
-- Which students are not being served by our current instructional and assessment systems?
-- Is it because of the system, the students, or both?
-- What is the best way to move underperforming students forward?
In the relatively short time since legislative sanction of modified assessments, significant amounts of research and development have been undertaken. However, as I asserted that the legislation did not fully meet the needs of “gap” kids, I also assert that the research and development efforts have yet to unequivocally answer any of the questions that the legislation raised. Though research has not yet answered those questions, this does not mean that the research has not improved our understanding of the 2% population and how they learn. And it does not mean that we should stop pursuing this research agenda.
Now, in the context of the RTTT Assessment competition, the 2% population seems to be disappearing, or is being re-subsumed into the general assessment population. I do not think that the Education Department means to decrease attention on the needs of students with disabilities or to negatively impact students with disabilities. There is still significant emphasis given to meeting the needs of students with disabilities and consistently underperforming students in the RTTT Assessment RFP and in the proposals submitted by the two K-12 consortia. However, the proposals do seem to indicate that the general assessment will need to meet the needs of these populations, offering both appropriate and accurate measurement of students’ Knowledge, Skills, and Abilities (KSAs) and individual growth. I wonder how much attention these students will receive in test development, research, and validation efforts when the test developers are also taxed with creating innovative assessments, designing technologically enhanced interactive items, moving all test-takers online, and myriad other issues. The test development effort is already under significant pressure before the needs of students previously involved in assessments with modified achievement standards were lumped in.
I applaud the idea of creating an assessment system that is accessible, appropriate, and accurate for the widest population of students possible. I also hope that the needs of all students will truly inform the development process from the start. However, I cannot help worrying. We are far from finished with the research agenda designed to help us better understand students who have not traditionally performed well on the general assessment. With so many questions left unanswered, and with so many new test development issues to consider, I hope that students with disabilities and under-performing students are not, once again, left in a “gap” in the comprehensive assessment system.
* April 19, 2007 Federal Register (34 C.F.R. Part 200) officially sanctioned the development of modified achievement standards.
Kelly Burling, PhD.
Senior Research Scientist
Psychometric and Research Services
Pearson
Currently under Elementary and Secondary Education Act (ESEA), states are allowed to develop three categories of achievement standards: general achievement standards, alternate achievement standards, and modified achievement standards. These standards all refer to criteria students must meet on ESEA assessments to reach different proficiency levels. Modified achievement standards only became part of the No Child Left Behind Act (NCLB) reporting options in 2007* after years of pressure from states. It was felt that the general assessment and alternate assessments did not fully meet states’ needs for accurate and appropriate measurement of all students. There were many students for whom the alternate assessment was not appropriate, yet nor was the general assessment. These kids were often referred to as the “grey area” or “gap” kids.
I do not think anyone would have argued that the modified achievement standards legislation fully addressed the needs of this group of kids, but it did provide several benefits. States that opted to create assessments with modified achievement standards were able to explicitly focus on developing appropriate and targeted assessments for a group of students with identifiably different needs. The legislation also drew national attention in academics, teaching, and assessment to the issue of “gap” students. This raised important questions, including:
-- Which students are not being served by our current instructional and assessment systems?
-- Is it because of the system, the students, or both?
-- What is the best way to move underperforming students forward?
In the relatively short time since legislative sanction of modified assessments, significant amounts of research and development have been undertaken. However, as I asserted that the legislation did not fully meet the needs of “gap” kids, I also assert that the research and development efforts have yet to unequivocally answer any of the questions that the legislation raised. Though research has not yet answered those questions, this does not mean that the research has not improved our understanding of the 2% population and how they learn. And it does not mean that we should stop pursuing this research agenda.
Now, in the context of the RTTT Assessment competition, the 2% population seems to be disappearing, or is being re-subsumed into the general assessment population. I do not think that the Education Department means to decrease attention on the needs of students with disabilities or to negatively impact students with disabilities. There is still significant emphasis given to meeting the needs of students with disabilities and consistently underperforming students in the RTTT Assessment RFP and in the proposals submitted by the two K-12 consortia. However, the proposals do seem to indicate that the general assessment will need to meet the needs of these populations, offering both appropriate and accurate measurement of students’ Knowledge, Skills, and Abilities (KSAs) and individual growth. I wonder how much attention these students will receive in test development, research, and validation efforts when the test developers are also taxed with creating innovative assessments, designing technologically enhanced interactive items, moving all test-takers online, and myriad other issues. The test development effort is already under significant pressure before the needs of students previously involved in assessments with modified achievement standards were lumped in.
I applaud the idea of creating an assessment system that is accessible, appropriate, and accurate for the widest population of students possible. I also hope that the needs of all students will truly inform the development process from the start. However, I cannot help worrying. We are far from finished with the research agenda designed to help us better understand students who have not traditionally performed well on the general assessment. With so many questions left unanswered, and with so many new test development issues to consider, I hope that students with disabilities and under-performing students are not, once again, left in a “gap” in the comprehensive assessment system.
* April 19, 2007 Federal Register (34 C.F.R. Part 200) officially sanctioned the development of modified achievement standards.
Kelly Burling, PhD.
Senior Research Scientist
Psychometric and Research Services
Pearson
Tuesday, August 31, 2010
Lessons Learned from a Psychometric Internship
“Meetings. Meetings. Meetings. How do these people get any work done?” This was one of my first impressions of Pearson. Well, as I quickly learned, much of the work gets done through the meetings. By including everyone with a need to know, and by meeting frequently, projects are kept moving forward smoothly with no items being overlooked or persons left out. Not only does this contribute to successful projects, but it helps build an esprit de corps where everyone realizes the value of their own and each others contribution to getting the job done.
Most images of a psychometric internship center around doing nothing but research. Mine was not like that. It was balanced between doing production work and research. As a result, I developed a better (and more favorable) understanding of a large scale psychometric testing process than I had ever imagined. My first week I started coding the opened answers from a survey administered to teachers to get feedback on how to improve the testing process for students with limited English who have language accommodations on tests. This was followed in later weeks by observing meetings with the Texas Technical Advisory Committee (invited scholars who provide insight into questions that Pearson and the Texas Education Agency have on the testing process), with panels of teachers invited to review new questions before they are put into the test bank, and with other panels who performed peer review of alternate assessments. Some of what goes into creating these panels and getting them together to do their work, I picked up from listening to the people who sat in adjacent cubes. There was a continuous conversation with school administrators about panelist selection, and with panelists about the mechanics of their coming to Austin to work for a few days. It was really amazing.
The research I did I was not able to complete. (I needed another month.) A disappointment, but I have topics to work on in the future; an appreciation of the breadth and depth of psychometrics that I did not have before, and professional and personal contacts with some of the sharpest people in the business.
C. Vincent Hunter
Psychometric Intern
Test, Measurement & Research Services
Pearson
Most images of a psychometric internship center around doing nothing but research. Mine was not like that. It was balanced between doing production work and research. As a result, I developed a better (and more favorable) understanding of a large scale psychometric testing process than I had ever imagined. My first week I started coding the opened answers from a survey administered to teachers to get feedback on how to improve the testing process for students with limited English who have language accommodations on tests. This was followed in later weeks by observing meetings with the Texas Technical Advisory Committee (invited scholars who provide insight into questions that Pearson and the Texas Education Agency have on the testing process), with panels of teachers invited to review new questions before they are put into the test bank, and with other panels who performed peer review of alternate assessments. Some of what goes into creating these panels and getting them together to do their work, I picked up from listening to the people who sat in adjacent cubes. There was a continuous conversation with school administrators about panelist selection, and with panelists about the mechanics of their coming to Austin to work for a few days. It was really amazing.
The research I did I was not able to complete. (I needed another month.) A disappointment, but I have topics to work on in the future; an appreciation of the breadth and depth of psychometrics that I did not have before, and professional and personal contacts with some of the sharpest people in the business.
C. Vincent Hunter
Psychometric Intern
Test, Measurement & Research Services
Pearson
Tuesday, August 24, 2010
Reflecting on a Psychometric Internship
The summer of 2010 was an unusual one in my career life: I was fortunately selected as a summer psychometric intern working in Iowa City. The short eight weeks turned out to be my most productive and memorable period during my years of graduate study. Not only did I develop two AERA proposals out of the projects I participated in, but I got to know a group of wonderful researchers in the educational measurement field. The hearty hospitality, thoughtful consideration and gracious support that I received from the group exceeded what I had expected.
I very much appreciate what I have learned from the weekly seminars that covered a broad range of concurrent hot topics and techniques on which leading testing companies continue making efforts. One-on-one meetings with experienced psychometricians about hands-on practices in K-12 testing projects have gradually broadened my views towards academic research and industrial operations. By seeing the practical needs emerging from testing operations, I think that the most valuable gain for me is to realize how much potential exists in the research of educational measurement.
Jessica Yue
Psychometric Intern
Test, Measurement & Research Services
Pearson
I very much appreciate what I have learned from the weekly seminars that covered a broad range of concurrent hot topics and techniques on which leading testing companies continue making efforts. One-on-one meetings with experienced psychometricians about hands-on practices in K-12 testing projects have gradually broadened my views towards academic research and industrial operations. By seeing the practical needs emerging from testing operations, I think that the most valuable gain for me is to realize how much potential exists in the research of educational measurement.
Jessica Yue
Psychometric Intern
Test, Measurement & Research Services
Pearson
Wednesday, August 04, 2010
Educational Data Mining Conference -- Part Two
Exploratory learning environments, by their nature, allow students to freely explore a target domain. Nonetheless, the art of teaching compels us to provide appropriate, timely supports for students, especially for those who don’t learn well in such independent situations. While intelligent tutoring systems are designed to identify and deliver pedagogical and cognitive supports adaptively for students during learning, they have largely been geared to more structured learning environments, where it is fairly straightforward to identify student behaviors that likely are or aren’t indicative of learning. But what about providing adaptive supports during exploratory learning?
One of the key note lectures at the Third International Conference on Educational Data Mining conference in Pittsburgh in June (discussed in a previous blog posting) covered this topic. Cristina Conati, Associate Professor of Computer Science at the University of British Columbia, described her research using data-based approaches to identify student interaction behaviors that are conducive to learning vs. behaviors that indicate student confusion while using online exploratory learning environments (ELEs). The long term goal of her work is to enable ELEs to monitor a student’s progress and provide adaptive support when needed, while maintaining the student’s sense of freedom and control. The big challenge here is that there are no objective definitions of either correct or effective student behaviors. Thus the initial effort of her team’s work is to uncover student interaction patterns that correlate with, and thus can be used to distinguish, effective vs. ineffective learning.
Core to this “bootstrapping” process is the technique of k-means clustering employed by Conati and her team. K-means clustering is a cluster analysis method to define groups (e.g., students) that exhibit similar characteristics (e.g., interaction behaviors), and is commonly used in educational data mining research. Data from student use of two different college-level ELEs were used in the study: AIspace, a tool for teaching and learning artificial intelligence (AI), and the Adaptive Coach for Exploration (ACE), an interactive open learning environment for exploration of math functions. The data sets consisted of both interface action only or interface action with student eye-tracking. Identification of groups as high learners (and therefore presumably exhibiting largely effective behaviors) vs. low learners (and therefore presumably exhibiting largely ineffective behaviors) was determined either by comparing students’ pre- and post-test scores or through expert judgment. Formal statistical tests were used to compare clusters in terms of learning and feature similarity.
In the end, the data permitted distinction of two (one high learner and one low learner) and three (one high learner and two low learner) groups (i.e., k=2 and k=3) as a function of student behaviors. Differential behavior patterns include:
Low learners moved more quickly through exercises, apparently allowing less time for understanding to emerge.
High learners paused longer with more eye-gaze movements during some activities.
Low learners paused longer after navigating to a help page.
Low learners chose to ignore coaches’ suggestion to continue exploring current exercise more frequently than high learners.
Low learners appeared to move impulsively back and forth through the curriculum.
In summary, the research shows promise for k-means clustering as a technique for distinguishing effective from ineffective learning behaviors, even during unstructured, exploratory learning. Of course, this work is just a start. For example, additional research with larger numbers of students (24 and 36 students were used in the current studies) might support distinguishing of additional groups — should such additional groups exists. In the end, the hope is that by identifying patterns of behaviors that can serve as indicators of effective vs. ineffective learning, targeted, adaptive interventions can be applied in real-time to students to support their productive learning while maintaining the freedom that defines ELE learning.
Bob Dolan, Ph.D.
Senior Research Scientist, Assessment & Information
Pearson
One of the key note lectures at the Third International Conference on Educational Data Mining conference in Pittsburgh in June (discussed in a previous blog posting) covered this topic. Cristina Conati, Associate Professor of Computer Science at the University of British Columbia, described her research using data-based approaches to identify student interaction behaviors that are conducive to learning vs. behaviors that indicate student confusion while using online exploratory learning environments (ELEs). The long term goal of her work is to enable ELEs to monitor a student’s progress and provide adaptive support when needed, while maintaining the student’s sense of freedom and control. The big challenge here is that there are no objective definitions of either correct or effective student behaviors. Thus the initial effort of her team’s work is to uncover student interaction patterns that correlate with, and thus can be used to distinguish, effective vs. ineffective learning.
Core to this “bootstrapping” process is the technique of k-means clustering employed by Conati and her team. K-means clustering is a cluster analysis method to define groups (e.g., students) that exhibit similar characteristics (e.g., interaction behaviors), and is commonly used in educational data mining research. Data from student use of two different college-level ELEs were used in the study: AIspace, a tool for teaching and learning artificial intelligence (AI), and the Adaptive Coach for Exploration (ACE), an interactive open learning environment for exploration of math functions. The data sets consisted of both interface action only or interface action with student eye-tracking. Identification of groups as high learners (and therefore presumably exhibiting largely effective behaviors) vs. low learners (and therefore presumably exhibiting largely ineffective behaviors) was determined either by comparing students’ pre- and post-test scores or through expert judgment. Formal statistical tests were used to compare clusters in terms of learning and feature similarity.
In the end, the data permitted distinction of two (one high learner and one low learner) and three (one high learner and two low learner) groups (i.e., k=2 and k=3) as a function of student behaviors. Differential behavior patterns include:
Low learners moved more quickly through exercises, apparently allowing less time for understanding to emerge.
High learners paused longer with more eye-gaze movements during some activities.
Low learners paused longer after navigating to a help page.
Low learners chose to ignore coaches’ suggestion to continue exploring current exercise more frequently than high learners.
Low learners appeared to move impulsively back and forth through the curriculum.
In summary, the research shows promise for k-means clustering as a technique for distinguishing effective from ineffective learning behaviors, even during unstructured, exploratory learning. Of course, this work is just a start. For example, additional research with larger numbers of students (24 and 36 students were used in the current studies) might support distinguishing of additional groups — should such additional groups exists. In the end, the hope is that by identifying patterns of behaviors that can serve as indicators of effective vs. ineffective learning, targeted, adaptive interventions can be applied in real-time to students to support their productive learning while maintaining the freedom that defines ELE learning.
Bob Dolan, Ph.D.
Senior Research Scientist, Assessment & Information
Pearson
Monday, August 02, 2010
Educational Data Mining Conference -- Part One
The Third International Conference on Educational Data Mining was held in Pittsburgh in June. The conference began in Montreal two years ago as an offshoot of the Intelligent Tutoring Systems Conference (held this year in Pittsburgh immediately following). A small (approximately 100 participants), single-track conference, participants are mostly academicians in the fields of cognitive science, computer science, and artificial intelligence (AI), most or all of whom have dedicated their efforts to education research.
Educational data mining (EDM) is the process of analyzing student (and in some cases even educator) behaviors for the purpose of understanding their learning processes and improving instructional approaches. It is a critical component of intelligent tutoring systems, since there is an implicit realization in this field that unidimensional models of student knowledge and skills are generally insufficient for providing adaptive supports. That said, the results from EDM go beyond informing Intelligent Tutoring Systems (ITS) on how to do their job. For example, they can be a cornerstone of formative assessment practices, in which we provide teachers with actionable data on which to shape instructional decisions. In fact, few would argue that the most successful ITS is one that not only provides individualized opportunities and supports to students in real-time but also keeps the teacher actively in the loop.
Examples of the types of data used in EDM include:
Correctness of student responses (of course!)
Types of errors made by students
Number of incorrect attempts made
Use of hints and scaffolds
Level of engagement / frequency of off-task behaviors (as measured through eye-tracking, student/computer interaction analysis, etc.)
Student affect (as measured through physiological sensors, student self-reports, etc.)
The list goes on ...
Much of EDM research focuses on identification of how students cluster into groups based upon their behaviors (K-Means is particularly popular, though by no means exclusive). For example, it might be found that a population of students working on an online tutoring system seems to divide into three groups -- high-performing, low-performing/high-motivated, and low-performing/low-motivated -- with each group exhibiting distinguish patterns of interaction and hence learning. The types of supports offered to students in each of these groups can, and should, vary as a function of this clustering.
As efforts to bridge the divide between instruction and assessment get underway, such as Federal Race To The Top Assessment program, it is important that the educational testing research community stay on top of the developments from the EDM research community, to best understand the types of data to collect, techniques for analysis, and their potential for improving educational opportunities for students.
Bob Dolan, Ph.D.
Senior Research Scientist, Assessment & Information
Pearson
Educational data mining (EDM) is the process of analyzing student (and in some cases even educator) behaviors for the purpose of understanding their learning processes and improving instructional approaches. It is a critical component of intelligent tutoring systems, since there is an implicit realization in this field that unidimensional models of student knowledge and skills are generally insufficient for providing adaptive supports. That said, the results from EDM go beyond informing Intelligent Tutoring Systems (ITS) on how to do their job. For example, they can be a cornerstone of formative assessment practices, in which we provide teachers with actionable data on which to shape instructional decisions. In fact, few would argue that the most successful ITS is one that not only provides individualized opportunities and supports to students in real-time but also keeps the teacher actively in the loop.
Examples of the types of data used in EDM include:
Correctness of student responses (of course!)
Types of errors made by students
Number of incorrect attempts made
Use of hints and scaffolds
Level of engagement / frequency of off-task behaviors (as measured through eye-tracking, student/computer interaction analysis, etc.)
Student affect (as measured through physiological sensors, student self-reports, etc.)
The list goes on ...
Much of EDM research focuses on identification of how students cluster into groups based upon their behaviors (K-Means is particularly popular, though by no means exclusive). For example, it might be found that a population of students working on an online tutoring system seems to divide into three groups -- high-performing, low-performing/high-motivated, and low-performing/low-motivated -- with each group exhibiting distinguish patterns of interaction and hence learning. The types of supports offered to students in each of these groups can, and should, vary as a function of this clustering.
As efforts to bridge the divide between instruction and assessment get underway, such as Federal Race To The Top Assessment program, it is important that the educational testing research community stay on top of the developments from the EDM research community, to best understand the types of data to collect, techniques for analysis, and their potential for improving educational opportunities for students.
Bob Dolan, Ph.D.
Senior Research Scientist, Assessment & Information
Pearson
Monday, July 26, 2010
Journal of Technology, Learning, and Assessment Article: The Effectiveness of Distributed Training for Writing Assessment Raters
Read the latest published article from several of our research colleagues:
The Journal of Technology, Learning, and Assessment (JTLA)
Volume 10, Number 1, July 2010
The Effectiveness and Efficiency of Distributed Online, Regional Online, and Regional Face-to-Face Training for Writing Assessment Raters
Edward W. Wolfe, Staci Matthews, and Daisy Vickers
This study examined the influence of rater training and scoring context on training time, scoring time, qualifying rate, quality of ratings, and rater perceptions. 120 raters participated in the study and experienced one of three training contexts: (a) online training in a distributed scoring context, (b) online training in a regional scoring context, and (c) stand-up training in a regional context. After training, raters assigned scores to qualification sets, scored 400 student essays, and responded to a questionnaire that measured their perceptions of the effectiveness of, and satisfaction with, the training and scoring process, materials, and staff. The results suggest that the only clear difference on the outcomes for these three groups of raters concerned training time—online training was considerably faster. There were no clear differences between groups concerning qualification rate, rating quality, or rater perceptions.
Download the article (Acrobat PDF, 239 KB).
About The Journal of Technology, Learning and Assessment
The Journal of Technology, Learning and Assessment (JTLA) is a peer-reviewed, scholarly on-line journal addressing the intersection of computer-based technology, learning, and assessment. The JTLA promotes transparency in research and encourages authors to make research as open, understandable, and clearly replicable as possible while making the research process – including data collection, coding, and analysis – plainly visible to all readers.
The Journal of Technology, Learning, and Assessment (JTLA)
Volume 10, Number 1, July 2010
The Effectiveness and Efficiency of Distributed Online, Regional Online, and Regional Face-to-Face Training for Writing Assessment Raters
Edward W. Wolfe, Staci Matthews, and Daisy Vickers
This study examined the influence of rater training and scoring context on training time, scoring time, qualifying rate, quality of ratings, and rater perceptions. 120 raters participated in the study and experienced one of three training contexts: (a) online training in a distributed scoring context, (b) online training in a regional scoring context, and (c) stand-up training in a regional context. After training, raters assigned scores to qualification sets, scored 400 student essays, and responded to a questionnaire that measured their perceptions of the effectiveness of, and satisfaction with, the training and scoring process, materials, and staff. The results suggest that the only clear difference on the outcomes for these three groups of raters concerned training time—online training was considerably faster. There were no clear differences between groups concerning qualification rate, rating quality, or rater perceptions.
Download the article (Acrobat PDF, 239 KB).
About The Journal of Technology, Learning and Assessment
The Journal of Technology, Learning and Assessment (JTLA) is a peer-reviewed, scholarly on-line journal addressing the intersection of computer-based technology, learning, and assessment. The JTLA promotes transparency in research and encourages authors to make research as open, understandable, and clearly replicable as possible while making the research process – including data collection, coding, and analysis – plainly visible to all readers.
Wednesday, July 07, 2010
Impressions of 2010 CCSSO Conference
This year, the Council of Chief State School Officers (CCSSO) Large-Scale Assessment conference took place in Detroit, the world's automotive center. The weather was a pleasant contrast compared to the heat in Texas, where I live and work. As the conference overlapped with River Days, it was nice to share in the festive atmosphere. The 30-minute fireworks display was splendid.
The participants in the CCSSO conference were quite different from those in NCME or in AERA Division D. The psychometrician/research scientist was a “rare animal.” However, the conference provided a good opportunity to meet assessment users from different states’ Departments of Education.
Most of the presentations were neither technical nor psychometric in orientation but they provided information about the trend of large-scale assessment in the United States. This year, the topics, like Race to the Top, Common Core Standards, and Best Assessments Practice, were especially “hot.”
Not many people showed up for my presentation. It might be the title, “…Threats to Test Validity….” Was it too technical?
Pearson proudly sponsored a reception buffet dinner with live music and several vice presidents greeted guests at the door. The dinner provided an enjoyable situation in which to meet and build relationships with others in the assessment field. I met a number of colleagues from global Pearson and I felt proud to be a member of this big family.
C. Allen Lau, Ph.D.
Senior Research Scientist
Psychometrics & Research Services
Assessment & Information
Pearson
The participants in the CCSSO conference were quite different from those in NCME or in AERA Division D. The psychometrician/research scientist was a “rare animal.” However, the conference provided a good opportunity to meet assessment users from different states’ Departments of Education.
Most of the presentations were neither technical nor psychometric in orientation but they provided information about the trend of large-scale assessment in the United States. This year, the topics, like Race to the Top, Common Core Standards, and Best Assessments Practice, were especially “hot.”
Not many people showed up for my presentation. It might be the title, “…Threats to Test Validity….” Was it too technical?
Pearson proudly sponsored a reception buffet dinner with live music and several vice presidents greeted guests at the door. The dinner provided an enjoyable situation in which to meet and build relationships with others in the assessment field. I met a number of colleagues from global Pearson and I felt proud to be a member of this big family.
C. Allen Lau, Ph.D.
Senior Research Scientist
Psychometrics & Research Services
Assessment & Information
Pearson
Wednesday, June 30, 2010
Abandoning the Pejorative
Psychometricians often speak of error in educational and psychological measurement. But error seems like such a pejorative term. According to the Merriam-Webster online dictionary Merriam-Webster Dictionary, an error is "an act involving an unintentional deviation from truth or accuracy." In ordinary language, error suggests unflattering connotations and alarming inferences.
Under classical test theory, error is the difference between observed student performance and an underlying "true score." But who has special insight into what is true? As Nichols, Twing, Mueller and O’Malley in their recent standard setting article* point out, human judgment is involved in every step of test development. Typically, a construct or content framework describing what we are trying to test is constructed by experts. But the experts routinely disagree amongst themselves and certainly disagree with outside experts. The same expert may even later disagree with a description constructed by them earlier in their career! So what psychometricians call error is just the difference between how one group of experts expects students to perform and how students actually perform.
Psychometricians might even be able to control the amount of "error" by careful selection of what they declare is true. After all, the size of the difference between the experts' description and student performance, or error, depends on what experts you listen to or when you catch them. A strategy for decreasing error might be to gather a set of experts' descriptions and declare as “true” the experts' description that shows the smallest difference between how experts' expect students to perform and how students actually perform. Who is to say this group of experts is right and that group of experts is wrong?
So let's agree to abandon the pejorative label "error" variance. The variance between how one group of experts expects students to perform and how students actually perform might be more appropriately referred to as "unexpected" or "irrelevant." Let's recognize that this variance is irrelevant to the description constructed by the experts, i.e., construct irrelevant variance. Certainly this unexpected variance threatens the interpretation of student performance using the experts' description. But does the unexpected deserve the pejorative label "error?"
*Nichols, P. D., Twing, J., Mueller, C. D., & O’Malley, K. (2010). Standard setting methods as measurement processes. Educational Measurement: Issues and Practices, 29, 14-24,
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
Under classical test theory, error is the difference between observed student performance and an underlying "true score." But who has special insight into what is true? As Nichols, Twing, Mueller and O’Malley in their recent standard setting article* point out, human judgment is involved in every step of test development. Typically, a construct or content framework describing what we are trying to test is constructed by experts. But the experts routinely disagree amongst themselves and certainly disagree with outside experts. The same expert may even later disagree with a description constructed by them earlier in their career! So what psychometricians call error is just the difference between how one group of experts expects students to perform and how students actually perform.
Psychometricians might even be able to control the amount of "error" by careful selection of what they declare is true. After all, the size of the difference between the experts' description and student performance, or error, depends on what experts you listen to or when you catch them. A strategy for decreasing error might be to gather a set of experts' descriptions and declare as “true” the experts' description that shows the smallest difference between how experts' expect students to perform and how students actually perform. Who is to say this group of experts is right and that group of experts is wrong?
So let's agree to abandon the pejorative label "error" variance. The variance between how one group of experts expects students to perform and how students actually perform might be more appropriately referred to as "unexpected" or "irrelevant." Let's recognize that this variance is irrelevant to the description constructed by the experts, i.e., construct irrelevant variance. Certainly this unexpected variance threatens the interpretation of student performance using the experts' description. But does the unexpected deserve the pejorative label "error?"
*Nichols, P. D., Twing, J., Mueller, C. D., & O’Malley, K. (2010). Standard setting methods as measurement processes. Educational Measurement: Issues and Practices, 29, 14-24,
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
Monday, June 14, 2010
Pearson at CCSSO - June 20-23. 2010
I'm looking forward to seeing folks at the upcoming Council of Chief State School Officers (CCSSO) National Conference on Student Assessment (in Detroit, June 20-23, 2010). Pearson employees will be making the following presentations. We hope to see you there.
Multi-State American Diploma Project Assessment Consortium: The Lessons We’ve Learned and What Lies Ahead
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Marquette (Detroit Marriott at the Renaissance Center)
Shilpi Niyogi
Theory and Research On Item Response Demands: What Makes Items Difficult? Construct-Relevant?
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Michael J. Young
Multiple Perspectives On Computer Adaptive Testing for K-12 Assessments
Sunday, June 20, 2010: 3:30 PM-5:00 PM
Nicolet (Detroit Marriott at the Renaissance Center)
Denny Way
The Evolution of Assessment: How We Can Use Technology to Fulfill the Promise of RTI
Monday, June 21, 2010: 3:30-4:30 PM
Cadillac B (Detroit Marriott at the Renaissance Center)
Christopher Camacho
Laura Kramer
Comparability: What, Why, When and the Changing Landscape of Computer-Based Testing
Tuesday, June 22, 2010: 8:30 AM-10:00 AM
Duluth (Detroit Marriott at the Renaissance Center)
Kelly Burling
Measuring College Readiness: Validity, Cut Scores, and Looking to the Future
Tuesday, June 22, 2010: 8:30 AM-10:00 AM
LaSalle (Detroit Marriott at the Renaissance Center)
Jon Twing & Denny Way
Best Assessment Practices
Tuesday, June 22, 2010: 10:30 AM-11:30 AM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Jon Twing
Distributed Rater Training and Scoring
Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Richard (Detroit Marriott at the Renaissance Center)
Laurie Davis, Kath Thomas, Daisy Vickers, Edward W. Wolfe
Identifying Extraneous Threats to Test Validity for Improving Tests and Using of Tests
Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Joliet (Detroit Marriott at the Renaissance Center)
Allen Lau
--------------------------
Edward W. Wolfe, Ph.D.
Senior Research Scientist
Multi-State American Diploma Project Assessment Consortium: The Lessons We’ve Learned and What Lies Ahead
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Marquette (Detroit Marriott at the Renaissance Center)
Shilpi Niyogi
Theory and Research On Item Response Demands: What Makes Items Difficult? Construct-Relevant?
Sunday, June 20, 2010: 1:30 PM-3:00 PM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Michael J. Young
Multiple Perspectives On Computer Adaptive Testing for K-12 Assessments
Sunday, June 20, 2010: 3:30 PM-5:00 PM
Nicolet (Detroit Marriott at the Renaissance Center)
Denny Way
The Evolution of Assessment: How We Can Use Technology to Fulfill the Promise of RTI
Monday, June 21, 2010: 3:30-4:30 PM
Cadillac B (Detroit Marriott at the Renaissance Center)
Christopher Camacho
Laura Kramer
Comparability: What, Why, When and the Changing Landscape of Computer-Based Testing
Tuesday, June 22, 2010: 8:30 AM-10:00 AM
Duluth (Detroit Marriott at the Renaissance Center)
Kelly Burling
Measuring College Readiness: Validity, Cut Scores, and Looking to the Future
Tuesday, June 22, 2010: 8:30 AM-10:00 AM
LaSalle (Detroit Marriott at the Renaissance Center)
Jon Twing & Denny Way
Best Assessment Practices
Tuesday, June 22, 2010: 10:30 AM-11:30 AM
Mackinac Ballroom West (Detroit Marriott at the Renaissance Center)
Jon Twing
Distributed Rater Training and Scoring
Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Richard (Detroit Marriott at the Renaissance Center)
Laurie Davis, Kath Thomas, Daisy Vickers, Edward W. Wolfe
Identifying Extraneous Threats to Test Validity for Improving Tests and Using of Tests
Tuesday, June 22, 2010: 4:00 PM-5:00 PM
Joliet (Detroit Marriott at the Renaissance Center)
Allen Lau
--------------------------
Edward W. Wolfe, Ph.D.
Senior Research Scientist
Wednesday, June 09, 2010
Innovative Testing
Innovative testing refers to the use of novel methods to test students in richer ways than can be accomplished using traditional testing approaches. This generally means the use of technology like computers to deliver test questions that require students to watch or listen to multimedia stimuli, manipulate virtual objects in interactive situations, and/or construct rather than select their responses. The goal of innovative testing is to measure students’ knowledge and skills at deeper levels and measure constructs not easily assessed, such as problem solving, critical analysis, and collaboration. This will help us better understand what students have and haven’t learned, and what misconceptions they might hold—and thus support decisions such as those related to accountability as well as what instructional interventions might be appropriate for individual students.
Educational testing has always involved innovative approaches. As hidden properties of students, knowledge and skill are generally impossible to measure directly and very difficult to measure indirectly, often requiring the use of complex validity arguments. However, to the extent that newer technologies may allow us to more directly assess students’ knowledge and skills—by asking students to accomplish tasks that more faithfully represent the underlying constructs they’re designed to measure—innovative testing holds the promise of more authentic methods of testing based upon simpler validity arguments. And as such, measurement of constructs that are “higher up” on taxonomies of depth of understanding, such as Bloom’s and Webb’s, should be become more attainable.
Consider assessing a high school student’s ability to design an experimental study. Is this the same as his or her ability to identify one written description of a well-designed experiment amongst three written descriptions of poorly designed experiments? Certainly there will be a correlation between the two; the question is how good a correlation, or more bluntly, how artificial the context. And further, to what extent is such a correlation a self-fulfilled prophesy, such that students who might be good at thinking and doing science but not at narrowly defined assessment tasks are likely to do poorly in school as a result of poor test scores due to the compounding impact of negative feedback?
Many will recall the promise of performance assessment in the 90’s to test students more authentically. Performance testing didn’t live up to its potential, in large part because of the challenges of standardized administration and accurate scoring. Enter innovative questions—performance testing riding the back of digital technologies and new media. Richer assessment scenarios and opportunities for response can be administered equitably and at scale. Comprehensive student interaction data can be collected and scored by humans in efficient, distributed setting, automatically by computer, or both. In short, the opportunity for both large-scale and small-scale testing of students using tasks that more closely resemble real-world application of learning standards is now available.
Without question, creating innovative test questions presents additional challenges over that of simpler, traditional ones. As with any performance task, validity arguments become more complex and reliability of scoring becomes a larger concern. Fortunately, there has been some excellent initial work in the area of understanding how, including the development of taxonomies and rich descriptions for understanding innovative questions (e.g., Scalise; Zenisky). Most notable are two approaches that directly address validity. The first is evidence-centered design, an approach to creating educational assessments in terms of evidentiary arguments built upon intended constructs. The second is a preliminary set of guidelines for the appropriate use of technology in developing innovative questions through application of universal design principles that take into account how students interact with those questions as a function of their perceptual, linguistic, cognitive, motoric, executive, and affective skills and challenges. Approaches such as these are especially essential if we are to help ensure the needs of students with disabilities and English language learners are considered from the beginning in designing our tests.
Do we know that innovative questions will indeed allow us to test students to greater depths of knowledge and skill than traditional ones, and whether will they do so in a valid, reliable, and fair manner? And will the purported cost effectiveness be realized? These are all questions that need ongoing research.
As we solve the challenges of implementing innovative questions in technically sound ways, perhaps the most exciting aspect of innovative testing is the opportunity of integration with evolving innovative instructional approaches. Is this putting the cart before the horse to focus so much on innovation in assessment before we figure it out in instruction? I believe not. Improvements to instructional and assessment technologies must co-evolve. Our tests must be designed to pick up the types of learning gains our students will be making, especially when we consider 21st century skills, which will increasingly rely on innovative, technology-based learning tools. Plus our tests have a direct opportunity to impact instruction: despite all our efforts, “teaching to the test” will occur, so why not have those tests become models of good learning? And even if an emphasis on assessment is the cart, at least the whole jalopy is going the correct way down the road. Speaking of roads, consider the co-evolution of automobiles and the development of improved paving technologies: improvement in each couldn’t progress without improvement in the other.
Bob Dolan, Ph.D.
Senior Research Scientist
Educational testing has always involved innovative approaches. As hidden properties of students, knowledge and skill are generally impossible to measure directly and very difficult to measure indirectly, often requiring the use of complex validity arguments. However, to the extent that newer technologies may allow us to more directly assess students’ knowledge and skills—by asking students to accomplish tasks that more faithfully represent the underlying constructs they’re designed to measure—innovative testing holds the promise of more authentic methods of testing based upon simpler validity arguments. And as such, measurement of constructs that are “higher up” on taxonomies of depth of understanding, such as Bloom’s and Webb’s, should be become more attainable.
Consider assessing a high school student’s ability to design an experimental study. Is this the same as his or her ability to identify one written description of a well-designed experiment amongst three written descriptions of poorly designed experiments? Certainly there will be a correlation between the two; the question is how good a correlation, or more bluntly, how artificial the context. And further, to what extent is such a correlation a self-fulfilled prophesy, such that students who might be good at thinking and doing science but not at narrowly defined assessment tasks are likely to do poorly in school as a result of poor test scores due to the compounding impact of negative feedback?
Many will recall the promise of performance assessment in the 90’s to test students more authentically. Performance testing didn’t live up to its potential, in large part because of the challenges of standardized administration and accurate scoring. Enter innovative questions—performance testing riding the back of digital technologies and new media. Richer assessment scenarios and opportunities for response can be administered equitably and at scale. Comprehensive student interaction data can be collected and scored by humans in efficient, distributed setting, automatically by computer, or both. In short, the opportunity for both large-scale and small-scale testing of students using tasks that more closely resemble real-world application of learning standards is now available.
Without question, creating innovative test questions presents additional challenges over that of simpler, traditional ones. As with any performance task, validity arguments become more complex and reliability of scoring becomes a larger concern. Fortunately, there has been some excellent initial work in the area of understanding how, including the development of taxonomies and rich descriptions for understanding innovative questions (e.g., Scalise; Zenisky). Most notable are two approaches that directly address validity. The first is evidence-centered design, an approach to creating educational assessments in terms of evidentiary arguments built upon intended constructs. The second is a preliminary set of guidelines for the appropriate use of technology in developing innovative questions through application of universal design principles that take into account how students interact with those questions as a function of their perceptual, linguistic, cognitive, motoric, executive, and affective skills and challenges. Approaches such as these are especially essential if we are to help ensure the needs of students with disabilities and English language learners are considered from the beginning in designing our tests.
Do we know that innovative questions will indeed allow us to test students to greater depths of knowledge and skill than traditional ones, and whether will they do so in a valid, reliable, and fair manner? And will the purported cost effectiveness be realized? These are all questions that need ongoing research.
As we solve the challenges of implementing innovative questions in technically sound ways, perhaps the most exciting aspect of innovative testing is the opportunity of integration with evolving innovative instructional approaches. Is this putting the cart before the horse to focus so much on innovation in assessment before we figure it out in instruction? I believe not. Improvements to instructional and assessment technologies must co-evolve. Our tests must be designed to pick up the types of learning gains our students will be making, especially when we consider 21st century skills, which will increasingly rely on innovative, technology-based learning tools. Plus our tests have a direct opportunity to impact instruction: despite all our efforts, “teaching to the test” will occur, so why not have those tests become models of good learning? And even if an emphasis on assessment is the cart, at least the whole jalopy is going the correct way down the road. Speaking of roads, consider the co-evolution of automobiles and the development of improved paving technologies: improvement in each couldn’t progress without improvement in the other.
Bob Dolan, Ph.D.
Senior Research Scientist
Wednesday, June 02, 2010
Where has gone the luxury of contemplation?
I ran across an Excel spreadsheet from some years ago that I had used to plan my trip to attend the 2004 NCME conference in San Diego. The weather was memorable that year. But I also attended a number of sessions during which interesting papers were presented and discussants and audience members made compelling comments.
My new memories of the 2010 NCME conference are different. I am grateful that the weather was pleasant. But I have memories of rushing from one responsibility to the next responsibility. I am sure the 2010 NCME conference included interesting papers and compelling commentary but memories of them were overshadowed by a sense of haste and feeling of urgency. This impression was of my own doing. First, I arrived several days late because of my already crowded travel schedule. Second, I participated in the conference in several roles: as presenter, discussant and co-author.
What I missed this year in Denver was the luxury of contemplation. I missed the luxury of sitting in the audience and reacting to the words and ideas as they rolled from the tongues of the presenters. I missed the luxury of mentally inspecting each comment from the discussants or the audience members and comparing them with my own reactions. I missed the luxury of chewing over the last session with a colleague as we walked through the hotel hallway and maybe grabbed lunch before the next session.
I can and did benefit from attending the NCME conference without the luxury of contemplation. But I missed the pleasure and comfort from indulging in the calm and thoughtful appreciation of the labors of my colleagues. These days we rarely indulge in the luxury of contemplation and we are often impoverished because of it.
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
My new memories of the 2010 NCME conference are different. I am grateful that the weather was pleasant. But I have memories of rushing from one responsibility to the next responsibility. I am sure the 2010 NCME conference included interesting papers and compelling commentary but memories of them were overshadowed by a sense of haste and feeling of urgency. This impression was of my own doing. First, I arrived several days late because of my already crowded travel schedule. Second, I participated in the conference in several roles: as presenter, discussant and co-author.
What I missed this year in Denver was the luxury of contemplation. I missed the luxury of sitting in the audience and reacting to the words and ideas as they rolled from the tongues of the presenters. I missed the luxury of mentally inspecting each comment from the discussants or the audience members and comparing them with my own reactions. I missed the luxury of chewing over the last session with a colleague as we walked through the hotel hallway and maybe grabbed lunch before the next session.
I can and did benefit from attending the NCME conference without the luxury of contemplation. But I missed the pleasure and comfort from indulging in the calm and thoughtful appreciation of the labors of my colleagues. These days we rarely indulge in the luxury of contemplation and we are often impoverished because of it.
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
Thursday, April 29, 2010
Pearson’s participation at AERA and NCME
As I read the latest TMRS Newsletter, I was reminded of my first days at Pearson. Back then, the department, lead by Jon Twing, consisted of 5 other staff. We were called psychometricians and we worked closely together across operational testing programs.
The department grew rapidly after that. As part of that growth, Jon and I created the Researcher-Practitioner model as an ideal. Under the Researcher-Practitioner Model, practice informs research and research supports practice. The role of the psychometrician combines research and fulfillment. Our department would not have separate groups of staff to perform research and fulfillment functions. Instead, our department would use the same staff members to perform both activities. Each psychometrician would dedicate the majority of their hours to contract fulfillment and the remaining hours to research.
Many things have changed since those first days but some things remain the same. With more than 50 research scientists, we are still a close-knit group. But the label “psychometrician” has been replaced with “research scientist.” And we are still working toward the ideal of the Researcher-Practitioner. While we may not have achieved the ideal, the list of Pearson staff participating in the annual conference of the American Educational Research Association (AERA) and the annual conference of the National Conference on Measurement in Education (NCME) in the latest TMRS Newsletter is proof that we are still active researchers. In Denver, Pearson staff provide 22 presentations at AERA and 15 presentations at NCME. In addition, Pearson research scientists will make three presentations at the Council of Chief State School Officers’ (CCSSO) National Conference on Student Assessment and will present at Society for Industrial & Organizational Psychology (SIOP) and the International Objective Measurement Workshop (IOMW).
Please review the research that Pearson research scientists will be presenting at these meetings that are listed in the newsletter. If you are interested in reading the conference papers, several are listed on the conference reports tab on the Research & Resources
page of the Pearson Assessment & Information website.
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
The department grew rapidly after that. As part of that growth, Jon and I created the Researcher-Practitioner model as an ideal. Under the Researcher-Practitioner Model, practice informs research and research supports practice. The role of the psychometrician combines research and fulfillment. Our department would not have separate groups of staff to perform research and fulfillment functions. Instead, our department would use the same staff members to perform both activities. Each psychometrician would dedicate the majority of their hours to contract fulfillment and the remaining hours to research.
Many things have changed since those first days but some things remain the same. With more than 50 research scientists, we are still a close-knit group. But the label “psychometrician” has been replaced with “research scientist.” And we are still working toward the ideal of the Researcher-Practitioner. While we may not have achieved the ideal, the list of Pearson staff participating in the annual conference of the American Educational Research Association (AERA) and the annual conference of the National Conference on Measurement in Education (NCME) in the latest TMRS Newsletter is proof that we are still active researchers. In Denver, Pearson staff provide 22 presentations at AERA and 15 presentations at NCME. In addition, Pearson research scientists will make three presentations at the Council of Chief State School Officers’ (CCSSO) National Conference on Student Assessment and will present at Society for Industrial & Organizational Psychology (SIOP) and the International Objective Measurement Workshop (IOMW).
Please review the research that Pearson research scientists will be presenting at these meetings that are listed in the newsletter. If you are interested in reading the conference papers, several are listed on the conference reports tab on the Research & Resources
page of the Pearson Assessment & Information website.
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
Wednesday, April 07, 2010
Performance-based Assessment Redux
Cycles in educational testing continue to repeat. The promotion and use of performance-based assessments is one such cycle. Performance-based assessment involves the observation of students performing authentic tasks in a domain. The assessments may be conducted in a more- or less-formal context. The performance may be live or may be reflected in artifacts such as essays or drawings. Generally, an explicit rubric is used to judge the quality of the performance.
An early phase of the performance-based assessment cycle was the move from the use of performance-based assessment to the use of multiple-choice tests as documented in Charles Odell’s 1928 book, Traditional Examinations and New-type Tests. The “traditional examinations” Odell referred to were performance-based assessments. The “new-type tests” Odell referred to were multiple-choice tests that were beginning to be widely adopted in education. These “new-type tests’ were promoted as an improvement over the old performance-based examinations in efficiency and objectivity. However, Odell had doubts.
I am not old enough to remember the original movement from the use of performance-based assessment to the use of multiple-choice tests but I am old enough to remember the performance-based assessment movement of the 1990s. As I remember it, performance-based assessment was promoted in reaction to the perceived impact of multiple-choice accountability tests on teaching. Critics worried that the use of multiple-choice tests in high-stakes accountability testing programs was influencing teachers to teach to the test, e.g., focus on teaching the content of the test rather than a broader curriculum. Teaching to the test would then lead to inflation of test scores that reflected rote memorization rather than learning in the broader curriculum domain. In contrast, performance-based testing was promoted as a solution that would lead to authentic student learning. Teachers that engage in teaching to a performance-based test would be teaching the actual performances that were the goals of the curriculum. An example of a testing program that attempted to incorporate performance-based assessment on a large scale was the Kentucky Instructional Results Information System.
It’s déjà vu all over again, as Yogi said, and I am living through another phase of the cycle. Currently, performance-based assessments are being promoted as a component of a balanced assessment system (Bulletin #11 ). Proponents claim that performance-based assessments administered by teachers in the classroom can provide both formative and summative information. As a source of formative information (Bulletin #5 ), the rich picture of student knowledge, skills and abilities provided by performance-based assessment can be used by teachers to tailor instruction to address individual student’s needs. As a source of summative information, the scores collected by teachers using performance-based assessment can be combined with scores from large-scale standardized tests to provide a more balanced view of student achievement. In addition, proponents claim that performance-based assessments are able to assess 21st Century Skills whereas other assessment formats may not.
But current performance-based assessments still face the same technical challenges that they faced in the 1990s. A major technical challenge facing performance-based assessments is adequate reliability of scores. Variance in both teachers’ ratings and task sampling may contribute to unacceptably low score reliability for scores used for summative purposes.
A second major challenge facing performance-based assessments is adequate evidence of validity. Remember that performance-based assessment scores are being asked to provide both formative and summative information. But validity evidence for formative assessment stresses consequences of test score use whereas validity evidence for summative assessment stresses more traditional sources of validity evidence.
A third major challenge facing performance-based assessments is the need for comparability of scores across administrations. In the past, the use of complex tasks and teacher judgments has made equating difficult.
Technology to the rescue! Technology can help address the many technical challenges facing performance-based assessment in the following ways:
Save your dire prediction for others, George Santayana. We may not be doomed to repeat history, after all. Technology offers not just a response to our lessons from the past but a way to alter the future.
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
An early phase of the performance-based assessment cycle was the move from the use of performance-based assessment to the use of multiple-choice tests as documented in Charles Odell’s 1928 book, Traditional Examinations and New-type Tests. The “traditional examinations” Odell referred to were performance-based assessments. The “new-type tests” Odell referred to were multiple-choice tests that were beginning to be widely adopted in education. These “new-type tests’ were promoted as an improvement over the old performance-based examinations in efficiency and objectivity. However, Odell had doubts.
I am not old enough to remember the original movement from the use of performance-based assessment to the use of multiple-choice tests but I am old enough to remember the performance-based assessment movement of the 1990s. As I remember it, performance-based assessment was promoted in reaction to the perceived impact of multiple-choice accountability tests on teaching. Critics worried that the use of multiple-choice tests in high-stakes accountability testing programs was influencing teachers to teach to the test, e.g., focus on teaching the content of the test rather than a broader curriculum. Teaching to the test would then lead to inflation of test scores that reflected rote memorization rather than learning in the broader curriculum domain. In contrast, performance-based testing was promoted as a solution that would lead to authentic student learning. Teachers that engage in teaching to a performance-based test would be teaching the actual performances that were the goals of the curriculum. An example of a testing program that attempted to incorporate performance-based assessment on a large scale was the Kentucky Instructional Results Information System.
It’s déjà vu all over again, as Yogi said, and I am living through another phase of the cycle. Currently, performance-based assessments are being promoted as a component of a balanced assessment system (Bulletin #11 ). Proponents claim that performance-based assessments administered by teachers in the classroom can provide both formative and summative information. As a source of formative information (Bulletin #5 ), the rich picture of student knowledge, skills and abilities provided by performance-based assessment can be used by teachers to tailor instruction to address individual student’s needs. As a source of summative information, the scores collected by teachers using performance-based assessment can be combined with scores from large-scale standardized tests to provide a more balanced view of student achievement. In addition, proponents claim that performance-based assessments are able to assess 21st Century Skills whereas other assessment formats may not.
But current performance-based assessments still face the same technical challenges that they faced in the 1990s. A major technical challenge facing performance-based assessments is adequate reliability of scores. Variance in both teachers’ ratings and task sampling may contribute to unacceptably low score reliability for scores used for summative purposes.
A second major challenge facing performance-based assessments is adequate evidence of validity. Remember that performance-based assessment scores are being asked to provide both formative and summative information. But validity evidence for formative assessment stresses consequences of test score use whereas validity evidence for summative assessment stresses more traditional sources of validity evidence.
A third major challenge facing performance-based assessments is the need for comparability of scores across administrations. In the past, the use of complex tasks and teacher judgments has made equating difficult.
Technology to the rescue! Technology can help address the many technical challenges facing performance-based assessment in the following ways:
- Complex tasks and simulations can be presented in standardized formats using technology to improve standardization of administration and broaden task sampling;
- Student responses can be objectively scored using artificial intelligence and computer algorithms to minimize unwanted variance in student scores;
- Teacher training can be detailed and sustained using online tutorials so that teachers’ rating are consistent within teachers across students and occasions and across teachers; and,
- Computers and hand-held devices can be used to collect teachers’ ratings across classrooms and across time so that scores can be collected without interrupting teaching and learning.
Save your dire prediction for others, George Santayana. We may not be doomed to repeat history, after all. Technology offers not just a response to our lessons from the past but a way to alter the future.
Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson
Wednesday, March 10, 2010
Some Thoughts About Ratings…
I spend a lot of time thinking about ratings. One reason I spend so much time thinking about ratings is that I’ve either assigned or been subjected to ratings many times during my life. For example, I review numerous research proposals and journal manuscripts each year, and I assign ratings that help determine whether the proposed project is funded or manuscript is published. I have entered ratings for over 1,000 movies into my Netflix database, and in return, I receive recommendations for other movies that I might enjoy. My wife is a photographer, and one of my sons is an artist, and they enter competitions and receive ratings through that process with hopes of winning a prize. My family uses rating scales to help us decide what activities we’ll do together—so much so that my sons always ask me to define a one and a ten when I ask them to rate their preferences on a scale of one to ten.
In large-scale assessment contexts, the potential consequences associated with ratings are much more serious than these examples, so I’m surprised at the relatively limited amount of research that has been dedicated to studying the process and quality of those ratings over the last 20 years. While writing this, I leafed through a recent year of two measurement journals, and I found only three articles (out of over 60 published articles) relating to the analysis of ratings. I’ve tried to conduct literature reviews on some topics relating to large-scale assessment ratings for which I have found few, if any, journal articles. This dearth of research relating to ratings troubles me when I think about the gravity of some of the decisions that are made based on ratings in large-scale assessment contexts and the difficulty of obtaining highly reliable measures from ratings (not to mention the fact that scoring performance-based items is an expensive undertaking).
Even more troubling is the abandonment, by some, of the entire notion of using assessment formats that require ratings because of these difficulties. This is an unfortunate trend in large-scale assessment, because there are many areas of human performance that simply cannot be adequately measured with objectively scored items. The idea of evaluating written compositions skills, speaking skills, artistic abilities, and athletic performance with a multiple-choice test seems downright silly. Yet, that’s what we would be doing if the objective of the measurement process was to obtain the most reliable measures. Clearly, in contexts like this, the authenticity of the measurement process is an important consideration—arguably as important as the reliability of the measures.
So, what kinds of research need to be done relating to the analysis of ratings in large-scale assessment contexts? There are numerous studies of psychometric models and statistical indices that can be utilized to scale ratings data and to identify rater effects. In fact, all three of the articles that I mentioned above focused on such applications. However, studies such as those do little to contribute to the basic problems associated with ratings. For example, very few studies exist that examine the decision making process that raters utilize when making rating decisions. There are also very few studies of the effectiveness of various processes for training raters in large-scale assessment projects—see these three Pearson research reports for examples of what I mean: Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring, A Comparison of Training & Scoring in Distributed & Regional Contexts - Writing , A Comparison of Training & Scoring in Distributed & Regional Contexts - Reading. Finally, there are almost no studies of the characteristics of raters that make them good candidates for large-scale assessment scoring projects. Yet, the basis of most of the decisions that are made by those who run scoring projects focus on these three issues: Who should score, how should they be trained, and how should they score? It sure would be nice to make better progress toward answering these three questions over the next 20 years than we have during the past 20.
Edward W. Wolfe, Ph.D.
Senior Research Scientist
Assessment & Information
Pearson
In large-scale assessment contexts, the potential consequences associated with ratings are much more serious than these examples, so I’m surprised at the relatively limited amount of research that has been dedicated to studying the process and quality of those ratings over the last 20 years. While writing this, I leafed through a recent year of two measurement journals, and I found only three articles (out of over 60 published articles) relating to the analysis of ratings. I’ve tried to conduct literature reviews on some topics relating to large-scale assessment ratings for which I have found few, if any, journal articles. This dearth of research relating to ratings troubles me when I think about the gravity of some of the decisions that are made based on ratings in large-scale assessment contexts and the difficulty of obtaining highly reliable measures from ratings (not to mention the fact that scoring performance-based items is an expensive undertaking).
Even more troubling is the abandonment, by some, of the entire notion of using assessment formats that require ratings because of these difficulties. This is an unfortunate trend in large-scale assessment, because there are many areas of human performance that simply cannot be adequately measured with objectively scored items. The idea of evaluating written compositions skills, speaking skills, artistic abilities, and athletic performance with a multiple-choice test seems downright silly. Yet, that’s what we would be doing if the objective of the measurement process was to obtain the most reliable measures. Clearly, in contexts like this, the authenticity of the measurement process is an important consideration—arguably as important as the reliability of the measures.
So, what kinds of research need to be done relating to the analysis of ratings in large-scale assessment contexts? There are numerous studies of psychometric models and statistical indices that can be utilized to scale ratings data and to identify rater effects. In fact, all three of the articles that I mentioned above focused on such applications. However, studies such as those do little to contribute to the basic problems associated with ratings. For example, very few studies exist that examine the decision making process that raters utilize when making rating decisions. There are also very few studies of the effectiveness of various processes for training raters in large-scale assessment projects—see these three Pearson research reports for examples of what I mean: Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring, A Comparison of Training & Scoring in Distributed & Regional Contexts - Writing , A Comparison of Training & Scoring in Distributed & Regional Contexts - Reading. Finally, there are almost no studies of the characteristics of raters that make them good candidates for large-scale assessment scoring projects. Yet, the basis of most of the decisions that are made by those who run scoring projects focus on these three issues: Who should score, how should they be trained, and how should they score? It sure would be nice to make better progress toward answering these three questions over the next 20 years than we have during the past 20.
Edward W. Wolfe, Ph.D.
Senior Research Scientist
Assessment & Information
Pearson
Wednesday, March 03, 2010
An ATP Newbie Reflects…
I walked into the annual meeting of the Association of Test Publishers (ATP) opening shindig (appropriately Superbowl-themed on 2/7/10 – congrats Saints!) and was struck by déjà vu. I eerily felt the same trepidation and bemusement as at my first educational conference back in 2000. Despite many years in assessment, I knew very few people. It was only later that I realized who the players were and that these were influential industry leaders—professors I had studied in college, text book authors I was required to read, people I had observed giving presentations across the country—competitors and colleagues. It occurred to me that they had much in common with me and I began to relax.
The opening session The Opening Session introduced Scott Berkun, author of “The Myths of Innovation”, who challenged attendees -- What is innovation and how does it REALLY happen? I thought of Edison’s “Genius is 1% inspiration and 99% perspiration”. Berkun’s message (chapter 7 of his book) was: throughout history there were few “epiphany” (“ah ha”) moments -- more trying ideas and doggedly pursuing them until success was achieved. Failures are buried in the annals of time like Roman architecture other than the Coliseum...Keep asking -- What is innovation and is this it?
I enjoyed sessions on innovative items in assessment. “Assessing the Hard Stuff with Innovative Items” Assessing the Hard Stuff with Innovative Items which covered approaches from Medical Examiners, Certified Public Accountants, Medical Sonographers and Architects. The simulation rich examples and expanded item types (e.g., interactive tasks; expanded response options like drop down lists, forms/notes/orders, drawing/annotation tools; and interactive response options like hotspots and drag-and-drops) were interesting to consider. “Are You Ready for Innovative Items” Are You Ready for Innovative Items was a how-to on considerations for implementing innovative items and really outlined the potential pitfalls in innovation. The first was more intellectually interesting but the second was a good overview for those of you new to innovative item formats.
The Education division meeting was another interesting event. As newly appointed Secretary, I was surprisingly asked to step into the Vice Chair role. WOW, nothing like a promotion when you attend your first conference --or a foreshadowing of how much work we need to do as a group. Steve Lazer from ETS accepted the Chair role and Jim Brinton of Certification Management Services for volunteered for Secretary. Now we have a full slate of officers ready to serve!
Despite our commitment to service, the Education division appears to suffer from an identity crisis. We discussed how to increase ATP membership and conference attendance but I failed to see the value proposition of membership for all groups. This is a trade association that should be working for us -- its members. I am puzzled and concerned by the discussion about the inability for state government entities (acting as publishers) to join -- since this is a trade organization. However, moving forward I hope to better understand the mission and goals of the Education division so I can help resolve this identity crisis!
Respectfully submitted,
Karen Squires Foelsch
VP, Content Support Services
(ATP neophyte and new TrueScores blogger)
I walked into the annual meeting of the Association of Test Publishers (ATP) opening shindig (appropriately Superbowl-themed on 2/7/10 – congrats Saints!) and was struck by déjà vu. I eerily felt the same trepidation and bemusement as at my first educational conference back in 2000. Despite many years in assessment, I knew very few people. It was only later that I realized who the players were and that these were influential industry leaders—professors I had studied in college, text book authors I was required to read, people I had observed giving presentations across the country—competitors and colleagues. It occurred to me that they had much in common with me and I began to relax.
The opening session The Opening Session introduced Scott Berkun, author of “The Myths of Innovation”, who challenged attendees -- What is innovation and how does it REALLY happen? I thought of Edison’s “Genius is 1% inspiration and 99% perspiration”. Berkun’s message (chapter 7 of his book) was: throughout history there were few “epiphany” (“ah ha”) moments -- more trying ideas and doggedly pursuing them until success was achieved. Failures are buried in the annals of time like Roman architecture other than the Coliseum...Keep asking -- What is innovation and is this it?
I enjoyed sessions on innovative items in assessment. “Assessing the Hard Stuff with Innovative Items” Assessing the Hard Stuff with Innovative Items which covered approaches from Medical Examiners, Certified Public Accountants, Medical Sonographers and Architects. The simulation rich examples and expanded item types (e.g., interactive tasks; expanded response options like drop down lists, forms/notes/orders, drawing/annotation tools; and interactive response options like hotspots and drag-and-drops) were interesting to consider. “Are You Ready for Innovative Items” Are You Ready for Innovative Items was a how-to on considerations for implementing innovative items and really outlined the potential pitfalls in innovation. The first was more intellectually interesting but the second was a good overview for those of you new to innovative item formats.
The Education division meeting was another interesting event. As newly appointed Secretary, I was surprisingly asked to step into the Vice Chair role. WOW, nothing like a promotion when you attend your first conference
Despite our commitment to service, the Education division appears to suffer from an identity crisis. We discussed how to increase ATP membership and conference attendance but I failed to see the value proposition of membership for all groups. This is a trade association that should be working for us -- its members. I am puzzled and concerned by the discussion about the inability for state government entities (acting as publishers) to join -- since this is a trade organization. However, moving forward I hope to better understand the mission and goals of the Education division so I can help resolve this identity crisis!
Respectfully submitted,
Karen Squires Foelsch
VP, Content Support Services
(ATP neophyte and new TrueScores blogger)
Subscribe to:
Posts (Atom)