Tuesday, April 27, 2010

This blog has moved


This blog is now located at http://www.truescores.com/.
You will be automatically redirected in 30 seconds, or you may click here.

For feed subscribers, please update your feed subscriptions to
http://www.truescores.com/feeds/posts/default.

Wednesday, April 07, 2010

Performance-based Assessment Redux

Cycles in educational testing continue to repeat. The promotion and use of performance-based assessments is one such cycle. Performance-based assessment involves the observation of students performing authentic tasks in a domain. The assessments may be conducted in a more- or less-formal context. The performance may be live or may be reflected in artifacts such as essays or drawings. Generally, an explicit rubric is used to judge the quality of the performance.

An early phase of the performance-based assessment cycle was the move from the use of performance-based assessment to the use of multiple-choice tests as documented in Charles Odell’s 1928 book, Traditional Examinations and New-type Tests. The “traditional examinations” Odell referred to were performance-based assessments. The “new-type tests” Odell referred to were multiple-choice tests that were beginning to be widely adopted in education. These “new-type tests’ were promoted as an improvement over the old performance-based examinations in efficiency and objectivity. However, Odell had doubts.

I am not old enough to remember the original movement from the use of performance-based assessment to the use of multiple-choice tests but I am old enough to remember the performance-based assessment movement of the 1990s. As I remember it, performance-based assessment was promoted in reaction to the perceived impact of multiple-choice accountability tests on teaching. Critics worried that the use of multiple-choice tests in high-stakes accountability testing programs was influencing teachers to teach to the test, e.g., focus on teaching the content of the test rather than a broader curriculum. Teaching to the test would then lead to inflation of test scores that reflected rote memorization rather than learning in the broader curriculum domain. In contrast, performance-based testing was promoted as a solution that would lead to authentic student learning. Teachers that engage in teaching to a performance-based test would be teaching the actual performances that were the goals of the curriculum. An example of a testing program that attempted to incorporate performance-based assessment on a large scale was the Kentucky Instructional Results Information System.

It’s déjà vu all over again, as Yogi said, and I am living through another phase of the cycle. Currently, performance-based assessments are being promoted as a component of a balanced assessment system (Bulletin #11 ). Proponents claim that performance-based assessments administered by teachers in the classroom can provide both formative and summative information. As a source of formative information (Bulletin #5 ), the rich picture of student knowledge, skills and abilities provided by performance-based assessment can be used by teachers to tailor instruction to address individual student’s needs. As a source of summative information, the scores collected by teachers using performance-based assessment can be combined with scores from large-scale standardized tests to provide a more balanced view of student achievement. In addition, proponents claim that performance-based assessments are able to assess 21st Century Skills whereas other assessment formats may not.

But current performance-based assessments still face the same technical challenges that they faced in the 1990s. A major technical challenge facing performance-based assessments is adequate reliability of scores. Variance in both teachers’ ratings and task sampling may contribute to unacceptably low score reliability for scores used for summative purposes.

A second major challenge facing performance-based assessments is adequate evidence of validity. Remember that performance-based assessment scores are being asked to provide both formative and summative information. But validity evidence for formative assessment stresses consequences of test score use whereas validity evidence for summative assessment stresses more traditional sources of validity evidence.

A third major challenge facing performance-based assessments is the need for comparability of scores across administrations. In the past, the use of complex tasks and teacher judgments has made equating difficult.

Technology to the rescue! Technology can help address the many technical challenges facing performance-based assessment in the following ways:
  • Complex tasks and simulations can be presented in standardized formats using technology to improve standardization of administration and broaden task sampling;
  • Student responses can be objectively scored using artificial intelligence and computer algorithms to minimize unwanted variance in student scores;
  • Teacher training can be detailed and sustained using online tutorials so that teachers’ rating are consistent within teachers across students and occasions and across teachers; and,
  • Computers and hand-held devices can be used to collect teachers’ ratings across classrooms and across time so that scores can be collected without interrupting teaching and learning.

Save your dire prediction for others, George Santayana. We may not be doomed to repeat history, after all. Technology offers not just a response to our lessons from the past but a way to alter the future.

Paul Nichols, PhD
Vice President
Psychometric & Research Services
Assessment & Information
Pearson

Wednesday, March 10, 2010

Some Thoughts About Ratings…

I spend a lot of time thinking about ratings. One reason I spend so much time thinking about ratings is that I’ve either assigned or been subjected to ratings many times during my life. For example, I review numerous research proposals and journal manuscripts each year, and I assign ratings that help determine whether the proposed project is funded or manuscript is published. I have entered ratings for over 1,000 movies into my Netflix database, and in return, I receive recommendations for other movies that I might enjoy. My wife is a photographer, and one of my sons is an artist, and they enter competitions and receive ratings through that process with hopes of winning a prize. My family uses rating scales to help us decide what activities we’ll do together—so much so that my sons always ask me to define a one and a ten when I ask them to rate their preferences on a scale of one to ten.

In large-scale assessment contexts, the potential consequences associated with ratings are much more serious than these examples, so I’m surprised at the relatively limited amount of research that has been dedicated to studying the process and quality of those ratings over the last 20 years. While writing this, I leafed through a recent year of two measurement journals, and I found only three articles (out of over 60 published articles) relating to the analysis of ratings. I’ve tried to conduct literature reviews on some topics relating to large-scale assessment ratings for which I have found few, if any, journal articles. This dearth of research relating to ratings troubles me when I think about the gravity of some of the decisions that are made based on ratings in large-scale assessment contexts and the difficulty of obtaining highly reliable measures from ratings (not to mention the fact that scoring performance-based items is an expensive undertaking).

Even more troubling is the abandonment, by some, of the entire notion of using assessment formats that require ratings because of these difficulties. This is an unfortunate trend in large-scale assessment, because there are many areas of human performance that simply cannot be adequately measured with objectively scored items. The idea of evaluating written compositions skills, speaking skills, artistic abilities, and athletic performance with a multiple-choice test seems downright silly. Yet, that’s what we would be doing if the objective of the measurement process was to obtain the most reliable measures. Clearly, in contexts like this, the authenticity of the measurement process is an important consideration—arguably as important as the reliability of the measures.

So, what kinds of research need to be done relating to the analysis of ratings in large-scale assessment contexts? There are numerous studies of psychometric models and statistical indices that can be utilized to scale ratings data and to identify rater effects. In fact, all three of the articles that I mentioned above focused on such applications. However, studies such as those do little to contribute to the basic problems associated with ratings. For example, very few studies exist that examine the decision making process that raters utilize when making rating decisions. There are also very few studies of the effectiveness of various processes for training raters in large-scale assessment projects—see these three Pearson research reports for examples of what I mean: Effects of Different Training and Scoring Approaches on Human Constructed Response Scoring, A Comparison of Training & Scoring in Distributed & Regional Contexts - Writing , A Comparison of Training & Scoring in Distributed & Regional Contexts - Reading. Finally, there are almost no studies of the characteristics of raters that make them good candidates for large-scale assessment scoring projects. Yet, the basis of most of the decisions that are made by those who run scoring projects focus on these three issues: Who should score, how should they be trained, and how should they score? It sure would be nice to make better progress toward answering these three questions over the next 20 years than we have during the past 20.

Edward W. Wolfe, Ph.D.
Senior Research Scientist
Assessment & Information
Pearson

Wednesday, March 03, 2010

An ATP Newbie Reflects…

I walked into the annual meeting of the Association of Test Publishers (ATP) opening shindig (appropriately Superbowl-themed on 2/7/10 – congrats Saints!) and was struck by déjà vu. I eerily felt the same trepidation and bemusement as at my first educational conference back in 2000. Despite many years in assessment, I knew very few people. It was only later that I realized who the players were and that these were influential industry leaders—professors I had studied in college, text book authors I was required to read, people I had observed giving presentations across the country—competitors and colleagues. It occurred to me that they had much in common with me and I began to relax.

The opening session The Opening Session introduced Scott Berkun, author of “The Myths of Innovation”, who challenged attendees -- What is innovation and how does it REALLY happen? I thought of Edison’s “Genius is 1% inspiration and 99% perspiration”. Berkun’s message (chapter 7 of his book) was: throughout history there were few “epiphany” (“ah ha”) moments -- more trying ideas and doggedly pursuing them until success was achieved. Failures are buried in the annals of time like Roman architecture other than the Coliseum...Keep asking -- What is innovation and is this it?

I enjoyed sessions on innovative items in assessment. “Assessing the Hard Stuff with Innovative Items” Assessing the Hard Stuff with Innovative Items which covered approaches from Medical Examiners, Certified Public Accountants, Medical Sonographers and Architects. The simulation rich examples and expanded item types (e.g., interactive tasks; expanded response options like drop down lists, forms/notes/orders, drawing/annotation tools; and interactive response options like hotspots and drag-and-drops) were interesting to consider. “Are You Ready for Innovative Items” Are You Ready for Innovative Items was a how-to on considerations for implementing innovative items and really outlined the potential pitfalls in innovation. The first was more intellectually interesting but the second was a good overview for those of you new to innovative item formats.

The Education division meeting was another interesting event. As newly appointed Secretary, I was surprisingly asked to step into the Vice Chair role. WOW, nothing like a promotion when you attend your first conference --or a foreshadowing of how much work we need to do as a group. Steve Lazer from ETS accepted the Chair role and Jim Brinton of Certification Management Services for volunteered for Secretary. Now we have a full slate of officers ready to serve!

Despite our commitment to service, the Education division appears to suffer from an identity crisis. We discussed how to increase ATP membership and conference attendance but I failed to see the value proposition of membership for all groups. This is a trade association that should be working for us -- its members. I am puzzled and concerned by the discussion about the inability for state government entities (acting as publishers) to join -- since this is a trade organization. However, moving forward I hope to better understand the mission and goals of the Education division so I can help resolve this identity crisis!

Respectfully submitted,
Karen Squires Foelsch
VP, Content Support Services
(ATP neophyte and new TrueScores blogger)

Monday, June 29, 2009

Pearson is Fulfilling the Goal to be the Nation’s Thought Leader in Assessment

One of primary objectives of Pearson as the leading provider of educational measurement research is to lead the effort on effective educational policy discussion. Sometimes these efforts are clearly articulated in customer facing actions (like legally defensible setting of student performance standards), academic research publications or conference presentations. Other times, policy and/or position papers are prepared to inform our customers and others regarding the direction Pearson is steering education. I was recently involved in the development of such a paper and wanted to share it with you in this post

“Using Assessments to Improve Student Learning and Progress” is a very interesting paper that clarifies the roles of large-scale, high-stakes assessments as contrasted with classroom assessments. While I have made such comparisons in other TrueScores posts, this paper is much more comprehensive.

Here is a brief except of the distinctions made in the paper:
“Assessments for learning provide the continuous feedback in the teach-and-learn cycle, which is not the intended mission of summative assessment systems. Teachers teach and often worry if they connected with their students. Students learn, but often misunderstand subtle points in the text or in the material presented. Without ongoing feedback, teachers lack qualitative insight to personalize learning for both advanced and struggling students, in some cases with students left to ponder whether they have or haven’t mastered the assigned content.”
This paper also contains links to other Pearson related efforts to inform and shape public policy and opinion as evidenced from the follow except:

“Assessments for learning are part of formative systems, where they not only provide information on gaps in learning, but inform actions that can be taken to personalize learning or differentiate instruction to help close those gaps. The feedback loop continues by assessing student progress after an instructional unit or intervention to verify that learning has taken place, and to guide next steps. As described by (Pearson authors) Nichols, Meyers, and Burling:
‘Assessments labeled as formative have been offered as a means to customize instruction to narrow the gap between students’ current state of achievement and the targeted state of achievement. The label formative is applied incorrectly when used as a label for an assessment instrument reference to an assessment as formative is shorthand for the particular use of assessment information, whether coming from a formal assessment or teachers’ observations, to improve student achievement. As William and Black (1996) note: ‘To sum up, in order to serve a formative function, an assessment must yield evidence that…indicates the existence of a gap between actual and desired levels of performance, and suggests actions that are in fact successful in closing the gap.’”

This quote also shows how the Pearson themes are indeed consistent in that personalized learning is supported through the Pearson "teach and learn" cycle as informed by assessment—one Pearson's primary goals. So, go check it out!