Tuesday, July 26, 2011

My Cat Is Dead! -- A Physics Problem


Let me start by saying that I am a physics nut who has been known to worship at the altar of Richard Feynman on occasion. So, you will not likely be surprised to discover that I was mesmerized by a recent article in the New Yorker by Rivka Galchen regarding quantum computing. You might be surprised that an old psychometrician reads the New Yorker (or even that I can read), but let me assure you I have had a subscription in one form or another since 1974.

Before I address quantum mechanics as they apply to computing and discuss Galchen’s article, it is best to attempt to present a little background on quantum physics such that we can have a context to evaluate its application. The world of quantum physics is bizarre if not unbelievable. For example, if you modify the traditional physics thought experiment commonly known as “Schrödinger's Cat” along the lines of Niels Bohr or Hugh Everett, you will understand its complexity.

Think of a box containing a cat. If I asked you “Is the cat alive or dead?” you would probably say it is one or the other but certainly not both. Well, with one interpretation of quantum physics (the “many-worlds interpretation of quantum mechanics”), the cat is both alive and dead simultaneously. The cat is alive in one universe while dead in another (or many) based on the cumulative probabilities of such events across all universes. Hence, depending upon what universe you are in when you observe the cat (read “universe” as “dimension” if you are struggling—like in the Twilight Zone television shows), you will see the cat either alive or dead.

But wait! There is a catch according to the quantum physicists—as soon as you look at the cat you become “entangled” and the cat’s state (dimension or world) is no longer separate from yours. Hence, if you see a dead cat, there is another universe (or dimension) where you are looking at a live cat! Makes great sense, right?

Before you dismiss this quantum stuff as all nonsense, let’s look at a separate but analogous concept in probability theory that is not so controversial—called Let’s Make a Deal or “the Monty Hall Problem.” Suppose there are three curtains; behind one waits a prize and behind the other two a rock. You don’t have to be a statistical genius to know that the probability of selecting the correct curtain is one out of three or 1/3 (0.3333…), at least at the onset of the experiment. However, once you pick a curtain, (even before it is revealed to you) you either won (1.0) or lost (0.0). Only over the long run or accumulation of probabilities will your “sampling distribution” average out to 1/3 and only if the game is played fairly.

However, when Monty eliminates a curtain and gives you a chance to keep your selection or switch, your probability of winning goes up substantially if you switch to the other curtain! Hence, even in current probability theory there seem to be separate dimensions—you have one probability of winning when you selected the first curtain and a different one if you switch. Notice also that once you make a final choice (i.e., once you become “entangled” with the probability) you either win or lose.

If you are still reading :), a logical question you might be asking is: “What does this quantum stuff and simultaneous probabilities have to do with computing and technology?” Galchen introduces us to a professor from Oxford by the name of David Deutsch. David is an odd professor because he hates to teach—rather, he hates to teach people who do not want to learn and sees this as one of the biggest problems of our educational systems—but I digress. Professor Deutsch “works” (to the extent you define just what it is a professor does) at the Oxford Centre for Quantum Computation, which is part of Oxford’s Clarendon Physics Laboratory (this is the same university hosting the Oxford University Centre for Educational Assessment established with a grant from Pearson.

As Galchen describes it, quantum computing is based on quantum mechanics, which simply states that particles exist in two places at once—a quality called superposition. Furthermore, these particles are related in a “spooky” way, or are entangled, such that they can coordinate their properties regardless of their distance apart in space or time. Finally, when we actually observe these particles, we obliterate them! This last case is analogous to looking at the cat in the box and seeing it is dead—because the sheer fact of looking at it caused the result to be a specific outcome!

Unlike a traditional computer that uses bits (“0” and “1”) to represent states, the quantum computer uses qubits (“0,” “1” and “0 and 1”). Hence, these entangled particles can share information, resulting in the ultimate “distributed computing” model. Conceptually, if we set a qubit to “0” in this state, we know that it has another particle in “another world” that can carry a “1” because it is entangled with this world. Hence qubits perform “double duty” across dimensions. One way to understand this is to think about what Everett says about this concept: in quantum computing when there is more than one possible outcome, ALL of the outcomes exist simultaneously, each in different universes. As such, all we have to do is figure out which one is the “correct one” and look only at that one.

If it works, we will unleash computing power never realized before. For example, it would take a current computer a lifetime to factor a 200-digit number whereas, depending upon the number of qubits in the quantum computer, it would take a fraction of a second to compute. This is not as fantastic to believe as it sounds. The Oxford Centre has an eight-qubit computer that looks like an “air hockey table,” according to Galchen, but with added lasers, optics and magnetic fields to control for contamination--as do other centers in Canada, Singapore and New Haven.

Outside of the power such enhanced computing would provide to us, why would we care? Well, to begin with, one of the organizations paying for and experimenting with quantum computing is Google. If Google invents and/or holds a patent on quantum computing, I am sure they can bring it to market and we could use it. I think it would be much better though, if we could see the same future Google sees, get in on the ground floor (particularly with a place as prestigious as Oxford) and allow them to help us open our eyes to the types of innovation the future will require.

Let me know what you think.

Jon S. Twing, Ph.D.
Executive Vice President & Chief Measurement Officer
Test, Measurement & Research Services
Assessment & Information
Pearson



Monday, May 09, 2011

What is “Out-of-Level” Content for the Digital Learner?

The influential Digital Learning Now! report provides a roadmap that lawmakers and policymakers can use to integrate digital learning into education. Among other elements, this report calls out the importance of personalized learning: the idea that all students can customize their education using digital content through an approved provider. A related concept called out by the report is that student learning is the metric for evaluating the quality of content and instruction.

I strongly support personalized learning, but I have long puzzled over how personalized learning and accurate measurement of student learning can be reconciled with standards-based assessment and accountability. The fundamental purpose of the No Child Left Behind (NCLB) legislation
“is to ensure that all children have a fair, equal, and significant opportunity to obtain a high-quality education and reach, at a minimum, proficiency on challenging State academic achievement standards and state academic assessments.”
There have been positive aspects to NCLB, to be sure, but the strict requirement for on-grade-level assessment of state-established content has narrowed the curriculum and stifled innovative assessment approaches. For example, because of this grade-level requirement the United States Department of Education (USED) discouraged the idea of computerized adaptive testing until only recently.

The on-grade-level requirement is also reflected in the 2007 regulations allowing students with disabilities to take modified assessments with modified achievement standards. These assessments apply to
“a limited group of students with disabilities who may not be able to reach grade-level achievement standards within the same time frame as other students.”
To satisfy these regulations, states developed assessments that reportedly measured regular grade-level content standards by using items that were simplified almost to the point of being caricatures of the items on the regular assessments. For example, some states created the modified items by removing the most attractive incorrect option, leaving three choices for students—the correct response and the two most obvious incorrect options. The alternate and quite logical idea of instructing this targeted group of students in prerequisite content commonly taught in earlier grades and assessing students accordingly is decidedly out of compliance with the regulations.

We are now embarking on a new era of educational reform, and President Obama has outlined his blueprint to reauthorize Elementary and Secondary Education Act (ESEA). It focuses on better preparing students for college and the workplace. It also emphasizes diversity of learners, innovation, and improving capacity at the state and district levels to support the effective use of technology. If done right, the new legislation will embrace the digital learner and the personalization of instruction and assessment. This can and should include measuring all learners against rigorous college and career-ready standards, but at the same time, it should encourage the use of technology to adapt instruction for the learner and adapt assessments to measure growth in learning—growth from the most appropriate (and possibly off grade level) starting point for the gifted student, the struggling student, or those in between.

In this brave new digital world, it seems to me that “on-grade-level content” may serve only as a milestone along the path towards the ultimate goal of college and career readiness. As long as instruction and assessment can be offered within a sequence supported by pedagogy or learning progressions, no content should be considered out-of-level for the digital learner.

Walter (Denny) Way , Ph.D.
Senior Vice President, Psychometric & Research Services
Assessment & Information
Pearson

Friday, March 25, 2011

Performance-based assessments: A brave, new world

Performance assessments. Performance-based assessments. Authentic assessments. Constructed response, open-ended, performance tasks. Our field has devised many terms to describe assessments in which examinees demonstrate some type of performance or create some type of product. Whatever you call them, performance-based assessments (PBAs) have a long history in educational measurement with cycles of ups and downs. And once again, PBAs are currently in vogue. Why?

To address the federal government’s requirements for assessment systems that represent “the full performance continuum,” the two consortia formed in response to Race to the Top funding have both publicized assessment plans that involve a heavy dose of performance-based tasks. Thus, PBAs are relevant to any discussion about the future of testing in America.

The old arguments in favor of PBAs are still appealing to today’s educators, parents, and policy-makers. Proponents claim these types of tests are more motivating to students. They provide a model for what teachers should be teaching and students should be learning. They serve as professional development opportunities for teachers involved in developing and scoring them. They constitute complex, extended performances that allow for evaluation of both process and product. Moreover, performance-based tasks provide more direct measures of student abilities than multiple-choice items. They are able to assess students’ knowledge and skills at deeper levels than traditional assessment approaches and are better suited to measuring certain skill types, such as writing and critical thinking. They are more meaningful because they are closer to criterion performances (Or so the story goes. To be fair, these are all claims requiring empirical validation.)

Despite their recent renaissance, PBAs have well-known limitations: lower reliability and generalizability than selected-response items, primarily because of differences in efficiency between the two task types (one hour of testing time buys you many fewer performance-based tasks than multiple-choice items). But these limitations also arise because PBAs are frequently scored by humans—a process that introduces a certain amount of rater error. In exchange for greater depth of content coverage, PBAs compromise breadth of coverage. Generalizability studies of PBAs have found that significant proportions of measurement error are attributable to task sampling, manifested in both person-by-task interactions and person-by-task-by-occasion interactions (in designs incorporating occasion). Again, this is largely because there are many fewer performance-based tasks on any given test.

PBAs are used in a variety of contexts, including summative, high-stakes contexts, such as certification and licensure, as well as employment and educational selection. PBAs are also used for formative or instructional purposes. When well-designed PBAs are administered and scored in the classroom, they can provide information for diagnosing student misconceptions, evaluating the success of instruction, and planning for differentiation because of their instructional alignment and because student performances offer windows into students’ thinking processes. In high-stakes contexts, strict standardization of task development, administration, and scoring is critical for promoting comparability, reliability, and generalizability. In classroom assessment contexts, such rigid standardization may be relaxed. Clearly, what makes a particular PBA useful for one context will make it less so for the other. For example, standardization of task development, administration, and scoring (which is impractical in classroom settings anyway) moves assessment further from instruction and makes it less amenable to organic adjustment by the teacher to meet student needs. In turn, the unstandardized procedures typically favored in classroom settings—extended administration time, student choice of tasks, student collaboration—introduce construct-irrelevant variance and diminish the comparability of tasks that is necessary for high-stakes contexts.

Bottom line: PBAs are here for the foreseeable future. The measurement community needs to revise its expectations about the reliability of individual assessment components. PBAs will prove less reliable than traditional assessment approaches! However, we should move forward with the expectation that this compromise in reliability means an upgrade in terms of greater construct validity for skills not easily assessed using traditional approaches. In addition, I suggest we focus on the reliability, comparability, and generalizability of scores and decisions emanating from assessment systems that incorporate multiple measures representing multiple assessment contexts taken over multiple testing occasions (e.g. through-course assessments).

Moreover, there are ways of making PBAs—even those administered in the classroom—more reliable and comparable. Although pre-service teacher training in measurement and assessment is notoriously weak (e.g., see Rick Stiggins’ work), teachers can be taught to design assessments according to measurement principles. For example, carefully-crafted test specifications can go a long way in creating comparable tasks. Although the measurement field has traditionally avoided classroom assessment, I suggest we consider participating in collaborative initiatives to create curricula with psychometrically-sound, embedded PBAs (e.g., see Shavelson and company’s SEAL project.

Doing this well will require new assessment development models that incorporate close collaboration between curriculum designers and assessment developers to ensure tight alignment and seamless integration of assessment and instruction. Such models will also require closer collaboration between the content specialists who write the tasks and the psychometricians charged with collecting evidence to support overall assessment validity and reliability. Finally, such embedded assessments will need to be piloted—not only to investigate task performance, but also to obtain feedback from teachers regarding assessment functionality and usefulness.

It’s a brave, new world of assessment. To truly advance and sustain these developments, we need to start thinking in brave, new ways. Such an approach will help ensure that the current wave of performance assessment has more staying power than the last.

Emily Lai
Associate Research Scientist
Test, Measurement & Research Services
Assessment & Information
Pearson

Monday, March 14, 2011

Teacher Effectiveness Measures: The Tortoise and the Hare



Several recent initiatives have fueled a firestorm of debate around measuring the effectiveness of our teachers. In the competition for billions of dollars through both Race to the Top and the Teacher Incentive Fund, responding states and districts were required to propose measures of teacher effectiveness that incorporate student growth data. Most of the applicants that were awarded these funds proposed weighting student data up to 50% within the measure. In an effort to increase teacher accountability, high stakes have been proposed for these effectiveness measures. They may be used to make decisions related to employment, like promotions and dismissals, as well as monetary bonuses.

In my opinion, there have been two kinds of responses to these teacher effectiveness measures. First, there’s been what I classify as the “sidelines” response where some researchers and teachers simply don’t want to take part in the development of these measures. Researchers claim that the value-added models that have been proposed to estimate teacher effects based on student data are flawed, whereas teachers claim that their work cannot be accurately and fairly reflected by student test scores alone. While these points are legitimate, those on the sidelines don’t tend to suggest alternative solutions.

In stark contrast to the sidelines response, there is also the response that I liken to the hare from the famous fable. Some policy-makers and vendors have ignored the debates and taken off in the race to provide a solution to teacher effectiveness. By moving directly to implementation of such measures, the policy-makers claim they will encourage reform within schools, and the vendors are happy to accept the new business.

I have followed both responses for nearly a year. Initially, I fell in line with many colleagues and wanted no part of using statistical models and student data to produce estimates of teacher effectiveness. I am now resigned to the fact that the sidelines response is unproductive. Refusing to participate does not impact the inevitability of teacher effectiveness measures and can be seen as a refusal to contribute to a solution.

So now that I’m ready to step off the sidelines and engage in teacher effectiveness measures, I realize that those who chose the hare response are so far ahead that they are no longer visible on the horizon. But perhaps they have paused to take a nap. By producing a solution to teacher effectiveness so quickly, perhaps this group has failed to provide the research and documentation needed to support and sustain such measures. Likely, they did not engage stakeholders to elicit feedback about what effective teaching really means. There has not been time to conduct validity studies that empirically link their measure to results in the classroom. While the hares in our industry may have gotten off to the fastest start when it comes to teacher effectiveness, I suspect that their lead will not last.

I suggest that neither the sidelines nor the sprint are advisable. Rather I advocate for what could be called the tortoise approach. It’s not a quick-fix solution, but with more time, a valuable and defensible measure of teacher effectiveness can be defined and established.


I suggest we tortoises follow comparable steps and standards to those used for assessment development:

  • The first step could be similar to a content review, where stakeholders convene to discuss and identify the essential components of teacher effectiveness.

  • Then content experts can partner with psychometricians to design and refine measures of these essential components.

  • Next, the measures could be piloted and validity studies could be performed. Standard setting could be used to establish the line that divides effective from ineffective teaching.

  • Custom reports could compare individual teacher performance to that of the schools’ teachers or teachers with similar student populations.

  • Professional development activities could be offered to help schools improve the skills of lower performing teachers.
Each of these steps takes time but also provides essential information needed to develop a valid and useful measure of teacher effectiveness.

Without sufficient evidence to support the defensibility and validity of the hares’ quick-fix solutions to teacher effectiveness, I wonder if they will even reach the halfway point, let alone the finish line. In contrast, we tortoises can continue to move forward building measures with known procedures, valuable stakeholder input, and informative data analysis. Let’s heed the advice of The Tortoise and the Hare and work to establish a quality measure of teacher effectiveness rather than the fastest solution.

Tracey Magda
Psychometrician
Evaluation Systems
Assessment & Information
Pearson

Friday, February 25, 2011

Old Dogs Need New Tricks


As soon as I finished graduate school my husband and I got two puppies, small Italian Greyhounds named Cyclone and Thunder. We named them after the mascots from our undergraduate universities. We took them everywhere with us and spoiled them rotten until the kids arrived. Once we had a toddler walking around the house, it was obvious that these old dogs needed to learn some new tricks. We had to teach them not to jump up on the kids or lick their faces. We had gotten small dogs because we weren’t going to have a lot of time to really train them, but now we knew we were going to have to be creative in teaching them that it really wasn’t okay to treat a toddler like a lollipop. It was going to be a challenge, but we were motivated.

Recently I was able to attend an event related to one of my challenges at work. The Center for K-12 Assessment and Performance Management at ETS hosted a research symposium on Through-Course Summative Assessments. Attendance at the conference was by invite only and each organization was only able to send one or two participants. I was excited to represent Pearson with such a distinguished group of researchers and thinkers. Through-course summative assessment poses some incredible challenges for traditional psychometrics, and I was eager to hear the recommendations from leaders in the field on issues such as how to handle reliability, validity, scoring, scaling, and growth measures in these types of assessments. Instead, the eight papers generally focused on the identification of issues and challenges, resulting in many recommendations for further research. Very few solutions were proposed, and many of the solutions that were proposed did not seem very viable for large-scale testing. It was clear to me that, just like my Italian Greyhounds, we old psychometricians will need some new tricks.

Although most of the papers focused on somewhat technical measurement topics, the audience at the symposium was really a mix of technical and policy experts. The tension between those viewpoints was evident throughout the conference. As set out in the presentations of the initial policy context, the next generation assessments designs proposed by the Partnership for Assessment of Readiness for College and Careers (PARCC) and by the SMARTER Balanced Assessment Consortium (SBAC)—complete with formative, interim, through-course, summative, and performance components—will be used to:

  • signal and model good instruction,
  • evaluate teachers and schools,
  • show student growth and status toward college and career readiness,
  • diagnose student strengths and weaknesses to aide instructional decision making,
  • be accessible to all student populations regardless of language proficiency or disability status, and
  • allow the United States to compete with other nations in a global economy.
That’s a tall order! It’s like trying to teach a puppy to sit, stay, roll over, and fetch at the same time.

The policy goals and several of the desired policy uses of the assessments are clear. What is not as clear is what psychometric models can be used to support these claims. It was mentioned more than once at the conference that if a test has too many purposes, it is unlikely that any purpose will be well met. I think it’s clear, however, that the new assessments will be used for all those purposes, and the assessment community must find a way to support them.

Too often the psychometric mantra has been “Just say no.” If you recall, that was the advertising campaign for the war on drugs in the 80’s and 90’s. It’s time to move to the 21st century. Assessments will be used for more than identifying how much grade-level content a student has mastered. We may not have originally developed assessments to be used for evaluating teachers, but they are used for this and will continue to be. In the same way, high school assessments will be used to predict readiness for college and careers. Policy makers are asking for our help to design and provide validity evidence for assessments that will serve a variety of purposes. No, the assessments may not have the same level of standardization and tight controls, but they still can be better than an alternative design that excludes psychometrics entirely.

There is already a mistrust of testing and an overload of data. Moving forward, we need to work with teachers, campus leaders, parents, and the community to better involve them in the testing process and particularly in the processes for reporting and interpreting test results. Tests should not be administered simply to check off a requirement. The data produced from assessments should inform instruction, student progress, campus development, and more. The assessments are not isolated events, but rather part of a larger educational system of instruction and assessment with the goal of preparing students for college and careers. This is a worthy goal. As a trained psychometrician, I also struggle with determining how far we can push the boundaries in meeting this goal before we’ve stepped over the line. If I bathe my kids in lemon juice to keep the dogs from licking them, have I gone too far? It may seem like a crazy idea, but I can’t ignore the need to think differently.

Indeed the next generation assessments, including through-course summative assessments, will provide new challenges and opportunities for psychometrics and research. The research, however, must be focused around solving the practical challenges that the assessment consortia will face. States are looking to us to be creative and propose solutions, not develop a laundry list of problems. There is no perfect solution. Instead, psychometrics must take steps forward to present innovative assessment solutions that balance the competing priorities and bring us closer to the goal of improving education for all students. We must continue to research and use that research to refine and update the assessment systems.

As Stan Heffner, Associate Superintendent for the Center for Curriculum and Assessment at the Ohio Department of Education, discussed in his presentation, “This is a time to be propelled by ‘what should be’ instead of limited by ‘what is’.”

He was too polite to really say it, but I think he meant that old dogs need new tricks.

Katie McClarty
Senior Research Scientist
Test, Measurement & Research Services
Assessment and Information
Pearson