Thursday, February 01, 2007

5-point, 7-point, 9-point, oh my!

One of the most common questions that comes up in assessment is: What type of rating scale should I use? (Followed closely by: What is it exactly that you do again?)

Usually when people ask this question, they're putting together an interview and (bless their hearts) are taking the time to think about how to grade candidate responses. But it doesn't have to be for an interview--rating scales are used for other types of tests, most notably work sample/performance tests. And then there's that whole "performance evaluation" thing.

So what does the research say?

Well, that depends on who you ask. Let's take a look at what some folks much smarter than me have said:

Nunnally (1978): "[the increase in reliability] tends to level off at about 7, and after about 11 steps there is little gain in reliability..."

Landy & Farr (1980): "...the format should require the individual to use one of a limited number of response categories, probably less than nine..."

Cascio (1998): "...4- to 9-point scales [are statistically optimal]..."

Pulakos, in Whetzel and Wheaton (1997): "A generally accepted guideline is somewhere between four and nine."

So what do we take from this? That generally we should shoot for between 4 and 11 scale points.

Does it really matter?

Probably not. Some of the most comprehensive studies of the topic have determined that when all is said and done, the number of points probably doesn't make that big of a difference. For example:

Wigdor & Green (1991): "...the consensus of professional opinion is that variations in scale type and rating format do not have a consistent, demonstrable effect on halo, leniency, reliability, or other sources of error or bias in performance ratings..."

Landy & Farr (1980), again: "...about 4%-8% of the variance in ratings can be explained on the basis of format."

Guion (1998): "The 5-point scale is so widely used that is seems as if it had been ordained on tablets of stone...there is little evidence that the number of scale units matters much..."

I suspect that if the problem is with the rating format, in most cases it's not because there are too few/too many categories, but that the categories aren't anchored very well. Have you ever had to rate an answer with a scale like this: Excellent - Satisfactory - Poor ? What the heck does "Satisfactory" mean? That doesn't help the rater, doesn't lend itself to reliable and valid measurement, and certainly won't look good in court.

The other big problem is rater training. Some organizations do a great job of training raters. Many don't. Without extensive rater training, you're just asking for all kinds of errors to enter into the equation. Panel members should pre-test the interview, try to poke holes in it, and generally discuss.

Bottom line

Back in 1956 a little article was published that you may have heard of. It was titled, "The magical number seven, plus or minus two: Some limits on our capacity for processing information." In this article, George Miller argued forcefully that humans seem to have a natural limit of dealing with around 7 (+/- 2) pieces of information simultaneously.

51 years later, we don't seem to have changed our mind much.

No comments: