Saturday, June 26, 2010

June '10 IOP Part II: "Test bias"

In Part I of this two-part post, I described the first focal article in the June 2010 issue of Industrial and Organizational Psychology (IOP), devoted to emotional intelligence. In this post, I will describe the second focal article, which focuses on test bias.

The focal article is written by Meade & Tonidandel and has at its core one primary argument. In their words:

"...the most commonly used technique for evaluating whether a test is biased, the regression-based approach outlined by Cleary (1968), is itself flawed and fails to tell us what we really want to know."

The Cleary method of detecting test bias is conducted by regressing a criterion (e.g., job performance) on test scores along with a dichotomous demographic grouping variable, such as sex or race, and looking at the interaction. If there is a significant difference by group variable (e.g., different slope, different intercepts), this suggests the test may be "biased". This situation is called differential prediction or predictive bias. The authors contrast this with "internal methods" of examining tests for bias, such as differential item and test functioning and confirmatory factor analysis which do not rely upon any external criterion data.

Meade & Tonidandel state that while the Cleary method has become the de facto method for evaluating bias, there are several significant flaws with the approach. Most critically, the presence of differential prediction does not necessarily mean the test itself is the cause (which we would call measurement bias). Other potential causes are:

- bias in the criterion
- reliability of the test
- omitted variables (i.e., important predictors are left out of the regression)

Another important limitation of the Cleary method is the susceptibility of slope difference tests to low power--this can result in findings of no slope differences due to small samples rather than a true absence. In addition, because of the type I error rate of the intercept test, one is likely to conclude that intercept differences are present when none truly exist.

Because of these limitations, the authors recommend the following general steps when investigating a test:

1. Conduct internal analyses examining differential functioning of items and the test.
2. Examine both test and criterion scores for significant group mean differences before conducting regression analyses.
3. Compute d effect size estimates for group mean differences for both the test and the criterion.

The authors present several scenarios where tests may be "biased" as conceived in the traditional sense but may or may not be "fair"--an important distinction. For example, one group may have higher performance scores, but there is no group difference in test scores. Use of the predictor may result in one group being hired at greater or lesser rates than they "should be", but our judgment of the fairness of this requires consideration of organizational and societal goals (e.g., affirmative action, maximum efficiency) rather than simply an analysis of the tests.

The authors close with several overall recommendations:
1) Stop using the term test bias (too many interpretations, confounds different concepts).
2) Always examine both measurement bias and differential prediction.
3) Do not assume a test is unusable if differential prediction is indicated.

There are several commentary articles that follow the focal one. The authors of these pieces make several points, ranging from criticisms of techniques and specific statements to suggestions for further analyzing the issue. Some of the better comments/questions include:

- How likely is it that the average practitioner is aware of these issues and further is able to conduct these analyses? (a point the authors agree with in part)

- This approaches advocated mainly work with multi-item tests; things get more complicated when we're using several tests or an overall rating.

- It may not be helpful to create a "recipe" of recommendations (as listed above); rather we should acknowledge that each selection scenario is a different context.

- We are still ignoring the (probably more important) issue of why a test may or may be "biased". An important consideration is the nature of group membership in demographic categories (including multiple group memberships).

Meade & Tonidandel provide a response to the commentaries and acknowledge several valid points raised but end this article with the same proposition that they started the focal article with:

"For 30 years, the Cleary model has served as the dominant guidelines for analyses of test bias in our field. We believe that these should be revisited."

No comments: