HR Tests - Recruitment, assessment, and personnel selection: Validity

Showing posts with label Validity. Show all posts

Sunday, September 22, 2013

Research update: September, 2013

Okay, it's mega research update time!

First off, the September IJSA; lots of good stuff, including:

- a constructed response multimedia test for entry-level police resulted in minor ethnic group differences

- panel interviews once again prove their superiority (also: more on interview reliability)

- further analysis of the Hogan Personality Inventory with a Spanish sample

- how to applicants form impressions of person-organization fit? This study suggests contextual factors may be more important than interview content

- circumplex traits (combinations of personality factors) may predict counterproductive work behaviors better than simple FFM scores

- speaking of CWBs, conditional reasoning tests may not be the best predictor of them

- last but not least, what looks to be a good overview of competency modeling

Next up, the September JAP:

- an interesting, large study of the impact of candidate reactions on test scores, organizational perception, and criterion-related validity

- a study of the dynamics of the job search process and the impact of efficacy and focus

- highlighting certain factors during an interview may reduce discrimination toward pregnant applicants

Next, the Autumn 2013 Personnel Psychology:

- first, an important study of self-efficacy that suggests it is a product of past performance and not necessarily a predictor of future performance (free right now!)

- second, a study indirectly on selection that suggests that age diversity in work groups leads to more emotion regulation

Let's move on to the September JASP:

- okay, this may be a bit of a stretch, but if you're considering interviewing for a position as a dentist or a lawyer, make sure you suit up

- knowledge of service encounters predicts service effectiveness (and is related to conscientiousness)

- can use of biodata instruments result in adverse impact? This study suggests so, but also suggests that removal of problematic items has no impact on validity

Starting to wrap up, let's move to the October JOB:

- perceptions of the fairness of promotion practices is one of those "bubbling beneath the surface" issues in most organizations. This study found that perceptions are impacted by having been promoted in the past, organizational commitment, and ego defensiveness. Good stuff.

- do more creative sales agents produce higher sales? Perhaps only when there is a high quality of leader-member exchange.

- is validity generalization overgeneralized? (say that five times fast) These folks seem to think so.

In the home stretch, from the September Psychological Science:

- older employees may have lower average cognitive performance, but it's more consistent

- spatial ability has a valuable role to play in the development of creativity, and can predict things like patents and publications

Second to last, for you stats geeks out there, a study that suggests that t-tests can be used reliably with small samples, thank you very much

Finally, something that has nothing to do with selection but is a nominee for the 2013 HR Tests Coolest Study Award, and something we all are very familiar with: time bandits (no, not the movie).

Saturday, April 27, 2013

Test validity: A fairy tale

Once upon a time there was a field called Industrial and Organizational Psychology. Its researchers and practitioners dealt with a myriad of magical issues ranging from individual differences and behavior to organizational structures.

Within this field, there was a specialty called Personnel Psychology. It dealt with narrower--but no less mysterious--issues such as defining and designing jobs and, most relevant for us, finding and hiring the right people. The I/O psychologists and HR professionals that quested for these answers often found themselves on dangerous missions like battling Monsters of Doubt (i.e., first-line supervisors).

These adventurers had two main weapons at their disposal when fighting these monsters: the Carrot of Truth and the Stick of Pain.

When invoked, the Carrot of Truth, fashioned deep in the Mines of Correlation, caused monsters to realize that hiring the right people was the right thing to do for their realm. It increased productivity and morale, customer satisfaction, and organizational flexibility. It also allowed supervisors to spend more time leading and less time dealing with gremlins (i.e., employees with performance problems).

The Stick of Pain used an opposing form of magic but was sometimes equally effective. It attempted to slay the monsters using a peculiar power called The Law. The Law frightened monsters because it meant they could experience emotional pain and suffering, and--more importantly--fewer bags of gold.

For a long time, Personnel adventurers used both of these weapons to slay all kinds of monsters, on high mountains and in dungeons. But over time, the adventurers discovered something: the Stick of Pain was becoming less and less effective.

It wasn't that the Stick was powerless. It's just that its magic didn't seem to frighten the monsters as much. The monsters saw their gold piling up and didn't feel the sting of suffering as they once had. And they started developing an addiction to carrots all on their own.

So there came a day where the adventurers and the monsters met on the battlefield and came to an agreement. No longer would the adventurer wield the Stick of Pain. And in return, the monsters pledged to respect the Carrot of Truth. They forged an eternal partnership and lived happily ever after.

The End.

Okay, so I've taken a little artistic license with my blog post today. But hopefully you see where I'm going.

Back in the old days (ya know, like the 80s), employers were faced with a foreboding world of testing, with the Civil Rights Act and cases like Griggs vs. Duke Power looming large over their assessment programs. I/O psychologists were brought in to help organizations navigate the complicated world of employment testing, which required an appreciation of statistics and the law alike. Large awards and settlements brought C-level attention, and regulatory agencies like the EEOC and DOJ served in ongoing oversight roles, requiring employers to clean up their act with procedural requirements that could be burdensome.

Nowadays, I/O psychologists are as likely to be valued for their ability to crunch "big data" to detect employee behavior trends as their ability to conduct thorough job analyses (not that the two are mutually exclusive). Lawsuits regarding testing are infrequent compared to issues like wages and hours, harassment, and terminations. The selection cases that do come up are as likely to involve disabilities as adverse impact due to cognitive loading.

Sure, we have the occasional big case that gets attention. But the bottom line is over the years the "stick" has become much less effective as an argument for sound assessment than the "carrot."

Smart employers like Google have started crunching the numbers and realized the true business value of defining the right competencies for jobs. They're doing so not because they're afraid of litigation, but because they see more clearly the direct line between best practices in selection that we've been preaching for years--i.e., focusing on valid assessment results--and the bottom line.

So where does that leave the stick (i.e., fear of lawsuits)? Is it time to put it away along with phrenology and T&Es (woops, that slipped in)?

I don't think so. Organizations will always be subject to legal scrutiny when their selection processes have adverse impact and the right person talks to the right attorney. Personnel psychologists and HR professionals should always have a healthy respect for the legal climate we operate in, and not forget that "job related and consistent with business necessity" isn't fictional gibberish.

But what it does mean is that because organizations are paying attention to their assessments, they are more likely to yield valid results and be more free of illegal bias. That means management's quest and the selection professional's quest are more likely to converge, with a lot more cooperation.

And hopefully a lot fewer monsters.

Saturday, October 16, 2010

Q&A with Piers Steel: Part 2

Last time I posted the first part of my Q&A with Piers Steel, co-author of a recent piece in Industrial and Organizational Psychology (that I wrote about here) on synthetic validity and a fascinating proposition to create a system that would greatly benefit both employers and candidates. Read on for the conclusion.

Q4) Describe the system/product--what does it look like? For applicants? Employers? Governments?

A4) How do we do it? Well, that’s what our focal article in Perspectives on Science and Practice was about. Essentially, we break overall performance into the smallest piece people can reliably discern, like people, data, things (note: our ability to do this got some push back from one reviewer – that is, he was arguing we can’t tell the difference if people are good at math but not good at sales and vice-versa – it is a viewpoint that became popular because researchers assumed that “if it ain’t trait, it’s error”). We get a good job analysis tool that assesses every relevant aspect of the job, such as job complexity. We get a good performance battery, naturally including GMA and personality. We then have lots of people in about 300 different jobs take the performance battery, have their performance on every dimension as well as overall assessed to a gold standard (i.e., train those managers!), and have their jobs analyzed with equal care with that job analysis tool. From that, we can create validity coefficients for any new job simply by running the math. It is basically like validity generalization plus a moderator search, where once we know the work, we can figure out the worker. Again, read the article for more details, but this was basically it.

Once built all employers need to do to get a top-notch quality selection is describe their job using the job analysis tool and then as fast as electrons fly through a CPU, you get your selection system, essentially instantly. It is several orders of magnitude better in almost every way from what we have now from almost every criterion.

Q5) What are the benefits--to candidates, employers, and society?

A5) Everyone had a friend who struggled through life before finding out what they should have been doing in the first place. Or changed a job for a new company only to find they hated it there. Or never found anything they truly excelled at and just tried to live their lives through recreational activities. Everyone has experienced lousy service or botched jobs because the employee wasn’t in a profession that they were capable of excelling in. Everyone has heard of talented people who were down and out because no one recognized how good they really were.

Synthetic validity is all about making this happen less. How much less is the real question. If we match people to jobs and jobs to people wonderfully now, then perhaps not at all. But of course, we know that presently it's pretty terrible.

Now synthetic validity won’t be able to predict people’s work future perfectly, but it will do a damn sight better job than what we have now. Also, the best thing about synthetic validity is that it is going to start off good and then get better every year. Because it is a consolidated system, incremental improvements, “technical economies,” are cost effective to pursue and once discovered and developed, they are locked in every time synthetic validity is used.

Right now, we have a system that can only detect the largest and most obvious of predictors (e.g., GMA) because of sample size issues, but can’t pursue other incremental predictors because they aren’t cost effective for just one job site. By the very nature of selection today, we are never going to get much better. As I mentioned, nothing major has changed in 50 years and nothing major will change in the next 50 if we continue with the same methodology. Synthetic validity is a way forward. With synthetic validity, the costs are dispersed across all users, potentially tens of millions, making every inch of progress matter.

So, what we will get? Higher productivity. If synthetic validity results in just a few thousand dollars of extra productivity per employee each year, multiply that by 130 million, the US work force. Take a second to work out the number – it’s a big number.

Also, people should be happier in their jobs too, creating greater life satisfaction. They will stay in their jobs longer, creating real value and expertise. Similarly, unemployment will go down as people more rapidly find new work appropriate to their skills. In fact, I can only think of one group that won’t like this – the truly bad performer. They are the only group that wouldn’t want better selection.

Q6) Finally, what do you need to move forward?

A6) So far, no one I know is doing this. There are some organizations who think they are doing synthetic validity, though it is really just transportability and they aren’t interested in pursuing the real thing. Partly, I think it is because the real decision makers don’t know about synthetic validity or don’t understand it. I could do more to communicate synthetic validity, though I have done quite a bit already. I have sent a few press releases, received a dozen or two newspaper interviews (Google it), contacted a few government officials on both sides our border, and pursued a dozen or so private organizations. Part of my reason to do this interview here is to try to get the word out. So far, all I got back was a few “interesting” but no actual action.

I used to think this lack of pursuit was because synthetic validity was so hard to build, requiring 30,000 people -- but we know a lot more now. In the Perspectives article, McCloy pointed out we could allow ourselves to use subject matter experts to estimate some of the relationships. That won’t be as good as if we gathered the data ourselves, but we could get something running real quick, though later we would upgrade with empirical figures. Consequently, the reason why this isn’t built isn’t because it is too difficult. Also, the payoff would eventually be cornering the worldwide selection and vocational counseling market. I am not sure what that is worth but I imagine you could buy Facebook with change left over for MySpace if you wanted to. The value of it then isn’t the problem either.

I am coming to the conclusion that despite the evidence, to most people it is just my word as an individual. I’m a good scientist, winner of the Killam award for best professor at my entire University, but it still isn’t enough. You need the backing of a professional association and so far ours [SIOP] hasn’t yet taken a stand. As a professional organization, we should be promoting this, using the full resources of our association. I admit that I am “a true believer,” but this seems to be one of the bigger breakthroughs in all the social sciences in the last 100 years. Alternatively to the backing of a professional association, we need a groundswell where hundreds of voices repeat the message. I will do my bit but hopefully I will have a lot of company.

If you think I am overstating the case regarding synthetic validity, show me where I’m wrong. We handled all the technical critiques and issues in the Perspectives article. Right now, you have to make the argument that “human capital” doesn’t matter, that being good or bad at your work doesn’t matter. And if you try to make that case, I don’t think you are the type of person who would be even worth arguing with.

I'd like to thank Dr. Steel for his time and energy. I truly hope this idea sees the light of day. If you are interested in moving this forward, leave a comment and I can put you in touch with him.

Monday, October 11, 2010

Q&A with Piers Steel: Part 1

A few weeks ago I wrote about a research article that I think proposes a revolutionary idea: The creation of synthetic validity database that would generate ready-made selection systems that would rival or exceed the results generated through a traditional criterion validation study.

I had the opportunity to connect with one of the articles authors, Piers Steel, Associate Professor of Human Resources and Organizational Dynamics at the University of Calgary. Piers is passionate about the proposal and believes strongly that the science of selection has reached a ceiling. I wanted to dig deeper and get some details, so I posed some follow-up questions to him. Read on for the first set of questions, and I'll post the rest next time:

Q1) What is the typical state of today's selection system--what do we do well, and what don't we?

A1) Here is quote from well a respected selection journal, Personnel Psychology: “Psychological services are being offered for sale in all parts of the United States. Some of these are bona fide services by competent, well-trained people. Others are marketing nothing but glittering generalities having no practical value.... The old Roman saying runs, Caveat emptor--let the buyer beware. This holds for personnel testing devices, especially as regard to personality tests.”

Care to try and date it? It is from the article “The Gullibility of Personnel Managers,” published in 1958. Did you guess a different date? You might of, as the observation is as relevant today as yesterday -- nothing fundamental has changed. Just compare that with this more recent 2005 excerpt from HR Magazine, Personality Counts: “Personality has a long, rich tradition in business assessment,” says David Pfenninger, CEO of the Performance Assessment Network Inc. “It’s safe, logical and time-honored. But there has been a proliferation of pseudo tests on the market: Caveat emptor.”

Selection is typically terrible with good being the exception. The biggest reason is that top-notch selection systems are financially viable only for large companies with a high-volume position. Large companies can justify the $75,000 cost and months to develop and validate and perhaps, if they are lucky, have the in-house expertise to identify a good product. Most other employers don’t the skill to differentiate the good from the bad as both look the same when confronted with nearly identical glossy brochures and slick websites. And then the majority of hires are done with a regular unstructured job interview – it is the only thing employers have the time and resources to implement. Interviews alone are better than nothing but not much better – candidates are typically better at deceiving the interviewer than the interviewer is at revealing the candidate.

The system we have right now can’t even be described as being broken. That implies it once worked or could be fixed. Though ideally we could do good selection, typically, it is next to useless, right up there with graphology, which about a fifth of professional recruiters still use during their selection process. For example, Nick Corcodilos reviews how effective internet job sites are getting people a position. He asks us to consider “is it a fraud?”

Q2) What's keeping us from getting better?

A2) Well, there are a lot of things. First, sales and marketing works, even if the product doesn’t. When you have a technical product and untechnical employer or HR office, you have a lot of room for abuse. I keep hearing calls for more education and that management should care more. You are right they should care more and know more. People should also care and know more about their retirement funds as well. Neither is going to change much.

Second, the unstructured job interview has a lot of “truthiness” to it. Every professional selection expert I know includes a job interview component to the process even when it doesn’t do much, as the employer simply won’t accept the results of the selection system without it. There are some cases where people “have the touch” and are value added but this is the exception. Still, everyone thinks they are gifted, discerning, and thorough. This is the classic competition between clinical and statistical prediction, with evidence massively favoring the superiority of the latter over the former but people still preferring the former over the latter (here are few cites to show I’m not lying, as if you are like everyone else, you won’t believe me: Grove, 2005; Kuncel, Klieger, Connelly, & Ones, 2008).

Third, it just costs too much and takes too much time to do it right. Also, most jobs aren’t really large enough to do any criterion validation.

Q3) What might the future look like if we used the promise of synthetic validity?

A3) Well, to quote an article John Kammeyer-Mueller and I wrote, our selection systems would be "inexpensive, fast, high-quality, legally defensible, and easily administered.” Furthermore, every year they would noticeable improve, just like computers and cars. A person would have their profile taken and updated whenever they want, with initial assessments done online and more involved ones conducted in assessment centers. Once they have the profile, they would get a list of jobs they would likely be good at, ones that they would be likely good at and enjoy, and ones they would be likely good at, enjoy and that are in demand.

Furthermore, using the magic of person-organization fit, you inform them what type of organization they would like to work for. If someone submitted their profile to a job database, every day job positions would come to them automatically, with the likelihood of them succeeding at it. These jobs would come in their morning email if they wanted it. Organizations would also automatically receive appropriate job applicants and a ready built selection system to confirm that the profile submitted by the applicant was accurate.

Essentially, we would efficiently match people to jobs and jobs to people. I would recommend people update their profile as they get older or go through a major life change to improve the accuracy of the system, but even initially it would be far more accurate than anything available today -- a true game changer.

Follow-up: Some might see a contradiction here. You cite an article that bashes internet-based job matching, yet this is what you're suggesting. Would your system be more effective or simply supplement traditional recruiting methods (e.g., referrals)?

A: Yup, we can do better. The internet is just a delivery mechanism and no matter how high-speed and video enabled, it is just delivering the same crap. This would provide any attempt to match people to jobs or jobs to people with the highest possible predictiveness.

Next time: Q&A Part 2

References:
Grove, W. M. (2005). Clinical versus statistical prediction: The contribution of Paul E. Meehl. Journal of Clinical Psychology, 61(10), 1233-1243. doi: 10.1002/jclp.20179

Kuncel, N. R., Klieger, D., Connelly, B., & Ones, D. S. (2008, April). Mechanical versus clinical data combination in I/O psychology. In I. H. Kwaske (Chair), Individual Assessment: Does the research support the practice? Symposium conducted at the annual meeting of the Society for Industrial and Organizational Psychology, San Francisco, CA.

Stagner, R. (1958). The Gullibility of Personnel Managers. Personnel Psychology, 11(3), 347-352.

Saturday, September 18, 2010

Every once in a while, an idea comes along...

Once in a while a research article comes along that revolutionizes or galvanizes the field of personnel assessment. Barrick & Mount's 1991 meta-analysis of personality testing. Schmidt & Hunter's 1998 meta-analysis of selection methods. Sometimes a publication is immediately recognized for its importance. Sometimes the impact of the study or article isn't recognized until years after its publication.

The September 2010 issue of Industrial and Organizational Psychology contains an article that I believe has the potential to have a resounding, critical impact for years to come. Will it? Only time will tell.

The article in question is by Johnson, et al. and is on its face a summary of the concept of synthetic validation and champions its use. As a refresher, synthetic validation is the process of inferring or estimating validity based on the relationship between components of a job and tests of the KSAs needed to perform those components. It differs from traditional criterion-related validation in that the statistics generated are not based on a local study of the relationship between test scores and job performance. Studies have shown that estimates based on synthetic validity closely correspond to local validation studies as well as meta-analytic VG estimates. Hence it has the potential to be as useful as criterion-related validation in generating estimates of, for example, cost savings, without requiring the organization to actually gather sometimes elusive data.

But the impact of the article, if I'm right, will not be felt based on its summary of the concept, but on what it proposes: a giant database containing performance ratings, scores from selection tests, and job analysis information. This database has the potential to radically change how tests are developed and administered. I'll let the authors explain:

"Once the synthetic validity system is fully operational, new selection systems will be significantly easier to create than with a traditional validation approach. It would take approximately 1-2 hours in total; employers or trained job analysts just need to describe the target job using the job analysis questionnaire. After this point, the synthetic validity algorithms take over and automatically generate a ready-made full selection system, more accurately than can be achieved with most traditional criterion-related validation studies."

Sound like a mission to Mars? Maybe. But the authors are incredibly optimistic about the chances for such a system, and it appears that it is already in the beginning stages of development. The commentaries following this focal article are generally very positive about the idea, some authors even committing resources to the project. The authors respond by suggesting that SIOP initiate the database and link it to O*NET. They point out, correctly, that this project has the potential to radically improve the macro-level efficiency of matching jobs to people; imagine how much more productive a society would be if the people with the right skills were systematically matched with jobs requiring those skills.

So as you can probably tell, I think this is pretty exciting, and I'm looking forward to seeing where it goes.

I should mention there is another focal article and subsequent commentaries in this issue of IOP, but it's (in my humble opinion) not nearly as significant. Ryan & Ford provide an interesting discussion of the ongoing identity crisis being experienced by the field of I/O psychology, demonstrated most recently by the practically tie vote over SIOP's name. I found two things of particular interest: first, the fact that they come out of the gate using the term "organizational psychology" which deserves only a footnote (a fact pointed out by several commentary authors). Second, they take an interesting approach to presenting several possible futures for the field, from the strengthening of historic identity to "identicide."

Finally, I want to make sure everyone knows about the published results of a major task force that looked at adverse impact. It too has the potential to have a significant impact on the study and legal judgment of this sticky (and persistent) issue.

Friday, September 10, 2010

Personnel Psychology, August 2010

The August, 2010 issue of Personnel Psychology came out a while ago, so I'm overdue in taking a look at some of the content:

Greguras and Diefendorff write about their study of how "proactive personality" predicts work and life outcomes. Using data from 165 employees and their supervisors across three time periods, the authors found that proactive individuals were more likely to set and attain goals, which itself predicted psychological need satisfaction. It was the latter that then predicted job performance and OCBs as well as life satisfaction.

Speaking of personality, next is an interesting study by Ferris et al. that attempts to clarify the relationship between self-esteem and job performance. Using multisource ratings across two samples of working adults, the authors found that the importance participants placed on work performance to their self-esteem moderated this relationship. In other words, this suggests that whether self-esteem predicts job performance depends on the extent to which people's self-esteem exists outside of their performance. Interesting.

Lang et al. describe the results of a relative importance analysis of GMA compared to seven narrower cognitive abilities (using Thurstone's primary mental abilities). Using meta-analysis data, the authors found that while GMA accounted for between 10 and 28% of the variance in job performance, it was not consistently the strongest predictor. Add this study to a number of previous ones suggesting that one solution to the validity-adverse impact dilemma may be in part to use narrower cognitive abilities (e.g., verbal comprehension, reasoning).

Last but definitely not least, Johnson and Carter write about a large study of synthetic validity (a topic Johnson writes more about in the August issue of IOP). For those that need a reminder, synthetic validity is the process of inferring validity rather than directly analyzing predictor-criteria relationships. After analyzing a fairly large sample, the authors found that synthetic validity coefficients were very close to traditional validity coefficients--in fact within the bounds of sampling error for all eleven job families studied. Validity coefficients were highest when both predictors and criterion measures were weighted appropriately.

So what the heck does that mean? Essentially this provides support for employers (or researchers) who lack the resources to conduct a full-blown criterion validation study but are looking for either (a) a logical way to create selection processes that do a good job predicting performance, or (b) support for said tests. Good stuff.

Saturday, August 28, 2010

September 2010 IJSA (those considering SHRM certification, read on)

The September issue of the International Journal of Selection and Assessment (IJSA) is out with a boatload of content. Let's check out some of the highlights:

First up, a piece by Gentry, et al. that has implications for self-rating instruments. The authors studied self-observer ratings among managers in Southern Asia and Confucian Asia and found an important difference: the discrepancy between the ratings was greater in Southern Asia. Specifically, the difference appears in self-ratings rather than observer ratings, indicating differences in how managers in the different areas perceived themselves. Implication? Differences in self ratings may be due to cultural differences in addition to things like personality and instrument type.

The second article is a fascinating one by Saul Fine in which the author analyzed differences in integrity test scores across 27 countries. Fine found two important things: first, there are significant differences in test scores across countries. Second, test results were significantly correlated (r= -.48) with country-level measures of corruption as well as several aspects of Hofstede's cultural dimensions.

Next, an article by De Corte, et al. that describes a method for creating Pareto-optimal selection systems that balance validity, adverse impact, and predictor constraints. This article continues the quest for balancing utility and subgroup differences. A link to the article is here but it wasn't functional at the time I wrote this; hopefully it will be soon.

Next, in an article that SHRM will probably place on their homepage if they haven't already, Lester et al. studied alumni from three U.S. universities to analyze the relationship between attainment of the Professional in Human Resources (PHR) certification offered by SHRM and early career success. Results? Those with a PHR were significantly more likely to obtain a job in HR (versus another field) BUT possession was not associated with starting salary or early career promotions. I'll let you decide if you think it's worth the time (and expense).

If you need another reason to focus on work samples and structured interviews, here ya go. Anderson, et al. provide us with the results of a meta-analysis of applicant reactions to selection instruments. Drawing from data from 17 countries, the authors found results similar to what we've seen in the past: work samples and interviews were most preferred, while honesty testing, personal contacts, and graphology were the least preferred. In the middle (favorably evaluated) were resumes, cognitive tests, references, biodata, and personality inventories.

Fans of biodata and personality testing may find the article by Sisco & Reilly reassuring. Using results from over 700 participants, the authors found that the factor structures of a personality inventory and biodata measure were not significantly impacted by social desirability at the item level. Implication? The measures seemed to hold together and retain at least an aspect of their construct validity even in the face of items that beg inflation.

Speaking of personality tests, Whetzel et al. investigated the linearity of the relationship between the OPQ and job performance. Results? Very little departure from linearity and where present the departure was small. This suggests that utility gains may be obtained across the spectrum of personality test results.

Are you overloading your assessment center raters? Melchers et al. present the results of a study that strongly suggests that if you are using group discussions as an assessment tool, you need to be sensitive to the number of participants that raters are simultaneously observing.

There are other articles in here you may be interested in, including ones on organizational attractiveness, range shrinkage in cognitive ability test scores, and staffing services related to innovation.

Wednesday, March 31, 2010

This and that

I follow several journals, several of which aren't specifically devoted to recruitment and selection. But if you believe, as I do, that organizational structure and behavior have implications for what we usually talk about on this blog, I think you might find the following recently published articles interesting. I've also included a couple directly on point that you may have missed:

Got meetings? Turns out they're a key aspect of job satisfaction.

Thinking about work-life balancing measures? Consider the type of employee.

GLBT nondiscrimination policies may impact overall organizational performance.

Wrap your mind around this one: The ability to recognize opportunities may have a genetic component, similar to the personality aspect of openness to experience.

Are formal HR policies bad for morale? This study certainly suggests so. It also suggests that we need to "think small" when it comes to organizational units.

What makes someone "employable"? Willingness to change jobs--yes. Willingness to develop new competencies--not so much.

Interested in presenteeism (people coming to work sick)? Here's a good overview.

Maybe the New London police department wasn't so wacky. Turns out being overeducated negatively impacts job satisfaction--the good news is experience appears to moderate the relationship.

Bothered by the "criterion problem" in measuring the utility of assessments? This study won't make you feel any better, but it does help explain our challenge.

Want to do better on a test? Think positive.

Need more evidence that off-list checks are important? Check this out.

Sunday, January 31, 2010

Lessons from the NYC Fire case - part 1

Part 1 of 2

New York City, like the cities of New Haven and Chicago, has a long history of employment discrimination litigation related to its firefighter testing.

Since the 1970s and cases like Guardians, the city has been under scrutiny for its woefully low number of black firefighters.

In 2007 the city found itself faced with another lawsuit over its firefighter hiring practices, and in July of 2009, a U.S. District Court judge found that the city had violated Title VII by administering written exams from 1999-2007 that had high levels of adverse impact. The city marshaled an inadequate defense. In January of 2010, the same judge (Nicholas Garaufis) found the city liable for a pattern and practice of disparate treatment for those same exams. An adverse impact finding, particularly for written exams, and especially for public safety tests, is not earth-shattering. But a finding of disparate treatment in this situation is less common.

This case, while only one example and limited in its impact, has some valuable lessons for test users and sheds some light on how judges look at our field. In particular, I describe below nine points the judge specifically made and what lessons we can draw from them:

1) While the city conducted a job analysis with an "extensive" list of tasks and surveyed incumbents, the city offered "no evidence of 'the relationship of abilities to tasks.'" They conducted a linkage, but the judge found that the SMEs were confused about what they were supposed to do and didn't understand several of the abilities they were rating.

Lesson: simply having subject matter experts (SMEs) link essential tasks and knowledge, skills, and abilities (KSAs) is not sufficient. You need to ensure they understand the statements they are linking as well as how exactly they are supposed to be linking them.

2) In conducting the job analysis, the city inappropriately retained tasks and KSAs that could be learned on the job. It is quite clear (e.g., per the Uniform Guidelines) that only tasks and KSAs that are required upon entry to the job should be identified as critical in terms of exam development.

Lesson: make sure that when you are developing exams based on job analysis results that you focus only on those tasks and KSAs that are required upon entry to the job. This should be determined by your SMEs.

3) The city relied to some extent upon the work of a previous test developer, Dr. Frank Landy (who sadly recently passed away). In addition to a tenuous link between Dr. Landy's work and the current exams, the judge makes it clear that "reliance on the stature of a test-maker cannot stand in for a proper showing of validity." At the same time, the judge emphasizes that exams should be constructed by "testing professionals."

Lesson: tests should be developed by people who know what they're doing. This means HR professionals with the requisite background in test validation and construction in conjunction with job experts. Do not rely solely on previous efforts, particularly when (as in this case) the results of those efforts were either incomplete or not fully relevant to your current situation.

4) The city performed no "sample testing" to ensure that the questions were reliable as well as "comprehensible and unambiguous."

Lesson: few steps in the test development process are as easy--or as valuable--as pilot testing. I have yet to see an exam that didn't benefit from a "trial run" with a group of incumbents. Not only will you catch unintended flaws, you will verify that the exam is doing what you claim it is.

5) There was insufficient evidence that the exams actually measured the (nine cognitive) KSAs the city claimed they intended to measure. Plaintiffs were able to suggest the opposite through analyzing convergent and discriminant validity as well as by conducting a factor analysis.

Lesson: there are two linkages of primary importance in test development. The first was describe in #1. The second is the link between critical KSAs and the exam(s). At the very least, you must be able to show evidence that there is a logical link between the two. When you claim to be measuring cognitive abilities, you incur an additional responsibility, which is gathering statistical evidence that supports this claim.

Next time: more lessons and the relief order.

Monday, January 18, 2010

How to get r = 1.0

Recruiters have a variety of measures of their success, often including process outcomes (time-to-fill, number of requisitions filled, etc.).

And although assessment professionals have a variety of success measures, some in common with recruiters (e.g., tenure), there is one measure that stands above all others: job performance.

The "gold standard" of this measurement is to correlate test scores with job performance measures (called criterion-related validation evidence). A correlation of, say, .50 between these two, is considered outstanding. Square that and you have the percentage of behavior explained. So in other words, when we can explain 25% of job performance with assessments, we call that success (and with good reason, because it's a heck of a lot better than 0%).

Why not higher than 25%? What would it take to get r =1.0, in other words a perfect correlation between test scores and performance? Here is a somewhat tongue-in-cheek recipe for achieving this impossible dream:

1. An accurate identification of the top competencies/KSAs required for the job. Qualified subject matter experts reach consensus on a handful of far and away the most important qualities that impact job performance.

2. Perfectly constructed and administered, perfectly reliable and accurate measures of the the top KSAs.

3. Variability among applicants in terms of amount of the relevant KSAs possessed.

4. Test scores combined and weighted appropriately given the job analysis results.

5. Variability in scores for those hired.

6. A clear description of the work to be performed and competencies to be demonstrated so the individuals understand expectations.

7. Perfectly reliable, accurate measures of job performance that capture behaviors one would logically relate to the critical KSAs.

8. A supportive work environment (e.g., high quality supervision, adequate resources) so this doesn't interfere with work performance.

9. Variability in job performance among those hired using the assessments.

10. Elimination of outside factors that may contribute to lower job performance (e.g., family emergencies, medical/psychological changes).

As you can see, some of these are achievable (1, 4, 6), others are challenging and depend on circumstances, but are not impossible to achieve (3, 5, 8, 9) and some are practically impossible (2, 7, 10). I said earlier this was tongue-in-cheek because obviously we'll never have a situation where all of these conditions (as well as ones I'm sure I forgot) are true.

Does this mean we should abandon the correlation between test score(s) and job performance? Absolutely not. It should continue to be one of our "gold standards" for measuring our success as assessment professionals. But we--and our customers--should have our eyes wide open before pressing "compute."

Sunday, December 20, 2009

Validity: An elusive (unitary?) concept

What makes a test "valid"? What is the best way to develop a selection system? These are two of the most fundamental questions we try to answer as personnel assessment professionals, yet the answers are strangely elusive.

First of all, let's get two myths out of the way: (1) a test is valid or invalid, and (2) there is a single approach to "validating" a test. It is the conclusions drawn from test results that are ultimately judged on their validity, not simply the instruments themselves. You may have the best test of color vision in the world--that doesn't mean it's useful for hiring computer programmers. And many sources of evidence can be used when making the validity determination; this is the so-called "unitary" view of validity described in references like the APA Standards and the SIOP Principles. Unitary in this case refers to validity being a single, multi-faceted concept, not that psychologists agree on the concept of validity--a point we'll come back to shortly.

Although we can debate test validation concepts ad infinitum, the bottom line is we create tests to do one primary thing: help us determine who will perform the best on the job. The validation concept that most closely matches this goal is criterion-related validity: statistical evidence that test scores predict job performance. So we should gather this evidence to show our tests work, right? Here's where things get complicated.

It's likely that many organizations can't, for various reasons, conduct criterion-related validity studies (although baseline evidence of this would be helpful). Most of the time, it's because they lack the statistical know-how or high quality criterion measures (a 3-point appraisal scale won't do it). So in a strange twist of fate, the evidence we are most interested in is the evidence we are least likely to obtain.

So what are organizations to do? Historically the answer is to study the requirements of the job and select/create exams that target the KSAs/competencies required; this matching of test and job is often referred to as "content validity" evidence. But Kevin Murphy, in a recent article in SIOP's journal Industrial and Organizational Psychology, makes an important point: this is good practice, but not a guarantee that our tests will be predictive of job performance. Why not? For a number of reasons, including poor item writing and applicant frame of reference. Murphy makes a passionate argument that we rely way too heavily on unproven content validation approaches when we should focus more on criterion-related validation evidence. Instead of focusing on job-test match, we should focus on selecting proven, high quality exams.

Not surprisingly, the article is accompanied by 12 separate commentaries that argue with various points he makes. It's also interesting to compare this piece with Charley Sproule's recent IPAC monograph where he makes an impassioned defense of content validity.

A complete discussion of the pros and cons of different forms of validation evidence are obviously beyond a simple blog post. My main issues with Murphy's emphasis on criterion-related validation are threefold. First, as stated above, most organizations likely don't have the expertise to gather criterion-related validation evidence for every selection decision (maybe this is his way of creating a need for more I/O psychologists?). Perhaps "insufficient resources" is a poor excuse, particularly for an issue as important as employment, but it is a reality we face.

Second, even if we were to shift our focus to individual test performance, following a content validation approach for development enhances job relatedness (which Murphy acknowledges). Should your selection system face an adverse impact challenge, the ability to show job relatedness will be essential.

Finally, let's not forget that high test-job match gives candidates a realistic job preview--hardly an unimportant consideration. RJPs help candidates decide whether the job would be a good match for their skills and interests. And no employer that I know of enjoys answering this question from candidates: "What does this have to do with the job?"

The approach advocated by Murphy, taken to its extreme, would result in employers focusing exclusively on the performance of particular exams rather than on their content in relation to the job. This seems unwise from a legal as well as face validity perspective.

In the end, as a practitioner, my concern is more with answering the second question I posed at the beginning of this post: What is the best way to develop a selection system? Given everything we know--technically, legally, psychologically--I return to the same advice I've been giving for years: know the job, select or create good tests that relate to KSAs/competencies required on the job, and base your selection decision on the accumulation of test score evidence.

Should researchers work harder to show that job-test content "works" in terms of predicting job performance? Sure. Should employers take criterion-related validation evidence into consideration and work to collect it whenever possible? Absolutely. Will job-test match guarantee a perfect match between test score and job performance? No. But I would argue this approach will work for the vast majority of organizations.

By the way, if you are interested in learning more about the different ways to conceptualize validity--"content validity" in particular--Murphy's focal article as well as the accompanying commentaries are highly recommended. He acknowledges that he is purposely being provocative, and it certainly worked. It's also obvious that our profession has a ways to go before we all agree on what content validity means.

Last point: the first focal article in this issue--about identifying potential--looks to be good as well. Hopefully I'll get around to posting about it, but if not, check it out.

Wednesday, August 26, 2009

Latest IJT and PP

The latest issue of the International Journal of Testing has some good stuff in it, particularly if you're in education. Here's a sample of what's available:

Correcting fallacies in (construct) validity, reliability and classification

Then head on over to Personnel Psychology to check out the Autumn issue. Here's some of what it covers:

Conscientiousness and KSAs predicts leadership in the military

Job component validation, meta-analysis, and DOT ratings

Personality variables and job search behavior and success

The importance of specificity and observability in job analysis ratings

Sunday, July 26, 2009

July 2009 J.A.P.: SJTs and more

Situational judgment tests (SJTs) have a long tradition of successfully being used in employment tests. These types of (typically multiple-choice) items describe a job-related scenario then ask the test-taker to endorse the proper response. The question itself usually takes one of two forms:

1) What SHOULD be done in this situation? ("knowledge instruction")

2) What WOULD you do in this situation? ("behavioral tendency instruction")

What are the practical differences between the two? Previous meta-analytic research, specifically McDaniel et al.'s 2007 study, revealed that knowledge instruction items tend to be more highly correlated with cognitive ability, while behavioral tendency items show higher correlations with personality constructs. In terms of criterion-related validity, there appeared to be no significant difference between the two.

But there were limitations to that study, and two of them are addressed in a study found in the July 2009 issue of the Journal of Applied Psychology. Specifically, Lievens et al. addressed the inconsistency in stem content by keeping it the same while altering the response instruction, and also looked at a large population of applicants, rather than incumbents, which tended to dominate McDaniel et al.'s 2007 sample.

Results? Consistent with the 2007 study, knowledge instructions were again more highly correlated with cognitive ability, and there was no meaningful difference in criterion-related validity (the criterion being grades in interpersonally-oriented courses in medical school). Contrary to some research in low-stakes settings, there were no mean score difference between the two response instructions.

Practical implications? The authors suggest knowledge instruction items may be superior due to their resistance to faking. My only concern is that these items are likely to result in adverse impact in many applied settings. Like all assessment situations, the decision will involve a variety of factors, including the KSAs required on the job, the size and nature of the applicant pool, the legal environment, etc. But at least this type of research supports the fact that both response instructions seem to WORK. By the way, you can see an in-press version of this article here.

Other content in this journal? There's quite a bit, but here's a sample:

Content validity <> criterion-related validity

More evidence that selection procedures can impact unit as well as organizational performance

Self-ratings appear to be culturally bound

Wednesday, July 01, 2009

Ricci case: Full of sound and fury...

There's been a lot of hoopla over the last several days over the U.S. Supreme Court's decision in Ricci v. DeStefano. It's been described as a win for "reverse discrimination" cases, a rebuke of written tests, and judicial activism. The way I read it, the decision is completely unsurprising and will likely change absolutely nothing about employment testing.

For anyone who isn't familiar with the case, here's a very brief rundown: the City of New Haven, CT gave promotional tests for Lieutenant and Captain firefighter positions using written multiple choice tests and interviews. When they crunched the results it turned out--not surprisingly--that there was statistical evidence of adverse impact against the Black candidates. The City decided not to use the list, and the White and Hispanic candidates sued, claiming disparate treatment. The Supreme Court ruled in their favor.

A little unusual of a case in terms of who's on what side, and there's a lot of good reading in the decision for anyone wanting to know more about test validation. But the decision itself is totally consistent with three main themes from previous decisions:

(1) There really isn't "reverse discrimination"--there's just discrimination based on a protected classification, such as race, color, or sex. Majority groups are protected just like minority groups.

(2) Employers do not have to go to irrational lengths to validate their selection methods. Although the tests had flaws, the court continued to demonstrate that employers simply need to follow a logical process for developing the exam to show job relatedness; the exams don't have to win any awards.

(3) Disparate treatment by a government entity in order to avoid liability for adverse impact is legal only in certain very specific instances (when there is a "strong basis in evidence"). The court has been trending for years toward "color-blind" selection decisions.

About the only thing this case really points out is employers need to be ready to use the results from whatever test they administer, barring some enormous irregularities. That, and part of a defense against an adverse impact case might be that choosing not to use the exam would have been evidence of disparate treatment (I'll grant you that one's a little confusing).

All in all--and I'm certainly not the only one who feels this way--it doesn't appear to be anything to get excited about.

Want to know more? Check out the scotuswiki page.

Wednesday, May 13, 2009

Free monograph on test validation

IPAC (the International Personnel Assessment Council) is making available, free of charge, a monograph by Dr. Charles Sproule titled Rationale and research evidence supporting the use of content validation in personnel assessment.

Having seen a copy, I can tell you it's chalk full of great content, spanning much of the field of personnel assessment, including updated information on validity coefficients and special sections for several different types of tests (e.g., interviews, training and experience exams).

It's a great primer for anyone who wants to learn more about what it means to "validate" an exam, and it's a worthy addition to the library of any seasoned professional.

The monograph can be accessed here if you are a member, or you can request a copy here if you are not. Check it out!

Tuesday, March 17, 2009

March '09 Journal of Applied Psychology

The March 2009 issue of J.A.P. is out with a lot of great content; for example:

To maximize diversity and validity, try recruiting on cognitive ability and selecting on conscientiousness

In a multistage assessment scenario (which exists for practically every hire), there are several options for maximizing diversity and validity

What determines the success of word of mouth recruitment strategies? Turns out, several things...
(read the draft version here and see Van Hoye's IPAC presentation here)

Interested in improving your T&Es? Check out what this study has to say about self and supervisory agreement of performance ratings.

More clues about what GMA is all about

How P-E fit is related to performance dimensions such as OCB

The importance of job offer negotiation: Subjective value predicts long-term employee outcomes

Sunday, March 08, 2009

New newsletter

There's a new newsletter in town. It's called EEO Insight, it's published by Biddle Consulting Group, and it focuses on EEO/AA issues, including employment testing.

Check out some the topics from the first issue (December '08):

The EEOC, OFCCP, and “Systemic Discrimination”: The Rules Have Changed

Where are the Courts Today? Proving and Defending Against an “Adverse Impact” Claim: OFCCP’S New Approach to Employer Selection Systems

Five Steps to Successful AAP Goal Development Diversifying Your Organization: How to Actually Make it Happen

Claims of Employment Test Validity: Who Can You Trust?

Good stuff. You can subscribe here.

Wednesday, December 03, 2008

New evidence of the power of GMA

One of the biggest areas of focus for personnel psychologists is uncovering which selection mechanisms do the best job of predicting job performance.

Different researchers have focused on various tests, but perhaps no tests have received as much attention as those that measure general mental ability (GMA). GMA has consistently been shown to produce the highest criterion-related validity (CRV) values and has some very strong proponents. (For those of you not up on your statistics, CRV refers to the statistical relationship between test scores and subsequent job or training performance; with a maximum value of 1.0, the bigger, the better)

One of the most strident advocates of ability testing is Frank Schmidt, who has studied and written extensively on the topic. You may have heard of the widely cited article he co-authored with John Hunter in 1998. In that article, they present a CRV value of .51 for cognitive ability tests, which is considered excellent. Only work samples received a higher score, but this value has been subsequently questioned.

In the latest issue of Personnel Psychology (v61, #4), Schmidt and his colleagues present an updated CRV value, and it's even higher. Using what they claim is a more accurate way of correcting for range restriction, the authors present an overall value of .734 for job performance and .760 for training performance. This value is the highest I've seen reported in a major study such as this and further solidifies GMA as "the construct to beat" when predicting performance.

The article also uses this same updated statistical approach to looking at the CRV of two personality variables that have been generally supported--Conscientiousness (Con) and Emotional Stability (ES). The values presented for these unfortunately were not that much larger than previously reported: for Con: .332 (.367) and for ES: -.100 (-.106) for job (training) performance.

That all being said, there are some things to note:

1) Use of GMA tests for selection are likely to produce substantial adverse impact with most applicant samples of any substantial size, potentially limiting their usage in many cases.

2) CRV coefficients are just one "type" of validity evidence. The calculation is far from perfect and depends greatly on the criterion being used. The authors admit that they were unable to measure the prediction of contextual performance, which could have resulted in substantially higher values for the personality variables.

3) On a related note, some of the largest CRV values for personality tests I've seen were reported in Hogan & Holland (2003), where they aligned predictor and criterion constructs. This study was excluded from the current study because "the performance criteria they employed were specific dimensions of job performance rather than overall job performance."

4) The lower values reported in this study for personality measures may also reflect the way personality is measured, which the authors acknowledge. They suggest using outside raters as well as multiple scales for the same constructs may yield higher CRV values. Interestingly, they also suggest that personality may not be as important because with sufficient GMA, individuals can make up for any weaknesses--such as forcing yourself to frequently speak with others even if you're an introvert.

5) CRV values for GMA continued to vary substantially depending on the complexity of the job, yielding values that ranged .20-.30 apart from one another. This is a key point and is related to the fact that the type of job--and job performance--matters when generating these numbers.

Last but not least, there's another great article in this issue, devoted to (coincidentally) conducting CRV studies by Van Iddekinge and Ployhart--check it out. They go into detail about many issues directly relevant to the study above.

Wednesday, October 29, 2008

Upcoming webinar on defending your tests

Tests aren't valid or invalid per se--it depends on what you use them for.

But if your tests are challenged legally (say, because they have a discriminatory impact against a protected group), one of the things you'll want to defend yourself with is a test validation report--a documentation of why the test was developed, how it was developed, and the purposes for which the test should be used.

This is one of the topics that will be covered in an upcoming webinar sponsored by Talent Management and presented by some well known folks over at APT. Taking place on November 11th at 11am PST, the webinar is titled "Testing the Test: What You Need to Know about Test Validation, Litigation, and Risk Management."

Should be worth a watch/listen.

On a related note, the latest issue of Talent Management had some good articles in it, including ones on "role based" assessment (which just sounds like good 'ol fashioned position-based assessment) and employee surveys. It's actually not a bad little magazine, and it's free. You can subscribe here.

Tuesday, September 02, 2008

Power v. Group Differences

In a recent post I wrote about a chart my co-workers and I created to help us communicate with hiring supervisors about the pros and cons of various testing instruments. That graph mapped power (validity) on one axis, and speed of administration on the other.

One of the comments on that post mentioned it would be nice to see power vs. group differences. I agreed. So here it is!

The bottom line on this graph (no pun intended), if you're looking for the best combination of both, will be in the upper left quadrant.

A few notes of notes of caution before interpreting the graph:

- this graph charts only Black-White differences, which is the largest data set we have. It's important to remember that combinations of other groups (including gender) will yield slightly different results.

- the evidence on group differences for T&Es is rather scant. Not much has been found, but that doesn't mean it couldn't in the future, depending on what specific training or experience is being measured.

- finally, as the excellent recent article by Roth, et al. reminds us, adverse impact in your selection process depends on several factors, including the specific test or construct, the selection ratio, your applicant pool, and the order you place your assessments in.