In 1998, the industrial-organizational psychologists Frank Schmidt and John Hunter published the most comprehensive analysis of hiring methods in the history of the field. They had eighty-five years of data — millions of hires, thousands of studies — and one question: which selection methods actually predict job performance?
The findings were so blunt they're still uncomfortable to read.
Work samples — having a candidate actually do a slice of the job — predicted future job performance at a validity coefficient of about 0.54. General mental ability tests came in at 0.51. Structured interviews, also 0.51. Job-knowledge tests, 0.48.
Unstructured interviews — the kind most companies still rely on — scored 0.38. Years of experience: 0.18. Years of education: 0.10.
The methods most companies use to hire are, statistically, barely better than guessing.
What "validity coefficient" actually means
Validity coefficients can feel abstract, so here's the practical translation. A coefficient of 0.54 means roughly that if you select candidates using work samples, you'll get top-quartile performers about 30% more often than if you select randomly. A coefficient of 0.10 — years of education — barely moves the needle from chance.
Or put more bluntly: hiring on a résumé is closer to a coin flip than to a process.
Most of the difference comes from a simple fact: past credentials measure what someone has done, not what they can do. A candidate's degree tells you they finished a curriculum. Their years of experience tell you they didn't get fired. Neither tells you whether they can handle the work in front of them on Tuesday.
Past credentials measure what someone has done. Work samples measure what they can do. Most companies still hire on the first.
The 2022 reanalysis: the numbers shifted, the ordering didn't
In 2022, a team led by Paul Sackett at the University of Minnesota published a reanalysis of the Schmidt and Hunter data. Their argument: the original analysis over-corrected for "range restriction" (a statistical artifact that happens when you only have data on the people you hired, not the ones you rejected). When that correction is dialed back, the absolute validity numbers drop somewhat.
The headlines briefly suggested the famous Schmidt and Hunter findings had been overturned. Read the actual paper and you find something more subtle. The exact coefficients moved. Structured interviews edged up. GMA edged down slightly. Some methods clustered closer together.
But the ordering — which methods predict performance, and which don't — barely shifted. Sample-based methods, structured methods, and cognitive ability remained near the top. Years of experience and unstructured interviews remained near the bottom. The picture that emerged from forty years of meta-analysis is still substantially the picture you have to plan around today.
The intellectually honest summary: the exact coefficients are contested. The ordering is not.
Why work samples win
The reason work samples top the list isn't mysterious. Three things make them work:
1. They measure the actual capability, not a proxy for it
A résumé says someone is a senior care coordinator. A work sample shows whether they can read a chart, spot what's missing, and prioritize three competing patient needs in the next ten minutes. A degree says someone studied accounting. A work sample shows whether they can reconcile a messy ledger without losing track of what they noticed five rows ago.
The distance between credential and capability is enormous, and it varies wildly by person. Work samples collapse it.
2. They're harder to game
You can polish a résumé. You can rehearse interview answers. You can buy a course on "how to ace cognitive ability tests" and learn the question types. What you can't easily fake is whether, when handed a real piece of work, you can think your way through it under realistic constraints.
The catch: as work samples have become more popular, the test items themselves get gamed too. Posted online. Memorized. Refreshed by candidates trading examples. The methodology survives only when the items refresh on a cadence — something to bake into how the test is maintained.
3. They predict the work because they are the work
The technical term is "criterion validity." The simpler version: a test that looks like the job predicts the job. A test that looks like a generic aptitude battery predicts generic aptitude.
What this means for hiring practice
If you're a hiring leader and the data above is news to you, take a beat before changing anything. There are real practical considerations behind why companies don't already hire on work samples.
Work samples are expensive to build well. They require a real job analysis — someone who understands the work observing it, identifying what predicts success, and constructing items that fairly probe those capabilities. They require scoring rubrics, calibration, and ongoing maintenance. They take longer for candidates to complete than a multiple-choice battery, which can hurt your top-of-funnel completion rate.
And — critically — they have to be legally defensible. The Uniform Guidelines on Employee Selection Procedures (1978) require that any screen with disparate impact on protected groups be job-related and validated through one of three accepted methods: content validity, criterion validity, or construct validity. A poorly designed work sample is worse than a generic battery, because the homemade version may not survive an EEOC challenge.
This is why the historical compromise has been generic batteries. They're cheap per candidate, defensible based on the vendor's validation studies, and easy to deploy at scale. The cost is the one Schmidt and Hunter identified four decades ago: they don't predict performance in your specific roles as well as a properly built work sample would.
Generic batteries are cheap, fast, and defensible. They just don't predict performance in your specific jobs as well as a properly built work sample.
How to think about the trade-off
The right answer isn't "all work samples for everyone." It's a portfolio decision based on the role.
For high-volume, low-stakes hiring — the kind where you're filling thousands of seats and each individual mis-hire is a small cost — a generic battery is usually the right call. The economics don't justify a custom build.
For high-volume, high-stakes hiring — roles where mis-hires are expensive, ramp times are long, and the work has distinctive shape (healthcare operations, complex billing, judgment-heavy customer service, regulated work) — the math flips. The cost of a single mis-hire is high enough that a custom work sample built around the actual role pays back in months, not years.
The bar HireGauge tends to use as a rough heuristic: roughly twenty or more hires per year in the same target role, with a defensible "cost of a bad hire" estimate in the high four figures or above. Below that volume, the unit economics of a custom build typically don't pencil. Above it, the question stops being whether and starts being how soon.
The path forward
If you read this and recognized your own hiring process in the bottom of the table — years of experience, unstructured interviews, a generic battery you bought from a vendor — you're not behind. You're where most companies still are. The research has been clear for forty years; the practical solutions have been hard to build well, which is why adoption has lagged the evidence.
The two things worth doing first, regardless of who builds your assessment:
- Add a structured interview component to whatever you already do. Not "tell me about a time" questions answered however the interviewer feels like scoring them. Real structured scoring against a defined rubric. This alone moves you from a 0.38 method to a 0.51 method.
- Add a real work-sample task for any role where mis-hires are expensive. Even a single, well-designed task — built from the actual work — moves the needle further than years of credential screening ever will.
If you want help building either — properly, with the legal validation work done right — that's what we do.