Why experts writing AI evals is creating the fastest-growing companies in history | Brendan Foody (CEO of Mercor)
The era of evals: how human expertise became the scaffolding for smarter AI
When the conversation about artificial intelligence turns clinical, it often centers on models and compute. Yet another axis has quietly asserted itself as decisive: the labor and craft of measuring what success looks like. Brendan Foody, the founder and CEO behind a company that rocketed from a million-dollar run rate to hundreds of millions in months, argues that evals—structured tests, rubrics, and verification systems—are the product requirement documents for AI. They determine what researchers should optimize for, how engineers hill-climb capabilities, and ultimately what a model will be rewarded to become.
From crowdsourcing to curated expert markets
There was an earlier era in the human-data economy dominated by high-volume crowdsourcing: cheap, plentiful annotators who helped early models learn grammar, surface-level facts, and basic tasks. As models matured, a new need emerged: professionals who could assess high-stakes capabilities in narrow domains. The marketplace shifted from quantity to caliber. What used to be solved by thousands of low-skill contributors is now solved by dozens of lawyers, doctors, investment bankers, software engineers, and comedians who can write the rubrics that define correctness.
Why expertise matters after pretraining
Large-scale pretraining floods a model with token-level knowledge, but it does not reliably teach the model how to reason, apply domain judgment, or prioritize accurate outcomes under complex constraints. Post-training—especially reinforcement learning driven by carefully defined rewards or AI-feedback loops—lets model behavior be shaped by concrete success criteria. That means someone has to craft those criteria: lawyers must write what a good contract redline actually looks like; radiologists must define diagnostic rewards; screenwriters must teach humor to a generative agent. The work is not glamorous, but it is crucial.
Mercor’s strategic play: a labor marketplace for model capability
What happens when a company decides to specialize in sourcing, vetting, and retaining the top decile of domain experts for labs? If executed well, it becomes a chokepoint for model progress. Brendan describes a striking dynamic: within a cohort of a hundred contributors, the top ten percent often drive most of the measurable model improvement. The strategic payoff is therefore not just volume but the privileged access to a persistent group of high-impact evaluators who can be matched to the problem at hand.
Speed, scale, and customer obsession
The story of rapid growth—moving from a small startup to a near-billion-dollar business trajectory in under two years—was not accidental. It depended on three overlapping principles: relentless ambition, a recruiting bar that prioritized former founders and rare operators, and an intensity of execution that favored customer feedback loops over early sales theatrics. Instead of polishing marketing, the early focus was product and experience: build something that flagship customers cannot live without, and they will pull you forward.
What a day in the life of an evaluator looks like
Contrary to the image of repetitive labeling, much of the most valuable work is design and judgment: crafting rubrics, scoring model outputs against professional standards, curating counterexamples, and sometimes producing supervised pairs to show what a correct output should be. These materials then serve double duty: they act as the benchmark researchers use to measure progress and as the reward signals that guide reinforcement learning from AI feedback.
Examples and data types
- Unit tests for code: precise, programmatic checks that scale evaluation.
- Legal rubrics: lists of clauses and negotiation priorities that define redlines.
- Clinical verifiers: diagnostic criteria used to weigh model recommendations.
Which jobs will last—and which will expand—when AI frees productivity
One of the sharper distinctions in the conversation is between elastic and inelastic domains. Elastic industries—software engineering, product, certain creative fields—can scale nearly without bound as productivity multiplies; when engineers become ten times as productive, software ecosystems and feature sets can expand dramatically. Inelastic areas, where demand is tied to finite processes, will not necessarily multiply with productivity enhancements. Brendan’s practical counsel for students and early-career professionals leans toward domains where increased productivity creates additional demand.
A longer view on intelligence and the role of humans
There are two competing myths to be wary of: the panic that superintelligence is imminent and the complacency that human expertise will soon be redundant. Both miss a third reality—this is a long road of many incremental capability improvements, each of which depends on human judgment. The path forward will be paved by thousands of finely tuned evals and domain-specific post-training datasets, not merely by another wave of pretraining.
Culture at breakneck pace
Scaling a business that sits at the intersection of talent markets and AI requires choices about hiring speed and standards. Early patience to seed exceptional talent paid enormous dividends, but there comes a moment when velocity matters more than perfect selectivity. Brendan’s playbook calibrated both: a careful initial talent density followed by aggressive scale once market pull was unmistakable.
Conclusion: expertise as infrastructure
What feels like a niche—expert evaluators writing rubrics and grading model outputs—reads differently when seen as infrastructure. It is the scaffolding on which reliable, capable systems will be built. As models become more central to medicine, law, education, and product design, the demand for people who can articulate what excellent looks like will grow, not vanish. The work is a peculiar mixture of craft, pedagogy, and measurement; it is the labor of translating human standards into machine incentives. The quiet truth that emerges is a paradox of our moment: automation will reshape the world, but it will also elevate a new class of work whose currency is precise judgment and the ability to make machines measure up to human expectations.
Insights
- Design a rubric before automating a workflow so you can objectively measure AI performance.
- Focus hiring early on a dense cluster of exceptional talent to shape long-term company culture.
- Prioritize building evals for the most economically valuable capabilities your customers care about.
- Treat model evaluation artifacts as reusable infrastructure that supports training, verification, and sales.
- Lean into elastic domains where increased individual productivity will create additional demand.




