From Lenny's Podcast: Product | Career | Growth

Why experts writing AI evals is creating the fastest-growing companies in history | Brendan Foody (CEO of Mercor)

1:07:08

September 18, 2025

Lenny's Podcast: Product | Career | Growth

https://api.substack.com/feed/podcast/10845.rss

Business

Entrepreneurship

Technology

The era of evals: how human expertise became the scaffolding for smarter AI

When the conversation about artificial intelligence turns clinical, it often centers on models and compute. Yet another axis has quietly asserted itself as decisive: the labor and craft of measuring what success looks like. Brendan Foody, the founder and CEO behind a company that rocketed from a million-dollar run rate to hundreds of millions in months, argues that evals—structured tests, rubrics, and verification systems—are the product requirement documents for AI. They determine what researchers should optimize for, how engineers hill-climb capabilities, and ultimately what a model will be rewarded to become.

From crowdsourcing to curated expert markets

There was an earlier era in the human-data economy dominated by high-volume crowdsourcing: cheap, plentiful annotators who helped early models learn grammar, surface-level facts, and basic tasks. As models matured, a new need emerged: professionals who could assess high-stakes capabilities in narrow domains. The marketplace shifted from quantity to caliber. What used to be solved by thousands of low-skill contributors is now solved by dozens of lawyers, doctors, investment bankers, software engineers, and comedians who can write the rubrics that define correctness.

Why expertise matters after pretraining

Large-scale pretraining floods a model with token-level knowledge, but it does not reliably teach the model how to reason, apply domain judgment, or prioritize accurate outcomes under complex constraints. Post-training—especially reinforcement learning driven by carefully defined rewards or AI-feedback loops—lets model behavior be shaped by concrete success criteria. That means someone has to craft those criteria: lawyers must write what a good contract redline actually looks like; radiologists must define diagnostic rewards; screenwriters must teach humor to a generative agent. The work is not glamorous, but it is crucial.

Mercor’s strategic play: a labor marketplace for model capability

What happens when a company decides to specialize in sourcing, vetting, and retaining the top decile of domain experts for labs? If executed well, it becomes a chokepoint for model progress. Brendan describes a striking dynamic: within a cohort of a hundred contributors, the top ten percent often drive most of the measurable model improvement. The strategic payoff is therefore not just volume but the privileged access to a persistent group of high-impact evaluators who can be matched to the problem at hand.

Speed, scale, and customer obsession

The story of rapid growth—moving from a small startup to a near-billion-dollar business trajectory in under two years—was not accidental. It depended on three overlapping principles: relentless ambition, a recruiting bar that prioritized former founders and rare operators, and an intensity of execution that favored customer feedback loops over early sales theatrics. Instead of polishing marketing, the early focus was product and experience: build something that flagship customers cannot live without, and they will pull you forward.

What a day in the life of an evaluator looks like

Contrary to the image of repetitive labeling, much of the most valuable work is design and judgment: crafting rubrics, scoring model outputs against professional standards, curating counterexamples, and sometimes producing supervised pairs to show what a correct output should be. These materials then serve double duty: they act as the benchmark researchers use to measure progress and as the reward signals that guide reinforcement learning from AI feedback.

Examples and data types

Unit tests for code: precise, programmatic checks that scale evaluation.
Legal rubrics: lists of clauses and negotiation priorities that define redlines.
Clinical verifiers: diagnostic criteria used to weigh model recommendations.

Which jobs will last—and which will expand—when AI frees productivity

One of the sharper distinctions in the conversation is between elastic and inelastic domains. Elastic industries—software engineering, product, certain creative fields—can scale nearly without bound as productivity multiplies; when engineers become ten times as productive, software ecosystems and feature sets can expand dramatically. Inelastic areas, where demand is tied to finite processes, will not necessarily multiply with productivity enhancements. Brendan’s practical counsel for students and early-career professionals leans toward domains where increased productivity creates additional demand.

A longer view on intelligence and the role of humans

There are two competing myths to be wary of: the panic that superintelligence is imminent and the complacency that human expertise will soon be redundant. Both miss a third reality—this is a long road of many incremental capability improvements, each of which depends on human judgment. The path forward will be paved by thousands of finely tuned evals and domain-specific post-training datasets, not merely by another wave of pretraining.

Culture at breakneck pace

Scaling a business that sits at the intersection of talent markets and AI requires choices about hiring speed and standards. Early patience to seed exceptional talent paid enormous dividends, but there comes a moment when velocity matters more than perfect selectivity. Brendan’s playbook calibrated both: a careful initial talent density followed by aggressive scale once market pull was unmistakable.

Conclusion: expertise as infrastructure

What feels like a niche—expert evaluators writing rubrics and grading model outputs—reads differently when seen as infrastructure. It is the scaffolding on which reliable, capable systems will be built. As models become more central to medicine, law, education, and product design, the demand for people who can articulate what excellent looks like will grow, not vanish. The work is a peculiar mixture of craft, pedagogy, and measurement; it is the labor of translating human standards into machine incentives. The quiet truth that emerges is a paradox of our moment: automation will reshape the world, but it will also elevate a new class of work whose currency is precise judgment and the ability to make machines measure up to human expectations.

Insights

Design a rubric before automating a workflow so you can objectively measure AI performance.

Focus hiring early on a dense cluster of exceptional talent to shape long-term company culture.

Prioritize building evals for the most economically valuable capabilities your customers care about.

Treat model evaluation artifacts as reusable infrastructure that supports training, verification, and sales.

Lean into elastic domains where increased individual productivity will create additional demand.

Timecodes

00:00 Opening thesis: wealthiest companies and evals

00:01 Guest introduction and Mercor overview

00:06 What 'the era of evals' actually means

00:08 Evals as PRDs and marketing for models

00:09 Market landscape: foundational models, dev tools, and data companies

00:11 Origin story: shifting from crowdsourcing to expert sourcing

00:12 Why expert evals unlock model capabilities

00:18 The economy as an RL environment and new job categories

00:19 Which skills and jobs will remain valuable

00:28 Labor marketplace mechanics and matching inefficiencies

00:31 Scale, the role of top contributors, and turnaround times

00:38 Company tenets: can-do attitude, high standards, intensity

00:41 Breakthrough moments and go-to-market pull

00:50 Hiring cadence: when to be patient and when to speed up

00:56 Data vs experts and long-term model progress

00:01 Closing thoughts, AI Corner, and lightning round

More from Lenny's Podcast: Product | Career | Growth

Lenny's Podcast: Product | Career | Growth

Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO)

How Handshake turned a decade-old student network into a $50M AI training-data powerhouse.

Business

Entrepreneurship

Technology

1:09:50

Aug 24, 2025

Lenny's Podcast: Product | Career | Growth

How Intercom rose from the ashes by betting everything on AI | Eoghan McCabe (founder and CEO)

How Intercom turned a six-week GPT prototype into a $100M AI agent business.

Business

Entrepreneurship

Technology

1:23:20

Aug 21, 2025

Lenny's Podcast: Product | Career | Growth

Why ChatGPT will be the next big growth channel (and how to capitalize on it) | Brian Balfour (Reforge)

ChatGPT could become the next dominant distribution platform—are you ready to place your bet?

Business

Entrepreneurship

Technology

1:29:11

Aug 17, 2025

Lenny's Podcast: Product | Career | Growth

The one question that saves product careers | Matt LeMay

Learn three practical steps product teams use to link work directly to business results.

Business

Entrepreneurship

Technology

1:32:09

Aug 14, 2025

The Thoughtful Entrepreneur

The Wolf Of All Streets

Good Sleep: Positive Affirmations

KGCI: Real Estate on Air

The School of Greatness

00:0000:00

The era of evals: how human expertise became the scaffolding for smarter AI

From crowdsourcing to curated expert markets

Why expertise matters after pretraining

Mercor’s strategic play: a labor marketplace for model capability

Speed, scale, and customer obsession

What a day in the life of an evaluator looks like

Examples and data types

Which jobs will last—and which will expand—when AI frees productivity

A longer view on intelligence and the role of humans

Culture at breakneck pace

Conclusion: expertise as infrastructure

Insights

Timecodes

More from Lenny's Podcast: Product | Career | Growth

You Might Also Like