From Lenny's Podcast: Product | Career | Growth

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

1:46:33

September 25, 2025

Lenny's Podcast: Product | Career | Growth

https://api.substack.com/feed/podcast/10845.rss

Business

Entrepreneurship

Technology

The Quiet Craft Behind Reliable AI: Why 'Evals' Matter

There is a practical ritual emerging inside teams that build conversational AIs: a disciplined, iterative practice of looking at what the model actually does for users and turning those observations into tests that guard against regressions. The discipline goes by a shorthand—"evals"—and it moves product work away from guesswork and toward measurable improvement. At its heart, this is not a new argument about better models; it is an argument about better measurement.

From logs to learning: the anatomy of error analysis for LLM applications

It begins with traces: the raw sequences of system prompts, tool calls, retrieved documents and model outputs recorded during real user interactions. These traces are messy and human-shaped—short texts, half-formed questions, tool failures, and odd context switches that reveal how a product actually behaves in the wild. The simplest, highest-leverage step is to open those logs and write notes: the researchers call it open coding. You do not need perfection. You do not need to automate. You need curiosity and a product hat.

Open coding is intentionally informal. The point is to capture the first upstream error you see in each trace—because that first error often cascades into others. Do this across a representative sample, the practitioners recommend starting with roughly a hundred traces. That sample is enough to develop a nose for recurring failure modes and, crucially, to hit theoretical saturation, the moment when new examples stop producing new categories of problems.

Turning observations into categories: axial coding and the pivot table

Once a set of open codes exists, the next move is to synthesize: group similar notes into axial codes, meaningful failure-mode categories like "human handoff failures," "hallucinated tool capabilities," or "conversational flow breakdowns." This is where simple counting and pivot tables become powerful analytic tools. A pivot table can turn a hundred qualitative notes into a ranked list of problems, exposing the few failure modes that matter most to the product's experience.

Those ranked problems inform choices about what to fix directly, what to monitor in production, and what merit a formal evaluator—or "eval"—that runs automatically.

Three pragmatic evaluator types and when to use them

Code-based checks: deterministic, cheap, and ideal when the condition can be checked by string or schema rules (e.g., required JSON structure, presence of a link).
LLM-as-judge prompts: narrow, binary judgments used when a failure mode is complex and hard to express in code—set them up as pass/fail, not 1-to-5 scales.
Human review and monitoring: sampling real production traces and combining human notes with automated classification to catch long-tail problems.

The secret is scope: make each LLM judge evaluate one tightly defined behavior and return a binary outcome. Then validate that judge against human-labeled data to measure alignment and avoid surprising drift.

The benevolent dictator and the politics of evaluation

Teams often stall in the social processes around evaluation. One practical prescription is appointing a "benevolent dictator"—a domain-expert owner who does the open coding and makes early judgments. That role keeps the process affordable and decisive, while preserving domain expertise for subtle judgments: leasing behavior should be judged by leasing experts, medical advice by clinicians.

Decisions about who owns eval definitions and what counts as a failure are inherently product-level judgments. The trick is to be explicit and iterative about those definitions rather than hoping a single rubric will be perfect upfront.

Where AI helps and where it misleads

Large language models accelerate synthesis: they can cluster open codes into axial codes, categorize notes against a taxonomy, and speed up the creation of judge prompts. But they cannot replace the initial exploratory phase. LLMs lack context about what a product actually supports; when an assistant says "virtual tour available," an LLM without product context may treat that as fine while a human with domain knowledge sees a hallucination. Use AI to organize and scale work, not to skip human discovery.

Operationalizing evals: from CI to production monitoring

Good evals are not just pre-release gates. They belong in continuous integration and in production monitoring: sample daily traces, run the judges, and surface rising error rates to product dashboards. A small suite—four to seven well-chosen evaluators—is often enough to keep a conversational product healthy and safe without drowning engineering capacity.

Resolving controversies: why the debate matters

Arguments in public forums often collapse into an either-or: "vibes and dogfooding" versus "formal evaluations." The more useful framing is that rigorous evaluation is product-minded data science. Different products need different mixes of methods; coding assistants can rely more on developer dogfooding, while domain-sensitive systems require explicit human-in-the-loop validation and domain experts. The debate reflects misunderstanding more than contradiction.

Concluding thought: the craft of building dependable AI is less about a single algorithm than about creating a repeatable measurement practice—that combination of curiosity, simple counting, and tight, testable definitions that turns surprising model behavior into actionable product improvements.

Insights

Sample real user traces and write brief open codes before building any automated tests.

Aggregate open codes into axial codes to find the most prevalent, actionable failure modes.

Choose evaluators conservatively: build code checks when possible and LLM judges only when necessary.

Make every LLM-as-judge a binary decision and benchmark it against human judgments.

Run evals regularly on production samples to monitor degradation and long-tail issues.

Keep the eval suite small (four to seven key evaluators) and expand only when necessary.

Timecodes

00:00 Definition of evals and the promise of error analysis

00:00 Live walkthrough: analyzing a NurtureBoss trace and open coding

00:00 Open coding, benevolent dictator, and how to sample traces

00:00 Synthesis: axial coding, LLM-assisted categorization, and pivot tables

00:01 LLM-as-judge design, validation, and debate about evals in the wild

More from Lenny's Podcast: Product | Career | Growth

Lenny's Podcast: Product | Career | Growth

Inside the expert network training every frontier AI model | Garrett Lord (Handshake CEO)

How Handshake turned a decade-old student network into a $50M AI training-data powerhouse.

Business

Entrepreneurship

Technology

1:09:50

Aug 24, 2025

Lenny's Podcast: Product | Career | Growth

How Intercom rose from the ashes by betting everything on AI | Eoghan McCabe (founder and CEO)

How Intercom turned a six-week GPT prototype into a $100M AI agent business.

Business

Entrepreneurship

Technology

1:23:20

Aug 21, 2025

Lenny's Podcast: Product | Career | Growth

Why ChatGPT will be the next big growth channel (and how to capitalize on it) | Brian Balfour (Reforge)

ChatGPT could become the next dominant distribution platform—are you ready to place your bet?

Business

Entrepreneurship

Technology

1:29:11

Aug 17, 2025

Lenny's Podcast: Product | Career | Growth

The one question that saves product careers | Matt LeMay

Learn three practical steps product teams use to link work directly to business results.

Business

Entrepreneurship

Technology

1:32:09

Aug 14, 2025

The Thoughtful Entrepreneur

The Wolf Of All Streets

Good Sleep: Positive Affirmations

KGCI: Real Estate on Air

The School of Greatness

00:0000:00

The Quiet Craft Behind Reliable AI: Why 'Evals' Matter

From logs to learning: the anatomy of error analysis for LLM applications

Turning observations into categories: axial coding and the pivot table

Three pragmatic evaluator types and when to use them

The benevolent dictator and the politics of evaluation

Where AI helps and where it misleads

Operationalizing evals: from CI to production monitoring

Resolving controversies: why the debate matters

Insights

Timecodes

More from Lenny's Podcast: Product | Career | Growth

You Might Also Like