Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)
The Quiet Craft Behind Reliable AI: Why 'Evals' Matter
There is a practical ritual emerging inside teams that build conversational AIs: a disciplined, iterative practice of looking at what the model actually does for users and turning those observations into tests that guard against regressions. The discipline goes by a shorthand—"evals"—and it moves product work away from guesswork and toward measurable improvement. At its heart, this is not a new argument about better models; it is an argument about better measurement.
From logs to learning: the anatomy of error analysis for LLM applications
It begins with traces: the raw sequences of system prompts, tool calls, retrieved documents and model outputs recorded during real user interactions. These traces are messy and human-shaped—short texts, half-formed questions, tool failures, and odd context switches that reveal how a product actually behaves in the wild. The simplest, highest-leverage step is to open those logs and write notes: the researchers call it open coding. You do not need perfection. You do not need to automate. You need curiosity and a product hat.
Open coding is intentionally informal. The point is to capture the first upstream error you see in each trace—because that first error often cascades into others. Do this across a representative sample, the practitioners recommend starting with roughly a hundred traces. That sample is enough to develop a nose for recurring failure modes and, crucially, to hit theoretical saturation, the moment when new examples stop producing new categories of problems.
Turning observations into categories: axial coding and the pivot table
Once a set of open codes exists, the next move is to synthesize: group similar notes into axial codes, meaningful failure-mode categories like "human handoff failures," "hallucinated tool capabilities," or "conversational flow breakdowns." This is where simple counting and pivot tables become powerful analytic tools. A pivot table can turn a hundred qualitative notes into a ranked list of problems, exposing the few failure modes that matter most to the product's experience.
Those ranked problems inform choices about what to fix directly, what to monitor in production, and what merit a formal evaluator—or "eval"—that runs automatically.
Three pragmatic evaluator types and when to use them
- Code-based checks: deterministic, cheap, and ideal when the condition can be checked by string or schema rules (e.g., required JSON structure, presence of a link).
- LLM-as-judge prompts: narrow, binary judgments used when a failure mode is complex and hard to express in code—set them up as pass/fail, not 1-to-5 scales.
- Human review and monitoring: sampling real production traces and combining human notes with automated classification to catch long-tail problems.
The secret is scope: make each LLM judge evaluate one tightly defined behavior and return a binary outcome. Then validate that judge against human-labeled data to measure alignment and avoid surprising drift.
The benevolent dictator and the politics of evaluation
Teams often stall in the social processes around evaluation. One practical prescription is appointing a "benevolent dictator"—a domain-expert owner who does the open coding and makes early judgments. That role keeps the process affordable and decisive, while preserving domain expertise for subtle judgments: leasing behavior should be judged by leasing experts, medical advice by clinicians.
Decisions about who owns eval definitions and what counts as a failure are inherently product-level judgments. The trick is to be explicit and iterative about those definitions rather than hoping a single rubric will be perfect upfront.
Where AI helps and where it misleads
Large language models accelerate synthesis: they can cluster open codes into axial codes, categorize notes against a taxonomy, and speed up the creation of judge prompts. But they cannot replace the initial exploratory phase. LLMs lack context about what a product actually supports; when an assistant says "virtual tour available," an LLM without product context may treat that as fine while a human with domain knowledge sees a hallucination. Use AI to organize and scale work, not to skip human discovery.
Operationalizing evals: from CI to production monitoring
Good evals are not just pre-release gates. They belong in continuous integration and in production monitoring: sample daily traces, run the judges, and surface rising error rates to product dashboards. A small suite—four to seven well-chosen evaluators—is often enough to keep a conversational product healthy and safe without drowning engineering capacity.
Resolving controversies: why the debate matters
Arguments in public forums often collapse into an either-or: "vibes and dogfooding" versus "formal evaluations." The more useful framing is that rigorous evaluation is product-minded data science. Different products need different mixes of methods; coding assistants can rely more on developer dogfooding, while domain-sensitive systems require explicit human-in-the-loop validation and domain experts. The debate reflects misunderstanding more than contradiction.
Concluding thought: the craft of building dependable AI is less about a single algorithm than about creating a repeatable measurement practice—that combination of curiosity, simple counting, and tight, testable definitions that turns surprising model behavior into actionable product improvements.
Insights
- Sample real user traces and write brief open codes before building any automated tests.
- Aggregate open codes into axial codes to find the most prevalent, actionable failure modes.
- Choose evaluators conservatively: build code checks when possible and LLM judges only when necessary.
- Make every LLM-as-judge a binary decision and benchmark it against human judgments.
- Run evals regularly on production samples to monitor degradation and long-tail issues.
- Keep the eval suite small (four to seven key evaluators) and expand only when necessary.




