:quality(82))
LLM evals explained: how to test agent behavior before it reaches production (July 2026)

LLM failures don't throw exceptions. A regression in faithfulness or relevance produces no stack trace, no alert, and no signal in your error logs; the output gets quietly worse. Evals are structured tests that measure model, agent, and workflow performance against defined criteria. Some check binary properties: schema validation, correct tool calls, and required fields. Others assess subjective qualities using LLM-as-judge methods, including tone, factual accuracy, and relevance. In 2026, evals are the production gate. You run them before every deployment, after every prompt change, and continuously on live traffic. Without them, the first person to find the regression is a user.
TLDR:
LLM evals test agent behavior against defined criteria before production, catching regressions that don't trigger errors.
Traditional unit tests miss semantic failures in which responses pass structural checks but drift from context.
Deterministic evals check schema and tool calls; LLM-as-judge scores subjective qualities at up to 90% human alignment.
Action-taking agents need four additional metrics: tool use/selection accuracy, task completion rate, step economy, and reasoning coherence to catch failures that output quality scores miss.
RAG systems need separate metrics for retrieval (e.g., contextual recall, precision) and generation (e.g., faithfulness).
Logic generates test cases from your spec and blocks deploys when regressions surface.
What LLM evals are and why they matter
You tweak a prompt, the output looks fine in your terminal, and you ship it. Two days later, a customer hits an edge case that returns garbage.
An LLM eval is a structured test that measures how well your model or agent performs against defined criteria. Some check binary properties: whether the output contains required fields. Others score fuzzier qualities, including tone, factual accuracy, and relevance. Manual spot-checking doesn't scale alongside the agents it's meant to verify.
In 2026, evals are the production gate. You run them before every deployment and after every prompt change. Skip them, and your users become your QA team.
The core problem: why traditional testing breaks for AI
The moment you add temperature to a model call, traditional test assertions stop working reliably. Unit tests verify that given input A, the function returns output B, a guarantee that holds for deterministic code. Run it a thousand times, get the same result. LLMs don't give you that.
:quality(82))
LLMs break that contract. At non-zero temperature, ask the same question twice, and you may get two different phrasings, two different structures, or two subtly different conclusions. Unit tests still verify the deterministic shell around your agent: schema compliance and routing correctness. Unit tests are blind to the semantic layer, which is where the actual reasoning happens. A response can pass every structural check and still be wrong in ways that matter to your users.
The failure mode that kills you is the quiet one. A 10 percent regression on faithfulness, per FutureAGI, produces no exceptions, failed assertions, or alerts in your monitoring dashboard. Your logs look clean. Your agent is confidently generating answers that drift further from the retrieved context, and you won't know until a user reads the response and realizes it's fabricating details. Error logs cost you engineering time. Trust erosion costs you customers.
Two approaches to evaluation: deterministic vs LLM-as-judge
Once you accept that traditional assertions can't cover everything, the question becomes which evaluation method to reach for and when. Two approaches dominate, and most production pipelines need both.
Deterministic evals
These are code-based checks with binary outcomes. Does the JSON match the expected schema? Is the classification field one of the allowed enum values? Did the agent call the right tool? You write them the same way you'd write any unit test, and they run fast. Where your agent's output has invariant properties, deterministic evals are the right tool.
LLM-as-judge
Subjective qualities like relevance, coherence, and tone resist hard-coded assertions. LLM-as-judge uses a second model to score the primary agent's output against criteria you define. The economics are stark: human evaluation runs $5 to $50 per instance and tops out at dozens of reviews per day, per LLM Evaluation in 2026, while an LLM judge costs fractions of a cent and handles thousands per minute.
The concern is obvious: can you trust a model to grade a model? Benchmark evaluations on frontier thinking models put human-judge alignment at up to 90 percent across both pairwise and single-output scoring. That's not perfect. It's repeatable and cheap enough to run on every commit.
Use deterministic evals for structure, LLM-as-judge for meaning.
Core evaluation metrics you need to track
The metrics you track depend on what your agent does and where it can go wrong. General output metrics cover the baseline qualities any agent should maintain. RAG systems layer on a second set because retrieval quality directly shapes generation quality, and you need to diagnose which stage failed.
Category | Metric | What it measures |
|---|---|---|
General | Factuality | Whether claims in the output are verifiable and accurate |
General | Relevance | Whether the response answers the user's actual question |
General | Coherence | Whether the output reads as logically structured and internally consistent |
General | Safety | Whether the output avoids harmful, biased, or policy-violating content |
General | Semantic similarity | Whether the output's meaning aligns with a reference answer, regardless of phrasing |
RAG | Answer relevancy | How directly the generated response maps to the input query (called Response Relevancy in RAGAS) |
RAG | Faithfulness | Whether the response stays grounded in the retrieved context or introduces hallucinated claims |
RAG | Contextual relevancy | Whether the retrieved documents are actually relevant to the query |
RAG | Contextual recall | Whether the retrieval step surfaced all the information needed to answer correctly |
RAG | Contextual precision | Whether the most relevant retrieved chunks are ranked highest |
Agent | Tool use/selection accuracy | Whether the agent calls the correct tool with the right arguments at each step |
Agent | Task completion rate | Whether the agent reached the user's goal end-to-end, regardless of the path taken |
Agent | Step economy | Whether the agent completes tasks without looping or burning excess tokens |
Agent | Reasoning coherence | Whether the agent's sequence of decision steps is logically consistent with its goal and spec |
A bad answer from a retrieval-augmented agent can fail at two different points. If contextual recall is low, the retrieval step didn't surface the right documents. The model lacks the information it needs to answer correctly. If faithfulness is low, the model has the right context but ignores it. Treating these as a single "accuracy" score hides where the fix belongs. Many off-the-shelf eval dashboards report a combined accuracy score by default; that number won't tell you whether the failure happened at retrieval or generation.
If your agent takes actions (calling tools, executing code, or reasoning across multiple steps), output quality metrics alone won't catch the failures that matter. A faithfulness score won't tell you whether the agent called the right tool or whether it reached the user's goal in a reasonable number of steps. Track tool use/selection accuracy for correctness at each action step, task completion rate for end-to-end success, step economy to catch agents that loop or burn tokens without making progress, and reasoning coherence to verify that the agent's sequence of decision steps is logically consistent with its goal.
Building your first eval: the golden dataset
Your golden dataset needs 200 to 500 examples sourced from real production failures, not synthetic data, per Golden Datasets for AI Evaluation. Pull from three places:
Production logs where the agent returned incorrect or low-quality output
Edge cases that broke previous deployments
A representative distribution of actual user queries, weighted toward the patterns your users hit most often
Synthetic examples fill gaps in input distributions you haven't yet seen in production: rare languages, adversarial inputs, or edge cases your users haven't hit yet. A dataset built primarily from real failures catches the regressions that matter.
OpenAI evals framework: how it works
The hosted OpenAI Evals Dashboard is being deprecated: existing eval content becomes read-only on October 31, 2026, and the hosted UI shuts down entirely on November 30, 2026, per the OpenAI API deprecations documentation. If you have existing evals there, export them before October 31. Going forward, run evals programmatically via the OpenAI Evals API or the open-source GitHub repository, which gives you full control over custom grading functions in Python. The framework structures each test around three pieces: a data source of input-output pairs, testing criteria that define what counts as correct, and a model configuration to test against.
For qualitative outputs that resist hard-coded checks, the framework supports LLM-as-judge: a judge model grades the primary model's responses against criteria you define. This handles tone, relevance, and factual accuracy where binary pass/fail assertions go blind.
The framework is also not limited to OpenAI endpoints. You can configure the programmatic tools to run evals against any supported model or custom completion function, which matters when you're comparing providers or need to verify behavior after switching models.
Setting up a production eval pipeline
Without a structured pipeline, eval coverage is whatever you remember to check, and the gaps are exactly where regressions hide. Production evaluation closes that with four stages, each with its own automated quality gate.
:quality(82))
Local development: run your agent against a curated golden dataset during rapid iteration. Catch obvious regressions before code leaves your machine.
Pull request: an LLM judge scores the full dataset on every PR. Calibrate the judge to hit up to 90 percent agreement with human-annotated references, per LLM Evaluation in 2026. Any regression below your prior deployment's score blocks the merge.
Deployment gate: threshold-based checks on accuracy, safety, and faithfulness act as hard gates. RAG metrics feed into this stage alongside safety evals. If scores drop below your production baseline, the deploy halts.
Production monitoring: sample live traffic continuously and feed interesting cases back into the golden dataset, keeping it current as user behavior changes.
Common eval mistakes and how to avoid them
A well-designed eval pipeline misleads you when its underlying assumptions are wrong. Five anti-patterns show up repeatedly.
Relying only on offline datasets. Your golden set captures past failures. If you never score live traffic, you miss distribution changes as user behavior evolves.
Optimizing for a single metric. An agent can score high on faithfulness while producing incoherent responses. Composite scoring across multiple dimensions gives you an honest picture because each dimension stays visible. A drop in coherence isn't masked by a high faithfulness score.
Insufficient test data diversity. A dataset skewed toward happy-path queries won't surface the edge cases that break production agents.
Using the wrong metric for the task. Semantic similarity is useful for open-ended generation but misleading for classification, where exact-match accuracy tells you more.
Not versioning eval criteria. When you update a prompt, snapshot the eval rubric alongside it. Otherwise, you're grading a new agent against outdated expectations, and regressions slip through undetected.
Testing agents before production with Logic
Building eval coverage by hand is where you stall. Logic generates test cases automatically from your spec, probing edge cases and boundary conditions that your manual tests miss. When you update the spec, Logic automatically generates test suites before deployment. If any test fails, deployment is blocked until you fix the issue or explicitly acknowledge the failure.
For action-taking agents, Logic mocks tool calls during test execution, so test scenarios run without hitting external APIs or sending real emails. Step-level traces in the execution history show every tool call and its arguments, so when a test fails, you can see exactly which action went wrong and where the decision sequence broke down.
Every agent version is immutable and carries its full test suite. Logic treats each published spec as a sealed artifact, so versions can't be edited in place, and the full history stays intact. You can compare behavior across versions side by side and roll back in one click if a regression ships.
Your eval coverage also grows organically from production. Any execution can be promoted to a permanent test case, so the agents you run in the real world feed directly into the suite that guards future deployments. Logic runs more than 250,000 production jobs a month, and any of those executions can seed your test suite directly. Testing ships with the agent, not after it.
Final thoughts on reliable LLM evaluation
Evals are the only honest signal you have before an agent reaches users. Deterministic checks catch schema and routing errors. LLM-as-judge scores relevance, faithfulness, and coherence at scale. For agents that take actions, tool-use accuracy and task completion rate tell you what output-quality scores can't. Real production resilience comes from coverage that updates when your spec changes and grows from live traffic. If you want test generation that ships with the agent instead of lagging behind it, book a call to see how Logic handles eval automation and version control without writing a single test by hand.
Frequently Asked Questions
What's the best way to test LLM agent behavior before production?
Run a combination of deterministic evals for structural properties (schema validation, tool calls, field requirements) and LLM-as-judge scoring for semantic qualities (relevance, factual accuracy, tone). Deterministic evals catch hard failures fast, while LLM-as-judge handles the fuzzy reasoning layer where traditional assertions go blind. For agents that take actions, also track tool use/selection accuracy (did the agent call the right tool with the right arguments?), task completion rate (did the agent reach the user's goal end-to-end?), step economy (did it get there without looping or burning excess tokens?), and reasoning coherence (did the agent's sequence of decision steps follow logically from its goal?).
How does LLM-as-judge compare to human evaluation for evals?
LLM-as-judge costs fractions of a cent per instance and handles thousands of evaluations per minute, while human evaluation runs $5 to $50 per instance and tops out at dozens per day. State-of-the-art models align with human judgment up to 90 percent for both pairwise and single-output scoring, making automated evaluation practical for continuous testing.
How do I build a golden dataset for LLM evals?
Source 200 to 500 examples from production logs where your agent failed, edge cases that broke previous deployments, and a representative distribution of actual user queries weighted toward common patterns. Real production failures catch regressions that matter more than synthetic examples.
What are evals in LLM testing?
Evals are structured tests that measure how well your model or agent performs against defined criteria. Some check binary properties like required fields, others score subjective qualities like tone or factual accuracy. They run before every deployment and after every prompt change to catch regressions before users do.
Can I run OpenAI evals on models other than GPT?
Yes. The OpenAI evals framework structures tests around data sources, testing criteria, and model configurations, and you can run evals through the API or CLI against any supported model endpoint, including models beyond OpenAI.
Related resources
LLM monitoring: what to track live (June 2026)
Learn what to monitor and log when your LLM agent goes live. Track latency, token usage, hallucinations, and retrieval accuracy in production. June 2026 guide.
LLM Testing in Production: Catch Regressions (2026)
Learn how to catch LLM regressions in production before users find them. Test prompts, responses, and workflows with frameworks and CI/CD integration. July 2026
LLM Prompting for Production Applications: Foundations and Infrastructure
Ship reliable LLM agents without building prompt infrastructure. Logic adds version control, auto-generated tests, and typed APIs from your spec.
Context Engineering for Production LLM Applications (2026)
Ship production LLM applications with Logic. The production AI platform handles context engineering so your team skips prompt infrastructure work.
Multi-Agent LLM Architecture: Building Coordinated Systems with Logic
Multi-agent LLM architecture borrows from distributed systems but breaks at the coordination layer. Learn the patterns and infrastructure required.
Agent Toolkit or Production Platform: What Your AI Stack Actually Needs
Agent toolkits handle orchestration but skip production infrastructure. Logic ships typed APIs, testing, and versioning so you build product, not plumbing.