Back to Resources
LLM monitoring and logging: what to track when your agent goes live (June 2026)

LLM monitoring and logging: what to track when your agent goes live (June 2026)

LLM monitoring and logging: what to track when your agent goes live (June 2026)

You deploy an agent. It responds fast, costs stay flat, and every request returns 200 OK. Then users start reporting wrong answers. LLM monitoring closes that gap by tracking what traditional APM ignores: whether the model followed instructions, whether retrieval pulled relevant context, and whether the output is correct. Open-source tracers like Langfuse cover hierarchical spans and prompt versioning. OpenTelemetry extensions like OpenLLMetry route LLM telemetry into the observability backend you already run. Managed tools like Logic log every execution, generate regression tests without manual setup, and give you one-click rollback when something goes wrong. Whichever layer you instrument, seven production metrics span computational performance and semantic correctness. A feedback loop from production traces back into your development cycle, catching regressions before the next deployment.

TLDR:

  • LLM monitoring closes the determinism gap: your agent can return a 200 status code while delivering a wrong answer that traditional APM tools never catch.

  • Four observability pillars cover distinct failure modes: tracing, logging, quality evaluation, and system health metrics.

  • Track seven production metrics split between computational (latency, token usage, cost, errors) and semantic (hallucination rate, retrieval accuracy, quality score).

  • Open-source options: Langfuse (hierarchical tracing, prompt versioning, free cloud or self-hosted), OpenLLMetry (routes LLM telemetry into your existing observability backend), and Helicone (in maintenance mode since Mintlify's March 2026 acquisition).

  • Logic logs every execution with typed inputs and outputs, generates regression tests from your spec, and provides one-click rollback to any version.

What does LLM monitoring measure in production?

Your agent returns a 200 status code, responds in 140 milliseconds, and gives the customer a confidently wrong answer. Traditional APM tools see a healthy service. You see a support ticket with no signal in your existing dashboards, indicating that something went wrong.

LLM monitoring tracks what traditional infrastructure metrics miss: the semantic quality of outputs, whether the model followed its instructions, how much each call costs, and whether retrieval pulled the right context. A web server either returns the correct page or throws an error. An LLM can fail without any signal that your existing dashboards would catch.

Conventional software is predictable: same input, output, and resource profile. LLMs produce different outputs across runs, silently degrade when prompts change, and hallucinate in ways that appear structurally valid, with none of it showing up as an error. That is the determinism gap, and it is why monitoring has to move from "is the service up?" to "is the answer right?" Helicone logs at the network layer with a URL swap. Langfuse instruments through SDK decorators. Logic tracks executions during the agent's runtime.

The four pillars: tracing, logging, quality evaluation, and system health

Four capabilities cover different failure modes when an agent goes wrong.

  • Tracing captures the execution path across multi-step agent workflows: which tools were called, in what order, and how long each step took. When an agent picks the wrong tool or loops unnecessarily, the trace tells you where the reasoning diverged.

  • Logging preserves the raw content: every prompt sent, every response received, every retrieval result injected into context, keeping failures reproducible and inspectable.

  • Quality evaluation answers whether the output was semantically correct. Structure can be valid even when the meaning is wrong. Evaluation layers catch hallucinations, instruction violations, and retrieval-relevance failures that pass all structural checks.

  • System health metrics aggregate what individual traces and logs cannot: fleet-wide success rates, latency distributions, run volume trends, and active error counts across all agents. A single trace tells you what happened in one execution. System health tells you whether the system is degrading across thousands of them.

Tracing without logging shows you the path but not the payload. Logging without evaluation gives you data you cannot act on at scale. Evaluation without system health metrics leaves you diagnosing individual failures while missing fleet-wide degradation. Miss any one layer and a failure becomes unreproducible or invisible.

Key metrics to track when agents go live

Seven metrics cover both how the system performs and whether the output is correct.

Metric

What it reveals

Latency (P50/P90/P99)

Response time distribution and UX-breaking outliers

Token usage per request

Cost driver: spikes often mean retrieval is injecting too much context

Cost per execution

Total spend per call across model, tokens, and retries

Error rate

Structural failures like timeouts, malformed outputs, and rate limits

Hallucination rate

How often the model fabricates ungrounded information

Retrieval accuracy

Whether RAG pulled the right documents before generation

Quality score

LLM-as-judge or human-labeled correctness against a reference

The first four answer performance and budget questions. The last three answer the harder one: is the output correct? A system can run fast and cheaply while consistently returning wrong answers, and without semantic metrics, you will not know until your users tell you.

Langfuse: Open source tracing and prompt management

Trace data, prompt version history, and quality scores are scattered across separate tools by default, making it hard to pinpoint which prompt version caused a regression or to reproduce a failure from last week. Langfuse is an open-source tool that keeps all three in one place, with both a managed cloud option and a self-hosted deployment you can run via Docker.

Core capabilities that matter for production monitoring:

  • Hierarchical tracing that nests spans within traces, so you see each step of a multi-agent workflow as a structured tree

  • Prompt management with built-in versioning, letting you track which prompt version produced which outputs

  • Evaluation support through model-based scoring, human annotation, and custom eval functions

  • Integration with major LLM providers and frameworks via native SDKs for both Python and JavaScript

  • OpenTelemetry compatibility, so you can pipe LLM telemetry into the same observability stack your infrastructure team already runs

Langfuse has a free cloud tier and a self-hosted Docker option. It covers hierarchical spans, prompt versioning, and evaluation support with no vendor dependency. The tradeoffs: SDK integration requires code changes across your application, the cloud tier means sending production trace data to a third party, and there is no built-in test generation, rollback, or pre-deployment quality gate. Observability is what it does.

Helicone: AI gateway with built-in observability

Most observability tools require SDK integration and code changes spread across your application. Helicone skips that by tracking latency, token usage, and cost at the network layer. Swap one base URL, and it logs every call with no SDK changes. Every request flows through Helicone's gateway, adding a network hop to each call.

Mintlify acquired Helicone in March 2026, and the project entered maintenance mode. Security patches, bug fixes, and new model support continue shipping. No new features are in development, so adopting it now means committing to a tool unlikely to keep pace with the rest of your stack.

OpenLLMetry: OpenTelemetry extensions for LLMs

If your observability stack already runs on OpenTelemetry, you don't need a separate dashboard to track LLM metrics. OpenLLMetry, built by Traceloop, extends OpenTelemetry with instrumentation for LLMs like OpenAI, Anthropic, and Cohere, vector databases like Pinecone, and frameworks like LangChain and Haystack. You get traces for prompts, responses, token usage, and model performance, all routed through the standard OpenTelemetry protocol.

Because OpenLLMetry speaks OTLP, you can send LLM telemetry to Datadog, New Relic, Sentry, Honeycomb, or whatever your ops team already runs. Python and TypeScript SDKs cover the full auto-instrumentation surface. Go and Ruby SDKs exist with narrower coverage; Ruby remains in beta per Traceloop's own documentation. Traceloop, the creator of OpenLLMetry, was acquired by ServiceNow in March 2026. The library remains open-source under Apache 2.0, with the enterprise roadmap now part of ServiceNow Cloud Observability. If avoiding vendor lock-in in your observability layer matters more than having a purpose-built UI, OpenLLMetry is worth considering.

Free and open source options worth considering

At low traffic volumes, Langfuse, OpenLLMetry, and Arize Phoenix all cover the computational metrics without a cost commitment. Langfuse offers a free cloud tier with capped observation volume and limited retention. OpenLLMetry has no usage restrictions since it is a library you run yourself, not a hosted service. Arize Phoenix is an open-source tracing and evaluation tool you can run locally.

Self-hosting any of these removes cost constraints. The burden moves to your ops team. You hit the upgrade threshold when you need longer log retention, role-based access controls, or uptime guarantees that a side project deployment cannot provide.

Commercial tools: when free options hit their ceiling

Once your agent handles real traffic, free tiers become a constraint. Trace storage caps at days or weeks, and self-hosted instances put retention, uptime, and security patching on your team. LangSmith, Braintrust, and Datadog cover the same computational and semantic metrics and add longer retention, role-based access, compliance certifications, and support SLAs, absorbing that burden with managed infrastructure, audit logs, and guaranteed response times when something breaks at 2 a.m.

How to choose based on your stack and deployment

Your architecture narrows the field quickly. Proxy-based tools slot in with a URL swap. They add a network hop to every call. SDK-based tools like Langfuse give you deeper trace granularity at the cost of integration work in your application code. OpenTelemetry-based approaches like OpenLLMetry let you reuse your existing observability backend. You own the collector and pipeline setup.

  • If you run on-prem or have data residency constraints, self-host Langfuse or run OpenLLMetry into a backend you control.

  • If your team already operates Datadog or Grafana, OpenLLMetry avoids adding another dashboard.

  • If you need the fastest path to visibility with minimal code changes, Langfuse Cloud removes ops overhead in exchange for sending trace data to a third party.

  • If you don't want to build or maintain a monitoring stack, a spec-driven tool like Logic covers execution history, regression tests, and rollback as part of the agent runtime.

Common debugging workflows: from trace to root cause

When an agent hallucinates, open the trace and inspect the retrieval span. If the retrieved documents were irrelevant or missing, the model filled in gaps on its own. Fix retrieval before touching the prompt.

For latency spikes, the timeline view shows which step ballooned. A single tool call waiting on a slow API can dominate the entire request duration, even when the model responds in milliseconds.

Cost overruns follow a similar pattern. Sort executions by token usage and look for outliers. One bloated retrieval step, injecting entire documents instead of relevant passages, can triple your per-call spend overnight.

Quality regressions surface when you compare evaluation scores across prompts or model versions. If your average quality score dropped after a deployment, the version diff tells you which change caused it. Pin the previous version, roll back, and investigate before the regression compounds.

Beyond monitoring: testing, evaluation, and continuous improvement

Monitoring tells you something broke. Testing prevents it from breaking in the first place. Without that loop, every deployment is a fresh guess about whether your agent regressed.

Promote real executions into permanent test cases. When a trace reveals a hallucination or a missed edge case, that input-output pair becomes a regression test you run before every deployment. Over time, your test suite reflects the actual distribution of failures your agent encounters, not hypothetical scenarios you imagined during development.

Layer in LLM-as-judge evaluations that score new prompt or model versions against your accumulated dataset before those versions reach users. If the new version scores lower than the current one, block the release. This quality gate turns monitoring data into a pre-deployment safety net instead of a post-incident forensic tool.

How Logic handles monitoring and versioning for production agents

Logic covers three of the four observability pillars at runtime out of the box: system health metrics, execution logs, and step-level traces. The fourth pillar, quality evaluation, is covered in two ways. In pre-development, Logic generates tests from your spec, and a failed test blocks the publish until you resolve the issue or acknowledge the failure. During runtime, the monitoring dashboard surfaces degradation as it appears across thousands of runs. No collector, pipeline, or dashboard wiring required.

Logic's monitoring dashboard covers every agent in your organization with no extra instrumentation required. Fleet-level health metrics surface total runs, success rate, average latency, and active issues across all agents in a single view. Per-agent activity shows hourly run volume over a rolling 24-hour window, latency distribution at P50, P90, and P99, and runs by source. Logic backs that visibility with a 99.9%+ uptime guarantee, load balancing, and failover redundancy.

Click into any execution, and Logic shows the inputs, outputs, model used, latency, and full trace. Step-level traces, shipped April 2026, open every tool call, intermediate result, and timing for each step. When an agent calls the wrong tool, the trace shows exactly where the reasoning diverged with no extra instrumentation needed.

External tool calls are mocked during test execution, so tests stay reproducible without hitting third-party APIs or sending real emails. Any historical execution is promoted to a permanent test case in one click, so testing real production failures becomes part of your regression suite.

Every spec change produces an immutable version. One-click rollback returns you to any prior version. Version comparison shows a diff between versions, so you know exactly what changed. Every execution is also indexed for adaptive learning. Logic retrieves semantically similar historical runs as few-shot examples during inference, so production data improves agent consistency without manual prompt engineering.

Final thoughts on LLM observability in production

Tracing shows the path, logging preserves the payload, evaluation tells you if the answer was right, and system health metrics show whether the fleet is degrading in aggregate. You need all four. The best monitoring setup is the one your team will actually maintain when things break at 2 a.m., which usually means picking tools that fit your existing stack instead of the most feature-rich dashboard. If you want to compare what Logic tracks out of the box versus what you'd build yourself, request a demo.

Frequently Asked Questions

Langfuse vs OpenLLMetry for production LLM monitoring?

Langfuse gives you a purpose-built dashboard with hierarchical tracing, prompt versioning, and evaluation support out of the box. OpenLLMetry extends your existing OpenTelemetry stack with LLM instrumentation, so you can pipe traces to Datadog or Honeycomb instead of adopting a new tool. If you already run an observability backend and want to avoid another dashboard, OpenLLMetry fits your ops workflow with less friction.

What's the fastest way to add LLM monitoring to a production agent?

Install the Langfuse Python SDK, set your API keys as environment variables, and add the @observe() decorator to your agent's entry function. You get full trace logging and prompt tracking in under an hour, with no infrastructure setup required.

Can I self-host Langfuse to avoid sending production data to a third party?

Yes. Langfuse offers both a managed cloud option and a self-hosted deployment you can run via Docker. Self-hosting removes the data sharing concern. Retention, uptime, and security patching move to your ops team.

What is Langfuse used for in production AI systems?

Langfuse provides tracing, prompt versioning, and evaluation for LLM applications. It logs every execution with hierarchical spans, so you can debug multi-step agent workflows, track which prompt version produced which outputs, and score quality using model-based or human annotations.

When should you switch from free LLM monitoring tools to a commercial solution?

When you need longer trace retention, role-based access controls, compliance certifications, or uptime SLAs that self-hosted deployments can't guarantee. Free tiers typically cap log storage at days or weeks, and self-hosted instances put retention, availability, and patching on your team.

LLM monitoring and logging: what to track when your agent goes live (June 2026)

Explain

Related resources

Ship your first production agent

Logic gives you typed APIs, evals, versioning, observability, and model routing for agents that run in production.