Publikováno 2026-07-01 · NextModel Research
Přímá odpověď
LLM observability tracks latency, cost, errors, and traces across every model call. See the key metrics and how a gateway enables it without custom code. Tento průvodce je určen pro produktové a platformní týmy, které porovnávají kvalitu modelů, cenu, politiku routování a riziko rolloutů.
What LLM observability means
LLM observability is not the same as general application monitoring. Standard APM tools track HTTP status codes and response times. LLM observability adds a layer specific to model traffic: token counts, per-model cost, prompt and completion metadata, cache behavior, and refusal or safety-filter outcomes. Because a single app might call several different models across several providers, LLM observability also needs to break results down by model and by route, not just by endpoint. Good LLM observability answers questions like the ones below.
- Which model is driving the most spend this week?
- Did latency spike after a provider changed something upstream?
- What fraction of requests are being served from cache instead of billed as a fresh call?
- Which route has the highest error or refusal rate, and since when?
Key metrics to track
A working LLM observability setup should capture the following, at minimum. Tracking these together, rather than in separate dashboards per provider, is what turns raw logs into usable LLM observability.
- Latency: time to first token and total completion time, per model and per route.
- Token usage: input and output tokens per request, aggregated by model, user, or API key.
- Cost: dollar cost per request, rolled up by model, team, or project.
- Error and refusal rates: failed calls, timeouts, and model refusals as a share of total traffic.
- Cache hit rate: the percentage of requests served from cache rather than sent to a provider.
- Per-model and per-route breakdowns: the same metrics above, segmented so you can compare models or endpoints directly.
- Traces: a request-level record that links a single call to its model, provider, latency, token count, and outcome, so any individual request can be inspected after the fact.
How a gateway makes observability possible
The hard part of LLM observability is usually not deciding what to measure. It is collecting consistent data across multiple providers and models without adding logging code to every call site. A gateway solves this by sitting between your application and every model provider. Instead of calling OpenAI, Anthropic, or another provider directly from each service, your application sends every request to one OpenAI-compatible base_url. See /docs/openai-compatible for the exact configuration. The gateway then routes the call to the right provider and model behind the scenes. Because every request passes through that single point, the gateway can record latency, token counts, cost, and outcome for each call in one consistent format, regardless of which provider served it. That consistency is what makes per-model and per-route breakdowns possible without per-provider integration work. NextModel's gateway captures this data for every request that goes through it, so LLM observability is available by default rather than something you build on top of application code.
Observability and provenance
LLM observability tells you what happened: latency, cost, and outcome for each request. Provenance goes a step further and establishes that a record of a call is authentic and has not been altered after the fact. NextModel pairs observability with auditable receipts: because all traffic passes through one gateway, each call can produce a verifiable record tied to the model, provider, and result. For details on how these receipts work, see /ai-provenance. Together, observability and provenance answer two different but related questions. Observability tells a team what is happening across their LLM traffic right now. Provenance lets anyone verify, later, that a specific call happened exactly as recorded.
FAQ
Is LLM observability the same as logging? No. Logging captures raw request and response data. LLM observability structures that data into metrics like latency, cost, and error rate, broken down by model and route, so it can be monitored and queried over time. Do I need a gateway for LLM observability? Not strictly, but without one you need to add token, cost, and latency tracking separately for each provider and model you call. A gateway centralizes that work at a single base_url. What is the difference between LLM observability and AI observability? AI observability is the broader term covering any AI system, including traditional ML models. LLM observability is the subset focused specifically on large language model traffic: prompts, completions, tokens, and model-specific costs. Can LLM observability data double as an audit trail? Observability data on its own is a record, not proof. Pairing it with provenance, as NextModel does through its receipts, adds verifiability so the record can be trusted after the fact, not just used for internal monitoring.