Evaluating LLM Quality in a Product: Hallucinations, Safety, Latency and Cost Metrics to Track

When an LLM sits inside a real product, “quality” stops being a single number. The same model can feel brilliant in a demo and still fail users in production because it fabricates details, refuses the wrong things, responds too slowly at peak hours, or becomes too expensive once traffic grows. A useful measurement approach in 2026 is to treat quality as a scorecard: reliability (including hallucinations), safety, performance, and unit economics. The point is not to collect dozens of vanity metrics, but to pick a small set that (1) predicts user outcomes, (2) can be measured repeatedly, and (3) has an owner and an action when it goes off track.

Hallucinations and reliability: measure what users actually experience

Start by defining the failure you care about, because “hallucination” is too broad. In most products it splits into at least three buckets: factual errors (wrong claims), ungrounded claims (true or false but unsupported by your sources), and inappropriate certainty (the model guesses instead of signalling uncertainty). Your metrics should map to those buckets, otherwise the team ends up arguing about examples rather than improving the system.

The most practical core metric is a labelled hallucination rate on representative traffic. Take a weekly stratified sample of conversations (by country, intent, and complexity), have reviewers label each answer for: (a) correctness, (b) grounding, and (c) severity. Report it as “% of responses with a material hallucination” and include a severity-weighted variant so you can see whether you are reducing harmful failures rather than just shifting them around.

If your product is retrieval-augmented, add a grounding coverage metric: “% of factual claims supported by provided sources”. You can operationalise this by requiring citations for certain answer types (policies, pricing, medical, legal) and measuring citation presence plus citation validity in audits. Over time, this becomes a leading indicator: grounding coverage often improves before reviewers notice a drop in hallucination severity.

Offline tests that predict hallucinations (and how to avoid fooling yourself)

Offline evaluation still matters, but only when it mirrors your product. Build a test set of real tasks: customer tickets, knowledge-base questions, workflow steps, tool calls, and edge cases. Track pass rate on “facts that must be correct” items and keep those items stable so trends mean something. If you frequently change the set, you will confuse movement from measurement noise with real improvements.

Use two additional metrics to reduce blind spots. First, calibration: measure whether the model’s confidence signals match reality. If you show confidence to users, track “high-confidence incorrect” separately, because that is the fastest route to lost trust. Second, abstention quality: measure whether the model says “I don’t know” (or routes to a human/tool) when it should. A system that never abstains can look good on superficial accuracy until the first incident lands.

Be cautious with model-graded evaluation. It can scale your test coverage, but it also shares the same biases as the model family you are judging. Treat it as a triage signal: use it to spot likely failures and to prioritise human review, not as the final score. If you do use automated judges, keep a small “gold” subset with human labels to estimate how often the judge disagrees with people.

Safety: from policy violations to real risk, with clear severity and ownership

Safety metrics need to reflect your product’s exposure, not just generic “toxicity”. In 2026 most teams use a policy taxonomy: self-harm, violence, hate, harassment, sexual content (especially minors), illegal wrongdoing, privacy leakage, and domain-specific harms (medical or financial misinformation). Your first job is to decide which categories matter for your use case, then define what “violation” and “near miss” mean in each.

A solid starting trio is: (1) policy violation rate on sampled traffic, (2) high-severity incident count, and (3) refusal accuracy. Refusal accuracy is often overlooked: you want the assistant to refuse truly disallowed requests, but still help with benign ones. Measure both false negatives (unsafe content that slipped through) and false positives (safe content incorrectly blocked), because both harm the product in different ways.

Beyond content, track data safety: leakage of personal data, secrets, or internal identifiers. Practical metrics include “% of responses containing detected personal data”, “secrets scanner hits in logs”, and “training-data regurgitation flags” on targeted prompts. Even a low rate can be unacceptable if the severity is high, so report leakage metrics with severity levels and escalation rules.

How to run safety evaluation: red teaming, drift detection and compliance alignment

Red teaming is not a one-off exercise; it is a pipeline. Maintain an adversarial prompt set that evolves with incidents and public attack patterns, and run it on every major model or prompt change. Track “time to fix” for newly discovered failure modes, because that is often more actionable than the absolute number of failures in a week.

Safety also drifts as your user base changes. Add drift detection metrics: shifts in topic distribution, language distribution, and the share of “high-risk” intents. When drift is detected, increase sampling or tighten controls temporarily. If you operate globally, track safety performance per language, because moderation and refusal behaviour can degrade outside English even when the headline metric looks stable.

Finally, align your measurement language with governance frameworks so executives and auditors can understand it. Many organisations map safety controls and metrics to risk management practices such as NIST’s AI RMF and organisational AI management requirements such as ISO/IEC 42001. That mapping does not replace engineering metrics, but it prevents safety work being treated as “soft” compared with latency or cost.

Latency and cost: treat performance as user experience plus unit economics

Latency should be measured as a distribution, not an average. The user remembers the slowest moments, not the mean. Track p50, p95 and p99 end-to-end latency, and split it into components: time to first token, generation time, retrieval/tool time, and post-processing. This breakdown stops teams “fixing” a number by moving work elsewhere in the pipeline.

Define service-level objectives that match the user journey. For chat, time to first token is often the strongest perceived-speed lever; for long answers, throughput matters more. If you stream tokens, measure “time to first meaningful chunk” (not just first byte) and monitor disconnect rate while streaming, because a fast start is pointless if the session drops midway.

For cost, do not stop at cost per 1,000 tokens. The business metric is cost per successful outcome: cost per resolved ticket, cost per completed workflow, cost per qualified lead, or cost per compliant answer. That shifts optimisation from “shorter answers” to “fewer retries, fewer escalations, fewer incidents”, which is usually where the real money is.

Putting it together: a scorecard, thresholds and a weekly operating rhythm

Create a single scorecard that fits on one page: hallucination (material + severity-weighted), safety (violations + high-severity incidents + refusal accuracy), latency (p95 end-to-end + time to first token), and cost (cost per successful task + tokens per task). Add one user metric that matters to your product, such as task success rate or user-rated helpfulness, so the team cannot optimise system metrics while user outcomes decline.

Set thresholds with an explicit “what happens next”. For example: if high-severity safety incidents exceed a small fixed number in a week, you trigger an incident review and a temporary policy tightening; if p95 latency breaches SLO for two consecutive days, you roll back the latest change or shift traffic to a safer configuration; if cost per successful task rises, you investigate retries, tool failures and prompt regressions before cutting answer length.

Adopt a cadence that prevents surprises: daily automated dashboards for latency and cost; weekly sampled review for hallucinations and safety; and release-gating tests before model updates. Frameworks like HELM can be useful inspiration for multi-metric evaluation thinking, but your scorecard must be grounded in your product tasks, your risks, and your users. When metrics are tied to decisions, teams stop debating numbers and start shipping measurable improvements.