Day 24

You can’t debug what you can’t see — build the observability stack for production AI.

Context

AI observability makes LLM application behavior visible, debuggable, and improvable in production. Unlike traditional software where stack traces tell you what happened, LLM applications fail subtly: technically valid but incorrect outputs, wrong tool selections, poor retrieval, unexpected agent paths. Without observability, debugging is guessing. With it, every failure has a traceable root cause.

The four-layer approach: (1) Instrumentation — capture inputs, outputs, tokens, latency, and errors for every LLM call as spans in a trace. (2) Monitoring — track quality metrics, cost, and latency trends over time. (3) Alerting — trigger when quality degrades (the thing that actually hurts the product). (4) Quality sampling — regularly review production outputs for issues automated metrics miss.

Platform landscape (2026): LangSmith — LangChain’s product but works beyond LangChain apps; strong tracing and eval integration. Braintrust — growing fast, excellent eval + experimentation features. Langfuse — leading open-source option, v3 with improved UI and production monitoring. Arize Phoenix — open-source, OpenTelemetry-native. Helicone — proxy-based, zero code changes (just change API base URL). OpenTelemetry is now the de-facto instrumentation standard; LlamaIndex, LangChain, AutoGen, and CrewAI all support it natively. Specify "use OpenTelemetry instrumentation" rather than framework-specific tracing.

Cost attribution by feature: Add metadata tags to each API call ({"feature": "contract_review", "user_tier": "enterprise"}) and track cost per feature in your observability layer. This enables targeting the highest-spend features for optimization first. Most platforms (LangSmith, Langfuse, Helicone) support this.

Anomaly detection: Beyond static thresholds, production AI monitoring uses statistical anomaly detection. If your acceptance rate drops 2 standard deviations from the rolling average, that’s more sensitive than any static threshold. This is how you catch quality regressions before customers notice.

Tasks (4)

Design your observability plan (25 min)
Define: 5 spans to trace (embed, retrieve, LLM call, tool call, output validation), 3 metrics to monitor (latency p99, acceptance rate, cost per request), and 2 alerts (quality drop, cost spike). Save as /day-24/observability_plan.md.
Explore Langfuse or Helicone (25 min)
Set up Langfuse (open-source) or Helicone (proxy-based, zero code changes). Instrument a simple LLM call. Inspect the trace: what does it show about token usage and latency that logs alone don’t? Save as /day-24/trace_analysis.md.
Design cost attribution (25 min)
Your product has 5 features using LLM calls. Design the metadata tagging scheme and the dashboard that shows cost per feature per day. How do you identify which feature is driving a sudden cost spike? Save as /day-24/cost_attribution_design.md.
Debug a production failure from a trace (25 min)
A user reports a wrong answer. You open the trace and see: retrieval returned 3 chunks, 2 were irrelevant. Write the root cause analysis and the fix (improve retrieval, not the prompt). Save as /day-24/trace_debug_exercise.md.

Interview question

How would you set up observability for an AI product in production?

Four layers, in order of implementation priority.

Instrumentation (week 1): Every LLM call captured as a span with: input prompt, output, model, tokens (input/output/cached), latency, and custom metadata (feature name, user tier). Use OpenTelemetry — it’s the de-facto standard supported by all major frameworks. Send to Langfuse (open-source) or Helicone (proxy-based, zero code changes).

Monitoring (week 2): Dashboard tracking: p99 latency by feature, acceptance rate (users accepting vs rejecting AI outputs), cost per request and per feature, and error rate. The acceptance rate is the most undervalued metric — it’s the leading indicator of product quality.

Alerting (week 3): Statistical anomaly detection on acceptance rate (2 standard deviations from rolling 7-day average) and static threshold on p99 latency. Quality degradation is what hurts the product.

Quality sampling (ongoing): Weekly manual review of 50 random production outputs. Automated evals miss edge cases that a PM reviewing real outputs catches. This is the PM’s direct connection to product quality.

PM angle

Observability is a PM responsibility. You should be able to open a trace, see why the product gave a wrong answer, and diagnose whether the fix is retrieval quality, prompt engineering, or model selection. The PM who reviews production traces weekly builds better products than one who waits for complaint tickets.

Resources

TOOL Langfuse — Leading open-source LLM observability. V3 with production monitoring.
TOOL Helicone — Proxy-based: change API base URL, get full observability. Zero code changes.
TOOL Braintrust — Eval + observability platform. Strong experimentation features.
TOOL LangSmith — LangChain’s observability product. Works beyond LangChain apps.
TOOL Arize Phoenix — Open-source, OpenTelemetry-native observability.
DOCS OpenTelemetry for LLMs — De-facto instrumentation standard. All major AI frameworks support it.