Day 6

Reasoning models change the product calculus — learn when to pay the premium and when it backfires.

Context

Reasoning models — OpenAI’s o3, o4-mini, and Claude’s extended thinking mode — produce significantly better answers on hard problems by spending more inference time "thinking" before responding. The mechanism: these models generate an internal chain-of-thought ("thinking tokens") that helps them verify their own reasoning before outputting a final answer. On math competitions, coding benchmarks, and scientific reasoning tasks, reasoning models consistently outperform standard models by meaningful margins. The improvement is real and measurable on hard tasks — but the cost and latency premium means routing intelligence matters.

The 2025-2026 reasoning model landscape: o3 is OpenAI’s most capable reasoning model at premium pricing. o4-mini — the most important competitive development of 2025 — provides strong reasoning at a fraction of o3’s cost, making the "reasoning vs. non-reasoning" routing decision far less economically punitive. Claude’s extended thinking (available in Claude 4.x series) follows a similar approach: the model generates thinking tokens before the final response. The key difference: Claude’s thinking can be streamed separately, enabling a "thinking in progress" UX.

The product-critical nuance: reasoning models are worse for easy tasks. Asking a reasoning model to extract a name from a document, format a table, or classify sentiment is expensive overkill. Worse, reasoning models sometimes over-think simple problems — producing more verbose, less direct answers than standard models. For customer support deflection or simple Q&A, standard models produce tighter, faster outputs. This is counterintuitive and important: better reasoning capability doesn’t mean better performance on all tasks. You need routing logic that matches task complexity to model capability.

Claude’s extended thinking exposes a budget_tokens parameter that controls how many tokens the model can spend on thinking. This is a product dial, not a binary switch: more budget tokens = better reasoning quality on hard problems = higher cost and latency. A PM needs to understand this tradeoff and specify the right budget for each use case. For a legal analysis feature, you might set a high thinking budget (more thorough reasoning). For a quick extraction feature, extended thinking is off entirely.

Streaming with extended thinking changes the UX equation. When using extended thinking with streaming, Claude streams thinking tokens separately from response tokens. You can show a "thinking..." indicator with actual progress (the thinking tokens stream in real time) before the response begins. This creates a fundamentally different UX than just "waiting" — users see the model working. Whether to expose the thinking to users is a genuine product decision: showing thinking increases trust and explainability but also increases latency perception and may expose uncertainty. Latency for reasoning models is typically 3-10x longer than standard models on typical queries — always benchmark in your actual use case environment rather than relying on generic estimates.

Tasks (4)

Build a routing decision guide (25 min)
Define criteria for when to route to a reasoning model vs standard model. Criteria should be specific enough to implement as code. Consider: task type, expected response complexity, user-facing latency tolerance, cost per query budget. Include the o4-mini tier — routing is no longer just "expensive reasoning vs cheap standard." Save as /day-06/routing_decision_guide.md.
Benchmark reasoning on a real task (25 min)
Find a complex SQL query, algorithmic problem, or multi-step analysis task. Run it through Claude extended thinking (or o3) AND Claude Sonnet 4.6 (standard). Compare quality, latency, and cost. Write up findings as a product recommendation: does the quality improvement justify the cost? Save as /day-06/reasoning_benchmark.md.
Design the "thinking" UX (25 min)
Your product uses extended thinking for complex queries. Design two UX approaches: one that shows the thinking stream (real-time progress), one that hides it (just shows "analyzing..."). What are the tradeoffs in trust, latency perception, and user confidence? Which would you ship for (a) a legal analysis tool, (b) a consumer chatbot? Save as /day-06/thinking_ux_design.md.
Cost-model the reasoning premium with o4-mini (25 min)
Your product handles 10,000 requests/day. 20% are complex enough to benefit from reasoning. Calculate monthly cost for three strategies: (a) all standard model, (b) 20% routed to o3, (c) 20% routed to o4-mini. Use current pricing from openai.com/pricing. How does o4-mini change the routing economics? Save as /day-06/cost_model_reasoning.md.

Reasoning model routing with budget_tokens — JavaScript

// Reasoning Model Router — including extended thinking budget management
// Updated March 2026: includes o4-mini tier and Claude budget_tokens

const ROUTING_SIGNALS = {
  reasoning: [
    'calculate', 'prove', 'debug', 'analyze', 'optimize', 'compare tradeoffs',
    'write tests for', 'architecture decision', 'why does this fail', 'step by step',
    'what are the implications', 'evaluate the risk'
  ],
  standard: [
    'summarize', 'translate', 'extract', 'format', 'classify', 'list',
    'rewrite', 'what is', 'explain briefly', 'draft email', 'schedule'
  ]
};

function routeQuery(query) {
  const q = query.toLowerCase();
  let reasoningScore = 0;
  let standardScore = 0;

  ROUTING_SIGNALS.reasoning.forEach(signal => {
    if (q.includes(signal)) reasoningScore++;
  });
  ROUTING_SIGNALS.standard.forEach(signal => {
    if (q.includes(signal)) standardScore++;
  });

  // Length heuristic: longer queries tend to be more complex
  const queryLength = query.split(' ').length;
  if (queryLength > 50) reasoningScore += 2;
  if (queryLength < 15) standardScore += 1;

  // Three-tier routing: standard, affordable reasoning, premium reasoning
  let model, tier, thinkingBudget;
  if (reasoningScore > standardScore + 2) {
    model = 'o3'; tier = 'premium-reasoning';
    thinkingBudget = 16000; // high budget for complex tasks
  } else if (reasoningScore > standardScore) {
    model = 'o4-mini'; tier = 'affordable-reasoning';
    thinkingBudget = 8000;  // moderate budget
  } else {
    model = 'claude-sonnet-4-6'; tier = 'standard';
    thinkingBudget = 0;     // no thinking needed
  }

  return { model, tier, thinkingBudget, reasoningScore, standardScore };
}

// Claude extended thinking configuration
const CLAUDE_EXTENDED_CONFIGS = {
  legal_analysis: {
    model: 'claude-sonnet-4-6',
    thinking: { type: 'enabled', budget_tokens: 16000 },
    note: 'High budget — thorough reasoning worth cost for legal accuracy'
  },
  quick_extraction: {
    model: 'claude-haiku-4-5-20251001',
    thinking: null, // no extended thinking
    note: 'Standard model, no thinking — speed and cost priority'
  },
  code_review: {
    model: 'claude-sonnet-4-6',
    thinking: { type: 'enabled', budget_tokens: 10000 },
    note: 'Moderate budget — helps catch logic errors without excessive cost'
  }
};

// Demo routing
const queries = [
  "What is the capital of France?",
  "Debug this recursive function and explain why it fails on edge cases",
  "Summarize this email in 2 sentences",
  "Analyze the tradeoffs between microservices and monolith for 8 engineers",
  "Translate this to Spanish: Good morning",
  "Evaluate the legal risk of this contract clause and suggest alternatives"
];

console.log('THREE-TIER ROUTING — Standard / Affordable Reasoning / Premium');
console.log('='.repeat(65));
queries.forEach(q => {
  const result = routeQuery(q);
  const preview = q.length > 55 ? q.slice(0, 52) + '...' : q.padEnd(55);
  console.log(preview + ' -> ' + result.model + ' [' + result.tier + ']');
  if (result.thinkingBudget > 0) {
    console.log('  ' + ' '.repeat(55) + '   thinking budget: ' + result.thinkingBudget);
  }
});

console.log('\nCLAUDE EXTENDED THINKING CONFIGS:');
console.log('-'.repeat(65));
Object.entries(CLAUDE_EXTENDED_CONFIGS).forEach(([name, cfg]) => {
  console.log(name + ': ' + cfg.model + (cfg.thinking ? ' | budget: ' + cfg.thinking.budget_tokens : ' | no thinking'));
  console.log('  ' + cfg.note);
});

console.log('\nKEY INSIGHT: Extended thinking is a dial, not a switch.');
console.log('budget_tokens controls quality/cost/latency tradeoff per use case.');
console.log('Reasoning models HURT on simple tasks — route carefully.');

Interview question

When would you choose a reasoning model over a standard model for an AI product feature?

I use a reasoning model when three conditions are true: the task has an objectively correct or clearly better answer that extra thinking would find, the latency increase (3-10x longer than standard) is acceptable for the UX, and the quality improvement justifies the cost premium.

Good candidates: complex code generation (not autocomplete, but writing algorithms), financial analysis with cascading calculations, scientific reasoning, advanced SQL against complex schemas, and legal clause evaluation where missing a nuance has real consequences.

Bad candidates: content extraction, summarization, translation, simple Q&A, classification. Reasoning models sometimes over-think simple problems — producing verbose, indirect answers when the user wants three words. Standard models are tighter and faster here.

In practice, I’d build a three-tier routing strategy. The economics changed dramatically in 2025: o4-mini provides strong reasoning at much lower cost than o3, so the routing penalty for over-routing is less severe than it was. For Claude, extended thinking with a budget_tokens parameter gives even more granular control — you can dial the thinking budget per use case rather than routing to a completely different model. A legal feature gets high thinking budget; a quick extraction feature gets none.

The routing classifier itself should be extremely cheap — don’t use a reasoning model to decide whether to use a reasoning model. A simple keyword + query length heuristic works surprisingly well as a baseline, then refine with actual usage data.

PM angle

Reasoning models are the first time in AI that "spend more compute at inference time" is a product lever you can pull. Extended thinking’s budget_tokens parameter makes it a dial, not a switch. Understanding when to turn that dial up (hard problems where accuracy matters), when to turn it down (simple tasks where speed matters), and when to turn it off entirely (high-volume extraction) is a core AI PM competency.

Resources

DOCS Claude Extended Thinking — How extended thinking works: budget_tokens, streaming, and when to use it.
BLOG OpenAI o3 / o4-mini — OpenAI’s reasoning model series. Compare pricing: o4-mini vs o3.
PAPER Scaling LLM Test-Time Compute — Research paper on the test-time compute approach behind reasoning models.
DOCS Streaming Extended Thinking — How to stream thinking tokens separately — enables "thinking in progress" UX.
PRICING OpenAI Pricing — Compare o3 vs o4-mini vs GPT-4o pricing for routing cost models.
PRICING Anthropic Pricing — Extended thinking token costs. Verify before any cost calculation.