← All days

Day 14

When to adapt a model vs when to prompt better — the decision tree now includes open-weight alternatives.

Context

Fine-tuning updates a model’s weights on a curated training dataset to improve performance on a specific task. It’s appropriate when: format consistency that’s hard to achieve through prompting alone, domain-specific vocabulary that general models handle poorly, high-volume inference where a smaller fine-tuned model replaces a larger general one, or proprietary behavioral patterns that can’t be exposed in a system prompt. Start with prompt engineering — the majority of production AI products are still prompt-engineered because it’s faster, cheaper, and more maintainable. Fine-tune only when prompting hits a quality ceiling.

The decision tree has expanded. In 2024, the choice was binary: fine-tune a proprietary model or improve your prompts. In 2026, there are three options: (1) Prompt engineering first — fastest, cheapest, most maintainable; (2) Fine-tune Claude via AWS Bedrock if you need Anthropic’s safety guarantees and enterprise SLAs (verify current fine-tuning availability at docs.anthropic.com); (3) Fine-tune an open-weight model with LoRA (Llama 4, Phi-4, Mistral) if data privacy is the primary constraint — no data leaves your environment. All three options are legitimate. Know all three.

LoRA (Low-Rank Adaptation) is the dominant fine-tuning technique for open-source models. It adds small trainable adapter layers to a frozen base model, making fine-tuning dramatically cheaper and faster: Phi-4 (14B parameters) can be LoRA fine-tuned on a single A100 GPU in hours. This is now a real alternative to proprietary API solutions for some use cases. When an enterprise asks "can we fine-tune Llama 4 on our proprietary data instead of paying for Claude API?", LoRA makes that not only possible but cheap. The PM answer: "Yes, and here’s when to choose each option" — not "that’s not possible."

Distillation trains a smaller model to mimic a larger one using synthetic training data from the teacher’s outputs. OpenAI explicitly supports distillation: GPT-4o outputs → fine-tune GPT-4o-mini. This gives large-model quality at small-model inference cost. Synthetic data generation — using Claude or GPT-4o to generate high-quality training examples — is now a standard production pattern, more cost-effective than human labeling for many use cases.

Data quality matters enormously more than quantity. Threshold guidelines (50 examples minimum, 500-1000 for reliable improvement) are rough approximations that vary significantly by task type and model. Don’t present them as universal rules — always benchmark your specific task and model combination. Budget 80% of effort on data curation and quality, 20% on the actual fine-tuning process.

Tasks (4)

  1. Fine-tuning decision audit: 3-option framework (25 min)
    Identify 3 AI product features where better model performance is needed. For each, evaluate all three options: (1) improved prompt engineering, (2) fine-tune Claude via Bedrock, (3) LoRA fine-tune Llama 4 or Phi-4. Which wins and why? Consider: data sensitivity, required safety guarantees, maintenance burden, and cost. Save as /day-14/finetune_decision_audit.md.
  2. Design a data collection workflow (25 min)
    You want to fine-tune a model for customer support at a SaaS company. Design the workflow: how do you identify good examples from existing tickets? Who reviews quality? What JSONL format do you use? Also: design a synthetic data generation approach using Claude to create additional training examples. Save as /day-14/data_collection_workflow.md.
  3. Fine-tune vs prompt comparison (25 min)
    Take a task where format consistency matters (e.g., writing product changelog entries in a specific style). Write a detailed system prompt with 3-5 few-shot examples. Then design what a fine-tuning dataset for the same task would look like (10 example pairs). Which approach gives better results with less ongoing maintenance? Save as /day-14/finetune_vs_prompt_comparison.md.
  4. Prepare the open-weight fine-tuning interview answer (25 min)
    Write a 200-word answer to: "We have 10,000 proprietary customer service transcripts. Should we fine-tune Claude, fine-tune Llama 4, or just prompt engineer?" Your answer must address all three options with specific prerequisites for each (data volume, privacy requirements, safety needs, cost). Save as /day-14/open_weight_finetune_answer.md.

Interview question

A customer wants their AI to always respond in their brand voice. Should you fine-tune or prompt engineer?

The 2026 answer has three options, not two. Start with prompting, escalate to fine-tuning only if prompting fails, and choose the right fine-tuning target.

Option 1 — Prompt engineering (try first): A well-crafted system prompt with 3-5 brand voice examples (few-shot) achieves remarkable consistency for most use cases. You can iterate in minutes, the style guide is directly visible, and brand voice changes don’t require retraining. Try this for 2 weeks before considering fine-tuning. If you achieve 90%+ brand consistency, you’re done.

Option 2 — Fine-tune Claude via Bedrock: If prompting hits a quality ceiling on highly distinctive voice (unusual tone, specific jargon, strict formatting), fine-tune on 200-500 brand-aligned examples. This gives better consistency and removes style instructions from the system prompt (reducing per-call cost at high volume). Choose this when you need Anthropic’s safety guarantees and compliance (SOC 2, HIPAA eligibility).

Option 3 — LoRA fine-tune Llama 4 or Phi-4: Choose this when data privacy is the primary constraint — the customer’s brand content never leaves their environment. LoRA on Phi-4 (14B params) runs on a single GPU. Trade-off: you lose Anthropic’s safety training and enterprise SLAs.

My recommendation: prompt engineering first (90% of the time it’s enough), then choose fine-tuning target based on the customer’s primary constraint — safety/compliance or data privacy.

PM angle

The fine-tune vs prompt decision is asked in every AI PM interview. In 2026, the complete answer includes three options: prompt engineering (default), proprietary fine-tune (compliance-first), and open-weight LoRA (privacy-first). The PM who only knows the first two is missing the option that many enterprise buyers will ask about. Know all three.

Resources