Day 14
When to adapt a model vs when to prompt better — the decision tree now includes open-weight alternatives.
Context
Fine-tuning updates a model’s weights on a curated training dataset to improve performance on a specific task. It’s appropriate when: format consistency that’s hard to achieve through prompting alone, domain-specific vocabulary that general models handle poorly, high-volume inference where a smaller fine-tuned model replaces a larger general one, or proprietary behavioral patterns that can’t be exposed in a system prompt. Start with prompt engineering — the majority of production AI products are still prompt-engineered because it’s faster, cheaper, and more maintainable. Fine-tune only when prompting hits a quality ceiling.
The decision tree has expanded. In 2024, the choice was binary: fine-tune a proprietary model or improve your prompts. In 2026, there are three options: (1) Prompt engineering first — fastest, cheapest, most maintainable; (2) Fine-tune Claude via AWS Bedrock if you need Anthropic’s safety guarantees and enterprise SLAs (verify current fine-tuning availability at docs.anthropic.com); (3) Fine-tune an open-weight model with LoRA (Llama 4, Phi-4, Mistral) if data privacy is the primary constraint — no data leaves your environment. All three options are legitimate. Know all three.
LoRA (Low-Rank Adaptation) is the dominant fine-tuning technique for open-source models. It adds small trainable adapter layers to a frozen base model, making fine-tuning dramatically cheaper and faster: Phi-4 (14B parameters) can be LoRA fine-tuned on a single A100 GPU in hours. This is now a real alternative to proprietary API solutions for some use cases. When an enterprise asks "can we fine-tune Llama 4 on our proprietary data instead of paying for Claude API?", LoRA makes that not only possible but cheap. The PM answer: "Yes, and here’s when to choose each option" — not "that’s not possible."
Distillation trains a smaller model to mimic a larger one using synthetic training data from the teacher’s outputs. OpenAI explicitly supports distillation: GPT-4o outputs → fine-tune GPT-4o-mini. This gives large-model quality at small-model inference cost. Synthetic data generation — using Claude or GPT-4o to generate high-quality training examples — is now a standard production pattern, more cost-effective than human labeling for many use cases.
Data quality matters enormously more than quantity. Threshold guidelines (50 examples minimum, 500-1000 for reliable improvement) are rough approximations that vary significantly by task type and model. Don’t present them as universal rules — always benchmark your specific task and model combination. Budget 80% of effort on data curation and quality, 20% on the actual fine-tuning process.
Tasks (4)
- Fine-tuning decision audit: 3-option framework (25 min) Identify 3 AI product features where better model performance is needed. For each, evaluate all three options: (1) improved prompt engineering, (2) fine-tune Claude via Bedrock, (3) LoRA fine-tune Llama 4 or Phi-4. Which wins and why? Consider: data sensitivity, required safety guarantees, maintenance burden, and cost. Save as /day-14/finetune_decision_audit.md.
- Design a data collection workflow (25 min) You want to fine-tune a model for customer support at a SaaS company. Design the workflow: how do you identify good examples from existing tickets? Who reviews quality? What JSONL format do you use? Also: design a synthetic data generation approach using Claude to create additional training examples. Save as /day-14/data_collection_workflow.md.
- Fine-tune vs prompt comparison (25 min) Take a task where format consistency matters (e.g., writing product changelog entries in a specific style). Write a detailed system prompt with 3-5 few-shot examples. Then design what a fine-tuning dataset for the same task would look like (10 example pairs). Which approach gives better results with less ongoing maintenance? Save as /day-14/finetune_vs_prompt_comparison.md.
- Prepare the open-weight fine-tuning interview answer (25 min) Write a 200-word answer to: "We have 10,000 proprietary customer service transcripts. Should we fine-tune Claude, fine-tune Llama 4, or just prompt engineer?" Your answer must address all three options with specific prerequisites for each (data volume, privacy requirements, safety needs, cost). Save as /day-14/open_weight_finetune_answer.md.
Interview question
A customer wants their AI to always respond in their brand voice. Should you fine-tune or prompt engineer?
Option 1 — Prompt engineering (try first): A well-crafted system prompt with 3-5 brand voice examples (few-shot) achieves remarkable consistency for most use cases. You can iterate in minutes, the style guide is directly visible, and brand voice changes don’t require retraining. Try this for 2 weeks before considering fine-tuning. If you achieve 90%+ brand consistency, you’re done.
Option 2 — Fine-tune Claude via Bedrock: If prompting hits a quality ceiling on highly distinctive voice (unusual tone, specific jargon, strict formatting), fine-tune on 200-500 brand-aligned examples. This gives better consistency and removes style instructions from the system prompt (reducing per-call cost at high volume). Choose this when you need Anthropic’s safety guarantees and compliance (SOC 2, HIPAA eligibility).
Option 3 — LoRA fine-tune Llama 4 or Phi-4: Choose this when data privacy is the primary constraint — the customer’s brand content never leaves their environment. LoRA on Phi-4 (14B params) runs on a single GPU. Trade-off: you lose Anthropic’s safety training and enterprise SLAs.
My recommendation: prompt engineering first (90% of the time it’s enough), then choose fine-tuning target based on the customer’s primary constraint — safety/compliance or data privacy.
PM angle
Resources
- DOCS OpenAI Fine-tuning Guide — Most comprehensive public fine-tuning guide. Includes distillation workflow.
- DOCS Anthropic Fine-tuning (verify availability) — Check current fine-tuning availability. Status changes — verify before claiming.
- PAPER LoRA Paper — Low-Rank Adaptation: the dominant open-source fine-tuning technique.
- PAPER Phi-4 Technical Report — Microsoft’s 14B model. LoRA fine-tunable on single GPU. Enterprise alternative.
- DOCS AWS Bedrock Fine-tuning — Enterprise Claude fine-tuning via Bedrock. Different options than direct API.
- PAPER Knowledge Distillation Survey — Comprehensive distillation techniques survey.