← All days

Day 18

You can’t improve what you don’t measure — build the evaluation infrastructure every AI product needs.

Context

Evaluations (evals) are automated tests measuring AI output quality — the AI equivalent of unit tests. Without evals, every model update is a gamble. Public benchmarks measure model capability in the abstract: MMLU-Pro and GPQA for reasoning (Claude models perform well on these), Humanity’s Last Exam (HLE) (released January 2025, the hardest public benchmark), SWE-bench Verified (the authoritative coding agent benchmark — Claude Code’s performance here is relevant PM knowledge), and Chatbot Arena (LMSYS) for human preference. But the only measure that matters for your product is your internal evals on your specific tasks.

Building internal evals requires three components: (1) A golden dataset — 50-500 representative queries with expected outputs, curated by domain experts who understand quality. Diversity matters more than volume: include the hard edge cases, not just the easy ones. (2) An evaluation function — how you score quality. Options: exact match, reference comparison, LLM-as-judge (using Claude or GPT-4 to rate outputs), or human review. (3) A regression harness — run evals on every model change and alert when metrics drop.

The LLM-as-judge pattern scales well and correlates reasonably with human judgments. But it has documented biases: preference for verbose responses, affinity toward own model’s outputs, position bias (preferring the first response in A/B comparisons), and sycophancy bias (preferring responses that agree with the evaluator’s framing). Mitigate by: randomizing presentation order, using a different model as judge than the one being evaluated, and calibrating judge scores against human ratings on a subset.

Eval platforms in 2025-2026: LangSmith (LangChain’s product — works beyond LangChain apps), Braintrust (strong eval + experimentation platform, growing fast), Langfuse (open-source, good for teams not in the LangChain ecosystem), and Arize Phoenix (open-source, OpenTelemetry-native). For independent model benchmarking, Artificial Analysis (artificialanalysis.ai) provides real-time quality, price, and speed comparisons across models.

The PM’s role in evals is defining what "good" looks like. Engineering builds the infrastructure, but the PM defines the success criteria that become golden dataset labels and evaluation dimensions. Unclear criteria produce useless evals. The hardest problems are subjective: measuring "helpfulness," "brand voice compliance," or "appropriate caution" requires careful rubric design and a mix of automated and human evaluation. PMs own the definition of good.

Tasks (4)

  1. Define 5 eval criteria for your product (25 min)
    Choose an AI product (customer support, document Q&A, or one you’d build). Define 5 evaluation dimensions: name, description, measurement method (exact match / LLM-as-judge / human review), scoring rubric (1-5), and what score means "pass." Save as /day-18/eval_criteria_definition.md. This is the most important PM artifact in your eval system.
  2. Build a golden dataset (20 examples) (25 min)
    For your chosen product, write 20 input/output pairs representing high-quality responses. Cover: 5 easy cases, 10 typical cases, and 5 hard edge cases. Format as JSONL with input, expected_output, difficulty, and eval_dimensions. Save as /day-18/golden_dataset_20examples.jsonl.
  3. Design CI/CD eval automation (25 min)
    Your team ships model updates monthly. Design the GitHub Actions workflow: trigger on PR, run eval suite against golden dataset, compare scores to baseline, auto-pass if above threshold, auto-block if below, human review for marginal cases. Write the workflow YAML skeleton. Save as /day-18/eval_ci_cd_design.md (or .github/workflows/eval.yml).
  4. Benchmark research: know the landscape (25 min)
    Research current model benchmarks: MMLU-Pro, HLE, SWE-bench Verified, Chatbot Arena, and Artificial Analysis. Where does Claude Sonnet 4.6 rank on each? Where does it trail? This is competitive intelligence you need for interviews. Save as /day-18/benchmark_landscape.md.

LLM-as-judge eval framework — JavaScript

// Evaluation Framework — LLM-as-judge with bias mitigation
// Production: each evalCall would hit Claude or GPT-4 API

const EVAL_RUBRIC = {
  dimensions: [
    { name: "accuracy", weight: 0.3, description: "Is the information correct and complete?" },
    { name: "tone", weight: 0.2, description: "Professional, empathetic, on-brand?" },
    { name: "actionability", weight: 0.3, description: "Clear next steps for the user?" },
    { name: "conciseness", weight: 0.2, description: "Appropriately brief, not verbose?" },
  ],
  passThreshold: 3.5,
  judgeModel: "claude-sonnet-4-6",
  biasMitigations: [
    "Randomize response order in A/B comparisons",
    "Use different model as judge than model being evaluated",
    "Calibrate judge scores against human ratings on 20% sample",
    "Watch for position bias (judge prefers first response)",
    "Watch for sycophancy bias (judge prefers agreeable responses)"
  ]
};

// Golden dataset (20 examples in production)
const goldenDataset = [
  {
    id: "test_001", difficulty: "easy",
    input: "I can't log in and my password reset isn't working",
    expected_min_score: 4.0,
    model_output: "I'm sorry you're having trouble. Please try clearing your browser cache, then attempt the reset again. If that doesn't work, I'll send a manual reset link to your email."
  },
  {
    id: "test_002", difficulty: "medium",
    input: "How do I cancel? Also I want a refund for this month.",
    expected_min_score: 3.5,
    model_output: "I can help with both. For cancellation: Settings > Billing > Cancel. Your access continues until the billing period ends. For the refund, I've flagged your account — our billing team will review within 24 hours."
  },
  {
    id: "test_003", difficulty: "hard",
    input: "Your product is terrible and I'm going to post negative reviews everywhere",
    expected_min_score: 3.0,
    model_output: "I hear your frustration and I want to help resolve this. Can you tell me specifically what's not working? I'll either fix it directly or escalate to our engineering team immediately."
  }
];

// Simulated eval scoring
function scoreResponse(testCase) {
  const scores = {};
  EVAL_RUBRIC.dimensions.forEach(d => {
    scores[d.name] = 3 + Math.random() * 2; // Simulated 3-5 range
  });
  const weightedAvg = EVAL_RUBRIC.dimensions.reduce((sum, d) => sum + scores[d.name] * d.weight, 0);
  return { scores, weightedAvg: parseFloat(weightedAvg.toFixed(2)), passed: weightedAvg >= testCase.expected_min_score };
}

console.log("EVAL FRAMEWORK — LLM-as-Judge with Bias Mitigation");
console.log("=".repeat(60));
console.log("Judge model: " + EVAL_RUBRIC.judgeModel);
console.log("Pass threshold: " + EVAL_RUBRIC.passThreshold + "/5.0");
console.log("Dimensions: " + EVAL_RUBRIC.dimensions.map(d => d.name + " (" + (d.weight * 100) + "%)").join(", "));

console.log("\nBias mitigations:");
EVAL_RUBRIC.biasMitigations.forEach(m => console.log("  - " + m));

console.log("\nRunning eval on golden dataset...");
console.log("-".repeat(60));
let passed = 0;
goldenDataset.forEach(tc => {
  const result = scoreResponse(tc);
  const status = result.passed ? "PASS" : "FAIL";
  console.log(tc.id + " [" + tc.difficulty + "] | Score: " + result.weightedAvg + "/5.0 | " + status);
  if (result.passed) passed++;
});
console.log("\nResults: " + passed + "/" + goldenDataset.length + " passed");

console.log("\n" + "=".repeat(60));
console.log("PUBLIC BENCHMARK LANDSCAPE (2026)");
console.log("MMLU-Pro: Advanced reasoning benchmark (supersedes MMLU)");
console.log("GPQA: Expert-level science questions (Claude strong)");
console.log("HLE: Humanity's Last Exam (hardest benchmark, Jan 2025)");
console.log("SWE-bench: Coding agent benchmark (Claude Code tracked)");
console.log("Chatbot Arena: Human preference (LMSYS)");
console.log("Artificial Analysis: Real-time quality/price/speed comparisons");

Interview question

How would you set up an evaluation framework for an AI feature before launch?

Three stages: definition, implementation, automation.

Definition: Work with domain experts to define 3-5 quality dimensions with explicit rubrics (1-5 scoring). Then build a golden dataset: 50-100 queries representing the real input distribution, including 20% hard edge cases. Have 2-3 experts independently label each. Where they disagree, resolve the disagreement — that’s your most valuable eval data. The PM owns this definition. If the rubric is vague, the eval is useless.

Implementation: Build the eval runner: for each test case, call the model, score with LLM-as-judge using your rubric, compare to threshold. Mitigate judge biases: randomize presentation order, use a different model as judge, calibrate against human ratings on a 20% sample. Run the baseline evaluation before any changes to establish ground truth.

Automation: Integrate into CI/CD via GitHub Actions. Define three thresholds: auto-pass (ship), auto-fail (block), and human review (marginal). Set a 24-hour SLA for human review of marginal cases. Run on every model version change, every system prompt update, and every data pipeline change.

Eval platforms: Braintrust for experimentation-focused teams, LangSmith for LangChain-heavy stacks, Langfuse for open-source preference. For competitive benchmarking, track Artificial Analysis leaderboards weekly.

PM angle

Evals are the most underinvested capability in most AI teams. Engineers build infrastructure; PMs own "what does good look like." If you don’t write the rubric and curate the golden dataset, no one will, and you’ll discover quality problems in production. The PM who builds eval criteria before launch — not after complaints — ships better products.

Resources