← All days

Day 13

The most-deployed AI architecture — build the retrieval layer that makes models know your data.

Context

Retrieval-Augmented Generation (RAG) solves the fundamental problem that language models don’t know your private data. At query time, retrieve relevant chunks from a vector database, inject them into the model’s context, and generate a response grounded in retrieved content. RAG is often preferred over fine-tuning for knowledge because it’s updateable (add new documents without retraining), auditable (trace which chunks grounded the answer), and cheaper to maintain. However, hybrid approaches — combining RAG with fine-tuning for domain-specific reasoning patterns — are increasingly common in 2025-2026 and can outperform either alone.

The core pipeline: Chunking (how you split documents — fixed-size, sentence-boundary, recursive with overlap, or structural based on document headings), Embedding (dense vector representations via text-embedding-3-small, Voyage AI, or Cohere), Retrieval (find top-k similar chunks), and Generation (inject chunks into prompt and generate). The most common failure mode is poor retrieval, not poor generation — the model can only answer well if the right chunks are found. Hybrid search (semantic similarity + BM25 keyword search) consistently outperforms pure semantic search, especially for domain-specific terminology.

RAG architectures have evolved significantly beyond "basic RAG" by 2025-2026: Agentic RAG uses an agent that decides what to retrieve, how many times to search, and when it has enough context. GraphRAG (Microsoft, 2024) builds knowledge graphs from documents for superior multi-hop reasoning on complex analytical questions. Self-RAG lets models decide when retrieval is needed and critique their own retrieval quality. Contextual retrieval — Anthropic’s own contribution — prepends chunk-specific summaries before embedding, reducing retrieval failures by 67% in Anthropic’s published research. This last technique is directly implementable, citable, and from Anthropic.

Contextual retrieval in practice: Before embedding each chunk, generate a brief summary explaining the chunk’s context within the full document. Embed the summary + chunk together. At query time, the embedding captures document-level context that a bare chunk would miss. A chunk that says "The penalty is 5% per annum" is ambiguous. With context: "Section 4.2 of the Master Services Agreement, covering late payment terms: The penalty is 5% per annum" — now the embedding knows what "penalty" refers to. Source: Anthropic contextual retrieval research.

Tasks (4)

  1. Sketch a production RAG architecture (25 min)
    Build the full data pipeline for a knowledge base Q&A product on a company’s internal Confluence. Include: document ingestion, chunking strategy (with rationale), embedding model choice, vector DB choice, retrieval (include hybrid search), and generation (model + system prompt). What does the system do when no relevant content is found? Save as /day-13/rag_architecture_diagram.md.
  2. Chunk size experiment (25 min)
    Take a 2,000-word article or spec. Create chunks at 3 sizes: 256, 512, and 1024 tokens. For each: count the chunks, write 3 questions about the document, and assess which chunk size retrieves the most relevant content. What do you lose at smaller sizes? What do you gain? Save as /day-13/chunk_size_experiment.md.
  3. Implement contextual retrieval (25 min)
    Take 5 chunks from a document. For each, write a 1-2 sentence context summary that explains what the chunk contains in the context of the full document. Compare: would a search for "payment terms" find the right chunk with vs without context prepended? This is Anthropic’s published technique — cite it. Save as /day-13/contextual_retrieval_test.md.
  4. RAG evaluation design (25 min)
    Define 4 evaluation metrics for your RAG system: context precision (relevant chunks / total retrieved), context recall (relevant chunks retrieved / relevant chunks available), answer faithfulness (answer reflects retrieved context), and answer correctness. How would you measure each? Research the RAGAS framework as an open-source tool for this. Save as /day-13/rag_evaluation_design.md.

RAG pipeline with contextual retrieval — JavaScript

// RAG Pipeline — including Anthropic's contextual retrieval technique
// Production: use real embeddings API (Voyage AI, OpenAI, or Cohere)

// === CONTEXTUAL RETRIEVAL (Anthropic's technique) ===
// Before embedding, prepend context to each chunk
const rawChunks = [
  { id: 1, text: "The penalty is 5% per annum on outstanding balances.", source: "contract.pdf", section: "4.2" },
  { id: 2, text: "Either party may terminate with 90 days written notice.", source: "contract.pdf", section: "7.1" },
  { id: 3, text: "The service tier includes 24/7 support with 4-hour SLA.", source: "contract.pdf", section: "3.3" },
  { id: 4, text: "Payment terms are Net 30 from invoice date.", source: "contract.pdf", section: "4.1" },
  { id: 5, text: "The provider maintains SOC 2 Type II certification.", source: "contract.pdf", section: "5.2" },
];

// Contextual retrieval: add document-level context to each chunk
function addContext(chunk) {
  return {
    ...chunk,
    contextualText: "Section " + chunk.section + " of Master Services Agreement (" + chunk.source + "): " + chunk.text,
    // In production: use Claude to generate richer context summaries
  };
}

const contextualChunks = rawChunks.map(addContext);

// Mock embedding + similarity (production: use real embeddings API)
function mockSimilarity(query, chunkText) {
  const queryWords = new Set(query.toLowerCase().split(/\s+/));
  const chunkWords = new Set(chunkText.toLowerCase().split(/\s+/));
  let overlap = 0;
  queryWords.forEach(w => { if (chunkWords.has(w)) overlap++; });
  return overlap / Math.max(queryWords.size, 1);
}

function retrieve(query, chunks, topK) {
  return chunks
    .map(c => ({ ...c, score: mockSimilarity(query, c.contextualText || c.text) }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

// Compare retrieval: with vs without context
const query = "What are the payment penalty terms?";

console.log("CONTEXTUAL RETRIEVAL COMPARISON");
console.log("Query: " + query);
console.log("=".repeat(55));

console.log("\nWITHOUT context (raw chunks):");
retrieve(query, rawChunks, 2).forEach((r, i) => {
  console.log("  " + (i+1) + ". [S" + r.section + "] " + r.text.slice(0, 60) + "... (score: " + r.score.toFixed(2) + ")");
});

console.log("\nWITH contextual retrieval:");
retrieve(query, contextualChunks, 2).forEach((r, i) => {
  console.log("  " + (i+1) + ". [S" + r.section + "] " + r.text.slice(0, 60) + "... (score: " + r.score.toFixed(2) + ")");
});

// RAG evaluation metrics
console.log("\n" + "=".repeat(55));
console.log("RAG EVALUATION METRICS (RAGAS framework)");
console.log("=".repeat(55));
const metrics = [
  { name: "Context Precision", desc: "Relevant chunks / Total retrieved chunks" },
  { name: "Context Recall", desc: "Relevant chunks found / All relevant chunks in corpus" },
  { name: "Answer Faithfulness", desc: "Answer claims supported by retrieved context" },
  { name: "Answer Correctness", desc: "Answer matches ground truth" },
];
metrics.forEach(m => console.log("  " + m.name + ": " + m.desc));

console.log("\nAdvanced RAG patterns (2025-2026):");
console.log("  Agentic RAG: Agent decides what/when/how many times to retrieve");
console.log("  GraphRAG: Knowledge graphs for multi-hop reasoning");
console.log("  Self-RAG: Model critiques its own retrieval quality");
console.log("  Contextual retrieval: Anthropic technique, 67% fewer failures");

Interview question

Design a RAG system for a legal firm’s document Q&A product. What are the key architectural decisions?

For legal RAG, three decisions differ from general implementations.

Chunking: Legal documents have logical sections (clauses, exhibits) that must stay together. I’d use structural chunking respecting heading boundaries, not fixed token sizes. A clause split across chunks loses legal meaning. Then apply Anthropic’s contextual retrieval: prepend each chunk with a context summary (e.g., "Section 4.2, Late Payment Terms of Master Services Agreement"). Their research shows this reduces retrieval failures by 67%.

Retrieval: Legal queries use precise terminology ("non-compete clause," "force majeure"). Hybrid search — BM25 for exact legal terms plus dense semantic search — outperforms pure semantic search here. Add a reranking layer (Cohere Rerank or Voyage AI) to filter top-20 down to top-3 before injection.

Citation: Lawyers must know exactly which clause grounded each answer. Build citation extraction into the system prompt and verify citations against retrieved chunks (don’t let the model hallucinate sources). Every answer references specific document, section, and page.

Evaluation: Use the RAGAS framework to measure context precision, recall, faithfulness, and correctness. The golden dataset should include adversarial cases: questions where the answer spans multiple sections, questions with no answer in the corpus, and questions requiring cross-document reasoning.

PM angle

RAG is the most-deployed AI architecture in enterprise software. The failure mode is almost always poor retrieval, not poor generation. Anthropic’s contextual retrieval technique (prepend context summaries to chunks before embedding) is a directly implementable improvement that reduces failures by 67%. Know it, cite it, recommend it.

Resources