Day 7

AI is no longer text-only — learn where vision, voice, and video create the highest-value product opportunities.

Context

Vision capabilities allow models to analyze images, diagrams, screenshots, and documents alongside text. Claude Vision and GPT-4o Vision can read screenshots, parse complex tables from PDFs, identify objects, and understand charts. For product managers, vision unlocks two categories: document understanding (process scanned PDFs, handwritten forms, complex layouts that resist text parsing) and visual analysis (quality inspection, content moderation, UI testing, accessibility review). The highest-value, highest-deployment multimodal use case in enterprise AI today is structured form extraction — W2s, invoices, medical forms, insurance claims. Claude Vision combined with structured JSON output handles this extremely well and is now a core production pattern.

A critical 2024 addition: native PDF support in the Claude API. You can now pass PDF files directly to the API rather than converting them to images first. This significantly simplifies document processing architecture — no more PDF-to-image conversion pipelines, better text extraction quality, and support for complex layouts with tables, charts, and mixed content. For any PM designing a document processing product, this capability removes an entire layer of infrastructure complexity. Source: Anthropic PDF support docs.

Voice has two distinct components: speech-to-text (Whisper, Deepgram, AssemblyAI) and text-to-speech (OpenAI TTS, ElevenLabs, Amazon Polly). The GPT-4o Realtime API is now production-available, combining speech understanding and generation with a conversational model in a single low-latency stream for real-time voice conversations. ElevenLabs has become a major player in voice generation with highly realistic voice cloning. The product challenge with voice remains error recovery: voice input is noisier than text, and users are less tolerant of misunderstandings than in chat UX.

Video understanding is now production-ready, not just a research frontier. Google’s Gemini 2.5 Pro can analyze hour-long videos natively within its 1M+ token context. Claude and GPT-4o process video as sequences of frames. Current production use cases: meeting summarization, manufacturing defect detection, sports analysis, security monitoring, and content moderation at scale. The "frontier" framing from 2024 is now outdated — video AI products are shipping.

Vision plus action = computer use. The natural evolution of vision capabilities leads to computer use (covered in depth on Day 25): vision enables an AI to read UIs and screens; computer use enables it to interact with them. This progression — from passive understanding to active interaction — is the trajectory that defines the agentic product landscape. Day 7 builds the foundation; Day 25 adds the action layer.

Tasks (4)

Process an invoice or form with Claude Vision (25 min)
Upload a real invoice, receipt, or form (redact any personal info) to Claude Vision. Ask it to: (1) extract all text, (2) return structured JSON with specific fields (vendor, amount, date, line items). Test with 3 different document formats. How accurate is the extraction? Where does it fail? What product would this enable? Save results as /day-07/vision_extraction_exercise.md.
Voice interface design (25 min)
Choose a product you know. Design a voice interface for one of its features. Include: system prompt, happy path conversation flow, misrecognition recovery, out-of-scope handling, and escalation to human. What does error recovery look like in voice vs text? Save as /day-07/voice_interface_spec.md.
Multimodal capability assessment (25 min)
For a product you would build: rank vision, voice, and video by (a) user value delivered, (b) current technical maturity and reliability, (c) competitive differentiation. Which capability would you ship first and why? Also consider: would native PDF support change your document processing architecture? Save as /day-07/multimodal_capability_ranking.md.
Vision use case research (25 min)
Find 3 products that have shipped vision-based AI features (not image generation — image understanding). Document each: problem solved, how vision is used, failure modes, and whether native PDF support would improve the architecture. Save as /day-07/vision_product_research.md.

Interview question

How would you decide whether to add voice, vision, or text capabilities to an AI product first?

I start with three questions: Where is the user’s primary input modality? What’s the latency tolerance? And what’s the failure recovery cost?

Vision first when the existing workflow involves documents or images. If users photograph forms and manually type data, vision-based extraction is high-value and low-risk. Anthropic’s native PDF support (no image conversion needed) simplifies this significantly. Structured form extraction — W2s, invoices, medical forms — is now the highest-deployment enterprise multimodal use case, and Claude Vision plus structured JSON output handles it well. The failure mode is localized: one document fails to parse, fallback to manual entry.

Voice first when users are mobile or hands-occupied and the interaction is conversational. The GPT-4o Realtime API makes sub-300ms voice conversations production-viable. But voice has higher failure recovery cost — a misheard command causes real errors, and users are less tolerant of voice errors than text errors.

Text first is almost always the right MVP. Cheapest, fastest to iterate, easiest to evaluate. Add vision or voice when you’ve validated the core product and identified a specific workflow where multimodal genuinely unlocks more value than improved text UX.

Video is now production-ready for specific use cases (meeting summarization, manufacturing QA, content moderation). It’s no longer "emerging" — the question is whether your specific use case has enough video content to justify the compute cost.

PM angle

The most undervalued multimodal capability in enterprise AI is document vision — specifically, structured form extraction from complex PDFs, invoices, and forms. Claude’s native PDF support eliminates the image conversion pipeline entirely. If your target market handles any paper-based or document-heavy workflows, this is your highest-ROI AI feature — and it’s production-ready today.

Resources

DOCS Claude Vision Guide — Image formats, size limits, and vision prompting best practices.
DOCS Claude PDF Support — Native PDF processing — no image conversion needed. Major architecture simplification.
DOCS OpenAI Realtime API — Production-available low-latency voice conversations. The reference for voice AI products.
DOCS Whisper API — Best general-purpose speech-to-text API. Also available open-source.
TOOL ElevenLabs — Leading voice generation and cloning. Major player in voice AI since 2024.
DOCS Gemini 2.5 Pro — Multimodal — 1M+ token context supports hour-long video analysis natively.