Apple ML System Design: Multi-modal RAG for Image+Text
Question Description
Design a multi-modal Retrieval-Augmented Generation (RAG) system that accepts hybrid queries containing text and images and returns grounded, concise answers. You’ll be expected to describe an end-to-end architecture that covers multi-modal input parsing, embedding generation, retrieval from a large knowledge base, and answer synthesis with an LLM.
Start by outlining a staged flow: 1) front-end ingestion and lightweight validation (image quality checks, text normalization), 2) modality-specific encoders (e.g., CLIP or a visual encoder for images, Transformer/BERT-family for text), 3) vectorization and indexing into a vector DB (FAISS/Milvus/Pinecone) with metadata, 4) multi-stage retrieval (coarse ANN search + hybrid scoring combining text heuristics/BM25 and cross-modal cosine similarity), 5) optional re-ranking with a cross-encoder, and 6) LLM-based synthesis that cites sources and returns provenance.
You should demonstrate knowledge of low-latency techniques (sharded ANN, caching, asynchronous image preprocessing, batching, GPU/CPU separation), scalability (horizontal index partitioning, autoscaling, backpressure), and accuracy strategies (re-ranker, calibration, grounding to retrieved docs). Discuss robustness (fallbacks for blurry images, input sanitization), maintainability (modular model registry, CI for model updates), and cost-efficiency (quantized embeddings, dynamic model selection). Finally, explain monitoring and metrics you’d track (p95 latency, recall@k, precision, hallucination rate), and how you’d support KB updates (incremental indexing, versioning, and soft deletes).
Common Follow-up Questions
- •How would you design the hybrid scoring function to combine visual similarity and text relevance? Describe normalization, weighting, and tuning strategies.
- •Explain a low-latency retrieval pipeline that meets 2s end-to-end and <200ms retrieval steps under load. Where would you use caching, sharding, and async preprocessing?
- •How would you evaluate and reduce hallucinations when the LLM synthesizes answers from retrieved multimodal documents? Describe metrics and mitigation techniques.
- •Design the KB update workflow for millions of documents with minimal downtime: how do you handle re-indexing, versioning, and consistency across shards?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.