ml system design
Apple
Google
Meta

Apple ML System Design: Multi-modal RAG for Image+Text

Topics:
Multimodal Retrieval
Retrieval-Augmented Generation
Vector Search & Embeddings
Roles:
Machine Learning Engineer
ML Systems Engineer
Research Engineer
Experience:
Mid Level
Senior
Staff

Question Description

Design a multi-modal Retrieval-Augmented Generation (RAG) system that accepts hybrid queries containing text and images and returns grounded, concise answers. You’ll be expected to describe an end-to-end architecture that covers multi-modal input parsing, embedding generation, retrieval from a large knowledge base, and answer synthesis with an LLM.

Start by outlining a staged flow: 1) front-end ingestion and lightweight validation (image quality checks, text normalization), 2) modality-specific encoders (e.g., CLIP or a visual encoder for images, Transformer/BERT-family for text), 3) vectorization and indexing into a vector DB (FAISS/Milvus/Pinecone) with metadata, 4) multi-stage retrieval (coarse ANN search + hybrid scoring combining text heuristics/BM25 and cross-modal cosine similarity), 5) optional re-ranking with a cross-encoder, and 6) LLM-based synthesis that cites sources and returns provenance.

You should demonstrate knowledge of low-latency techniques (sharded ANN, caching, asynchronous image preprocessing, batching, GPU/CPU separation), scalability (horizontal index partitioning, autoscaling, backpressure), and accuracy strategies (re-ranker, calibration, grounding to retrieved docs). Discuss robustness (fallbacks for blurry images, input sanitization), maintainability (modular model registry, CI for model updates), and cost-efficiency (quantized embeddings, dynamic model selection). Finally, explain monitoring and metrics you’d track (p95 latency, recall@k, precision, hallucination rate), and how you’d support KB updates (incremental indexing, versioning, and soft deletes).

Common Follow-up Questions

  • How would you design the hybrid scoring function to combine visual similarity and text relevance? Describe normalization, weighting, and tuning strategies.
  • Explain a low-latency retrieval pipeline that meets 2s end-to-end and <200ms retrieval steps under load. Where would you use caching, sharding, and async preprocessing?
  • How would you evaluate and reduce hallucinations when the LLM synthesizes answers from retrieved multimodal documents? Describe metrics and mitigation techniques.
  • Design the KB update workflow for millions of documents with minimal downtime: how do you handle re-indexing, versioning, and consistency across shards?

Related Questions

1Design a text-only RAG system for product QA: architecture, retrieval, and grounding techniques
2How to build a scalable image search engine using vector embeddings and ANN indexes
3Compare late fusion vs. joint cross-modal encoders for multimodal retrieval tasks
4How to architect a low-cost embedding service for high-throughput image and text encoding

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Multi-modal RAG System Design — Apple ML Interview | Voker