Amazon ML Interview: Scalable RAG Q&A System Design

Question Description

You are asked to design a Retrieval-Augmented Generation (RAG) Q&A system for a large e-commerce customer support platform that answers natural-language queries by retrieving and grounding responses in a large, dynamic knowledge base (product docs, policies, FAQs).

Start by outlining an end-to-end pipeline: query preprocessing (spell-correction, intent detection, normalization), dense+keyword retrieval (embeddings + inverted index), optional reranking, and LLM-based response synthesis with citations. Include a session/context manager to maintain multi-turn state for both direct customer chat and agent-assist flows, and connectors to backend services (product catalog, order status) for real-time facts.

Design for the non-functional constraints up front: horizontal scaling (K8s, autoscaling groups), vector stores (FAISS/Milvus or managed vector DB), caching hot results, batching and async inference to control GPU utilization, and failover strategies to meet 99.9% uptime and 95th-percentile <200 ms latency targets. Plan cost controls via model routing (small models for simple queries, LLMs for complex ones), quantization, and token-level streaming.

Flow/stages you should discuss: ingestion & embedding refresh, retrieval & reranking, prompt construction with provenance, LLM serving & fallback policies, response formatting & citation, logging/metrics and feedback loop for continual improvement.

Skill signals: demonstrate knowledge of vector search and hybrid retrieval, prompt engineering and hallucination mitigation, scalable inference architectures, monitoring and SLOs, security/data privacy, and trade-offs (latency vs. accuracy vs. cost).

Amazon ML System Design: Scalable RAG Q&A for Support

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI