Amazon ML System Design: Scalable RAG Q&A for Support
Question Description
You are asked to design a Retrieval-Augmented Generation (RAG) Q&A system for a large e-commerce customer support platform that answers natural-language queries by retrieving and grounding responses in a large, dynamic knowledge base (product docs, policies, FAQs).
Start by outlining an end-to-end pipeline: query preprocessing (spell-correction, intent detection, normalization), dense+keyword retrieval (embeddings + inverted index), optional reranking, and LLM-based response synthesis with citations. Include a session/context manager to maintain multi-turn state for both direct customer chat and agent-assist flows, and connectors to backend services (product catalog, order status) for real-time facts.
Design for the non-functional constraints up front: horizontal scaling (K8s, autoscaling groups), vector stores (FAISS/Milvus or managed vector DB), caching hot results, batching and async inference to control GPU utilization, and failover strategies to meet 99.9% uptime and 95th-percentile <200 ms latency targets. Plan cost controls via model routing (small models for simple queries, LLMs for complex ones), quantization, and token-level streaming.
Flow/stages you should discuss: ingestion & embedding refresh, retrieval & reranking, prompt construction with provenance, LLM serving & fallback policies, response formatting & citation, logging/metrics and feedback loop for continual improvement.
Skill signals: demonstrate knowledge of vector search and hybrid retrieval, prompt engineering and hallucination mitigation, scalable inference architectures, monitoring and SLOs, security/data privacy, and trade-offs (latency vs. accuracy vs. cost).
Common Follow-up Questions
- •How would you design the embedding refresh and re-indexing pipeline to support frequently changing documents while avoiding staleness and maintaining low latency?
- •What strategies would you use to reduce hallucinations and ensure factual grounding (e.g., retrieval augmentation, rerankers, verification, citation generation)?
- •How would you architect multi-model routing and model cascades to reduce cost while preserving accuracy under high QPS?
- •Describe approaches to secure PII and customer data in transit and at rest, and how you would implement redaction or masking in generated answers.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.