ml system design
Amazon
AWS
Shopify

Amazon ML System Design: Scalable RAG Q&A for Support

Topics:
RAG Systems
Conversational AI
LLM Serving
Roles:
Machine Learning Engineer
ML Engineer
NLP Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to design a Retrieval-Augmented Generation (RAG) Q&A system for a large e-commerce customer support platform that answers natural-language queries by retrieving and grounding responses in a large, dynamic knowledge base (product docs, policies, FAQs).

Start by outlining an end-to-end pipeline: query preprocessing (spell-correction, intent detection, normalization), dense+keyword retrieval (embeddings + inverted index), optional reranking, and LLM-based response synthesis with citations. Include a session/context manager to maintain multi-turn state for both direct customer chat and agent-assist flows, and connectors to backend services (product catalog, order status) for real-time facts.

Design for the non-functional constraints up front: horizontal scaling (K8s, autoscaling groups), vector stores (FAISS/Milvus or managed vector DB), caching hot results, batching and async inference to control GPU utilization, and failover strategies to meet 99.9% uptime and 95th-percentile <200 ms latency targets. Plan cost controls via model routing (small models for simple queries, LLMs for complex ones), quantization, and token-level streaming.

Flow/stages you should discuss: ingestion & embedding refresh, retrieval & reranking, prompt construction with provenance, LLM serving & fallback policies, response formatting & citation, logging/metrics and feedback loop for continual improvement.

Skill signals: demonstrate knowledge of vector search and hybrid retrieval, prompt engineering and hallucination mitigation, scalable inference architectures, monitoring and SLOs, security/data privacy, and trade-offs (latency vs. accuracy vs. cost).

Common Follow-up Questions

  • How would you design the embedding refresh and re-indexing pipeline to support frequently changing documents while avoiding staleness and maintaining low latency?
  • What strategies would you use to reduce hallucinations and ensure factual grounding (e.g., retrieval augmentation, rerankers, verification, citation generation)?
  • How would you architect multi-model routing and model cascades to reduce cost while preserving accuracy under high QPS?
  • Describe approaches to secure PII and customer data in transit and at rest, and how you would implement redaction or masking in generated answers.

Related Questions

1Design a high-throughput vector search service for millions of documents and 10k QPS
2Design a low-latency LLM inference platform with batching, autoscaling, and model quantization
3How to build a feedback loop and active learning pipeline to improve retrieval and generation quality over time
4Design a multi-turn conversational context manager that handles long histories and context window limits

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Amazon ML Interview: Scalable RAG Q&A System Design | Voker