OpenAI ML System Design: Scalable Enterprise RAG
Question Description
You are asked to design a Retrieval-Augmented Generation (RAG) system for a large enterprise to support internal document Q&A and a customer support chatbot. The system should ingest millions of confidential documents, perform semantic search over embeddings, and synthesize accurate, context-aware answers in real time while enforcing privacy and access controls.
Core content — what you'll be asked
- Document ingestion & indexing: automated parsing for PDFs, DOCX, TXT; text cleaning; chunking strategies; metadata extraction and embeddings generation; storing vectors and metadata in a scalable vector database.
- Query processing & retrieval: natural-language preprocessing, intent parsing, hybrid retrieval (semantic embeddings + keyword filters), ANN search (HNSW/IVF), re-ranking and top-k passage selection with >90% retrieval precision targets.
- Context & generation: maintaining conversational state for multi-turn queries, prompt construction with retrieved passages, LLM answer synthesis with source citations and hallucination mitigation.
High-level flow/stages
- Ingest -> normalize -> chunk -> embed -> index. 2. Query -> preprocess -> retrieve top-k -> re-rank -> assemble context. 3. LLM synthesize answer -> citation & redact -> return. 4. Fallback/escalation to human agents if confidence low.
Skill signals you should demonstrate
You should show knowledge of vector search engines and ANN algorithms, embedding/model selection, latency optimization (caching, sharding, batching), security (encryption, RBAC, audit logging), training pipelines for retrievers/generators (contrastive losses, supervised reranking, offline & online eval), and metrics (precision@k, MRR, latency percentiles, user feedback loops). Also discuss deployment concerns: autoscaling, cost trade-offs, model update strategies, and monitoring for drift and hallucinations.
Use concrete trade-offs and design choices; explain how they meet the non-functional requirements (scalability, low latency, accuracy, security, reliability, maintainability, and cost efficiency).
Common Follow-up Questions
- •How would you design the vector index and sharding strategy to support low-latency search across terabytes of embeddings and thousands of queries per second?
- •Describe your approach to minimizing hallucinations: what retrieval- and generation-level techniques and evaluation metrics would you apply?
- •How do you enforce fine-grained access control and data privacy when the vector DB contains confidential passages tied to user roles?
- •What monitoring, A/B testing, and continuous training pipelines would you build to detect retrieval drift and maintain >90% precision over time?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.