ml system design
OpenAI
Google
Microsoft

OpenAI ML System Design: Scalable Enterprise RAG

Topics:
Conversational AI
Retrieval-Augmented Generation
Online Inference
Roles:
Machine Learning Engineer
ML Systems Engineer
ML Research Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to design a Retrieval-Augmented Generation (RAG) system for a large enterprise to support internal document Q&A and a customer support chatbot. The system should ingest millions of confidential documents, perform semantic search over embeddings, and synthesize accurate, context-aware answers in real time while enforcing privacy and access controls.

Core content — what you'll be asked

  • Document ingestion & indexing: automated parsing for PDFs, DOCX, TXT; text cleaning; chunking strategies; metadata extraction and embeddings generation; storing vectors and metadata in a scalable vector database.
  • Query processing & retrieval: natural-language preprocessing, intent parsing, hybrid retrieval (semantic embeddings + keyword filters), ANN search (HNSW/IVF), re-ranking and top-k passage selection with >90% retrieval precision targets.
  • Context & generation: maintaining conversational state for multi-turn queries, prompt construction with retrieved passages, LLM answer synthesis with source citations and hallucination mitigation.

High-level flow/stages

  1. Ingest -> normalize -> chunk -> embed -> index. 2. Query -> preprocess -> retrieve top-k -> re-rank -> assemble context. 3. LLM synthesize answer -> citation & redact -> return. 4. Fallback/escalation to human agents if confidence low.

Skill signals you should demonstrate

You should show knowledge of vector search engines and ANN algorithms, embedding/model selection, latency optimization (caching, sharding, batching), security (encryption, RBAC, audit logging), training pipelines for retrievers/generators (contrastive losses, supervised reranking, offline & online eval), and metrics (precision@k, MRR, latency percentiles, user feedback loops). Also discuss deployment concerns: autoscaling, cost trade-offs, model update strategies, and monitoring for drift and hallucinations.

Use concrete trade-offs and design choices; explain how they meet the non-functional requirements (scalability, low latency, accuracy, security, reliability, maintainability, and cost efficiency).

Common Follow-up Questions

  • How would you design the vector index and sharding strategy to support low-latency search across terabytes of embeddings and thousands of queries per second?
  • Describe your approach to minimizing hallucinations: what retrieval- and generation-level techniques and evaluation metrics would you apply?
  • How do you enforce fine-grained access control and data privacy when the vector DB contains confidential passages tied to user roles?
  • What monitoring, A/B testing, and continuous training pipelines would you build to detect retrieval drift and maintain >90% precision over time?

Related Questions

1Design a scalable semantic search service for enterprise documents using embeddings and hybrid retrieval
2Build a customer support chatbot that integrates real-time product data and escalates to agents
3Design a training and evaluation pipeline for a dense retriever and a reranker at scale
4How to optimize LLM inference latency and cost for real-time multi-turn conversational agents

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

RAG System Design Interview - OpenAI ML Engineer Guide | Voker