Uber ML System Design: Scalable AI Chatbot with History
Question Description
You are asked to design the backend for a scalable AI-powered chatbot that serves millions of daily users, supports multiple sessions per user, persists full chat histories, and returns model-generated replies with low latency.
Core content:
-
Build a real-time messaging stack (API Gateway → Auth → Message Ingest) that accepts user messages via HTTP/WebSocket and enqueues them for processing. Use a durable message queue (Kafka, Pulsar) to decouple ingestion from inference and persist events for reliability.
-
Design an inference layer for online AI responses: a model-serving cluster (KFServing, Triton, or managed model endpoints) with autoscaling, request batching, and GPU/CPU tiering. Implement context retrieval (last N messages or retrieved embeddings from a vector DB) to supply the model with session history while keeping token usage and latency in check.
-
Persist chat histories in a partitioned, highly-available store (e.g., DynamoDB/Cassandra + S3 for long-term archives) with per-session keys, timestamps, and metadata. Use strong or configurable consistency for reads/writes, idempotent writes, and optimistic concurrency for session updates.
Flow & stages:
- Client sends message → API gateway/auth → enqueue message.
- Worker retrieves message, reads session context, fetches embeddings if needed.
- Dispatch to model-serving endpoint (streaming responses supported) → write AI reply back to history store and publish to user via WebSocket/SSE.
Skill signals:
You should demonstrate distributed-systems design (sharding, partitioning, replication), low-latency model deployment (batching vs. streaming), persistence strategies for chat history (consistency, compaction, archival), session management, monitoring/SLAs, and security (encryption, access control, privacy). Be prepared to justify trade-offs (cost, latency, consistency) and propose metrics, autoscaling knobs, and failure recovery strategies.
Common Follow-up Questions
- •How would you design context retrieval for long conversations (10k+ tokens) while keeping inference latency under 2s?
- •Describe caching and batching strategies to increase throughput for model serving without violating per-user session isolation.
- •How would you guarantee exactly-once delivery and consistent chat history under partial failures and retries?
- •What monitoring, SLAs, and alerting metrics would you instrument to ensure 99.9% availability and detect inference regressions?
- •How would you modify the design to support multimodal inputs (images/audio) and multimodal model inference?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.