Oracle System Design: Scalable Real-Time Chat System
Question Description
Problem overview
You are asked to design the backend for a scalable, low-latency real-time chat feature for a social platform with millions of daily active users. The system must support one-to-one direct messages and group chats (up to ~100–256 participants), provide message history, presence, read receipts, and push notifications while ensuring high availability, durability, and strong ordering guarantees.
What you'll be asked to do
Start by clarifying traffic and latency targets (messages/sec, peak concurrency, acceptable tail latency). Propose a high-level architecture that covers clients (WebSocket/gRPC), chat gateway servers, message service, persistent storage, presence/read-receipt services, and push notification integration (APNs/FCM). Explain how messages flow from client → gateway → durable message queue/pub-sub → storage and fan-out to recipients, and how acknowledgements and retries ensure no data loss.
Key design choices & trade-offs
You should discuss partitioning (sharding conversations by conversationID), ordering (per-conversation partitioning or per-group leader sequence numbers), fan-out strategies (push vs. pull), storage choices (append-optimized stores like Kafka + NoSQL like Cassandra/Dynamo-style for message history), caching (Redis for latest messages and presence), and cross-region replication for HA. Explain failure modes, how you maintain strong consistency for message ordering in groups, and how you handle offline delivery, deduplication, and backpressure.
Skills to demonstrate
You need solid distributed systems knowledge (message queues, pub/sub, consistency models), backend design (APIs, storage schemas), networking (WebSocket/push), reliability (replication, retries, monitoring), and security (TLS and optional end-to-end encryption). Be ready to sketch APIs, data models, and describe operational concerns (scaling, monitoring, and costs).
Common Follow-up Questions
- •How would you guarantee strict message ordering in large group chats (100+ users) while preserving horizontal scalability?
- •Describe how you would implement end-to-end encryption for messages and still support server-side features like search and read receipts.
- •How do you design offline message delivery and syncing for mobile users who switch networks or devices?
- •What monitoring, SLA, and operational playbooks would you put in place to meet 99.9% uptime and fast incident recovery?
- •How would you optimize fan-out for supergroups to avoid write amplification and reduce latency?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.