Microsoft System Design: Distributed Key-Value Store & Cache
Question Description
You’re asked to design a distributed key-value system that acts both as a low-latency cache (sessions, profiles, product details) and as persistent storage for critical configuration data. The system will serve millions of global users across multiple data centers and must meet strict NFRs: high availability (99.9% uptime), horizontal scalability, low latency (<10ms typical), configurable consistency, durable persistence, and robust failure handling.
Core content
You should describe an architecture that includes: consistent hashing for automatic data partitioning and shard rebalancing; replication across nodes and across data centers for redundancy; a pluggable consistency model (strong vs eventual) using quorum reads/writes or consensus (Raft/Paxos) for strongly consistent items; cache layer features (TTL, LRU eviction, high cache-hit strategies); and persistence options (write-through, write-back, or async replication). Explain how you’ll handle keys up to 256 bytes and values up to 1MB, and how to avoid hot keys and size-related performance problems.
Flow / interview stages
Start with requirements and trade-offs, propose a high-level architecture, then dive into data partitioning, replication, consistency choices, failure detection and recovery, and operational concerns (monitoring, metrics, logging, and capacity planning). Be ready to sketch API semantics for get/put/delete/update and TTL behavior.
Skill signals
Demonstrate knowledge of distributed systems (consistent hashing, quorum, consensus), caching strategies (eviction policies, cache warming), replication and cross-DC design, failure modes and recovery, and performance optimizations (load balancing, batching, compression). Show you can reason about CAP trade-offs and give concrete mitigation strategies.
Common Follow-up Questions
- •How would you implement strong consistency across multiple data centers while keeping read latency low? Discuss quorum sizes, leader-based replication, and geo-replication trade-offs.
- •Design a strategy for cache eviction and TTL semantics that preserves durability guarantees for critical keys. How do you combine LRU eviction with write-back or write-through persistence?
- •How would you detect and recover from node failures and network partitions? Explain re-replication, hinted handoff, and the steps to rebalance shards with minimal downtime.
- •How do you prevent and mitigate hot keys and large-value throughput problems (values up to 1MB)? Discuss sharding, request coalescing, rate limiting, and value chunking.
- •If clients require multi-key atomic updates, how would you support transactions or distributed locks? Compare optimistic concurrency (CAS) vs distributed locking approaches.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.