backend system design
Microsoft
Amazon
Google

Microsoft System Design: Distributed Key-Value Store & Cache

Question Description

You’re asked to design a distributed key-value system that acts both as a low-latency cache (sessions, profiles, product details) and as persistent storage for critical configuration data. The system will serve millions of global users across multiple data centers and must meet strict NFRs: high availability (99.9% uptime), horizontal scalability, low latency (<10ms typical), configurable consistency, durable persistence, and robust failure handling.

Core content

You should describe an architecture that includes: consistent hashing for automatic data partitioning and shard rebalancing; replication across nodes and across data centers for redundancy; a pluggable consistency model (strong vs eventual) using quorum reads/writes or consensus (Raft/Paxos) for strongly consistent items; cache layer features (TTL, LRU eviction, high cache-hit strategies); and persistence options (write-through, write-back, or async replication). Explain how you’ll handle keys up to 256 bytes and values up to 1MB, and how to avoid hot keys and size-related performance problems.

Flow / interview stages

Start with requirements and trade-offs, propose a high-level architecture, then dive into data partitioning, replication, consistency choices, failure detection and recovery, and operational concerns (monitoring, metrics, logging, and capacity planning). Be ready to sketch API semantics for get/put/delete/update and TTL behavior.

Skill signals

Demonstrate knowledge of distributed systems (consistent hashing, quorum, consensus), caching strategies (eviction policies, cache warming), replication and cross-DC design, failure modes and recovery, and performance optimizations (load balancing, batching, compression). Show you can reason about CAP trade-offs and give concrete mitigation strategies.

Common Follow-up Questions

  • How would you implement strong consistency across multiple data centers while keeping read latency low? Discuss quorum sizes, leader-based replication, and geo-replication trade-offs.
  • Design a strategy for cache eviction and TTL semantics that preserves durability guarantees for critical keys. How do you combine LRU eviction with write-back or write-through persistence?
  • How would you detect and recover from node failures and network partitions? Explain re-replication, hinted handoff, and the steps to rebalance shards with minimal downtime.
  • How do you prevent and mitigate hot keys and large-value throughput problems (values up to 1MB)? Discuss sharding, request coalescing, rate limiting, and value chunking.
  • If clients require multi-key atomic updates, how would you support transactions or distributed locks? Compare optimistic concurrency (CAS) vs distributed locking approaches.

Related Questions

1Snowflake System Design: Versioned Key-Value Store
2Apple System Design: Distributed Task Scheduler in Cloud
3Bloomberg System Design: Global News Aggregation & Trending
4eBay System Design: Concert Ticketing with Real-Time
5Netflix System Design: Real-Time Ad Impression Limiter
6Design a distributed caching layer for session management with TTL and failover
7How to design a globally distributed datastore with configurable consistency (Dynamo-style)
8Design a scalable consistent-hashing based sharding system and how to re-shard live
9Design a rate limiter and hotspot prevention mechanism for a high-throughput key-value store

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Distributed Key-Value Store Design - Microsoft Interview | Voker