backend system design
Microsoft
Amazon
Google

Microsoft System Design: Distributed Key-Value Store & Cache

Topics:
Key-Value Stores
Strong Consistency
Distributed Systems
Roles:
Software Engineer
Backend Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You’re asked to design a distributed key-value system that acts both as a low-latency cache (sessions, profiles, product details) and as persistent storage for critical configuration data. The system will serve millions of global users across multiple data centers and must meet strict NFRs: high availability (99.9% uptime), horizontal scalability, low latency (<10ms typical), configurable consistency, durable persistence, and robust failure handling.

Core content

You should describe an architecture that includes: consistent hashing for automatic data partitioning and shard rebalancing; replication across nodes and across data centers for redundancy; a pluggable consistency model (strong vs eventual) using quorum reads/writes or consensus (Raft/Paxos) for strongly consistent items; cache layer features (TTL, LRU eviction, high cache-hit strategies); and persistence options (write-through, write-back, or async replication). Explain how you’ll handle keys up to 256 bytes and values up to 1MB, and how to avoid hot keys and size-related performance problems.

Flow / interview stages

Start with requirements and trade-offs, propose a high-level architecture, then dive into data partitioning, replication, consistency choices, failure detection and recovery, and operational concerns (monitoring, metrics, logging, and capacity planning). Be ready to sketch API semantics for get/put/delete/update and TTL behavior.

Skill signals

Demonstrate knowledge of distributed systems (consistent hashing, quorum, consensus), caching strategies (eviction policies, cache warming), replication and cross-DC design, failure modes and recovery, and performance optimizations (load balancing, batching, compression). Show you can reason about CAP trade-offs and give concrete mitigation strategies.

Common Follow-up Questions

  • How would you implement strong consistency across multiple data centers while keeping read latency low? Discuss quorum sizes, leader-based replication, and geo-replication trade-offs.
  • Design a strategy for cache eviction and TTL semantics that preserves durability guarantees for critical keys. How do you combine LRU eviction with write-back or write-through persistence?
  • How would you detect and recover from node failures and network partitions? Explain re-replication, hinted handoff, and the steps to rebalance shards with minimal downtime.
  • How do you prevent and mitigate hot keys and large-value throughput problems (values up to 1MB)? Discuss sharding, request coalescing, rate limiting, and value chunking.
  • If clients require multi-key atomic updates, how would you support transactions or distributed locks? Compare optimistic concurrency (CAS) vs distributed locking approaches.

Related Questions

1Design a distributed caching layer for session management with TTL and failover
2How to design a globally distributed datastore with configurable consistency (Dynamo-style)
3Design a scalable consistent-hashing based sharding system and how to re-shard live
4Design a rate limiter and hotspot prevention mechanism for a high-throughput key-value store

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Distributed Key-Value Store Design - Microsoft Interview | Voker