WalmartLabs System Design: Multi-Region Notification Service
Question Description
You are asked to redesign a push notification service for a global social media platform so notifications are highly available and low-latency across regions. The current single-region design causes latency spikes and outages; you must propose a multi-region architecture that maintains 99.99% uptime, delivers notifications within ~200ms on average, and scales to billions of users and 100k notifications/sec.
Start by clarifying requirements (delivery SLAs, acceptable ordering, duplicate tolerance, and traffic patterns). Then outline a high-level multi-region flow: ingest events, persist them to a durable queue, perform fan-out in the nearest region, and push to device gateways (APNs/FCM/web push). Explain cross-region concerns: user-to-region affinity (based on last-known location or geo-IP), asynchronous replication of critical metadata, and failover policies when a primary region is unhealthy.
You should demonstrate knowledge of distributed systems patterns: global load balancing (DNS/Anycast/GSLB), regional replicas with eventual consistency, geo-aware routing, durable message queues (Kafka/replicated log), idempotency and deduplication, rate limiting and backpressure, and integration with mobile push providers. Also cover monitoring and alerting: latency histograms, error budgets, regional health checks, and automated failover triggers.
When answering, discuss trade-offs (consistency vs latency, cost vs durability), scaling strategies, and concrete failure scenarios (region outage, mobile-provider outage) with mitigation steps. This prepares you to explain design choices clearly during interviews.
Common Follow-up Questions
- •How would you implement deduplication and idempotent delivery across regions to avoid duplicate push notifications?
- •Describe exactly-once vs at-least-once delivery trade-offs for notifications and how you'd design for minimal duplicates and latency.
- •If the primary region for a user fails, how would you detect it and perform automated regional failover while keeping latency below targets?
- •How would you design throttling and backpressure when downstream push providers (APNs/FCM) are slow or rate-limited during a global spike?
- •Explain cost and capacity planning: how many replicas, partitioning strategy, and autoscaling rules to handle 100k notifications/sec and 1B users?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.