backend system design
WalmartLabs
Twitter
Meta

WalmartLabs System Design: Multi-Region Notification Service

Topics:
Notification Systems
Multi-region Architecture
High Availability
Roles:
Software Engineer
Backend Engineer
Platform Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to redesign a push notification service for a global social media platform so notifications are highly available and low-latency across regions. The current single-region design causes latency spikes and outages; you must propose a multi-region architecture that maintains 99.99% uptime, delivers notifications within ~200ms on average, and scales to billions of users and 100k notifications/sec.

Start by clarifying requirements (delivery SLAs, acceptable ordering, duplicate tolerance, and traffic patterns). Then outline a high-level multi-region flow: ingest events, persist them to a durable queue, perform fan-out in the nearest region, and push to device gateways (APNs/FCM/web push). Explain cross-region concerns: user-to-region affinity (based on last-known location or geo-IP), asynchronous replication of critical metadata, and failover policies when a primary region is unhealthy.

You should demonstrate knowledge of distributed systems patterns: global load balancing (DNS/Anycast/GSLB), regional replicas with eventual consistency, geo-aware routing, durable message queues (Kafka/replicated log), idempotency and deduplication, rate limiting and backpressure, and integration with mobile push providers. Also cover monitoring and alerting: latency histograms, error budgets, regional health checks, and automated failover triggers.

When answering, discuss trade-offs (consistency vs latency, cost vs durability), scaling strategies, and concrete failure scenarios (region outage, mobile-provider outage) with mitigation steps. This prepares you to explain design choices clearly during interviews.

Common Follow-up Questions

  • How would you implement deduplication and idempotent delivery across regions to avoid duplicate push notifications?
  • Describe exactly-once vs at-least-once delivery trade-offs for notifications and how you'd design for minimal duplicates and latency.
  • If the primary region for a user fails, how would you detect it and perform automated regional failover while keeping latency below targets?
  • How would you design throttling and backpressure when downstream push providers (APNs/FCM) are slow or rate-limited during a global spike?
  • Explain cost and capacity planning: how many replicas, partitioning strategy, and autoscaling rules to handle 100k notifications/sec and 1B users?

Related Questions

1Design a scalable push notification system for mobile and web clients
2Design a global publish-subscribe messaging service for billions of users
3How to design a geo-distributed message queue (Kafka/Raft) for low latency
4Design rate limiting and spike protection for a notification pipeline

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

WalmartLabs System Design: Multi-Region Notification Service | Voker