backend system design
NVIDIA
Amazon
Google

NVIDIA System Design Interview: Distributed Rate Limiter

Topics:
Distributed Systems
API Gateways
Rate Limiting
Roles:
Software Engineer
Backend Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to design a distributed rate limiter that integrates with an API gateway serving millions of users. The goal is to protect backend services from excessive traffic while keeping per-request latency low and allowing flexible, real-time configuration.

Core problem: For each incoming request your gateway must evaluate identifiers (user ID, API key, IP) against configurable rules and decide whether to allow, delay, or reject the request (HTTP 429). You must support at least two algorithms (e.g., token bucket and sliding window log), provide metrics and logs, and enable dynamic updates for different classes of users (free vs premium).

How an interview typically flows:

  • Clarify scope: single region vs multi-region, per-user vs per-endpoint limits, soft vs hard limits.
  • Propose high-level architecture: components, data flow, and where the limiter plugs into the API gateway.
  • Drill into design choices: algorithm selection, distributed counters, state sharding, consistency/accuracy trade-offs, failure modes and fallbacks.
  • Walk through scaling and HA: horizontal scaling, replication, and latency budget.
  • Add observability and operational controls: metrics, dashboards, and runtime config rollout.

Skill signals you should show: distributed-systems reasoning (consensus/consistency trade-offs), data structures for counters, caching and persistence choices (e.g., Redis, consistent hashing), algorithm pros/cons (token bucket vs sliding window), and operational thinking for monitoring, throttling strategies, and dynamic config rollout.

Common Follow-up Questions

  • How would you implement global (cross-region) rate limits while keeping per-request latency under 10ms?
  • Compare token bucket and sliding window log in terms of memory use, accuracy, and burst handling—when would you pick each?
  • If your primary counter store (e.g., Redis cluster) becomes unavailable, what fallback strategies ensure continued enforcement without allowing limit bypass?
  • How would you support complex, hierarchical rules (global, per-tenant, per-endpoint) and evaluate them efficiently at high QPS?
  • Describe how you would roll out dynamic limit changes safely (canary, feature flags) and ensure consistency during transitions.

Related Questions

1Design an API Gateway that supports authentication, routing, and throttling at scale
2Implement a distributed counter for high-throughput event tracking (Redis vs CRDT approaches)
3Design a quota management and billing system for tiered API access
4How to design load-shedding and graceful degradation mechanisms for overloaded services
5Design a global rate-limiting service with eventual consistency and bounded error guarantees

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Distributed Rate Limiter Interview - NVIDIA System Design | Voker