Stripe System Design: Scalable Real-Time Logs & Metrics

Question Description

You are asked to design a unified, scalable system for collecting, processing, and analyzing both structured logs and time-series metrics across a global cloud service that produces millions of events per second.

Core task: define ingestion APIs (HTTP/gRPC/TCP), a durable, partitioned transport layer (e.g., Kafka or a cloud equivalent), real-time stream processing for enrichment/aggregation (windowed percentiles, counts, anomaly detection), and a storage strategy that separates hot (recent, fast queries) and cold (cost-efficient historical) tiers. Consider schema flexibility for JSON logs, metrics with high cardinality, retention policies, and configurable aggregation intervals.

Interview flow: you’ll typically walk through end-to-end data flow (clients → ingestion → broker → stream processors → hot/cold storage → query/alerting), discuss data partitioning, state management (checkpoints, exactly-once vs at-least-once), and failure modes (retries, buffering, backpressure). Expect to sketch components, APIs, scaling strategies, and trade-offs for latency, durability, and cost.

Skill signals: demonstrate knowledge of distributed systems, stream processing frameworks (Flink, Kafka Streams), time-series storage patterns (TSDBs, column stores, object store cold tier), cardinality reduction (sketches, downsampling), observability, and operational concerns (multi-region replication, SLA targets, capacity planning). You should justify design choices with performance, availability, and cost trade-offs.

Stripe System Design: Scalable Real-Time Logs & Metrics

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI