Stripe System Design: Scalable Real-Time Logs & Metrics
Question Description
You are asked to design a unified, scalable system for collecting, processing, and analyzing both structured logs and time-series metrics across a global cloud service that produces millions of events per second.
Core task: define ingestion APIs (HTTP/gRPC/TCP), a durable, partitioned transport layer (e.g., Kafka or a cloud equivalent), real-time stream processing for enrichment/aggregation (windowed percentiles, counts, anomaly detection), and a storage strategy that separates hot (recent, fast queries) and cold (cost-efficient historical) tiers. Consider schema flexibility for JSON logs, metrics with high cardinality, retention policies, and configurable aggregation intervals.
Interview flow: you’ll typically walk through end-to-end data flow (clients → ingestion → broker → stream processors → hot/cold storage → query/alerting), discuss data partitioning, state management (checkpoints, exactly-once vs at-least-once), and failure modes (retries, buffering, backpressure). Expect to sketch components, APIs, scaling strategies, and trade-offs for latency, durability, and cost.
Skill signals: demonstrate knowledge of distributed systems, stream processing frameworks (Flink, Kafka Streams), time-series storage patterns (TSDBs, column stores, object store cold tier), cardinality reduction (sketches, downsampling), observability, and operational concerns (multi-region replication, SLA targets, capacity planning). You should justify design choices with performance, availability, and cost trade-offs.
Common Follow-up Questions
- •How would you handle very high-cardinality dimensions (e.g., user_id) for metrics while keeping storage and query costs manageable?
- •Describe how you would implement exactly-once semantics for metric aggregations across failures and rebalances.
- •How do you design the system to meet sub-second processing for critical metrics while keeping cold storage cheap?
- •What backpressure and client-side buffering strategies would you use to avoid data loss during traffic spikes?
- •How would you support schema evolution, PII redaction, and dynamic configuration of retention and aggregation without downtime?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.