backend system design
Stripe
Square
Paypal

Stripe System Design: Scalable Real-Time Logs & Metrics

Topics:
Stream Processing
Metrics Monitoring
Log-Based Storage
Roles:
Software Engineer
Backend Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to design a unified, scalable system for collecting, processing, and analyzing both structured logs and time-series metrics across a global cloud service that produces millions of events per second.

Core task: define ingestion APIs (HTTP/gRPC/TCP), a durable, partitioned transport layer (e.g., Kafka or a cloud equivalent), real-time stream processing for enrichment/aggregation (windowed percentiles, counts, anomaly detection), and a storage strategy that separates hot (recent, fast queries) and cold (cost-efficient historical) tiers. Consider schema flexibility for JSON logs, metrics with high cardinality, retention policies, and configurable aggregation intervals.

Interview flow: you’ll typically walk through end-to-end data flow (clients → ingestion → broker → stream processors → hot/cold storage → query/alerting), discuss data partitioning, state management (checkpoints, exactly-once vs at-least-once), and failure modes (retries, buffering, backpressure). Expect to sketch components, APIs, scaling strategies, and trade-offs for latency, durability, and cost.

Skill signals: demonstrate knowledge of distributed systems, stream processing frameworks (Flink, Kafka Streams), time-series storage patterns (TSDBs, column stores, object store cold tier), cardinality reduction (sketches, downsampling), observability, and operational concerns (multi-region replication, SLA targets, capacity planning). You should justify design choices with performance, availability, and cost trade-offs.

Common Follow-up Questions

  • How would you handle very high-cardinality dimensions (e.g., user_id) for metrics while keeping storage and query costs manageable?
  • Describe how you would implement exactly-once semantics for metric aggregations across failures and rebalances.
  • How do you design the system to meet sub-second processing for critical metrics while keeping cold storage cheap?
  • What backpressure and client-side buffering strategies would you use to avoid data loss during traffic spikes?
  • How would you support schema evolution, PII redaction, and dynamic configuration of retention and aggregation without downtime?

Related Questions

1Design a global log ingestion pipeline for microservices with low-latency queries
2Build a time-series metrics system that computes real-time percentiles and SLA alerts
3Design a cost-efficient hot/cold storage architecture for observability data
4How to architect a multi-tenant monitoring system with isolation and resource quotas

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Stripe System Design: Scalable Real-Time Logs & Metrics | Voker