LinkedIn System Design: Scalable Monitoring (Metrics/Logs)
Question Description
You are asked to design a cloud-based, LinkedIn-scale monitoring platform that ingests and stores high-volume telemetry (metrics, logs, traces) from thousands of services. Your design should enable real-time visibility, fast ad-hoc and aggregated queries, alerting, and long-term analysis while meeting strict non-functional requirements like low latency, high availability, data durability, and cost efficiency.
Core scope: define the end-to-end architecture for ingestion, real-time processing, storage, and querying. Consider protocols and client SDKs (OpenTelemetry, Prometheus, gRPC/HTTP), high-throughput ingestion gateways (e.g., Kafka), stream processors for aggregation (Flink/Beam), specialized storage backends (TSDB for metrics, indexed log store, trace store), and a query/visualization layer for dashboards and alerts.
Flow & stages: focus on (1) resilient ingestion with buffering and backpressure; (2) lightweight preprocessing (validation, enrichment, sampling); (3) stream processing for rollups, percentiles, and anomaly detection; (4) hot/warm/cold storage tiers with compaction and retention policies; (5) query API, dashboards, and alerting pipeline integrated with notification systems.
Skill signals: demonstrate choices for sharding, compression, indexing, time-series schemas, cardinality control, aggregation strategies, and trade-offs between query latency vs. storage cost. Show how you ensure multi-tenant isolation, secure access, HA/failover, minimal data loss during partitions, and extensibility for new telemetry types. Provide metrics for scaling (writes/sec, query latency SLA) and explain operational concerns like SLOs, monitoring the monitoring system, and cost-efficient long-term retention.
Common Follow-up Questions
- •How would you control high-cardinality tags (label cardinality) for metrics and still allow flexible querying without exploding storage costs?
- •Design the retention and downsampling strategy: how do you balance raw-data retention (30 days) vs aggregated long-term storage (1 year) and what compaction/rollup schedule do you choose?
- •Explain how you would implement multi-tenant isolation, authentication, and RBAC so teams can only access their telemetry while supporting cross-team dashboards.
- •How do you ensure the monitoring pipeline stays available under load — describe backpressure, buffering, replication, and disaster recovery for ingestion and storage layers.
- •Design a query engine capable of returning percentile and group-by aggregations over terabytes of time-series data within seconds. What indexes, data layout, and caching techniques would you use?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.