System Design: Scalable Monitoring for Metrics & Logs

Question Description

You are asked to design a cloud-based, LinkedIn-scale monitoring platform that ingests and stores high-volume telemetry (metrics, logs, traces) from thousands of services. Your design should enable real-time visibility, fast ad-hoc and aggregated queries, alerting, and long-term analysis while meeting strict non-functional requirements like low latency, high availability, data durability, and cost efficiency.

Core scope: define the end-to-end architecture for ingestion, real-time processing, storage, and querying. Consider protocols and client SDKs (OpenTelemetry, Prometheus, gRPC/HTTP), high-throughput ingestion gateways (e.g., Kafka), stream processors for aggregation (Flink/Beam), specialized storage backends (TSDB for metrics, indexed log store, trace store), and a query/visualization layer for dashboards and alerts.

Flow & stages: focus on (1) resilient ingestion with buffering and backpressure; (2) lightweight preprocessing (validation, enrichment, sampling); (3) stream processing for rollups, percentiles, and anomaly detection; (4) hot/warm/cold storage tiers with compaction and retention policies; (5) query API, dashboards, and alerting pipeline integrated with notification systems.

Skill signals: demonstrate choices for sharding, compression, indexing, time-series schemas, cardinality control, aggregation strategies, and trade-offs between query latency vs. storage cost. Show how you ensure multi-tenant isolation, secure access, HA/failover, minimal data loss during partitions, and extensibility for new telemetry types. Provide metrics for scaling (writes/sec, query latency SLA) and explain operational concerns like SLOs, monitoring the monitoring system, and cost-efficient long-term retention.

LinkedIn System Design: Scalable Monitoring (Metrics/Logs)

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI