Databricks System Design: Multi-threaded Event Logger
Question Description
You are asked to design a high-performance, multi-threaded event logger that a thousand concurrent threads will share. The logger must capture structured events (timestamp, thread id, level, message, metadata) without blocking application threads and without losing data under high load.
A practical design you can present uses asynchronous, non-blocking ingestion: application threads serialize minimal event payloads and push them into a concurrent in-memory queue (MPSC or lock-free ring buffer). A background worker (or pool) performs batching, serialization (structured JSON/Avro), and writes to outputs (local files with rotation, console, or remote sinks via a message queue like Kafka). Add a small in-memory buffer per thread or a bounded queue to limit contention and enable batching for throughput and low tail latency.
The interview typically flows: clarify requirements, pick concurrency primitives, sketch components (ingest queue, batcher, output writer, retry/backpressure), discuss durability and failure modes, and trade-offs (latency vs durability). Be explicit about configuration: dynamic log levels, pluggable outputs, and runtime toggles.
Skill signals to show: understanding of lock-free vs mutex designs, batching and flush strategies, backpressure and retry policies, crash recovery (WAL or fsync frequency), throughput/latency trade-offs, and observability (metrics, error handling). Mention common optimizations (Disruptor-style ring buffer, memory pooling, and efficient serializers) and when you’d choose them.
Common Follow-up Questions
- •How would you guarantee no log events are lost on a process crash? Explain fsync, write-ahead logs, and replication trade-offs.
- •How do you ensure ordering guarantees (per-thread vs global)? Describe designs for preserving order while maintaining throughput.
- •If a remote sink (e.g., Kafka or HTTP collector) becomes slow, how would you apply backpressure or shed load without dropping critical logs?
- •How would you support dynamic runtime configuration (log levels, new outputs) with zero downtime and consistent behavior across threads?
- •How would you scale this logger across processes or machines (centralized collector, aggregated streams, retention and partitioning)?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.