backend system design
Databricks
Confluent
Snowflake

Databricks System Design: Multi-threaded Event Logger

Topics:
Structured Logging
Message Queues
Task Scheduling
Roles:
Software Engineer
Backend Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to design a high-performance, multi-threaded event logger that a thousand concurrent threads will share. The logger must capture structured events (timestamp, thread id, level, message, metadata) without blocking application threads and without losing data under high load.

A practical design you can present uses asynchronous, non-blocking ingestion: application threads serialize minimal event payloads and push them into a concurrent in-memory queue (MPSC or lock-free ring buffer). A background worker (or pool) performs batching, serialization (structured JSON/Avro), and writes to outputs (local files with rotation, console, or remote sinks via a message queue like Kafka). Add a small in-memory buffer per thread or a bounded queue to limit contention and enable batching for throughput and low tail latency.

The interview typically flows: clarify requirements, pick concurrency primitives, sketch components (ingest queue, batcher, output writer, retry/backpressure), discuss durability and failure modes, and trade-offs (latency vs durability). Be explicit about configuration: dynamic log levels, pluggable outputs, and runtime toggles.

Skill signals to show: understanding of lock-free vs mutex designs, batching and flush strategies, backpressure and retry policies, crash recovery (WAL or fsync frequency), throughput/latency trade-offs, and observability (metrics, error handling). Mention common optimizations (Disruptor-style ring buffer, memory pooling, and efficient serializers) and when you’d choose them.

Common Follow-up Questions

  • How would you guarantee no log events are lost on a process crash? Explain fsync, write-ahead logs, and replication trade-offs.
  • How do you ensure ordering guarantees (per-thread vs global)? Describe designs for preserving order while maintaining throughput.
  • If a remote sink (e.g., Kafka or HTTP collector) becomes slow, how would you apply backpressure or shed load without dropping critical logs?
  • How would you support dynamic runtime configuration (log levels, new outputs) with zero downtime and consistent behavior across threads?
  • How would you scale this logger across processes or machines (centralized collector, aggregated streams, retention and partitioning)?

Related Questions

1Design a durable log aggregation pipeline for distributed services (Kafka, Flink, S3)
2Implement a non-blocking logger in Java: choices between ConcurrentLinkedQueue, Disruptor, and ring buffers
3How to design log rotation, retention, and compaction for high-volume logging systems
4Design a backpressure strategy for high-throughput event ingestion systems
5Explain trade-offs between synchronous fsync and asynchronous batching for reliable logging

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Multi-threaded Event Logger System Design - Databricks | Voker