Apple System Design: Distributed Task Scheduler in Cloud
Question Description
You are asked to design a distributed task scheduler that runs cloud background jobs such as data processing, reporting, and batch operations. The core pieces you must cover are a durable task queue, a scheduler that assigns work, and a worker pool that executes tasks reliably and at scale.
Start by describing end-to-end flow: how tasks are submitted via API (with metadata like priority, timeout, dependencies), how they are enqueued into a distributed queue, how the scheduler selects and leases tasks to workers, and how workers execute, acknowledge, retry, or dead-letter tasks. Explain how you’ll support delayed tasks, task dependencies, and priority ordering.
For the task queue and task structure, specify a durable storage choice (e.g., partitioned message broker or persistent stream like Kafka, SQS + durable DB, or Redis Streams) and patterns to avoid duplicates (idempotency keys, dedup store). Define a task schema including: id, type, payload, status, priority, retry_count, max_retries, timestamps (created, scheduled_at, started_at, completed_at), visibility_timeout/lease_id, timeout_ms, dependencies, resource_requirements, and metadata.
Discuss fault tolerance and consistency: use leases/visibility timeouts to prevent double execution, implement exponential backoff and a dead-letter queue for persistent failures, and prefer at-least-once semantics with idempotent handlers (or an exactly-once layer using deduplication). Cover scaling: partition queues, autoscale worker groups, use health checks and leader election for scheduler components, and monitor queue length, processing latency, retries, and error rates.
Skills you should demonstrate: distributed systems design, queueing and partitioning strategies, failure modes and retry semantics, data durability, scaling and autoscaling patterns, and practical choices of technologies for low-latency, reliable background job processing.
Common Follow-up Questions
- •How would you implement exactly-once delivery semantics for tasks? Discuss deduplication, idempotence, and transactional sinks.
- •Describe how you would schedule dependent tasks or DAGs. How do you represent dependencies and avoid starvation or cycles?
- •How would you design autoscaling for the worker pool to handle spikes in queue depth while maintaining sub-second task dispatch?
- •What metrics, alerts, and observability tooling would you add to detect slow consumers, poisoned messages, and cascading failures?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.