backend system design
Apple
Google
Amazon

Apple System Design: Distributed Task Scheduler in Cloud

Topics:
Task Scheduling
Job Queues
Distributed Systems
Roles:
Software Engineer
Backend Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to design a distributed task scheduler that runs cloud background jobs such as data processing, reporting, and batch operations. The core pieces you must cover are a durable task queue, a scheduler that assigns work, and a worker pool that executes tasks reliably and at scale.

Start by describing end-to-end flow: how tasks are submitted via API (with metadata like priority, timeout, dependencies), how they are enqueued into a distributed queue, how the scheduler selects and leases tasks to workers, and how workers execute, acknowledge, retry, or dead-letter tasks. Explain how you’ll support delayed tasks, task dependencies, and priority ordering.

For the task queue and task structure, specify a durable storage choice (e.g., partitioned message broker or persistent stream like Kafka, SQS + durable DB, or Redis Streams) and patterns to avoid duplicates (idempotency keys, dedup store). Define a task schema including: id, type, payload, status, priority, retry_count, max_retries, timestamps (created, scheduled_at, started_at, completed_at), visibility_timeout/lease_id, timeout_ms, dependencies, resource_requirements, and metadata.

Discuss fault tolerance and consistency: use leases/visibility timeouts to prevent double execution, implement exponential backoff and a dead-letter queue for persistent failures, and prefer at-least-once semantics with idempotent handlers (or an exactly-once layer using deduplication). Cover scaling: partition queues, autoscale worker groups, use health checks and leader election for scheduler components, and monitor queue length, processing latency, retries, and error rates.

Skills you should demonstrate: distributed systems design, queueing and partitioning strategies, failure modes and retry semantics, data durability, scaling and autoscaling patterns, and practical choices of technologies for low-latency, reliable background job processing.

Common Follow-up Questions

  • How would you implement exactly-once delivery semantics for tasks? Discuss deduplication, idempotence, and transactional sinks.
  • Describe how you would schedule dependent tasks or DAGs. How do you represent dependencies and avoid starvation or cycles?
  • How would you design autoscaling for the worker pool to handle spikes in queue depth while maintaining sub-second task dispatch?
  • What metrics, alerts, and observability tooling would you add to detect slow consumers, poisoned messages, and cascading failures?

Related Questions

1Design a reliable message queue for background jobs at scale
2How to implement delayed and scheduled tasks in a distributed system
3Design a dead-letter queue and retry/backoff strategy for batch jobs

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Distributed Task Scheduler Design - Apple Cloud | Voker