Anthropic ML System Design: Scalable Batch Inference
Question Description
Problem overview
You need to design a scalable batch inference system that accepts RESTful batch requests from multiple clients and forwards individual inputs to a fixed, single-item inference API (pre-trained model, immutable API). Your system must queue incoming batches, apply dynamic batching (by size or time window), dispatch grouped requests to the inference API via GPUs, and return per-item results while preserving order and status.
High-level flow
- Client POSTs a batch to a REST endpoint → request saved to durable queue (async).
- Scheduler pulls queued items, applies dynamic batching policy (max size, timeout) to form GPU-efficient batches.
- Batches are dispatched to workers that manage GPU execution and call the single-input inference API concurrently with concurrency limits.
- Results are aggregated, per-item statuses recorded, and responses delivered back to clients or downstream systems.
What interviewers expect you to show
You should explain trade-offs (latency vs throughput), batching window tuning, concurrency control, autoscaling rules (metrics-based GPU scaling), fault tolerance (retries with exponential backoff, exactly-once vs at-least-once semantics), and cost controls to maintain high GPU utilization (target ≥70%). Describe observability (GPU utilization, queue length, latency, error rates), backpressure mechanisms, and capacity estimation (rough GPU count = ceil(expected items/s * avg model time per item / batch_size)).
Demonstrate practical considerations: ordering guarantees, multi-tenant isolation, model versioning and canarying, retry policies, and how you'd test and monitor performance under traffic spikes (load tests, chaos testing). Use diagrams and a brief ops playbook during the interview to make your design concrete.
Common Follow-up Questions
- •How would you guarantee input-order preservation and idempotency when requests are retried or replayed?
- •Describe a capacity-estimation formula and show an example: how many GPUs do you provision for 10,000 requests/min with average model latency X and target batch size Y?
- •How do you handle multi-tenant resource isolation and fair-share GPU scheduling when different clients have different SLAs?
- •If the single-item inference API introduces variable latency, how would you adapt batching windows and autoscaling to maintain <5s batch latency?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.