Scalable Batch Inference System Design - Anthropic

Question Description

Problem overview

You need to design a scalable batch inference system that accepts RESTful batch requests from multiple clients and forwards individual inputs to a fixed, single-item inference API (pre-trained model, immutable API). Your system must queue incoming batches, apply dynamic batching (by size or time window), dispatch grouped requests to the inference API via GPUs, and return per-item results while preserving order and status.

High-level flow

Client POSTs a batch to a REST endpoint → request saved to durable queue (async).
Scheduler pulls queued items, applies dynamic batching policy (max size, timeout) to form GPU-efficient batches.
Batches are dispatched to workers that manage GPU execution and call the single-input inference API concurrently with concurrency limits.
Results are aggregated, per-item statuses recorded, and responses delivered back to clients or downstream systems.

What interviewers expect you to show

You should explain trade-offs (latency vs throughput), batching window tuning, concurrency control, autoscaling rules (metrics-based GPU scaling), fault tolerance (retries with exponential backoff, exactly-once vs at-least-once semantics), and cost controls to maintain high GPU utilization (target ≥70%). Describe observability (GPU utilization, queue length, latency, error rates), backpressure mechanisms, and capacity estimation (rough GPU count = ceil(expected items/s * avg model time per item / batch_size)).

Demonstrate practical considerations: ordering guarantees, multi-tenant isolation, model versioning and canarying, retry policies, and how you'd test and monitor performance under traffic spikes (load tests, chaos testing). Use diagrams and a brief ops playbook during the interview to make your design concrete.

Anthropic ML System Design: Scalable Batch Inference

Question Description

Problem overview

High-level flow

What interviewers expect you to show

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI