ml system design
Anthropic
OpenAI
Google

Anthropic ML System Design: Scalable Batch Inference

Topics:
Batch Inference
Model Deployment
Dynamic Batching
Roles:
Software Engineer
ML Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

Problem overview

You need to design a scalable batch inference system that accepts RESTful batch requests from multiple clients and forwards individual inputs to a fixed, single-item inference API (pre-trained model, immutable API). Your system must queue incoming batches, apply dynamic batching (by size or time window), dispatch grouped requests to the inference API via GPUs, and return per-item results while preserving order and status.

High-level flow

  1. Client POSTs a batch to a REST endpoint → request saved to durable queue (async).
  2. Scheduler pulls queued items, applies dynamic batching policy (max size, timeout) to form GPU-efficient batches.
  3. Batches are dispatched to workers that manage GPU execution and call the single-input inference API concurrently with concurrency limits.
  4. Results are aggregated, per-item statuses recorded, and responses delivered back to clients or downstream systems.

What interviewers expect you to show

You should explain trade-offs (latency vs throughput), batching window tuning, concurrency control, autoscaling rules (metrics-based GPU scaling), fault tolerance (retries with exponential backoff, exactly-once vs at-least-once semantics), and cost controls to maintain high GPU utilization (target ≥70%). Describe observability (GPU utilization, queue length, latency, error rates), backpressure mechanisms, and capacity estimation (rough GPU count = ceil(expected items/s * avg model time per item / batch_size)).

Demonstrate practical considerations: ordering guarantees, multi-tenant isolation, model versioning and canarying, retry policies, and how you'd test and monitor performance under traffic spikes (load tests, chaos testing). Use diagrams and a brief ops playbook during the interview to make your design concrete.

Common Follow-up Questions

  • How would you guarantee input-order preservation and idempotency when requests are retried or replayed?
  • Describe a capacity-estimation formula and show an example: how many GPUs do you provision for 10,000 requests/min with average model latency X and target batch size Y?
  • How do you handle multi-tenant resource isolation and fair-share GPU scheduling when different clients have different SLAs?
  • If the single-item inference API introduces variable latency, how would you adapt batching windows and autoscaling to maintain <5s batch latency?

Related Questions

1Design an online (real-time) inference service that serves single requests with strict tail-latency guarantees
2How to implement GPU autoscaling and job queueing for distributed ML inference workloads
3Design a dynamic batching service for multiple models with versioning and canary deployment
4Architect a fault-tolerant inference pipeline with durable queues and exactly-once processing semantics

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Scalable Batch Inference System Design - Anthropic | Voker