backend system design
Anthropic
OpenAI
Cohere

Anthropic System Design: Stateless Prompt Playground

Topics:
Message Queues
Chat Service
Stream Processing
Roles:
Software Engineer
Backend Engineer
Platform Engineer
Experience:
Mid Level
Senior

Question Description

You are asked to design a stateless Prompt Engineering Playground backend (similar to an LLM playground) where each request to the model is independent unless a user explicitly saves context. The system must accept very large prompts (10MB+), support multiple browser windows/tabs per user, and let users share or export full conversations via unique links or JSON.

Core content

  • Ingest: accept large prompt uploads via presigned object-storage URLs or chunked streaming to avoid overloading the app server. Use content-addressable storage (hash + dedupe) for large files.
  • Preprocessing: chunk and optionally compress or summarize very large prompts before sending to the LLM. Consider client-side or edge pre-processing to reduce bandwidth and cost.
  • LLM access: treat the LLM as an external API. Use a queuing layer (message queue or stream processor) to batch, rate-limit, and retry calls. Stream tokens back to the user (WebSocket/SSE) for low-latency UX.
  • Persistence & sharing: store minimal metadata in a primary DB (User, Conversation, PromptRef, ResponseRef) and keep large payloads in object storage. Generate shareable links with randomized IDs or signed short-lived tokens; allow JSON export.

Flow / stages you should explain in an interview

  1. Client upload (presigned PUT / multipart / chunked). 2. Server validates, stores metadata, and returns a prompt reference. 3. Preprocessor (sync/async) chunks or summarizes large prompts. 4. Queue/worker sends request to LLM API, streams response back via WebSocket or SSE. 5. Persist response references and make the conversation shareable.

Skill signals to demonstrate

  • Distributed systems: queues, workers, backpressure, and horizontal scaling.
  • Streaming & performance: SSE/WebSockets, partial response delivery, chunking strategies.
  • Cost optimization: deduplication, summarization, batching, and caching of frequent prompts.
  • Data modeling & indexing: how you model User, Conversation, Prompt, Response, and your indexing strategy for fast retrieval.
  • Security & reliability: signed URLs, access control on shared links, retries, idempotency, and monitoring.

When answering, walk through trade-offs (latency vs. cost, client vs. server preprocessing, strict statelessness vs. saved context) and be explicit about failure modes and mitigation (partial uploads, LLM timeouts, replay protection).

Common Follow-up Questions

  • How would you optimize for cost when multiple users submit similar large prompts repeatedly? (hint: caching, content-addressable storage, deduplication)
  • Design the sharing and access-control model: how do you implement expiring share links, permission scopes, and anonymous access safely?
  • Explain how you'd stream partial LLM responses to the client while ensuring resume/reconnect semantics across multiple browser tabs.
  • How do you guarantee true statelessness server-side while letting users 'save' context optionally? Describe idempotency, session tokens, and persisted context references.

Related Questions

1Design a stateful chat service with conversation memory and retrieval-augmented generation (RAG)
2Design a cost-efficient LLM proxy that batches and caches requests for multiple clients
3Design a large-file ingestion and chunking pipeline for ML inference serving
4Design a collaborative prompt editor with live sharing and merge/conflict resolution

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Stateless Prompt Playground Design - Anthropic Interview | Voker