DoorDash System Design: Reliable Scalable Payment Processor
Question Description
You are asked to design a backend payment processing service that accepts transactions from multiple e-commerce platforms and integrates with external gateways (Stripe, PayPal) and banks. The primary goal is to process each payment exactly once even when clients retry or external systems fail, while meeting strict SLAs for latency, availability, durability, and security.
Start by describing the high-level flow: client submits a payment request → API gateway returns immediate ACK with a unique transaction ID → request is written to a durable queue → stateless workers pull jobs, validate idempotency keys, and call external gateways → payment state is persisted in a strongly consistent store → asynchronous notifications/webhooks are sent to clients and order services. Include refund flows and reconciliation as part of the lifecycle.
Key design considerations you must be prepared to discuss: idempotency and deduplication strategies (idempotency keys, write-once markers, compare-and-set), message queue guarantees (at-least-once vs exactly-once semantics, dedupe in consumers), retry/backoff policies, circuit breakers and gateway fallbacks, strong consistency for payment state vs eventual consistency trade-offs (sagas or compensating actions), partitioning and horizontal scaling to handle 10k TPS, durable storage for audit trails, and PCI DSS, tokenization, encryption and fraud detection.
During the interview you should show system-level thinking: data schemas for transactions, failure scenarios and recovery, monitoring and alerting plans, and concrete trade-offs (latency vs consistency, cost vs redundancy). Demonstrate familiarity with distributed systems patterns (idempotent operations, message queues, retries) and operational concerns for production-grade payment services.
Common Follow-up Questions
- •How would you implement exactly-once semantics end-to-end when external gateways only provide at-least-once guarantees?
- •Design the database schema and indexes for transaction records and explain how you’d query by merchant, time range, and status at high throughput.
- •How would you handle a region-wide outage of a primary payment gateway — describe fallbacks, routing, and consistency implications for in-flight transactions.
- •Explain how you would design retries and exponential backoff without breaking idempotency, and how to surface transient vs permanent failures to clients.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.