Visa System Design: Scalable A/B Experiment Service
Question Description
Overview
You are asked to design a scalable experiment (A/B) service that deterministically assigns users to variants for features at runtime. The frontend calls an endpoint with a userId and featureName and expects a stable variant (e.g., "control", "treatment_A"). Your design must support multiple independent experiments, admin overrides, logs for metric collection, and easy experiment lifecycle management.
Flow & stages
- Request: frontend calls GET /assign?userId=123&featureName=new_ui
- Resolve experiment configuration (is feature enabled? active experiment, variant weights)
- Check overrides or manual enrollments for the user
- Compute deterministic bucketing (consistent hashing on userId+featureName)
- Return variant and asynchronously log the assignment for metrics
Key components to propose
- Config store for experiments (multi-tenant schema, versioned configs)
- Fast read path: cache (Redis) with eventual consistency and fallback to config DB
- Deterministic bucketing: consistent hash + normalized cumulative weights
- Overrides service: small, strongly consistent store for per-user overrides
- Logging / event pipeline: write assignment events to a durable stream (Kafka) for downstream analytics
Non-functional & trade-offs
Design for <100ms P95 reads with horizontal scaling: use in-memory caches near the API, shard configs for large scale, and replicate read-only config snapshots for HA. Consider eventual consistency for config changes vs. immediate rollback needs. Plan for graceful degradation (return default variant if config store is unreachable) and strong audit logs for experiment integrity.
Skill signals
You should demonstrate understanding of distributed systems (caching, sharding), data modeling for multi-tenancy, deterministic hashing algorithms, API design, and how to integrate logging/metrics for experiment analysis.
Common Follow-up Questions
- •How would you handle reassigning users when experiment allocation changes mid-run while preserving statistical validity?
- •Describe how you'd implement consistent hashing so assignments remain stable across service version upgrades and multiple datacenters.
- •What storage strategy would you choose for overrides (latency vs. consistency) and how would you replicate/backup that data?
- •How do you design the logging and metrics pipeline to capture exposure, conversions, and anomalous assignment patterns at scale?
- •How would you support complex targeting (user attributes, segments) while keeping assignment deterministic and performant?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.