Bloomberg ML System Design: Real-time Fraud Detection
Question Description
You are asked to design a low-latency, production-ready machine learning system that detects fraudulent e-commerce transactions in real time.
The core task is to ingest streams of transaction events (user profile, payment method, device, location, purchase history), compute real-time features, run an online inference model, and output a fraud probability and binary decision within strict latency bounds (<=100 ms). You must also provide data storage for auditing and retraining, a safe deployment strategy for model updates, and monitoring for model quality and data drift.
High-level flow you should discuss:
- Data ingestion and validation (streaming platforms, gateways)
- Real-time feature computation (feature store, windowing, sessionization)
- Low-latency model serving (lightweight ensemble or neural network optimized for inference)
- Decisioning and alerting (rules + score thresholds, manual review queue)
- Storage and offline pipelines (cold store for full transactions, warm store for recent context)
Skill signals interviewers expect:
- Knowledge of online inference and feature stores (latency, consistency, cold starts)
- Trade-offs between precision/recall and user experience (thresholding, cost of false positives)
- Scalability and reliability design (sharding, autoscaling, failover)
- Monitoring and observability (latency SLOs, data/model drift detection, alerting)
Be ready to justify design trade-offs, propose A/B or shadow testing strategies, and describe responses to evolving fraud patterns.
Common Follow-up Questions
- •How would you detect and handle concept drift in fraud patterns without degrading user experience?
- •Describe the architecture of a feature store that supports sub-100 ms lookups for recent and historical features.
- •What are the trade-offs between using a lightweight GBDT versus a deep neural network for online inference in this low-latency setting?
- •How would you design canary/rolling model updates and shadow testing to avoid downtime or sudden drops in precision?
- •Explain how you would instrument monitoring and alerting to catch model degradation, data pipeline failures, and latency SLO breaches.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.