Google ML System Design: Fuzzy Video Deduplication
Question Description
You must design a fuzzy deduplication system that detects near-duplicate short videos in real time at Google-scale. The system should ingest millions of uploads per day, make a deduplication decision in seconds, and allow creators to appeal false positives.
Core requirements: use learned embeddings (video-frame + audio + text fusion) to detect fuzzy matches rather than exact hashes; support extremely high concurrency (thousands RPS); be fault tolerant and cost-aware; and include a human-in-the-loop appeal and calibration workflow.
High-level flow: (1) lightweight pre-filter (perceptual hashing, uploader metadata, and text/audio heuristics) to reject obvious uniques; (2) frame sampling and feature extraction with a compact embedding model; (3) two-stage retrieval: low-dim ANN for recall (HNSW/IVF+PQ via FAISS/ScaNN/Milvus) then high-dim rerank for precision; (4) decision logic (block/soft-flag/warn) and immediate notification with appeal links; (5) human review queue that feeds labeled cases back for offline retraining and threshold calibration.
Skill signals you should show: designing scalable ANN indices and sharding, embedding model tradeoffs (128 vs 512 dims), latency vs accuracy modeling (memory, network, search cost, P95 latency), metrics (precision/recall/F1, A/B and offline evaluation), operational concerns (monitoring, rollback, cost, consistency), and designing a robust human-in-the-loop feedback loop to reduce false positives over time.
Common Follow-up Questions
- •How would you quantify and model the tradeoff between embedding dimension, memory footprint, and P95 latency (show calculations for 128 vs 512 dims)?
- •Design the ANN sharding and replication strategy to support thousands of concurrent nearest-neighbor queries with low tail latency—how do you handle hot shards?
- •What techniques would you use to reduce false positives (creator-appealed false matches) while preserving recall—discuss cascade thresholds, multimodal signals, and reranking?
- •How would you instrument and A/B test the deduplication pipeline in production to validate accuracy and user impact before full rollout?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.