Real-Time Multimodal Moderation — Snapchat ML Design

Question Description

You're asked to design a real-time multimodal harmful content detection system for a large social platform (text, images, video). The goal is to flag hate speech, harassment, graphic violence, explicit material, and spam in incoming posts, integrate with recommendations to down-rank or filter content, and surface suspicious items for human review while keeping latency, scale, and accuracy constraints in mind.

High-level flow: ingest posts via API/streaming, pre-process text (tokenization, profanity masks) and media (thumbnailing, keyframe extraction, audio-to-text), run fast heuristics and lightweight classifiers for immediate filtering, and then run a multimodal fusion model (text + image + audio/video features) to produce a harmfulness score and category. Use thresholding and an ensemble of models to decide actions: auto-block, down-rank, flag for review, or shadow-mode logging for offline evaluation.

What you'll be evaluated on: system architecture for horizontal scaling (sharding, batching, model serving), low-latency inference strategies (quantization, pruning, model distillation, GPU/accelerator utilization), data engineering for continuous retraining (handling class imbalance, label quality), and product/policy trade-offs (precision vs recall, explainability to moderators, privacy constraints). Be ready to discuss metrics (precision, recall, F1, latency percentiles), monitoring and auditing, feedback loops from human review, and how you would evolve the pipeline as new harmful content types emerge.

Snapchat ML System Design: Real-Time Multimodal Moderation

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI