ml system design
Snapchat
TikTok
Meta

Snapchat ML System Design: Real-Time Multimodal Moderation

Topics:
Multimodal Models
Content Moderation
Text Classification
Roles:
Machine Learning Engineer
Data Scientist
Applied ML Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You're asked to design a real-time multimodal harmful content detection system for a large social platform (text, images, video). The goal is to flag hate speech, harassment, graphic violence, explicit material, and spam in incoming posts, integrate with recommendations to down-rank or filter content, and surface suspicious items for human review while keeping latency, scale, and accuracy constraints in mind.

High-level flow: ingest posts via API/streaming, pre-process text (tokenization, profanity masks) and media (thumbnailing, keyframe extraction, audio-to-text), run fast heuristics and lightweight classifiers for immediate filtering, and then run a multimodal fusion model (text + image + audio/video features) to produce a harmfulness score and category. Use thresholding and an ensemble of models to decide actions: auto-block, down-rank, flag for review, or shadow-mode logging for offline evaluation.

What you'll be evaluated on: system architecture for horizontal scaling (sharding, batching, model serving), low-latency inference strategies (quantization, pruning, model distillation, GPU/accelerator utilization), data engineering for continuous retraining (handling class imbalance, label quality), and product/policy trade-offs (precision vs recall, explainability to moderators, privacy constraints). Be ready to discuss metrics (precision, recall, F1, latency percentiles), monitoring and auditing, feedback loops from human review, and how you would evolve the pipeline as new harmful content types emerge.

Common Follow-up Questions

  • How would you design the model serving layer to meet <100ms latency at 10k req/s — discuss batching, caching, GPU vs CPU, and autoscaling strategies?
  • What evaluation and monitoring pipeline would you implement to detect model drift, measure precision/recall in production, and trigger retraining?
  • How would you handle video moderation specifically (frame sampling, temporal models, audio transcription) while minimizing compute cost?
  • Discuss strategies to make detections explainable for moderators and to reduce false positives without sacrificing recall.

Related Questions

1Design a scalable image moderation pipeline for real-time detection of graphic content
2How to build a text-only hate-speech detection system: metrics, datasets, and deployment
3Architect a feedback-driven retraining pipeline for moderation models that handles class imbalance
4Design a content ranking filter that integrates harmfulness scores into a recommendation engine

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Real-Time Multimodal Moderation — Snapchat ML Design | Voker