Snowflake ML System Design: Cloud Anomaly Detection
Question Description
You are asked to design a cloud-based anomaly detection and response service that ingests OS snapshots from client machines, runs an AI model to classify Normal vs Abnormal, and triggers automated mitigations (shutdown, quarantine, email) while providing queryable history for clients.
High-level flow
- Snapshot ingestion: clients push encrypted OS snapshots to a secure API endpoint. You validate auth, perform lightweight pre-processing, and enqueue snapshots for processing.
- Online inference: a scalable model-serving tier (containerized or serverless) performs real-time scoring or streaming inference, producing anomaly scores and labels.
- Response orchestration: a response service evaluates policy rules and, on confirmed anomalies, issues actions (shutdown/quarantine APIs) and publishes alerts to notification/email services.
- Persistence & queries: store detection events, machine states, and audit logs in an indexable store so clients can query history and admins can audit responses.
Skill signals and trade-offs
You should demonstrate knowledge of streaming vs. batch inference, message queues (Kafka/SQS), autoscaling model deployment (Kubernetes, FaaS), low-latency design (cold start mitigation, model warm pools), HA patterns (multi-AZ, replication), and security (TLS, encryption-at-rest, RBAC). Be ready to discuss false-positive handling, rollbacks, cost controls (spot instances, autoscaling policies), and monitoring/observability for anomaly drift and model performance.
Common Follow-up Questions
- •How would you design continuous learning or online model updates while avoiding concept drift and minimizing downtime during model deployment?
- •Describe trade-offs between synchronous (blocking) responses and asynchronous workflows for automated mitigation. How do you ensure low-latency but safe actions?
- •How would you reduce false positives and implement alert deduplication and escalation to avoid alert fatigue?
- •Explain how you would secure OS snapshot transport and storage, including authentication, encryption, and privacy-preserving options.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.