ML System Design: Cloud Anomaly Detection | Snowflake

Question Description

You are asked to design a cloud-based anomaly detection and response service that ingests OS snapshots from client machines, runs an AI model to classify Normal vs Abnormal, and triggers automated mitigations (shutdown, quarantine, email) while providing queryable history for clients.

High-level flow

Snapshot ingestion: clients push encrypted OS snapshots to a secure API endpoint. You validate auth, perform lightweight pre-processing, and enqueue snapshots for processing.
Online inference: a scalable model-serving tier (containerized or serverless) performs real-time scoring or streaming inference, producing anomaly scores and labels.
Response orchestration: a response service evaluates policy rules and, on confirmed anomalies, issues actions (shutdown/quarantine APIs) and publishes alerts to notification/email services.
Persistence & queries: store detection events, machine states, and audit logs in an indexable store so clients can query history and admins can audit responses.

Skill signals and trade-offs

You should demonstrate knowledge of streaming vs. batch inference, message queues (Kafka/SQS), autoscaling model deployment (Kubernetes, FaaS), low-latency design (cold start mitigation, model warm pools), HA patterns (multi-AZ, replication), and security (TLS, encryption-at-rest, RBAC). Be ready to discuss false-positive handling, rollbacks, cost controls (spot instances, autoscaling policies), and monitoring/observability for anomaly drift and model performance.

Snowflake ML System Design: Cloud Anomaly Detection

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI