ml system design
Snowflake
Databricks
AWS

Snowflake ML System Design: Cloud Anomaly Detection

Topics:
Anomaly Detection
Online Inference
Model Deployment
Roles:
Software Engineer
ML Engineer
Site Reliability Engineer
Experience:
Mid Level
Senior
Staff

Question Description

You are asked to design a cloud-based anomaly detection and response service that ingests OS snapshots from client machines, runs an AI model to classify Normal vs Abnormal, and triggers automated mitigations (shutdown, quarantine, email) while providing queryable history for clients.

High-level flow

  • Snapshot ingestion: clients push encrypted OS snapshots to a secure API endpoint. You validate auth, perform lightweight pre-processing, and enqueue snapshots for processing.
  • Online inference: a scalable model-serving tier (containerized or serverless) performs real-time scoring or streaming inference, producing anomaly scores and labels.
  • Response orchestration: a response service evaluates policy rules and, on confirmed anomalies, issues actions (shutdown/quarantine APIs) and publishes alerts to notification/email services.
  • Persistence & queries: store detection events, machine states, and audit logs in an indexable store so clients can query history and admins can audit responses.

Skill signals and trade-offs

You should demonstrate knowledge of streaming vs. batch inference, message queues (Kafka/SQS), autoscaling model deployment (Kubernetes, FaaS), low-latency design (cold start mitigation, model warm pools), HA patterns (multi-AZ, replication), and security (TLS, encryption-at-rest, RBAC). Be ready to discuss false-positive handling, rollbacks, cost controls (spot instances, autoscaling policies), and monitoring/observability for anomaly drift and model performance.

Common Follow-up Questions

  • How would you design continuous learning or online model updates while avoiding concept drift and minimizing downtime during model deployment?
  • Describe trade-offs between synchronous (blocking) responses and asynchronous workflows for automated mitigation. How do you ensure low-latency but safe actions?
  • How would you reduce false positives and implement alert deduplication and escalation to avoid alert fatigue?
  • Explain how you would secure OS snapshot transport and storage, including authentication, encryption, and privacy-preserving options.

Related Questions

1Design a streaming anomaly detection pipeline for telemetry data with real-time alerts
2How would you architect online model serving for low-latency inference at scale (Kubernetes vs serverless)?
3Design an alerting and incident response system that integrates automated mitigations and human review
4How to build an auditable event store for security detections and long-term forensic queries

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

ML System Design: Cloud Anomaly Detection | Snowflake | Voker