K-Means MapReduce Interview Question — Snapchat ML

Overview

You will implement K-means clustering in two stages: first a deterministic single-machine routine that returns final centroids and per-point assignments, then a MapReduce adaptation (mapper.py and reducer.py) so the algorithm can run across distributed workers. The interview tests both algorithmic correctness (centroid updates, assignments, termination) and engineering for distributed iteration and I/O formats.

What you'll be asked to do

Build a function that accepts n d-dimensional points and k and returns k centroids and an assignment array. Document assumptions (consistent dimensionality, numeric inputs) and initialization behavior (random seed or deterministic init such as k-means++). Explain how you compute centroids and detect convergence (max iterations or centroid movement tolerance).
Provide mapper.py and reducer.py that read point lines from stdin and emit centroid updates to stdout. Describe the centroid-file format mappers read, the key/value text encoding you choose, and how reducers aggregate sums and counts to produce next-round centroids.

Flow / Iterations

A driver script distributes the current centroid file to all mappers each round, runs the MapReduce job, collects reducer outputs as the new centroids file, and repeats until centroids change less than tol or max_iters is reached. Persist only the centroids and optionally assignments or diagnostics between rounds.

Skills signaled

You should demonstrate: unsupervised learning fundamentals (K-means objective, centroid math), practical implementation (numerical stability, empty-cluster handling), distributed processing patterns (mapper/reducer I/O, aggregation, driver orchestration), and engineering judgment on termination and initialization strategies.

Snapchat ML Coding: K-Means + MapReduce Implementation

Question Description

Overview

What you'll be asked to do

Flow / Iterations

Skills signaled

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI