Snapchat ML Coding: K-Means + MapReduce Implementation
Question Description
Overview
You will implement K-means clustering in two stages: first a deterministic single-machine routine that returns final centroids and per-point assignments, then a MapReduce adaptation (mapper.py and reducer.py) so the algorithm can run across distributed workers. The interview tests both algorithmic correctness (centroid updates, assignments, termination) and engineering for distributed iteration and I/O formats.
What you'll be asked to do
-
Build a function that accepts n d-dimensional points and k and returns k centroids and an assignment array. Document assumptions (consistent dimensionality, numeric inputs) and initialization behavior (random seed or deterministic init such as k-means++). Explain how you compute centroids and detect convergence (max iterations or centroid movement tolerance).
-
Provide mapper.py and reducer.py that read point lines from stdin and emit centroid updates to stdout. Describe the centroid-file format mappers read, the key/value text encoding you choose, and how reducers aggregate sums and counts to produce next-round centroids.
Flow / Iterations
A driver script distributes the current centroid file to all mappers each round, runs the MapReduce job, collects reducer outputs as the new centroids file, and repeats until centroids change less than tol or max_iters is reached. Persist only the centroids and optionally assignments or diagnostics between rounds.
Skills signaled
You should demonstrate: unsupervised learning fundamentals (K-means objective, centroid math), practical implementation (numerical stability, empty-cluster handling), distributed processing patterns (mapper/reducer I/O, aggregation, driver orchestration), and engineering judgment on termination and initialization strategies.
Common Follow-up Questions
- •How would you implement k-means++ initialization in your single-machine function and why does it matter for convergence?
- •Describe how you would handle empty clusters during iterations in both single-machine and MapReduce implementations.
- •If you needed to scale to streaming data, how would you adapt your implementation to an online or mini-batch K-means algorithm?
- •Explain the communication and computational complexity per iteration in your MapReduce design; how does this affect cluster count k and dimensionality d?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.