ml coding
Google
YouTube

Top-k Video Similarity Search - Google ML Coding Interview

Topics:
Graph Algorithms
Pathfinding / Shortest-Path Transforms
Distributed Systems (remote adjacency)
Roles:
Machine Learning Engineer
ML Engineer
Data Scientist
Experience:
Mid Level
Senior

Question Description

You are asked to find the top-k most similar videos to a given source in a directed, weighted video-similarity graph where similarity composes multiplicatively along paths (path weight = product of edge weights). Each edge weight is a float (typically in [0,1]) and cycles may exist; the similarity from s to t is the maximum product over all paths from s to t.

In the coding stage you will implement top_k_similar(graph, source_id, k) that returns up to k (video_id, similarity) pairs ordered by decreasing similarity. Expect to discuss numerical stability (products of many small floats), how you handle zero-weight edges, and how to make the search efficient on very large graphs (millions of nodes) when you only need the top-k results.

The interview typically flows in two parts: (1) local algorithm and complexity/edge-case analysis — you should show that you can transform multiplicative scores (e.g., -log transform) to use a shortest-path style best-first search and explain correctness with cycles; (2) system/remote-awareness — describe how you would adapt the solution when adjacency lists are sharded and must be fetched over RPC. For the non-local stage you'll be expected to specify required interfaces (for example, fetch_neighbors(node_id) -> List[(neighbor, weight)]), caching budget, timeout and retry behavior, and trade-offs between batching RPCs, memory limits, and result latency.

Skills signaled: graph algorithms (max-product / log-transform + best-first search), numerical stability and precision, large-scale search optimization, and distributed/remote data access patterns (API contracts, caching, failure modes). Use precise complexity and operational constraints when you explain your design.

Common Follow-up Questions

  • How does your approach change if edge weights can exceed 1 (so cycles could amplify similarity)? Explain correctness and termination.
  • What numerical issues arise from multiplying many small floats and how would you address them (e.g., log-transform, underflow, precision)?
  • How would you update top-k results efficiently if edges are added/removed frequently (dynamic graph / incremental updates)?
  • Design a shard-aware coordination strategy to compute global top-k with minimal RPCs and bounded latency — what state must each shard return?
  • If you can only return approximate top-k under strict latency, how would you provide quality/latency trade-offs and error bounds?

Related Questions

1Find the maximum-probability path between two nodes (max-product path) in a weighted directed graph
2Implement top-k nearest neighbors for high-dimensional video embeddings (ANN approaches vs exact search)
3Design a distributed graph traversal that returns top-k reachable nodes with sharded adjacency lists
4Compute personalized PageRank / truncated random-walk scores for top-k recommendations
5How to adapt Dijkstra/A* variants for multiplicative edge weights and large-scale graphs

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Top-k Video Similarity Search — Google ML Interview | Voker