ml coding
Netflix
Amazon
Spotify

Netflix ML Coding: Compute TF-IDF for Corpus Implementation

Topics:
TF-IDF
Vectorization
Text Processing
Roles:
Machine Learning Engineer
Data Scientist
NLP Engineer
Experience:
Entry Level
Mid Level
Senior

Question Description

You are asked to implement a Python function that computes TF-IDF scores for every token in every document of a small corpus. The input is a list of strings (each string is a document) and you must tokenize each document using str.split() only. Derive the vocabulary from those tokens, compute term frequency (TF) as count(t, d) / total_tokens(d), compute inverse document frequency (IDF) as log(N / df(t)), and return a list of dictionaries mapping each token present in a document to its TF-IDF score.

Focus on these stages during the interview:

  • Tokenization and vocabulary extraction: call str.split() and collect per-document token counts and document frequencies (df).
  • TF calculation: compute normalized term frequency per document (counts divided by total tokens in that document).
  • IDF calculation: compute IDF using the natural log over the corpus size N and document frequencies.
  • Combine TF and IDF: multiply per-token TF by IDF and return one dict per document with floating-point scores.

Skill signals the interviewer will look for

  • Correct implementation of TF and IDF formulas and careful handling of document counts (df) and total tokens.
  • Efficient counting and use of Python data structures (collections.Counter, dicts) and clear indexing/slicing logic.
  • Awareness of numerical edge cases (e.g., tokens that appear in all documents) and clear, testable code. You may be asked to optimize with vectorization (NumPy) or to produce sparse representations for larger corpora.

Common Follow-up Questions

  • How would you modify the IDF definition to use smoothing (e.g., log(1 + N / df(t))) and why might you do that?
  • Show a vectorized NumPy implementation that computes TF-IDF for the whole corpus efficiently; how would you handle sparse output?
  • How would you normalize TF-IDF vectors (L1 vs L2) and why is normalization important for comparing documents (e.g., cosine similarity)?
  • If a token appears in zero documents because of preprocessing differences, how should your implementation handle unseen tokens or empty documents?
  • Explain how you would extend this to n-grams (bigrams/trigrams) and discuss the trade-offs in vocabulary size and memory usage.

Related Questions

1Implement a Count Vectorizer: build document-term count matrices from a corpus using str.split() tokenization
2Compute cosine similarity between documents using TF-IDF vectors and return the top-k most similar documents
3Design an inverted index for term-to-document lookup and show how to use it for quick DF and search
4Implement TF-IDF with sublinear TF scaling (1 + log(tf)) and compare results to raw TF
5Write a sparse matrix-based TF-IDF transformer using scipy.sparse and explain memory/time benefits

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

TF-IDF Coding Question - Netflix ML Engineer Interview | Voker