TF-IDF Coding Question - Netflix ML Engineer Interview

Question Description

You are asked to implement a Python function that computes TF-IDF scores for every token in every document of a small corpus. The input is a list of strings (each string is a document) and you must tokenize each document using str.split() only. Derive the vocabulary from those tokens, compute term frequency (TF) as count(t, d) / total_tokens(d), compute inverse document frequency (IDF) as log(N / df(t)), and return a list of dictionaries mapping each token present in a document to its TF-IDF score.

Focus on these stages during the interview:

Tokenization and vocabulary extraction: call str.split() and collect per-document token counts and document frequencies (df).
TF calculation: compute normalized term frequency per document (counts divided by total tokens in that document).
IDF calculation: compute IDF using the natural log over the corpus size N and document frequencies.
Combine TF and IDF: multiply per-token TF by IDF and return one dict per document with floating-point scores.

Skill signals the interviewer will look for

Correct implementation of TF and IDF formulas and careful handling of document counts (df) and total tokens.
Efficient counting and use of Python data structures (collections.Counter, dicts) and clear indexing/slicing logic.
Awareness of numerical edge cases (e.g., tokens that appear in all documents) and clear, testable code. You may be asked to optimize with vectorization (NumPy) or to produce sparse representations for larger corpora.

Netflix ML Coding: Compute TF-IDF for Corpus Implementation

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI