Streaming Mean & Variance - LinkedIn ML Coding Interview

Question Description

This ML coding prompt asks you to design a concise, robust interface to compute the population mean and variance over extremely large numeric streams that cannot fit in memory.

You must provide a single-pass, streaming-compatible implementation that processes values incrementally and returns, for any aggregated dataset, the total count n, the population mean x̄, and the population variance σ². In addition, your solution should expose a mergeable summary representation so partial results computed on disjoint shards (parallel workers or map tasks) can be combined to yield a correct global summary.

The interviewer will typically walk through:

the streaming ingestion stage (how you update state per new observation),
the partial-summary serialization (what values you keep in memory and return), and
the merge step used to combine two summaries produced independently.

You should demonstrate knowledge of numerically stable online algorithms and floating-point considerations (rounding error, float32 vs float64, handling NaNs/Infs). Performance signals include O(1) memory per summary, O(1) time per observation, and efficient, associative merge operations that work in distributed or parallel environments. Be prepared to justify design choices, show how you test correctness (edge cases: empty streams, single element, extreme values), and explain trade-offs between simplicity, accuracy, and throughput.

LinkedIn ML: Large-Scale Streaming Mean & Variance

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI