ml coding
LinkedIn
Google
Amazon

LinkedIn ML: Large-Scale Streaming Mean & Variance

Topics:
Online Algorithms
Numerical Stability
Floating Point Precision
Roles:
Machine Learning Engineer
Data Scientist
Data Engineer
Experience:
Mid Level
Senior
Staff

Question Description

This ML coding prompt asks you to design a concise, robust interface to compute the population mean and variance over extremely large numeric streams that cannot fit in memory.

You must provide a single-pass, streaming-compatible implementation that processes values incrementally and returns, for any aggregated dataset, the total count n, the population mean x̄, and the population variance σ². In addition, your solution should expose a mergeable summary representation so partial results computed on disjoint shards (parallel workers or map tasks) can be combined to yield a correct global summary.

The interviewer will typically walk through:

  • the streaming ingestion stage (how you update state per new observation),
  • the partial-summary serialization (what values you keep in memory and return), and
  • the merge step used to combine two summaries produced independently.

You should demonstrate knowledge of numerically stable online algorithms and floating-point considerations (rounding error, float32 vs float64, handling NaNs/Infs). Performance signals include O(1) memory per summary, O(1) time per observation, and efficient, associative merge operations that work in distributed or parallel environments. Be prepared to justify design choices, show how you test correctness (edge cases: empty streams, single element, extreme values), and explain trade-offs between simplicity, accuracy, and throughput.

Common Follow-up Questions

  • How would you extend the interface to compute covariance or correlation between two streams in a streaming, mergeable way?
  • How do you modify the algorithm to support weighted observations (importance weights) and still allow correct merges?
  • Explain the numerical differences and trade-offs between using float32 and float64 for this streaming variance; how do you mitigate loss of precision?
  • What unit and property tests would you write to validate merge correctness and numerical stability (edge cases like NaN, Inf, empty streams)?
  • How would you adapt the approach to report running standard deviation, and what changes (if any) are needed to preserve numerical stability?

Related Questions

1Implement Welford's online algorithm for mean and variance and explain why it's numerically stable
2Design a distributed aggregator (MapReduce) for computing count, mean, and variance across shards
3Compute streaming covariance and extend summaries to handle multivariate statistics
4Reservoir sampling and streaming quantiles: computing medians and percentiles in one pass
5Stable online algorithms for higher moments (skewness, kurtosis) and how to merge them

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Streaming Mean & Variance - LinkedIn ML Coding Interview | Voker