ml coding
Uber
Lyft
Airbnb

Implement k-Fold Cross-Validation From Scratch — Uber

Topics:
K-Fold Cross-Validation
Stratified Sampling
Time-Series CV
Roles:
Machine Learning Engineer
Data Scientist
ML Researcher
Experience:
Mid Level
Senior
Entry Level

Question Description

What you'll be asked to do

You will implement reusable cross-validation utilities from scratch: a standard k-fold splitter, a stratified k-fold variant for classification, and a time-series (forward-chaining) variant. Each function should (1) produce train/validation index splits, (2) call a provided train_and_predict callable to get per-fold predictions, and (3) return per-fold metrics and an aggregate metric.

Core requirements and flow

  • Input validation: confirm X and y lengths match and that k (or n_splits) is an integer with 2 ≤ k ≤ n_samples. For stratified folds, verify every class has at least k examples.
  • Split generation: produce a list of k tuples (train_idx, val_idx). When shuffle=True, use seed to make splits reproducible.
  • Execution: for each fold call train_and_predict(X_train, y_train, X_val) → y_pred, compute metric(y_val, y_pred), collect per_fold_metrics.
  • Aggregate metric: return the mean of per-fold metrics: mˉ=1ki=1kmi\bar{m} = \frac{1}{k}\sum_{i=1}^k m_i

Variant specifics

  • Stratified: preserve class proportions in each validation fold; allow optional shuffling within class buckets with seed-controlled randomness.
  • Time-series: enforce temporal order so training indices are strictly earlier than validation indices; support expanding-window and fixed-length rolling-window modes.

Skills you must demonstrate

You should show solid understanding of data shuffling and reproducibility (seed handling), index-based splitting, class-preserving sampling, and time-series leakage avoidance. Be prepared to discuss when each variant is appropriate and trade-offs between k choices (bias–variance, compute cost).

Common Follow-up Questions

  • How would you modify the stratified k-fold function if some classes have fewer than k examples (rare classes)? Discuss trade-offs of oversampling, undersampling, or fallback to non-stratified splits.
  • Design nested cross-validation using your k-fold implementation to perform hyperparameter selection and unbiased model evaluation. How do you combine inner and outer loop metrics?
  • For time-series CV, explain how and why you would introduce a purging/gap period between training and validation sets to avoid leakage. Provide an index-based strategy.
  • How can you parallelize training across folds while preserving reproducibility (identical splits and deterministic model behavior)? Discuss random seeds and shared resources.

Related Questions

1Implement Group k-Fold (grouped cross-validation) for datasets with correlated groups and explain when to use it
2Implement Leave-One-Out CV from scratch and discuss computational cost versus variance of the estimate
3Build cross-validation that supports sample weights and custom metrics; explain how to aggregate weighted per-fold scores
4Implement nested cross-validation for hyperparameter tuning and estimate unbiased generalization performance

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

k-Fold Cross-Validation Implementation - Uber | Voker