Implement k-Fold Cross-Validation From Scratch — Uber
Question Description
What you'll be asked to do
You will implement reusable cross-validation utilities from scratch: a standard k-fold splitter, a stratified k-fold variant for classification, and a time-series (forward-chaining) variant. Each function should (1) produce train/validation index splits, (2) call a provided train_and_predict callable to get per-fold predictions, and (3) return per-fold metrics and an aggregate metric.
Core requirements and flow
- Input validation: confirm X and y lengths match and that k (or n_splits) is an integer with 2 ≤ k ≤ n_samples. For stratified folds, verify every class has at least k examples.
- Split generation: produce a list of k tuples (train_idx, val_idx). When shuffle=True, use seed to make splits reproducible.
- Execution: for each fold call train_and_predict(X_train, y_train, X_val) → y_pred, compute metric(y_val, y_pred), collect per_fold_metrics.
- Aggregate metric: return the mean of per-fold metrics:
Variant specifics
- Stratified: preserve class proportions in each validation fold; allow optional shuffling within class buckets with seed-controlled randomness.
- Time-series: enforce temporal order so training indices are strictly earlier than validation indices; support expanding-window and fixed-length rolling-window modes.
Skills you must demonstrate
You should show solid understanding of data shuffling and reproducibility (seed handling), index-based splitting, class-preserving sampling, and time-series leakage avoidance. Be prepared to discuss when each variant is appropriate and trade-offs between k choices (bias–variance, compute cost).
Common Follow-up Questions
- •How would you modify the stratified k-fold function if some classes have fewer than k examples (rare classes)? Discuss trade-offs of oversampling, undersampling, or fallback to non-stratified splits.
- •Design nested cross-validation using your k-fold implementation to perform hyperparameter selection and unbiased model evaluation. How do you combine inner and outer loop metrics?
- •For time-series CV, explain how and why you would introduce a purging/gap period between training and validation sets to avoid leakage. Provide an index-based strategy.
- •How can you parallelize training across folds while preserving reproducibility (identical splits and deterministic model behavior)? Discuss random seeds and shared resources.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.