ml coding
Visa
Mastercard
American Express

Visa ML Coding Question: Linear Regression Implementation

Topics:
Linear Regression
Gradient Descent
Mean Squared Error
Roles:
Machine Learning Engineer
Data Scientist
ML Engineer
Experience:
Entry Level
Mid Level
Senior

Question Description

Train linear regression from scratch using gradient descent and return learned parameters and per-epoch loss history.

You are asked to implement a standard linear model where predictions follow:

y_hat = X @ w + b

and the objective is to minimize the mean squared error (MSE):

L(w, b) = (1/n) * sum_i (y_i - (x_i^T w + b))^2

Your implementation should accept a feature matrix X (n_samples, n_features), target vector y (n_samples,), a learning_rate, and a number of epochs. Use batch gradient descent (vectorized operations recommended) to update parameters. Return a 1-D params array of length n_features + 1 that includes the intercept (bias) — specify in your function docstring whether the intercept is the last or first element (recommendation: place the bias as the last element) — and a loss_history list containing the training MSE after each epoch.

Flow you can expect in an interview:

  • Clarify shapes, numeric types, and whether to include an intercept term.
  • Derive gradients for w and b and propose a vectorized update rule.
  • Implement and run a training loop, computing MSE each epoch and returning params + loss history.

Skill signals to demonstrate: linear algebra/vectorization, gradient derivation, numerical stability (feature scaling), convergence behavior (learning rate choice), and clear function documentation. You may mention analytical (normal equation) comparison as an alternative but implement gradient descent without ML libraries.

Common Follow-up Questions

  • How would you add L2 regularization (Ridge) to your gradient updates and show the modified gradient expressions?
  • Compare batch, stochastic, and mini-batch gradient descent for this problem. How does batch size affect convergence and runtime?
  • How does feature scaling (standardization) change convergence speed and why should you apply it here?
  • Derive and implement the analytical closed-form solution (normal equation). Compare its output and runtime to gradient descent on small and large feature sets.

Related Questions

1Implement logistic regression from scratch using gradient descent and return loss history
2Explain and implement gradient descent variants (SGD, momentum, Adam) for linear models
3When to prefer normal equation vs iterative optimization for linear regression?
4How to add early stopping and learning-rate schedules to your training loop?

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Linear Regression Implementation - Visa ML Coding Question | Voker