Scaled Self-Attention ML Coding Implementation — Meta

Question Description

Implementing scaled dot-product self-attention is a common ML coding interview task that tests your understanding of linear algebra, tensor shapes, masking, and numerical stability in Transformer-style models.

You are asked to compute attention outputs and attention weights from Queries (Q), Keys (K), and Values (V). The core operation is to form the batched dot-product QK^T, scale by sqrt(d_k), apply an optional mask that blocks specific key positions, run a numerically-stable softmax over the key dimension, and multiply the resulting attention distribution by V. The function must accept batched inputs (batch, seq_len_q, d_k), (batch, seq_len_k, d_k), and (batch, seq_len_k, d_v) and return (attention_output: batch, seq_len_q, d_v) and (attention_weights: batch, seq_len_q, seq_len_k).

Interview flow / stages:

Clarify shapes and mask semantics (True/1 = masked out).
Describe scaling and why you divide by sqrt(d_k).
Explain numerical-stability steps (subtract max per row before softmax) and dtype considerations.
Implement batched matmul, mask application (set masked logits to -inf or a large negative), softmax along keys, then multiply by V.

Skill signals the interviewer expects:

Correct tensor broadcasting and axis choices
Preservation of input dtypes where practical
Proper mask handling and numerical stability for softmax
Awareness of runtime and memory trade-offs for batched attention

Scaled Self-Attention Implementation — Meta

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI