Decaying Attention Implementation - NVIDIA ML Coding

Question Description

You are asked to implement a variant of dot‑product attention where a pairwise positional bias grows with the absolute difference in token indices. Concretely, you must compute

A = softmax(Q K^T + B) V

with B_{ij} = |i - j| applied across the query/key sequence axes. The implementation must accept both unbatched (Q: (L_q,d), K: (L_k,d), V: (L_k,d_v) -> output (L_q,d_v)) and batched inputs (Q: (B,L_q,d), K: (B,L_k,d), V: (B,L_k,d_v) -> output (B,L_q,d_v)). You should preserve batch semantics and ensure numeric dtype consistency across matmuls, bias addition, and softmax.

Focus areas and flow

Compute similarity S = Q @ K^T (vectorized via matmul/einsum). Build B once from arange indices (abs(i-j)) and broadcast to (L_q,L_k) or (B,L_q,L_k).
Add B to S, apply row-wise softmax over keys (subtract the row max for numeric stability), then multiply by V to get output of shape (L_q,d_v) or (B,L_q,d_v).
Validate shapes: allow L_q != L_k, require matching feature dims d and batch dims when present.

Skill signals

You should demonstrate: efficient NumPy vectorization, correct broadcasting of positional bias over batch, numeric/stability best practices for softmax, careful shape and dtype validation, and awareness of time/memory complexity (O(B * L_q * L_k * d)). Include tests for edge cases (single token, unequal lengths, float32 vs float64).

NVIDIA ML Coding: Decaying Attention Implementation

Question Description

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI