PyTorch Transformer Interview Question — Apple (Seq2Seq)

Question Description

Overview

You are asked to implement a runnable, from-scratch Transformer in PyTorch suitable for sequence-to-sequence tasks. The deliverable must include a fully implemented Multi-Head Attention module and a complete Transformer (encoder–decoder) assembly. The Multi-Head Attention should accept query, key, value tensors of shapes (B, S_q, D), (B, S_k, D), (B, S_k, D) respectively, support H heads with d_head = D / H, and return (output, attn_weights) where output has shape (B, S_q, D) and attn_weights has shape (B, H, S_q, S_k).

High-level flow

You will typically build in stages: implement and validate the Multi-Head Attention (including query/key/value projections, head splitting/reshaping, scaled dot-product, masking, and final projection). Next, implement encoder and decoder layers that compose attention, residual connections, LayerNorm, and position-wise feed-forward networks. Finally, assemble positional encoding, stacked encoder/decoder layers, and a forward pass that accepts src, tgt, src_mask, tgt_mask and returns decoder representations of shape (B, S_tgt, D).

What you must show / Skill signals

You must demonstrate correct tensor reshaping (split/merge heads), mask alignment and broadcasting, numerical stability in softmax (scale by sqrt(d_head)), residual connections with normalization, and clean PyTorch module structure (init and forward). Knowledge of positional encoding (sinusoidal or learned), causal masking for decoder self-attention, and debugging shape/attention-weight outputs will be evaluated. Include unit-like shape checks and ensure attention weights are exposed per head for inspection.

From-Scratch PyTorch Transformer — Apple Interview

Question Description

Overview

High-level flow

What you must show / Skill signals

Common Follow-up Questions

Related Questions

Explore More Questions

Practice This Question with AI