Debug and Extend GPT-style Transformer — OpenAI ML Engineer
Question Description
You are given a compact PyTorch GPT-style causal transformer and a small training harness. Your tasks are to diagnose and fix bugs so the model and training loop produce bitwise-identical outputs to a provided reference, add a KV cache for incremental autoregressive decoding, and attach a token-level classifier head (e.g., odd/even).
Core content
- The exercise focuses on causal/self-attention correctness, positional embedding combination, model I/O shapes, and training updates. There are exactly four intentional bugs: three are in the model (attention masking, positional addition, and output projection shape) and one is in the training loop (missing optimizer step). You must preserve the external forward signature and return LM logits (and optional classifier logits) so existing harness code still works.
Flow / stages you'll work through
- Inspect shapes and numerics on small synthetic batches to locate the three model bugs and correct causal masking, positional embedding usage, and output projection dimensions.
- Fix the training-step bug so parameters actually update (maintain determinism with identical seeds and hyperparameters) and verify you can reproduce the reference outputs for a few deterministic steps.
- Implement a simple per-layer KV cache API, add an incremental forward that reuses cached keys/values, and assert cached decoding matches full-sequence decoding.
- Add a token-level classifier head (small linear layer producing two logits per token), integrate its loss with next-token cross-entropy, and report classifier accuracy on held-out data.
Skill signals
You should demonstrate PyTorch primitives, deterministic RNG handling, attention masking and numerical stability (softmax over -inf masked scores), careful tensor shapes, and minimal, testable changes so the LM outputs remain reproducible when the classifier is disabled. Include end-to-end verification tests for cache correctness and training determinism.
Common Follow-up Questions
- •How would you extend the single-head SimpleSelfAttention into a multi-head attention block while preserving deterministic bitwise reproducibility?
- •Explain numerical stability measures for attention softmax; how would you modify the implementation to avoid NaNs with long contexts?
- •Describe how you would adapt the KV cache to support batched, variable-length prefixes and efficient decoding across multiple sequences in parallel.
- •If classifier accuracy lags while LM loss improves, what debugging steps and loss-weighting strategies would you try to balance the objectives?
- •How would you test that positional embeddings (absolute vs. rotary vs. learned relative) preserve causal behavior and do not leak future tokens?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.