Anthropic ML Coding: Prompt-based Binary Classifier
Question Description
You are asked to build a prompt-based binary classifier that uses a helper which returns per-token log probabilities for a batch of prompts.
For each input example you must construct two class-conditioned system prompts (one for the positive class and one for the negative class). You will pass these prompt strings to a helper:
from typing import List def helper(prompts: List[str]) -> List[List[float]]: """Returns per-token log probabilities for each prompt in the batch.""" def classify_batch(examples: List[str]) -> List[float]: """Given a batch of inputs, return P_pos for each example."""
From the helper you receive sequences of per-token log-probs. Aggregate token-level values into a single score per prompt (for example by summing log-probs to get the log-probability of the full generated token sequence), yielding s_pos and s_neg for each example. Convert these to a probability for the positive class via a normalized function over the two scores, e.g. P_pos = exp(s_pos) / (exp(s_pos) + exp(s_neg)). All arithmetic must use the provided log probabilities (no logits from the model).
After obtaining P_pos for the dataset, compute accuracy (predict positive when P_pos >= 0.5) and the average cross-entropy loss per example: loss = -[y * log(P_pos) + (1 - y) * log(1 - P_pos)]. Provide function signatures that accept lists of probabilities and labels and return scalar metrics.
Skills you should demonstrate: prompt engineering and class-conditioning, numerical stability with log-space operations, batching for the helper, and manual implementation of metrics (accuracy and cross-entropy).
Common Follow-up Questions
- •How would you handle numerical stability when converting large negative log-scores to probabilities (e.g., implement log-sum-exp for two scores)?
- •If prompts for the positive and negative class have different token lengths, how would you normalize s_pos and s_neg to avoid length bias?
- •How could you calibrate P_pos post-hoc (e.g., temperature scaling or Platt scaling) using only a small validation set?
- •Describe batching and memory strategies when helper(prompts) has variable-length token sequences and you must score thousands of examples.
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.