OpenAI ML Coding: Noisy Human-Labeled Text Classifier
Question Description
You are given a dataset of text prompts labeled for a binary identity_attack tag, with multiple human annotators per example and pre-computed embeddings provided. Your goal is to quantify how annotation noise and annotator bias affect an embedding-based classifier, identify reliable annotators, filter training data, and improve robustness.
Start by loading the train/validation/test splits and answer core questions: how many unique annotators labeled the training set, what is the label distribution (counts and relative frequencies), and what is the embedding dimensionality. While exploring the data, note common difficulties with human labeling (disagreement, systematic bias, inconsistent guidelines, and class imbalance) without proposing solutions yet.
Next, train a baseline classifier that consumes the provided embeddings and report standard metrics (accuracy, precision, recall, F1) on validation and test splits. Observe any generalization gaps between validation and test.
For annotator identification, use only dataset signals (annotator_id, their labels, and model behavior) to distinguish Expert, Decent, and Spammer pools. Filter training examples to keep only those from reliable annotators, retrain the same model on the filtered subset, and compare metrics to the baseline.
Finally, propose next-step improvements (de-noising decent annotators, confidence-weighted label aggregation, LLM-assisted review, active re-sampling, calibration) with one-line rationales and assumptions. You should demonstrate practical evaluation choices and a clear path toward production readiness.
Common Follow-up Questions
- •How would you quantify annotator reliability statistically (per-annotator confusion matrices, calibration, or agreement scores) and choose a threshold to separate Expert/Decent/Spammer?
- •If each example has multiple annotator labels, how would you aggregate them (majority vote, Dawid–Skene, or confidence-weighted aggregation) and incorporate label uncertainty into training?
- •Describe techniques to denoise labels from the Decent pool (e.g., label smoothing, loss correction, reweighting) and how you'd validate they improved generalization.
- •How would you adjust training and evaluation to handle class imbalance (sampling, class weights, or different metrics) and ensure robust recall on minority class?
Related Questions
Explore More Questions
Practice This Question with AI
Get real-time hints, detailed requirements, and insightful analysis of the question.