ml coding
OpenAI
Anthropic
Google

OpenAI ML Coding: Noisy Human-Labeled Text Classifier

Topics:
Classification Metrics
Feature Engineering
Imbalanced Data
Roles:
Machine Learning Engineer
ML Engineer
Data Scientist
Experience:
Mid Level
Senior
Entry Level

Question Description

You are given a dataset of text prompts labeled for a binary identity_attack tag, with multiple human annotators per example and pre-computed embeddings provided. Your goal is to quantify how annotation noise and annotator bias affect an embedding-based classifier, identify reliable annotators, filter training data, and improve robustness.

Start by loading the train/validation/test splits and answer core questions: how many unique annotators labeled the training set, what is the label distribution (counts and relative frequencies), and what is the embedding dimensionality. While exploring the data, note common difficulties with human labeling (disagreement, systematic bias, inconsistent guidelines, and class imbalance) without proposing solutions yet.

Next, train a baseline classifier that consumes the provided embeddings and report standard metrics (accuracy, precision, recall, F1) on validation and test splits. Observe any generalization gaps between validation and test.

For annotator identification, use only dataset signals (annotator_id, their labels, and model behavior) to distinguish Expert, Decent, and Spammer pools. Filter training examples to keep only those from reliable annotators, retrain the same model on the filtered subset, and compare metrics to the baseline.

Finally, propose next-step improvements (de-noising decent annotators, confidence-weighted label aggregation, LLM-assisted review, active re-sampling, calibration) with one-line rationales and assumptions. You should demonstrate practical evaluation choices and a clear path toward production readiness.

Common Follow-up Questions

  • How would you quantify annotator reliability statistically (per-annotator confusion matrices, calibration, or agreement scores) and choose a threshold to separate Expert/Decent/Spammer?
  • If each example has multiple annotator labels, how would you aggregate them (majority vote, Dawid–Skene, or confidence-weighted aggregation) and incorporate label uncertainty into training?
  • Describe techniques to denoise labels from the Decent pool (e.g., label smoothing, loss correction, reweighting) and how you'd validate they improved generalization.
  • How would you adjust training and evaluation to handle class imbalance (sampling, class weights, or different metrics) and ensure robust recall on minority class?

Related Questions

1How to detect and mitigate label noise in binary text classification tasks?
2Design evaluation metrics and a reporting strategy for imbalanced identity-attack detection
3Use LLMs to assist labeling and filter low-quality human annotations
4Active learning strategies to select examples for expert relabeling and annotator quality control

Explore More Questions

Practice This Question with AI

Get real-time hints, detailed requirements, and insightful analysis of the question.

Noisy Human-Labeled Text Classifier — OpenAI ML Coding | Voker