Towards Principled Model Evaluation Under Imperfect "Ground Truth" Labels

Luke Guerdan

Carnegie Mellon University

The NLP Reading Group is excited to host Luke Guerdan, a PhD student at CMU, who will be speaking remotely on Zoom on Friday November 29th about “Towards Principled Model Evaluation Under Imperfect “Ground Truth” Labels”.

Talk Description

In many evaluation contexts, “ground truth” labels are an imperfect proxy for the broader capabilities or limitations of interest—such as the “relevance” of retrieval augmented generation (RAG) outputs or the “toxicity” of chatbot responses. How can we conduct statistically rigorous and informative performance evaluations under an imperfect gold standard?

In this talk, I begin by addressing this question in the context of predictive modeling for algorithmic decision support. I describe an approach that leverages structured human feedback in the form of expert anchor assumptions that better-connect observable proxy labels to unobservable constructs of interest. I validate this approach theoretically, and empirically demonstrate that measurement error modeling is critical for learning reliable models. I conclude by illustrating that a similar approach is necessary while evaluating LLMs under violations to the “gold labels’’ assumption.

Speaker Bio

Luke Guerdan is a Ph.D. student in the Human-Computer Interaction Institute at Carnegie Mellon University. His research focuses on developing tools to evaluate the capabilities and limitations of human-algorithmic systems under imperfect labels. Luke’s work has been recognized with an ACM FAccT Best Paper Award and an NSF Graduate Research Fellowship.

Logistics

Date: November 29th
Time: 11:30AM
Location: Zoom (See email)