Not All Mistakes Are Equal: Why Accuracy Isn't Enough to Evaluate AI Systems

Monday, May 11, 2:25–2:45 p.m.
Room 236
Presenter: Kabir Kang
Modality: Traditional Talk

Abstract

When we deploy AI systems for content moderation, medical screening, or safety decisions, we usually ask one question: how accurate is it? But accuracy treats every mistake the same. Misclassifying an obvious case counts no differently than getting a genuinely ambiguous one wrong. In the real world, these errors have very different consequences. Consider a toxicity classifier for an online platform. If every human reviewer agrees a comment is toxic and the model misses it, that's a serious failure. If reviewers are split down the middle, the "right" answer was never clear to begin with. Standard accuracy can't tell the difference between these two kinds of mistakes, but users, policymakers, and patients certainly can. In this talk, I'll share research from Georgia Tech's Data-Centric ML group where we built a framework to address this gap. We assign each example a misclassification cost based on how much the mistake matters, whether that's derived from human disagreement, clinical thresholds, or confidence ratings, and introduce a metric that makes these costs visible. Across experiments in toxicity detection, medical diagnosis, and image classification, we found that models are often much better than their accuracy suggests: most errors concentrate on genuinely ambiguous cases that even humans can't agree on. We also discovered something surprising: explicitly teaching models to care about these costs during training barely helps. They already learn to get the clear-cut cases right on their own. I'll discuss what these findings mean for how we think about evaluating and trusting AI systems in high-stakes settings.

Bio

Image
Kabir Kang

Kabir Kang is a software engineer at Monogram and an MSCS student through Georgia Tech's OMSCS program. He spent 2024-25 on campus at Georgia Tech conducting research in the ADDAPT ML Lab under Professor Stephen Mussmann, focusing on cost-sensitive classification and evaluation metrics. His research explores the gap between how ML models are evaluated in practice and how their mistakes actually matter in real-world deployment, with applications in content moderation and medical screening.

Program

Check out the Program page for the full program!

Questions About the Conference?

Check out our FAQ page for answers and contact information!