Black-Box Interpretability of Large Language Models: A Model-Agnostic Framework
Room 236
Presenter: Brennen Yu
Modality: Traditional Talk
Abstract
Large language models (LLMs) have achieved remarkable performance across diverse tasks, yet their opacity presents significant challenges for deployment in high-stakes domains such as medicine and law, where explainability is essential. Traditional interpretability methods that examine model internals—including attention mechanisms and gradient analyses—are unavailable for closed APIs and often inadequately capture the complex, emergent behaviors characteristic of large-scale models. There is a clear need for Black-Box Interpretability methods that can accurately predict when or why an LLM will exhibit specific behaviors, while allowing for flexibility of use and expertise. While much work has been done to develop novel Black-Box Interpretability methods, the literature is scattered and there is a clear need to compare the strengths and weaknesses of the latest methods. We conduct a systematic review of Black-Box Interpretability of LLMs and a benchmark of Black-Box Interpretability methods on different models and model sizes.
Program
Check out the Program page for the full program!
Questions About the Conference?
Check out our FAQ page for answers and contact information!