Black-Box Interpretability of Large Language Models: A Model-Agnostic Framework
Room 236
Presenter: Brennen Yu
Modality: Traditional Talk
Abstract
Black-Box Interpretability of Large Language Models. LLM Interpretability findings and techniques for any model and without access to model internals. Explainable AI (XAI). Presentation based on research the team has conducted via HAAG. Primary advisor is an Italian Postdoc currently conducting research at Northwestern University. Secondary advisor is an Iranian, 4th-year PhD Student currently at Michigan State University.
A systematic review of Black-Box Interpretability of LLMs. A performance benchmark of Black-Box Interpretability techniques on different models and model sizes. Open-source GitHub repository featuring intuitive interfaces to apply interpretability techniques to any LLM. Novel Black-Box Interpretability techniques we've developed.
Large language models (LLMs) have achieved remarkable performance across diverse tasks, yet their opacity presents significant challenges for deployment in high-stakes domains such as medicine and law, where explainability is essential. Traditional interpretability methods that examine model internals—including attention mechanisms and gradient analyses—are unavailable for closed APIs and often inadequately capture the complex, emergent behaviors characteristic of large-scale models. Currently, we lack robust tools to predict when or why an LLM will exhibit specific behaviors.
Program
Check out the Program page for the full program!
Questions About the Conference?
Check out our FAQ page for answers and contact information!