Synthetic Data for Language AI: Insights from Low-Resource Languages

Tuesday, May 12, 3:45–4:05 p.m.
Room 236
Presenter: Christian Resch
Modality: Traditional Talk

Abstract

LLMs such as GPT and Claude—and today’s machine translation and speech systems—perform impressively for a handful of high-resource languages like English. But many of the world’s languages remain data-scarce, and therefore without support by language AI services. This leaves billions of speakers with limited access to information and digital services. Scaling traditional data collection to thousands of languages is slow and costly, so we need complementary approaches that can move faster at lower cost. In this talk, I present our recent work on the first systematic assessment of large-scale synthetic voice corpora for African automatic speech recognition (ASR). We generate training data through a three-step pipeline—LLM-driven text creation, text-to-speech (TTS) synthesis, and ASR fine-tuning—and evaluate when and why synthetic data helps, where it fails, and how results vary across languages. I’ll also highlight practical challenges in automatic and human evaluation. Finally, I connect these findings to other forms of synthetic data—such as leveraging grammars and dictionaries for machine translation—and to the broader goal of building language AI that works reliably across languages for high-impact applications in areas like agriculture and health.

Bio

Image
Christian Resch

Christian Resch works at the intersection of language technology, AI, and international development, focused on practical social-impact use cases. He has advised and managed AI/NLP and language tech projects across Africa and Asia—spanning dataset creation, solution building, and technical capacity building. Previously, he worked in data science and analytics in Europe’s financial sector, including the European Central Bank and Deutsche Bundesbank. He’s also currently an OMSCS student at Georgia Tech.

Program

Check out the Program page for the full program!

Questions About the Conference?

Check out our FAQ page for answers and contact information!