Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent text, yet a theoretical understanding of why they work remains elusive.


The question of understanding language generation is not new. Computer scientists have been fascinated by the ability of humans and certain machines to generate natural language for decades, dating back to early work by Shannon. Here, an important line of work was initiated by Gold (1967) – who introduced a formal model of language identification in the limit, which received extensive study in learning theory (e.g., Angluin (1980)) and in linguistics.


This tutorial covers a recently proposed formal framework for language generation; a modern take on classical work on language identification. Specifically, we explore the model of "language generation in the limit" recently proposed by Kleinberg and Mullainathan (2024), which offers a surprising positive result: even with minimal requirements, coherent language generation is possible after observing finitely many samples – a finding that stands in stark contrast to the negative findings on language identification established by Gold (1967) and Angluin (1980).


Our tutorial aims to introduce this emerging theoretical framework to the broader computational learning theory community, stimulating new research that bridges formal theory and practical language models. No prior knowledge of language generation models is required; basic mathematical maturity is sufficient.

Slides and Materials

All tutorial materials, including slides, will be made available on this website following the event. Please check back!

Resources

Alongside the COLT 2025 tutorial, we are preparing an annotated reading list for participants exploring language generation in greater depth. Below is a list of recent (and rapidly growing) lines of works on language generation in the limit. The list is still under preparation, please check back frequently.



Selected Works and Surveys on Language Identification in the Limit
Works on Language Generation in the Limit
Works on Language Generation in the Limit with Breadth
Relevant Talks

Organizers