Tutorial on Language Generation in the Limit

Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent text, yet a theoretical understanding of why they work remains elusive.

The question of understanding language generation is not new. Computer scientists have been fascinated by the ability of humans and certain machines to generate natural language for decades, dating back to early work by Shannon. Here, an important line of work was initiated by Gold (1967) – who introduced a formal model of language identification in the limit, which received extensive study in learning theory (e.g., Angluin (1980)) and in linguistics.

This tutorial covers a recently proposed formal framework for language generation; a modern take on classical work on language identification. Specifically, we explore the model of "language generation in the limit" recently proposed by Kleinberg and Mullainathan (2024), which offers a surprising positive result: even with minimal requirements, coherent language generation is possible after observing finitely many samples – a finding that stands in stark contrast to the negative findings on language identification established by Gold (1967) and Angluin (1980).

Our tutorial aims to introduce this emerging theoretical framework to the broader computational learning theory community, stimulating new research that bridges formal theory and practical language models. No prior knowledge of language generation models is required; basic mathematical maturity is sufficient.

Resources

Alongside the COLT 2025 tutorial, we are preparing an annotated reading list for participants exploring language generation in greater depth. Below is a list of recent (and rapidly growing) lines of works on language generation in the limit. The list is still under preparation, please check back frequently.

Selected Works and Surveys on Language Identification in the Limit

Language Identification in the Limit — E. Mark Gold Information and Control 1967
Inductive Inference of Formal Languages from Positive Data — Dana Angluin Information and Control 1980
Finding Patterns Common to a Set of Strings — Dana Angluin STOC 1979
Inductive Inference: Theory and Methods — Data Angluin and Carl H. Smith ACM Computing Survey 1983
Learning indexed families of recursive languages from positive data: A survey Steffen Lange, Thomas Zeugmann, and Thomas Zeugmann TCS
Language identification in the limit Wikipedia

Works on Language Generation in the Limit

Language Generation in the Limit — Jon Kleinberg and Sendhil Mullainathan NeurIPS 2024
Generation through the Lens of Learning Theory — Jiaxun Li, Vinod Raman, and Ambuj Tewari COLT 2025
Generation from Noisy Examples — Ananth Raman and Vinod Raman ICML 2025
On Union-Closedness of Language Generation — Steve Hanneke, Amin Karbasi, Anay Mehrotra, and Grigoris VelegkasarXiv 2025

Works on Language Generation in the Limit with Breadth

On the Limits of Language Generation: Trade‑Offs Between Hallucination and Mode Collapse — Alkis Kalavasis, Anay Mehrotra, and Grigoris Velegkas STOC 2025
Exploring Facets of Language Generation in the Limit — Moses Charikar and Chirag Pabbaraju COLT 2025
Characterizations of Language Generation With Breadth — Alkis Kalavasis, Anay Mehrotra, and Grigoris Velegkas arXiv 2024
Representative Language Generation — Charlotte Peale, Vinod Raman, and Omer Reingold ICML 2025
Density Measures for Language Generation — Jon Kleinberg and Fan Wei FOCS 2025

Relevant Talks

Language Generation in the Limit — Jon Kleinberg at Simons Berkeley
Language Generation in the Limit — Jon Kleinberg at IAS
Generation Through the Lens of Learning Theory — Jiaxun Li [slides] and Vinod Raman [slides]
Exploring Facets of Language Generation in the Limit — Chirag Pabbaraju [slides]
On the Limits of Language Generation — Anay Mehrotra and Grigoris Velegkas, STOC online talk [slides] (for more recent results, but only in the online setting, see: [slides 1] [slides 2])
AI's Models of the World, and Ours — Jon Kleinberg; ICML Invited Talk (Registration Required)

Language Generation in the Limit

Tutorial @ COLT 2025

Slides and Materials

Resources

Organizers