Physics of Language Models

A scientific approach to study and improve LLMs

Even today, GPT-4 and Llama-3 still provide incorrect answers to some questions that are simple for humans. Is this a problem inherent to GPT-4, or is it due to insufficient training? Is its mathematical capability too weak? Does this only affect models as of June 2024, or will GPT-6 and Llama-5 also face this problem? What about other pre-trained models? 

(Spoiler alert: these are counterexamples that all today's LLMs shall fail — a.k.a. Turing tests.) 

Ethology is great, but...

Pretrained LLMs (like GPT-4, LLaMA-3, Claude-3) are like monkeys used in animal behavior science, and we now live in an era where most people can interact with these monkeys and play games. This is fantastic! However, rigorous scientists need to think about the underlying "why" to uncover the "universal laws" behind these phenomena, rather than merely studying individual monkeys.

People often celebrate when an LLM ranks high on a benchmark, but is this truly accurate? Could the models have "seen" this benchmark data during training? Imagine if I were to post a French version of the GPQA benchmark on MIT’s website tomorrow. Henceforth, all LLMs trained with internet data could unknowingly "cheat," as this wouldn't be a direct copy of the original benchmark.

Data contamination is just one issue. If model A outperforms model B on the GSM8k, is it because A has better English comprehension or superior math skills? For instance, LLaMA2-70B scores 63.6% on the world knowledge benchmark, while LLaMA2-7B scores 48.9%. Does a 10x increase in model size enhance knowledge capacity by only 30%?

Moreover, the excessive pursuit of benchmark performance might lead us further away from achieving Artificial General Intelligence (AGI). Imagine incorporating Wu's method and a specialized brute-force search into LLMs, allowing GPT-5 to solve all 30 IMO geometry problems since 2000. While this can achieve a 100% score, it does not necessarily indicate a general mastery of math.

Benchmarks are great, but...

Our Philosophy

Apples fall and boxes move, but universal laws like gravity and inertia are crucial for technological advancement. While GPT-5 or LLaMA-3 may offer revolutionary experiences tomorrow, we need to establish universal laws for LLMs that can eventually help us reach AGI.

Physics of language models

We propose dividing the concept of "intelligence" into multiple dimensions (such as structures, knowledge, reasoning, etc.). For each dimension, we create synthetic data and build an idealized environment for LLM training to push the capability of LLMs in this dimension to the extreme. By performing a large number of controlled experiments, we aim to discover the universal laws of all LLMs, not just a particular version of GPT-4. 

Spherical chicken in a vacuum

This humorous phrase critiques theoretical physicists for using oversimplified models. However, without idealized environments, one might wrongly assume that iron balls fall faster than feathers. Idealized environments also help discover simple formulas like the ideal gas laws, which have broad applications. 

The same is true in studying LLMs. Commercial LLMs are trained on messy, secretly preprocessed internet data. Training LLMs in a controlled, idealized environment allows us to manage the data and tweak hyperparameters — such as data amount, type, difficulty, and format — to scientifically determine factors affecting LLM performance and suggest improvements.

Newton, Kepler and Brahe

The law of gravitation (1666-1687) could not have emerged without Kepler's laws of planetary motion (1609-1619), which in turn relied on Tycho Brahe's 20 years of observational data (1577-1597). Studying the universal laws of LLMs also requires extensive observational data, which cannot be achieved by examining merely a few commercial LLMs. Our aim is to derive universal conclusions, regardless of model size or training parameters. How can this be achieved? By breaking down "intelligence" into manageable components, each using synthetic data with controlled size and difficulty. This allows us to repeatedly train many small models under sufficiently varied conditions to initially identify laws, and then test them more broadly.

Probing and transparent AI

Playing with monkeys is fascinating, but we aim to understand the inner workings of their brains. Using commercial LLMs pretrained on internet data allows for probing, though we typically uncover only very basic behaviors at the token level, such as induction and inhibition heads. 

In our idealized environment, we can develop advanced probing techniques tailored to our synthetic data and conduct highly repeatable probing operations. We wish to probe not just one model, but models of all sizes and data with various hyperparameters. This moves us closer to the goal of creating more transparent, understandable AI systems.