Solomonoff induction, LLMs, and dead (stochastic) parrots
"Look, matey, I know a dead parrot when I see one, and I'm looking at one right now."
Emily Bender and colleagues have famously dismissed Large Language Models (LLMs) as “stochastic parrots.” There has since been an entertaining, and extensive, back-and-forth in the literature between the psittacine trivialists and the Hintonites. Not a week passes without at least a handful of papers claiming that LLMs are nothing more than a shallow parlour trick, counterbalanced by an equal number of contributions suggesting that LLMs are very serious general purpose cognition machines.
A recent paper from Deepmind on Learning Universal Predictors scores a point for the Hintonites. In what follows, I will try to articulate why I like this paper so much, beginning with a pedagogical detour into some beloved theory.
Algorithmic information theory
Theoretical computer scientists like to work with simplified representations, because it makes the bookkeeping in our proofs easier and the intuition in our theorems cleaner. But we’re generally careful to do so WLOG (without loss of generality). Typically, we work over strings of tokens drawn from a finite alphabet, with the acknowledgement that any other input modality can always be transformed into a string1. So in today’s post, we’re going to stick to the world of strings-of-tokens. WLOG, of course.
If I give you an arbitrary string, and say “How complex is this string?”, how would you answer that? Intuitively this string:
010101010101010101010101010101010101010101010101010101010101
seems less complex than this string:
001001000011111101101010100010001000010110100011000010001101
The former is just “01” repeated 30 times. The latter looks random.
The field of algorithmic information theory deals with exactly this question and the gold-standard answer is Kolmogorov/Chaitin complexity. To figure out how complex a string is, you just have to find the smallest/shortest computer program2 that will produce that string.
This probably matches your intuition. The program to produce the first string is going to be very short: “Print ‘01’ thirty times.” Thus, according to Kolmogorov and Chaitin, this is a very simple string. The program to produce the second string though… yikes… maybe the shortest program is just “print [verbatim copy of string]” in which the size of the program is dominated by the size of the string. That would suggest the string is very complex. And, indeed, the most complex strings in this world of algorithmic information are those that are truly random!
If you’re very observant, though, you may have noticed that the second string is actually digits 2-62 of pi (written in binary). And the digits of pi can be generated with a pretty simple algorithm. So the algorithmic complexity of a string like this one is actually much lower than it looks on first glance because it can be generated by a program that says “compute and output digits 6-62 of pi” along with a fixed-size program to compute the first “n” digits of pi.
But you had to recognize that it was pi in order to be able to correctly find the Kolmogorov complexity and this is the achilles heel of algorithmic information theory: a generalized mechanism for always finding the shortest program computing a given string is an impossibility. The halting problem for Turing Machines strikes again. Kolmogorov/Chaitin complexity is, alas, uncomputable. But all hope is not lost.
In addition to providing a beautiful intellectual framework for thinking abstractly about complexity, one can also practically compute lower bounds. When one finds a small program generating a string (perhaps through a bounded search mechanism), one can conclude: “This may not be the absolute smallest program that produces this string, but it’s still pretty darn small, so we should consider this string less complex.” Gödel’s ghost, however, prohibits us from accessing the converse: if we can’t find a short program that produces a string, we can never conclude with certainty that the string must be truly complex. Maybe we just didn’t try hard enough3.
Solomonoff induction
We now have a key tool for understanding the world: Kolmogorov/Chaitin complexity.
Let’s talk about model building. As humans, we spend a huge amount of effort constructing mental models of our world. These models allow us to predict how actions we might choose to perform will impact our world. Some models are simple and easily learned: touch a hot stove once, and you will probably build a strong internal model which concludes that repeating that action will result in severe pain (again). But we model much more complex things, too; like each others’ minds.
A very interesting question to ask, then, is: “what’s the best possible way to build a model, given a finite set of observations?” Ray Solomonoff proposed an answer that is deeply rooted in algorithmic information theory: the best model is the smallest computer program for predicting the world that is consistent with the data you’ve observed so far4.
Although obviously incomputable, Solomonoff induction can be semi-approximated (as we did for Kolmogorov complexity) and offers a beautiful framework for thinking about intelligence. Intelligence is, of course, a poorly-defined (and heavily litigated) concept but allow me the conceit in this note of hypothesizing that greater intelligence is concomitant with greater world-model-building (and predicting) ability.
If you accept that hypothesis, then Solomonoff induction is, in this sense, a strict upper bound on intelligence. It is a roadmap (alas, uncomputable) for how to build the best possible model of your world, given the observations that you’ve made of it thus far. Marcus Hutter has been eloquently making this point for two decades. (And such an observation is also consistent with some of my previous ramblings on this substack.)
So that gets us all the way to the big reveal:
Learning Universal Predictors
In a very beautiful paper from Deepmind last month, the authors investigate deep neural networks, including those with transformer architectures, as potential learners of Solomonoff induction. They write:
“In fact, one could argue that the impressive in-context generalization capabilities of LLMs is a sign of a rough approximation of Solomonoff induction.”
while establishing theoretical results showing that approximations to Solomonoff induction are accessible via meta-training alongside experimental results taking nontrivial steps in exactly that direction. (And, notably for us in the depths of the transformers-are-🔥 era, showing that the transformer consistently outperforms other known architectures.)
While it’s not a fully closed case yet, we now have some pretty compelling evidence that LLMs are not just not Markovian “stochastic parrots,” but they may be doing rather good approximations of Solomonoff induction in the building of their internal models. If you accept the (admittedly narrow and controversial) definition of intelligence above, it’s reasonable to suggest that these mechanisms are a good deal more intelligent than the “it’s just a fancy autocomplete” crowd believe.
And this all ties in rather beautifully with Rich Sutton’s Bitter Lesson. As our models get bigger, as we add more compute, and more data, the models perform better approximations of Solomonoff induction. They become, in the sense of model-building and world prediction, more intelligent.
Perfect Solomonoff induction is uncomputable, so we’ll never get there, which may mean there is unbounded potential for better and better approximations. New architectures, new ideas, new theories may help us get there faster (transformers, e.g., appear to outperform LSTMs), but the ultimate driver of better approximate Solomonoff inducers, of more intelligent machines, is exceptionally simple: more compute.
And that is probably why Sam Altman is trying to raise $7T.
Indeed, modern multi-modal transformer architectures are trending towards the tokenization of the other modalities so that the input to the transformer is, in fact, exactly a string of tokens. My inner formal language theorist is gratified by this development.
You can pick any Turing-complete programming language that you like. Theoretical computer scientists often work with Turing machines, but Chaitin has written some lovely tutorials using Lisp, and no one will stop you from using Python. The only constraint is that you have to keep the choice of language constant if you want to make comparisons between strings.
The only way we could be sure is to generate and run every possible program up to a certain size to see if any of them generate the string. But some of those programs may end up in (nontrivial) infinite loops and we’d have to wait infinitely long because maybe they are just about to output our magic string.
Solomonoff actually suggested something stronger: that you should compute the probabilities of next potential observations by weighting the output of every computable theory that is consistent with observations thus far, normalized by the relative complexities of the theories.