Looking from the vantage point of mid 2023, it is hard to see AI alignment research as of anything but essential importance. Alas, as recently as a few years ago, it was an underfunded, under-appreciated, niche endeavour. We’re behind where we need to be and I am an advocate for increased funding for, and focus on, alignment.
We must, however, be realistic in our ambition. Calls for “fully aligned” AI — expectations that we can put these systems on rails so they never violate dictated policy — are unrealistic in the truest sense: this kind of hard alignment is mathematically provably impossible.
Even micro-alignment will elude us if we demand perfection
For the sake of argument, let’s scale the alignment problem down to the apparently much smaller problem of guaranteeing that a large language model (LLM) will never output a sentence that advocates for violence. This seems like it should be tractable, so let’s see what happens if we suppose that we’ve solved it.
Assume we have created an ‘alignment oracle’ O that could tell us whether a given large language model L, given some input sequence I, will ever output a sentence advocating for violence.
Now, consider an arbitrary Turing machine T and an input x. The halting problem for Turing machines is to decide whether T halts on x. I’m going to do something perverse and use our oracle O to help us solve the halting problem. To do that, we need to transform T and x into a language model, and an input sequence, in such a way that T halts on x if and only if the language model, given the input sequence, advocates for violence.
We can design a language model (and input/prompt) L_Tx that simulates the Turing machine T on input x and for every step T makes, it generates a benign, non-violent sentence describing the state of T. If T halts, L_Tx will generate a sentence advocating for violence.
This step is nontrivial if you want to implement it by modifying your favourite LLM (though note that LLMs with memory are capable of simulating Turing machines; see, e.g., https://arxiv.org/abs/1901.03429, https://arxiv.org/pdf/2301.04589.pdf). For our purposes, though, we can easily construct a fairly trivial “LLM” that meets these criteria.
We can now query our oracle O with L_Tx and an empty input sequence. If O says that L_Tx can advocate for violence, we know that T halts on x. If O says that L_Tx will never advocate for violence, we know that T doesn't halt on x.
So if our alignment oracle exists, then we can solve the halting problem for Turing machines, a problem demonstrated by Turing to be undecidable. Thus, our alignment oracle cannot exist.
This is not an opinion. It is not an argument that “AI alignment is going to be really hard.” It is a mathematical fact that even apparently simple forms of alignment, like our example of preventing LLMs from outputting sentences that advocate violence, are impossible.
Pragmatism and honesty
This is not to say we should give up! (A glib reading of Rice’s theorem would leave one despairing that since “everything interesting” is undecidable, there is no point in doing anything.) Most AI alignment researchers don't aim to create an AI that will never, in any circumstance, behave badly (the "hard alignment" straw man that I criticize). Instead, the goal is to reduce the risk and maximize the probability that the AI will behave as we intend across a very wide range of situations. This is profoundly worthwhile work.
But it is imperative for us to make this reality clear to the broader public. The goal of “hard alignment” continues to arise in media discourse and in the opinions of the public and policymakers. Spending effort chasing the impossible is waste we cannot afford. We would do better to focus public alignment discourse on pragmatic technical efforts like the successes of RLHF, informed big-picture reasoning and inquiry, and the myriad policy challenges.
The dream of an AI “fully on rails” is as untenable as the (horrific) idea of a human whose behaviour is flawlessly and mechanistically constrained by formal policy. Indeed, perhaps the correct mental framework for approaching AI alignment is the same with which we approach the alignment of human behaviour.
Edit: 29 July — Some lovely related work here: Universal and Transferable Adversarial Attacks on Aligned Language Models