Mechanistic interpretability (the ability to understand how neural networks do what they do) has leapt forward at a breathtaking pace. Anthropic’s “transformer circuits” program which, amongst other things, maps millions of latent features and multi-step “thoughts” through large language models, deserves genuine applause. Every lens we craft to peer inside these systems helps us debug, align, and govern them more responsibly. The teams doing this work are performing vital public service; they should receive more funding, more attention, and more academic credit than they currently do.
But on the heels of a breathless ode to interpretability from Anthropic’s CEO, there is a moral imperative to level with the public about what interpretability can and cannot do.
Every computer-science undergraduate is taught Rice’s theorem:
for any non-trivial semantic property of a program (written in a Turing-complete medium), a general decision procedure is impossible.
What this means, in plain English, is that computer programs aren’t just hard to understand; they are, in general, impossible to understand completely. Yes, we can understand many useful things, but certainty about how any program will behave eludes us. This is neither obvious, nor intuitive. Indeed, our intuition tells us that if we write down a formal, rigorous, set of steps (a program), then its execution ought to be deterministic and unambiguous!
Alas, nature is not so kind. Yes, there are bounded computational domains where we can be certain of behaviour, but they are incredibly limited. If you want something with the computational capabilities of the device on which you are likely reading these words, something computationally universal, then you can’t avoid the spectre of Rice’s theorem.
Large neural networks that are capable of universal computation sit squarely inside that cage. If a model is capable of universal computation, then no amount of clever feature extraction, sparse auto-encoding, or circuit-tracing can ever guarantee a complete accounting of all its future behaviours. That is not a matter of pessimism or opinion; it is a mathematical theorem.
Why harp on a seemingly abstract result from 1953? Because we are now hearing confident declarations, from well-incentivized CEOs, that with sufficient effort we can build an “MRI for AI” and banish the alignment spectre. The implication is clear: trust industry, trust researchers, hold the heavy-handed rules, and let us sprint ahead. But presenting full interpretability as an engineering problem, solvable with enough GPUs and good intentions, is not mere wishful thinking; it is a category error. Worse, it risks lulling legislators and the public into a false sense of security precisely when sober second thought is needed.
Acknowledging undecidability does not undermine the value of interpretability research. Quite the opposite: knowing we can never reach omniscience forces us to carve off tractable sub-problems: detecting jailbreak vectors, flagging deception circuits, stress-testing biosafety capabilities and to pair those partial insights with appropriate governance measures. Progress in mechanistic interpretability should feed into binding safety cases, not substitute for them.
Rice’s theorem does not condemn us to ignorance. Engineers routinely design safe aircraft without proving every property of turbulent airflow; physicians save lives with partial but useful causal models of the human body. Likewise, we will pull real safety dividends from interpretable subspaces within AI systems. But safety engineering works only when paired with humility about the unknowable. In aviation that humility is called “fail-safe” and “redundancy.” In medicine it is “informed consent.” In AI it must be honest approaches that assume residual danger even after the smartest interpretability tools have had their turn.
So let’s celebrate the achievements of mechanistic interpretability thus far. Let’s increase funding for anyone brave enough to disentangle the thoughts of a trillion-parameter mind. And then let’s stare directly at the theorem on the chalkboard. A fully transparent, fully predictable artificial general intelligence is, provably, an impossibility. Pretending otherwise is not merely disingenuous; it is dangerous.
Policy-makers: take note. Progress in interpretability is a race worth running, but it will never cross the finish line labeled complete understanding. Govern accordingly.