A persistent narrative about modern artificial intelligence has emerged: that model intelligence is a consequence of ever more layers and parameters, a Platonic curve we ascend with each new model and GPU island. But the more time I spend with the numbers, the more pedestrian (and more interesting) the story becomes. Intelligence is not just code and curves. It is heat. It is siting. It is timing. In short: it is thermodynamics.
Energy counts.
Over the past few years, our understanding of AI progress has matured from a simple “scale it up” mantra to a more nuanced picture of compute-optimal training. Kaplan and colleagues mapped early scaling regularities, showing how loss falls as a power law with model size, data and compute. Then DeepMind’s “Chinchilla” results suggested many frontier models were simply under‑trained for their size and that, for a fixed budget, one should scale parameters and tokens together. The moral wasn’t “bigger is better,” but “trained to the right place is better” and, crucially, often cheaper to run thereafter.
At the same time, algorithmic efficiency has been quietly keeping pace with hardware. OpenAI’s work on “measuring” efficiency showed order‑of‑magnitude improvements in the compute needed to hit the same benchmark scores across several domains; others have chronicled similar trends in vision. Brains beat brawn more often than our headlines suggest. This isn’t hand‑waving: it’s fewer floating‑point operations, and thus less energy, for the same task.
Compression deepens the point. Pruning and sparsity (from the “lottery ticket” line of work through large‑scale studies of magnitude pruning) show that you can often keep accuracy while discarding most parameters; mixture‑of‑experts(MoE) architectures make parameters “lazy,” activating only a few per token. The result, when it works, is not just a smaller bill, it’s less heat per unit of cognition. That sounds prosaic until you look at utility interconnect queues.
Because here is the constraint we no longer treat as background noise: the grid. The International Energy Agency estimates data centres consumed ~415 TWh in 2024 (about 1.5% of global electricity), with demand growing ~12% annually since 2017, and projects more than a doubling by 2030 with AI the primary driver. In the U.S., official forecasts now expect record electricity use through 2026, explicitly citing AI and data centres. Even as renewables scale, the physics (and permitting) of transmission and cooling don’t bend to press releases.
Meanwhile, facility‑level efficiency has plateaued. Industry‑average PUE (power usage effectiveness) has hovered around ~1.56 for years; the low‑hanging fruit has been picked, and legacy estates dampen gains from cutting‑edge designs. If you can’t slice the power at the building envelope, you must slice it in the workload.
The next meaningful frontier in AI is the thermodynamic efficiency of cognition: how many joules we spend to produce a given unit of capability, insight, or utility. Not merely time‑to‑train or cost‑per‑token, but energy‑per‑token and, better, energy‑per‑task. This observation is anything but novel; it is, in fact, one of the primary concerns of the industry right now, making it worthwhile to explicitly frame progress as a joint optimisation problem across model design, training recipe, and the physical stack that powers and cools the whole contraption.
What does this look like in practice?
First, we need everyone talking about energy a first‑class metric. We’ve had arguments for “Green AI” since at least 2019, reporting the cost of models, not just their accuracy. Let’s aggressively normalise joules alongside capability benchmarks. Tools like CodeCarbon and the ML CO₂ calculator are imperfect (recent validations show nontrivial error bars), but they’re better than darkness and can be improved. Put “kWh/sample” and “kgCO₂e/run” in model cards; make reviewers ask for it. Then the incentives, built by us, can begin to move.
Second, prefer algorithmic thrift over brute force. Chinchilla taught us that compute‑optimal beats parameter‑maximal for a given budget, and often lowers inference energy thereafter. Add to that: low‑precision training and inference (FP8, INT8/4), 8‑bit optimisers, structured sparsity, distillation, and MoE where it actually reduces token‑wise compute. The outlook here is good: the economic incentives of the industry have made this an incredibly active, and productive, research frontier.
Third, (and very aspirationally), orchestrate jobs against the grid, not just the cluster. Google showed that carbon‑aware scheduling, shifting flexible workloads to hours and locations with cleaner generation, can materially cut emissions without hurting reliability; demand‑response pilots now throttle certain jobs when local grids are stressed. This is the banal magic of calendars and forecasts, not alchemy, and it works. In principle: if we can route packets intelligently, we can route training runs. In practice: the co-ordination problem here is wicked and may require policy incentives to succeed.
Fourth, co‑design with the datacentre’s thermodynamic envelope. Densities are rising; air cannot keep up; liquid and even microfluidic chip‑level cooling are moving from demos to deployment. Facility PUE may be flat today, but the stack is changing fast, and the research frontier lives at the junction of model schedule, thermal limits, and cooling technology. If your training plan ignores the plant, you are leaving performance (and megawatt‑hours) on the table.
And, I wouldn’t be my honest self without adding: we should be honest about physical limits. Landauer’s principle reminds us there’s a floor to the energy cost of irreversible computation: approximately 0.018 eV per bit erased. We’re nowhere near that bound in practice, but thinking in these terms disciplines our language; it forces us to be clear that intelligence at scale is a negotiation with physics.
There is genuine reason for optimism. Patterson and colleagues, e.g., have suggested that best practices can cut training energy by up to 100x and emissions by up to 1000x through better hardware utilisation, judicious selection of regions, right‑sizing, and reporting.
That’s demand, what about supply? Will the grid catch up? Perhaps, and I sincerely hope it does; Big Tech is signing unprecedented clean‑power deals and experimenting with storage and firm, carbon‑free supply. Yet even optimists at the IEA concede that data‑centre demand will more than double by 2030; Microsoft and others are re‑architecting cooling under the tacit admission that thermal constraints are binding now. We should build for a world in which watts are valuable and variable. Because they are.
The economics of a world with a finite ability to generate and deliver power, and a seemingly unbounded appetite for machine cognition, all but ensure that the next breakthroughs won’t just be bigger. They’ll be colder. And that’s a good thing.