2024 was fast.
Researchers, policy makers, the general public — no one has been spared the vertigo of AI breakthroughs arriving like clockwork, each overshadowing the last. And yet, even against this relentlessly quickening pace, last week’s announcement of OpenAI’s o3 model still managed to surprise, forcing us once again to stop and reassess what “frontier AI” means.
The Day the Benchmarks Changed (again)
At this point, we have grown used to new models creeping ahead of human-level performance on various and sundry specialized tasks. But o3 tossed aside modest increments for leaps and bounds:
ARC-AGI: This test, specifically designed to capture a more “general intelligence” facet of reasoning, has historically been an uphill climb for AI. o3 clocked in at an 87.5% success rate, edging out the approximate 85% average human performance. Crossing the “human reference line” in a test that was initially designed for the purpose of quantifying “human like intelligence” has psychological weight, and o3 cleared it solidly.
Competitive Programming Elo 2727 (Codeforces): Anyone who has tried to push AI code assistants to do complex tasks knows there is often a gap between neat, self-contained logic puzzles and the chaotic reality of real competition. Achieving an Elo rating of 2727 means o3 sits among the top echelons (top 0.2%!) of competitive programmers—people who code under time pressure, tricky constraints, and puzzle-like problem statements.
Mathematical Reasoning (AIME 2024): o3 turned in a 96.7% accuracy on this year’s AIME (American Invitational Mathematics Examination). Yes, it’s a format that (some) AI can brute force with methodical search. But 96.7% is not a small step over last year’s record. It’s a clinically thorough conquest, especially given the AIME’s well-earned reputation for clever and non-routine problems.
EpochAI Frontier Math Benchmark: Perhaps the most humbling result of them all. Just two months ago, two distinguished Fields Medalists had insisted that these “Frontier Math” questions would remain out of reach for machines for at least a few years. Sir Timothy Gowers mused that “getting even one question right would be well beyond what we can do now,”1 and Terry Tao predicted these challenges would “resist AIs for several years at least.” Yet o3 pulled off a 25.2% success rate. It sounds small… until you note that the previous state-of-the-art was a mere 2%. That is not an incremental improvement; it’s an order-of-magnitude leap—and a sign that, like it or not, AI is stepping into creative, conceptual territory mathematicians once thought was ours alone.
The Pace of Change: A Study in Contradictions
What’s made the December 2024 unveiling of o3 feel so charged is the tension between two major storylines in AI this year:
The Lock-in of “Old SOTA”: Through the first half of 2024, many2 believed that top-tier performance was plateauing around known tasks. We saw a rash of papers describing “model stalling” in competitive programming benchmarks, where it seemed little beyond single-percentage improvements were possible. Not worthless, but hardly world-shaking. The field felt ready to declare that 2024 might be a consolidation year—time to refine, scale responsibly, and worry more about system safety.
A Surge in Hyperspecialized Systems: Meanwhile, narrower but unbelievably good specialized AIs — especially in computational biology, drug discovery, and advanced theorem-proving — were upending those niches. Yet these systems usually lacked the generalist utility that reigned supreme in LLM-derived models. The sense was that “bespoke AI for each domain” might be the next wave.
Then o3 arrived and broke these neat predictions. Not only did it blast past well-established “walls” in math, code, and general reasoning tasks, it did so while remaining flexible across multiple domains. For those of us carefully tracking the year’s small but real “AI slowdown” narrative, o3 is proof that waves are cyclical: the perceived lull was just a precursor.
The Philosophical Jolt
It’s one thing for a new model to top a leaderboard. It’s another to hear a subtle shift in tone amongst top mathematicians: not resignation, but a recognition that we’ve passed a threshold in AI’s ability to forge new problem-solving paths. To challenge human creativity in a truly intellectually demanding domain.
Watching the reaction from world-class coders in the competitive programming circuits has been similarly telling. Many experts are expressing a grudging admiration for o3’s code solutions. Typically, when an AI solution surpasses human solutions, it does so by following a brute-force search or relying on expansions. But o3’s solutions show more fluid, cunning problem breakdown—occasionally even elegant “shortcuts” that read like someone’s personal code-golf spree. Not just “exceeding” top humans, but doing so with flair.
In other words, we’re seeing glimpses of real creativity that can no longer be brushed aside with “it’s just an autocomplete, but bigger.” That line, and its stochastic psittacine brethren, were always more about human self-soothing than rigorous analysis, but it’s now an even harder position to defend in the face of o3’s abilities.
The Future We Didn’t Quite Expect
A year ago, in December 2023, many predicted AI’s biggest transformations would come from applying large language models to business tasks: workflow automation, data insight generation, or large-scale content moderation. That story is still playing out. But the truly eye-popping pivot in 2024 turned out to be in reasoning: mathematics, advanced coding, intricate problem-solving.
It’s reminiscent of a step in history a few decades ago, when the shift from “expert systems” to machine learning took people by surprise. Back then, the raw power of iterative improvement overshadowed carefully hand-crafted logic. Now, the raw power3 of multi-scale reasoning overshadows the incremental approach to specialized domain solvers. And ironically, it’s happening in domains once considered 4: abstract math competitions, advanced puzzle-like code, or conceptual frontiers.
So Where Do We Go from Here?
We close out 2024 with a swirl of conflicting responses:
Euphoria in the Tech Sector: AI valuations soar yet again. The sector sees o3 as a “new milestone,” a chance to go bigger, bolder—maybe even aim for beyond-human performance in every major reasoning benchmark, from physics to creative design.
Scrutiny from Academia: There’s a quiet buzz that many academic domains will shift from “AI can’t do this” to “How do we harness AI that can do this better than us?” We can expect an avalanche of “AI-human collaboration” frameworks in the next wave of research papers.
A Policy Lag?: The legislative and policy realms, understandably fixated on near-term AI governance (misinformation, job displacement, data privacy), are just now waking up to the reality that models can do more than chat: they can invent new algorithms, rewrite the frontiers of mathematics, and solve specialized tasks with fewer guardrails or simpler chain-of-thought prompts. The policy dimension of these breakthroughs could be the next big story in 2025.
Philosophical Discomfort: The Gowers and Taos of the world aren’t losing their humanity, far from it—but that intangible sense of “What is left to explore when the machine is unbounded?” will only grow. On the fringes, we may see more calls for clarifying the nature of creativity, the essence of proof, and the actual meaning of “understanding.”
One fact unites these responses: o3 shattered the comfortable illusions about AI’s near-term limitations. For better or worse, “impossible” is a word the AI community will be more careful using as we move into 2025.
Note on AGI: You’ll note I’ve avoided talking about ‘AGI’ — Artificial General Intelligence, or AI that we all roughly agree is “as smart as” a human being (whatever that means). It’s not that this is an unimportant concept or dialogue; it is. But right now AGI is used too often to unhelpfully deflect conversation away from what is actually happening. Is o3 AGI? Unlikely. Is it rational and productive to take the intellectual position “It’s not AGI, therefore I don’t have to worry about it!” Absolutely not5. The rational approach is to engage with these models for what they are and what they can do today; one can even win a Nobel prize with sub-AGI-level AI.
Final Thoughts
Year-in-review pieces often end in tidy, sweeping conclusions. But the story o3 tells us defies neat summation. It simply insists we keep paying attention, keep pivoting our frameworks for what “frontier” means.
We enter 2025 realizing that tasks like solving deep math problems, conjuring intricate creative code, or surpassing human-led reasoning aren’t coda to AI’s progress, they’re stepping stones. If the first half of this decade taught us AI can approach human-level language, the second half might demonstrate that AI can push beyond constraints in reasoning, creativity, and even scientific discovery.
So here we are, signing off 2024: The year that began with a sense of “consolidation and stability” closed with a model that politely grinned and demolished the old boundaries. If you’re feeling a bit breathless, you’re not alone. The rollercoaster is still gaining speed.
Especially “permabear” Gary Marcus.
One wonders how many times we will have to learn Rich Sutton’s Bitter Lesson before it sticks?
Though human intuition is famously suboptimal at deciding which problems are hard.
But it is seductively comforting to some.