Defining AGI : Why OpenAI’s o3 Isn’t Enough to achieve Artificial General Intelligence

26 December 2024 0 By Judicael Poumay (Ph.D.)
Performance of o1 and o3 on the ARC-AGI benchmark
Performance of o1 and o3 on the ARC-AGI benchmark

With the release of OpenAI’s o3, the discourse surrounding Artificial General Intelligence (AGI) has intensified recently; with some are asking “is o3 AGI?”. Indeed, OpenAI’s o3, has achieved unprecedented performance on benchmarks such as the ARC-AGI, scoring 87.5%, surpassing the human-level threshold of 85%. Such results have led many to suggest that AGI is already here or at least very close.

Don’t get me wrong, this technology is awesome and I have used it to help me write this article. But is o3 AGI? Are we close to AGI? No, I believe we are still far from that and at the core of this discussion is the absence of a proper definition of what AGI is. Therefore, the aim of this article therefore is to define AGI by defining its desired characteristics. 

OpenAI’s o3 Model: Why It Sparked the AGI Debate

OpenAI’s o3 : Impressive results on AGI benchmarks

OpenAI’s o3 model has demonstrated remarkable capabilities, excelling in complex coding tasks and advanced mathematical problem-solving. Notably, it achieved a 22.8% improvement over its predecessor, o1, on the SWE-Bench Verified coding benchmark (Wired). Additionally, o3 nearly aced the 2024 American Invitational Mathematics Examination (AIME), missing only one question. It also scored 87.7% on a benchmark for expert-level science problems (The Verge). 

To achieve this level of performance, o3 uses and is trained on Chain of Thoughts. Basically, making the model think out loud before it starts providing an answer. This leads to important increases in performance but is also extremely costly in terms of computation and energy. Hence, making these kinds of models powerful but highly inefficient. Thus, indicating that this solution while impressive may not scale well for general problem solving.

Exemple of a chain of thought for a mathematical problem
Exemple of a chain of thought for a mathematical problem

Some observers interpret these performances as evidence of o3 reaching Artificial General Intelligence (AGI). Many employees at OpenAI have also openly announced that “3o is AGI” (source, source). Yet, many in the AI research community emphasize caution. While impressive, o3 is still a product of massive data processing and specialized architectures. This is far from human-like reasoning, creativity, or consciousness.

While models perform brilliantly on known benchmarks, they fail in unfamiliar real-world situations. On top of this, we have a phenomenon known as “benchmark saturation“. It occurs when AI models achieve near-perfect scores on existing benchmarks. This effectively renders these tests less effective at measuring true intelligence (HAI Stanford). As a result, we need more challenging and comprehensive benchmarks to accurately assess AI capabilities.

O3 is impressive but AGI benchmarks are not

It could be argued that traditional benchmarks are not suitable for testing true Artificial General Intelligence (AGI). AGI might require us to develop a more complex sets of problems to demonstrate its capabilities. These problems should be incomplete, multi-step, without one single solution and requiring iterative trial and error. Real end-to-end software engineering would be a good exemple. An AI that can debug, read documentation, handle dependency errors, create and apply unit tests on its own and more would be closer to AGI. Open AI’s o3 clearly does not have these capabilities.

The o3 model with all of its impressive results remains, in the end, an LLM. Something built to generate text. These kinds of models, however impressive, are stochastic parrots, they generate the most likely answer given a query. Train it on older data and it will generate wrong scientific information that has recently changed. It does not think, even if it uses chain-of-thoughts. It cannot be creative and invent something it has never heard of before. Anyone who has been using these models for brainstorming new ideas are well aware of their limitations. 

AGI and the chinese room argument

Yes, we have passed the turing test, but do not forget about the chinese room argument. Proposed by philosopher John Searle, it raises fundamental questions about whether syntactic processing alone can lead to genuine understanding. This thought experiment suggests that even if a machine appears to “understand” , it may lack awareness of meaning, intentions, or context.

Representation of the chineese room argument
Representation of the chineese room argument

While advanced AI models like LLMs excel in generating semantically appropriate outputs, they do so through statistical inference and pattern recognition. They do not posess a real understanding of the world. It’s just reacting to a prompt and generating the most likely answer.

What Is Artificial General Intelligence (AGI)?

Artificial General Intelligence (AGI) is the holy grail of AI. It refers to a hypothetical AI system capable of performing intellectual tasks at a level comparable to human intelligence. An AGI would demonstrate a versatile understanding, learning, and application of knowledge across diverse domains. Essentially, anything a human can accomplish using a computer, an AGI should be able to replicate. For instance, whether it’s analyzing complex datasets, programming, or engaging in creative problem-solving, AGI would aim to perform all such tasks with human-level competence.

Cognition and Creativity : LLMs do not think like us

Representation of the two kinds of cognition : logical/orderly and exploratory/chaotic
Representation of the two kinds of cognition : logical/orderly and exploratory/chaotic

AI’s ability to solve problems has shown significant progress, yet many challenges remains. Current systems like OpenAI’s O3 excel at predefined benchmarks, generating coherent responses, and mimicking creativity. While these achievements are striking, they are superficial compared to the goal of AGI. AI models create art, code, and text that appear original, but their “creativity” is merely a sophisticated recombination of known patterns. They cannot imagine or invent without referencing their training data. For instance, without exposure to Picasso’s works, AI could never emulate his style independently.

True Artificial General Intelligence (AGI) would demonstrate human-like creativity, generating novel concepts and adapting to unfamiliar domains. Imagine an AGI capable of solving problems with cross-disciplinary insights—leveraging biology, philosophy, or abstract reasoning. Unlike today’s models, AGI wouldn’t emulate; it would genuinely innovate. This leap requires surpassing current AI’s abilities to process data and reproduce patterns, reaching a stage where reasoning and creativity emerge.

Human creativity thrives on curiosity, exploration, and unpredictability—qualities absent in AI, which operates within logical and rule-based constraints. To replicate this, AGI must engage in experimentation, iterative testing, and imaginative thinking, akin to human problem-solving’s trial-and-error nature. This is far more complex than applying learned methods to structured problems, which is the limit of current systems.

Achieving such a paradigm shift requires redefining benchmarks. Current benchmarks are closed sets of well-defined tasks, unrepresentative of real-world problems. AGI needs benchmarks that simulate open-ended, poorly-defined, real-life problems. These scenarios might involve ambiguous problem definitions, incompatible tools, and iterative refinement. Success in such benchmarks would necessitate an AI capable of critical thinking, redefining the problem’s scope, and creatively resolving it. True AGI must mirror human adaptability, bridging the gap between structured tasks and the unpredictable demands of the real world.

Self-Agency and Awareness : We need true AI agents

Elements of an Autonomous AGI Agent
Elements of an Autonomous Agent

Current AI systems are highly effective tools, designed to follow prompts, execute predefined instructions, and occasionally exhibit surprising emergent behaviors. However, self-agency—the ability to independently set goals, explore novel strategies, or proactively decide to acquire new skills—remains far from reality. Models like OpenAI’s o3 can adeptly utilize tools when instructed. Nonetheless, they lack the initiative to autonomously learn new tools or optimize their internal learning processes without direct human intervention.

True Artificial General Intelligence (AGI) would require a fundamentally different kind of autonomy. It would need to self-assess its performance, identify gaps in its capabilities, and learn autonomously. This means self-directed learning—choosing what to learn, why to learn it, and setting independent priorities. An AGI would also need a form of self-awareness, of its actions, their outcomes, and how they align with overarching objectives. Today’s systems are exceptional assistants but lack the self-driven adaptability and introspection required for genuine intelligence.

Social and Moral Dimensions : AI must be able to prompt us

Human and AGI communicating
Human and AI communicating

If Artificial General Intelligence (AGI) becomes a reality, it will need to operate within the context of human society. Just as no human solves problems in isolation, AGI would depend on humans (and other AIs) as much as humans rely on it. Yet, today’s AI lacks a theory of mind; the ability to know that other entities like us have knowledge too. Models like o3 cannot autonomously inquire users or engage in clarification dialogues. Without this capability, they fail to discern whether their solutions align with user needs, much less adapt their responses accordingly. AGI would need to be a truly social entity, actively participating in collaborative problem-solving. True AGI should be able to prompt us as much as we prompt it.

Additionally, AGI would require an advanced understanding of cultural, historical, and ethical nuances. This entails not merely avoiding harm but grasping why certain actions are harmful. It also needs to do this while taking into account societal, historical, and cultural contexts that vary widely across the globe. Morality and ethics are not universal; they are shaped by cultural and individual perspectives. True AGI must navigate these complexities thoughtfully, ensuring that its decisions respect the values of its users. While today’s systems can simulate ethical reasoning to some extent, they lack the depth to address ambiguous or culturally sensitive dilemmas. Achieving AGI means creating a system capable of not only solving problems but doing so as a collaborative, socially aware, and morally responsible entity.

And More

This is by no means a complete picture of Artificial General Intelligence (AGI). We’re good at benchmarks and flashy demos. But AGI isn’t about any one of these things. It’s about bringing together cognition, agency, creativity and social intelligence into something cohesive. Something that truly mirrors the depth and breadth of human intelligence. That’s not where we are today, but it’s where we need to go. And for that, we need more than better models and better benchmarks. We also need a clearer vision of what AGI is and, more importantly, what we want it to be.

OpenAI’s o3 is not AGI but we are getting there

The journey toward Artificial General Intelligence (AGI) is a testament to humanity’s ambitious drive for innovation. While models such as OpenAI’s o3 exemplify incredible progress in AI, the leap to AGI entails much more than that. It requires the creation of systems capable of genuine adaptability and self-agency, deep reasoning, social awareness, and creativity.

Yet, the pursuit of AGI requires collective rethinking of what AGI truly means. Without a clear, shared definition, AGI risks dilution into a mere marketing buzzword, devoid of substance or direction. To build the AGI of tomorrow, we must first define the vision of what we want it to be. Only with this clarity can we chart a meaningful path forward.

Judicael Poumay (Ph.D.)