HomeArticle

Google DeepMind releases the ultimate syllabus for AGI for the first time, offering a global reward of $200,000 to expose the true nature of all large models.

新智元2026-03-18 15:42
DeepMind has proposed a cognitive assessment framework for AGI, setting up ten major capability dimensions, and offering a $200,000 reward to globally solicit evaluation plans.

[Introduction] How exactly should AGI be evaluated? Just now, Google DeepMind released a significant paper, directly "borrowing" a set of metrics from cognitive science - breaking down general intelligence into 10 key cognitive abilities, accompanied by a three - stage evaluation protocol. They also joined hands with Kaggle and offered a $200,000 reward to global researchers, asking: Who can measure the true AGI?

Where exactly has AGI reached at present?

Just now, Google DeepMind provided the ultimate metric for AGI!

The core claim of the paper titled "Measuring Progress Toward AGI: A Cognitive Framework" can be summed up in one sentence: Stop arguing about what AGI is. First, figure out how to measure it.

Paper link: https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/measuring-progress-toward-agi/measuring-progress-toward-agi-a-cognitive-framework.pdf

Specifically, the evaluation of AGI is refined into 10 key cognitive domains, including perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem - solving, and social cognition.

Meanwhile, Google DeepMind is also launching a $200,000 Kaggle hackathon for global developers.

The hackathon gives the question - setting power directly to global researchers - the framework is set up, and you are invited to help create the test papers.

From "AGI Grading" to "AGI Physical Examination"

This is not the first time DeepMind has attempted to draw a roadmap for AGI.

In 2023, the same team published the famous "Levels of AGI" framework, breaking down the path to AGI into 5 performance levels.

From "Novice" (Emerging) to "Superhuman", and at the same time, 6 autonomy levels were defined, ranging from "Pure Tool" to "Fully Autonomous".

That paper had a great influence. It provided the entire industry with a common language, just like the L1 to L5 in the field of autonomous driving, enabling people to communicate within the same coordinate system.

However, it left a huge gap: The steps are drawn, but how to measure each level?

The new paper aims to fill this gap.

10 Key Cognitive Abilities: Drawing a Map for General Intelligence

Its core is a "Cognitive Taxonomy" that breaks down general intelligence into 10 key cognitive abilities.

Specifically, to evaluate the gap between AI and human cognitive abilities, the first step is to figure out: What are the key processes involved in human cognition?

Over the past many years, psychology, neuroscience, and cognitive science have accumulated a large number of relevant results through experiments, brain imaging, case studies, and model building.

Based on these studies, the team compiled a cognitive classification system to describe the core abilities required to achieve AGI.

Let's first look at 8 basic abilities.

1. Perception

Extracting and processing sensory information from the environment. It includes visual perception (from low - level edge detection to high - level scene understanding), auditory perception (from pitch discrimination to speech understanding), and text perception unique to AI.

LLMs process text directly through tokenization, which is essentially a unique perception modality that humans do not possess. This "superpower" bypasses vision and directly reaches language.

2. Generation

Producing outputs such as text, speech, and actions (robot control, computer operation).

Among them, the most intriguing one is "thought generation", that is, generating internal thinking to guide decision - making.

DeepMind links this with OpenAI's o1 - style reasoning ability and points out that since thinking is essentially "internal", it may be extremely difficult to evaluate.

3. Attention

When there is an information overload, it is necessary to concentrate cognitive resources on key things.

There is a delicate balance here: One should focus on the current goal without being distracted, and at the same time, be alert to unexpected changes in the environment. Being too focused may cause one to miss danger signals, while being too distracted will lead to getting nothing done.

4. Learning

Acquiring new knowledge and skills through experience.

It includes six major categories: concept formation, associative learning, reinforcement learning, observational learning, procedural learning, and language learning.

The key is that a real AGI should be able to continuously learn and retain new knowledge after deployment, rather than just "cramming" during the training phase or within the context window.

5. Memory

The ability to store and retrieve information.

It includes semantic memory (world knowledge), episodic memory (specific events), procedural memory (skills), prospective memory (remembering what to do at a future moment), and an easily overlooked ability - forgetting.

Yes, the ability to actively clear outdated or incorrect information is also an important part of intelligence.

6. Reasoning

Deriving valid conclusions through logical principles.

It covers five types: deductive, inductive, abductive, analogical, and mathematical reasoning.

It is worth noting that automatic pattern matching does not count as reasoning.

7. Metacognition

This may be the ability that creates the biggest gap among the 10 abilities.

It requires the system to:

  • Know what it knows and what it doesn't know (metacognitive knowledge);
  • Be able to monitor its own cognitive state in real - time, such as whether the confidence in the answer is accurate (metacognitive monitoring);
  • And adjust strategies according to the monitoring results, such as actively switching methods when it finds itself making mistakes (metacognitive control).

To put it bluntly: How can an AI that doesn't know it's talking nonsense be considered reliable?

8. Executive Functions

A set of high - order abilities that support goal - oriented behavior.

It includes goal setting, planning, inhibitory control (resisting habitual responses and choosing more appropriate actions), cognitive flexibility (switching between different thinking modes), conflict resolution, and working memory.

In addition to the above 8 "basic building blocks", the framework also defines 2 "composite abilities":

9. Problem Solving

Comprehensively using abilities such as perception, reasoning, planning, and learning to solve specific problems.

It is further divided into fluid reasoning, mathematical problem - solving, algorithmic problem - solving, common - sense problem - solving (including temporal reasoning, spatial reasoning, causal reasoning, intuitive physics), and knowledge discovery.

10. Social Cognition

The ability to process and interpret social information and make appropriate responses in social scenarios.

It includes social perception, theory of mind (inferring others' beliefs and intentions), and social skills such as cooperation, negotiation, persuasion, and even deception.

It is worth noting that persuasion and deception may also constitute dangerous abilities in certain contexts.

Generally speaking, according to DeepMind's core hypothesis, if a system has any obvious weaknesses in these 10 dimensions, it will not be able to complete most of the real - world tasks that humans can do.

Then, it is not a truly "general" intelligence.

Three Steps to Test the True Quality of AI

With the taxonomy, the next question is how to evaluate.

In response, Google proposed a three - stage evaluation protocol.

Step 1: Cognitive evaluation.

Let the AI complete tasks that cover all 10 cognitive abilities.

The task design has strict requirements:

  • It must target specific cognitive abilities (not mix multiple things in one task);
  • It must use a confidential question bank; and must be audited by an independent third party;
  • The difficulty should have a gradient (including questions that are easy for humans but difficult for AI, as well as questions that challenge human limits);
  • The format should be diverse (multiple - choice questions, open - ended questions, multi - modal, multi - step).

Step 2: Collect human baselines.

Let a large number of humans do the same questions under exactly the same conditions.

The same instructions, the same response format, and the same access to tools.

DeepMind suggests that the sample should be "adults who are demographically representative and have completed at least high - school education".

Step 3: Build a cognitive profile.

Locate the AI's performance within the distribution of human performance - calculate the proportion of human subjects that the system outperforms and draw a radar chart in 10 dimensions.

Why must a radar chart be drawn?

Because a core feature of AI abilities is "jagged". This is also a phenomenon that DeepMind has repeatedly verified in another study:

A model may outperform 99% of humans in logical reasoning but be worse than the human median in social cognition or common - sense reasoning.

Just looking at a total score can't reveal this fatal imbalance. And the radar chart is used to expose this disguise.

DeepMind presented three hypothetical scenarios:

A. A system that is below the human median in some dimensions will surely "fail" in certain real - world scenarios.

B. A system that exceeds the human median in all 10 items can at least match 50% of humans.

C. A system that reaches the 99th percentile in all aspects can almost match anyone.

Meanwhile, DeepMind also did not avoid the three major sources of uncertainty: (1) Whether the quality of the task itself is up to standard, (2) Whether the test is really measuring the target ability (construct validity), (3) The inherent randomness of generative AI - asking the same question twice may yield completely different answers.

Why the Old Measuring Ruler Is No Longer Useful

What is the significance of Google DeepMind's research?

Why are the previous scales for measuring AGI no longer applicable?

The reason is that it is now impossible to judge what AGI is: GPT - 4 can pass the lawyer qualification exam, Gemini can read a 100,000 - token paper, and Claude can write code faster than programmers.

But which one is AGI? The existing evaluation system not only fails to answer this question but also has two underlying logics that have collapsed.

The first is the "town exam - taker" dilemma: data pollution.

If an AI system has "seen" the answers or problem - solving strategies of test questions from a large amount of Internet data during the training phase, getting a high score cannot prove that it has general intelligence. At most, it is just a repeater with excellent memory.

The second is more tricky: Should we evaluate the "model" or the "system"?

In the past, we tested an isolated model, but today's AI is a complete system. It comes with system prompts, can call calculators, execute code, search the Internet, and even call other AI models.

For example, if you want to test an AI's historical knowledge reserve, but the system can search the Internet at any time. What are you really testing, its "memory" or "search skills"?