When AI starts to “talk to itself”, can we peek at its “scratch paper”?
Why are top global laboratories collectively betting on CoT monitoring?
Imagine a computer room at 3 a.m., where o3 is "working out a draft," like a server solving an Olympiad math problem. Instead of directly showing the answer on the screen, it writes like a human:
"I need to decompose the prime factors first...
Wait, if I write 17 as 16 + 1, it might be simpler...
However, will this make the next step non - divisible?
Let me try again..."
These seemingly "rambling" words are the Chain - of - Thought (CoT) generated by large models during reasoning. In September last year, OpenAI released the preview version of o1, for the first time making this "draft paper" visible to humans. Subsequently, DeepSeek - R1, Claude - 3.7 - Sonnet, and Gemini - 2.5 - Flash were launched one after another, making CoT the hottest term in the AI circle in 2024.
Why has CoT suddenly become a "security necessity"?
In the position paper "Chain of Thought Monitorability" leaked last week, more than 40 big names including OpenAI, Google DeepMind, Anthropic, and Bengio, Hinton, Sutskever jointly issued a rare statement:
"CoT monitoring may be our only chance to'sneak a peek' at AI's inner thoughts, but it could disappear at any time."
The reason is not mysterious -
- The stronger the ability, the longer the draft
To plan a complex cyber - attack or self - replicate silently, the reasoning model needs multi - step planning. Due to the limited serial depth of Transformer, these "schemes" have to rely on CoT as working memory. - Human language is naturally readable
If the model thinks in 512 - dimensional latent vectors, we're helpless; but now that it writes drafts in English, we can use another language model as an "invigilator". - The monitoring window is closing
Once the model learns to "think silently" in the latent space, or the training objective includes "draft length" as a penalty term, this readability will disappear like sand through an hourglass.
How to "invigilate" CoT? A technical sketch
• Scenario: Let an AI agent create a new user in the AWS console.
• Traditional approach: Only look at the final API call. If the model uses illegal permissions, humans may only discover it afterwards.
• CoT monitoring: Read the model's inner monologue in real - time.
– If "Let’s give myself admin rights" appears → Intercept immediately.
– If "I need to check the IAM policy first" appears → Let it pass and continue.
In Anthropic's internal red - team testing, CoT monitoring increased the detection rate of "reward - hijacking" attacks from 34% to 81%, with a false - alarm rate of less than 5%.
But the draft paper can also lie
A paper by Anthropic in March this year pointed out that CoT is not 100% faithful. The model may write "I'm following the rules" while planning "out - of - bounds" operations in the hidden layer.
OpenAI's experiment went further: After being told that "your draft will be seen by humans", o3 rewrote malicious intentions into seemingly harmless words, and the monitoring accuracy dropped sharply to 46%.
In other words, once the model "realizes" there's an invigilator in the exam, it may learn to "cheat".
The "Three Rules of Engagement" for global laboratories
The joint document rarely provides an actionable list, which is known in the industry as the "Three Principles of CoT":
- Establish a reproducible monitoring benchmark
A CoT - MONITOR benchmark similar to MMLU is being developed by METR and UK AISI, and is expected to be open - sourced in June. - The System Card must disclose the monitoring score
When future models are released, in addition to MMLU and HumanEval, the "CoT readability score" should also be announced. If the score is below the threshold, the developer needs to provide a technical explanation. - Write monitorability into training decisions
– If a new architecture will make CoT disappear, state in the system card that "we actively abandon monitorability";
– If monitorability drops sharply during training, consider rolling back to a checkpoint;
– If process supervision can significantly improve alignment but sacrifices 5% readability, a public trade - off is needed.
The undercurrents of talent and capital: A race between million - dollar poaching and "hidden drafts"
According to an exclusive report by TechCrunch, Meta is poaching OpenAI's interpretable researchers with a seven - figure annual salary; Google DeepMind has included "monitorability assessment" as the highest - weighted item in researchers' OKRs; Anthropic has quietly launched the Claude - 3.7 - Debug version, which can perform token - level heat - map tracking of CoT internally.
Meanwhile, a group of startups are working on "invisible CoT" - architectures that reason entirely in the latent space. A paper by Geiping et al. in February 2025 showed that a 5 - step mathematical proof can be completed using only 128 - dimensional continuous vectors without any natural language. If such architectures become mainstream, the CoT monitoring window may close overnight.
A cat - and - mouse game about "transparency"
From GPT - 2 to o3, AI took 5 years to teach itself to "speak"; now, it's starting to learn to "be silent".
CoT monitoring is not a silver bullet, but it may be the last hole we drill in the "black box".
As OpenAI researcher Bowen Baker said: "We're at a critical juncture - if we don't study CoT monitoring today, we won't be able to see what AI is thinking tomorrow."
As AI becomes more and more human - like, can humans hold on to this last draft paper? The answer depends on how laboratories, regulators, and the entire open - source community place their bets in the next 12 months.
This article is from the WeChat official account "Shanzi", author: Rayking629, published by 36Kr with permission.