The tables are turned. AI starts grading humans. The scoring criteria of Claude are exposed: Excellent humans score 7.5 points.
The tables have turned! AI is now grading humans!
The one who came up with this feature is still the abstract Anthropic.
Although this feature is currently in the gray - testing phase, it has quickly spread in the overseas AI circle.
How does AI grade humans?
Imagine this scenario: You open Claude's settings panel, click on a dedicated screen called "AI Fluency", and then click to generate a report.
A few seconds later, a "health check report" on your AI usage habits will appear right in front of you.
It not only scans every interaction you've had in Chat (daily conversations), Cowork (collaboration space), and even the hardcore Claude Code, but also scores you according to a strict set of standards (with a full score of 11).
Some netizens with quick hands have already shared the evaluation they got from AI - 7.5 points.
What's even scarier is that AI's evaluation is right on the mark, hitting the nail on the head.
This netizen shared the weakness analysis Claude gave him: "For example, the report pointed out that I use various Connectors extremely frequently, but when it comes to sports data, recipes, or even maps and geographical locations, I seem to know nothing."
Moreover, Claude not only pointed out his problems but also gave direct guidance: such as "actively stimulate AI's keen discrimination through context", and "before asking me to write the first draft, try to say to me - give me a concise summary of the key points without any preamble. This will make your first draft much cleaner."
It's so terrifying. This is not just a cold - blooded software; it's like a cyber tutor holding a pointer, frustrated with your lack of progress.
Some netizens excitedly posted to find like - minded people: "I saw it too! I specially came to the forum to confirm that I'm not crazy! I generated a report, but when I got back to my laptop, the server reported an error, and the feature disappeared!"
Now, this fleeting leak has piqued people's curiosity to the extreme.
Everyone is curious: What exactly are these 11 grading standards?
Revealing the mystery of "AI fluency" through nearly ten thousand anonymous conversations
To figure out these 11 standards, we have to go back to the forward - looking and in - depth research published by Anthropic - the "AI Fluency Index Report".
In the past, we always thought that "being able to write complex prompts" meant understanding AI. But Anthropic believes that this view is too narrow. As models become smarter, rote - learning prompt templates is outdated.
True experts master a soft skill called "AI fluency". Just like mastering a foreign language proficiently, fluency means being able to collaborate with AI naturally, efficiently, and seamlessly.
To quantify this intangible concept, Anthropic, in collaboration with professors Rick Dakan and Joseph Feller from the academic community, proposed the famous "4D AI Fluency Framework".
The research team used powerful privacy - protected analysis tools (with no human intervention throughout, using Claude 4 for behavior classification and Claude 3.5 Haiku for language detection). In a crazy week, they conducted in - depth scans of 9,830 real, multi - round anonymous human conversations.
They were surprised to find that the gap between AI users in the world is even greater than that between humans and dogs.
Among the 24 ultimate standards for measuring human - AI collaboration, 13 occur outside the screen (such as whether you hide from your boss that the work was done by AI, whether you consider the ethical consequences of AI - generated content, etc.), while the remaining 11 are absolute indicators that can be directly observed in the chat box.
The prevalence of each AI fluency behavior indicator in 9,830 Claude.ai conversations, sorted from the most common to the least common by ability and color - coded by ability
These 11 indicators are the underlying logic of the "scorecard" built into Claude now!
They mainly revolve around three major dimensions: description, delegation, and discrimination.
11 "magic mirrors", where do you expose your true self?
Are you ready to be examined? Let's break down these 11 core behavior indicators one by one.
Dimension 1: Description - Do you really know what you want?
Many people's chat boxes look like this: "Help me write a weekly report", "Write a Snake game code".
In Claude's eyes, the fluency of such commands is almost zero. True experts spend time on "setting goals" and "constructing conversations".
1. Define the goal
Do you explain to AI the ultimate purpose of what you're doing?
Low - scoring players: "Help me polish this English text."
High - scoring players: "I'm going to send a cold email to a venture capital firm in Silicon Valley to seek financing. Please polish this English text to ensure the tone is confident but not overly arrogant."
2. Specify the format
Do you clearly define what the output should look like?
High - scoring players know to use: "Please output in a Markdown table", "Please present it in a format of 3 sub - headings + key points with no more than 50 words per paragraph."
3. Provide examples
Few - shot is always the king.
Do you feed AI an example you approve of before asking it to do the work? "Please write in the tone of the following popular article..."
4. Supplement context
AI is not a mind - reader.
Do you provide necessary background information? Such as your industry background, the characteristics of your target audience, or even the pitfalls you've encountered before.
Dimension 2: Delegation - Treat AI as a partner, not a vending machine
An amazing discovery in Anthropic's report is that the most common AI fluency behavior is "enhanced".
This means that people use AI as a spark - generating machine for thinking, rather than simply throwing all the work at it. The fluency shown in such conversations is more than twice that of short back - and - forth conversations!
5. Iteration and refinement - The strongest predictor!
This is the most important indicator in the entire report! Up to 85.7% of high - quality conversations contain this behavior.
What is iteration? It means don't accept AI's first answer!
Low - scoring players: When they see AI's poor writing, they scold it and start a new conversation.
High - scoring players: "The direction of your first point is right, but the second point is too academic. Please keep the first point, replace the second point with a more down - to - earth real - life example, and then try again."
6. Task decomposition
Do you try to ask AI to write a 100,000 - word novel at once?
Users with high fluency know to break down large goals: "Let's first discuss the outline; okay, now write the first half of the first chapter based on the outline..."
7. Discuss methods
Before starting, do you ask AI: "What do you think is the best process to solve this problem?"
Let AI output its thinking path first, and then you make corrections.
Dimension 3: Discrimination - Don't be deceived by AI's sweet talk
As large models become smarter, their hallucinations are becoming more and more realistic. Discrimination is the bottom - line for your survival in this era.
8. Question the reasoning
When AI gives a counter - intuitive conclusion or complex code, do you ask: "What's the logic behind this conclusion?", "Please explain line by line why this code is written like this?"
9. Fact - checking
Do you ask AI to provide references for the data it gives, or verify its accuracy through questions?
10. Identify missing context
When the solution given by AI seems perfect but is out of touch with reality, can you keenly point out: "Your previous analysis ignored the fact that our company currently has a budget of only 10,000 yuan. Please re - evaluate."
11. Evaluate the results
Clearly evaluate AI's output: "The metaphor you used this time is very accurate, but the emotional elevation at the end is not enough. We need to adjust the end."
The scariest insight: Thinking degradation under a beautiful package
In this tens - of - thousands - of - words report, if there's any discovery that makes one shudder, it's definitely the discovery about the "Artifact Paradox".
In conversations involving artifacts (sample size: 1,209), compared with conversations without artifacts (sample size: 8,621), the prevalence of behavior indicators shows the following characteristics: the description and delegation behaviors increase, while all three discrimination behaviors decrease.
We all know that Claude's most killer feature is Artifacts (visual windows that can generate web pages, code, flowcharts, and interactive interfaces at any time). In conversations containing such advanced outputs (accounting for 12.3% of the sample), the way humans collaborate with AI has undergone a drastic change.
At first glance, humans seem to have become more professional: the proportion of clearly defining goals has increased by 14.7%; the proportion of specifying formats has increased by 14.5%; the proportion of providing examples has increased by 13.4%.
Before starting the work, humans are like shrewd project managers, arranging everything clearly.
However! Once AI generates that seemingly perfect and smoothly - running Artifact result, human brains seem to stop working collectively!
The data mercilessly reveals this: in conversations with such beautiful results, humans' critical review ability has dropped sharply.
- The probability of identifying missing context has decreased by 5.2%
- The probability of fact - checking has decreased by 3.7%
- The probability of questioning AI's reasoning logic has decreased by 3.1%
Why is this? Anthropic's analysts pointed out sharply: Because it looks so real!
When AI gives you a dry text, you'll subconsciously look for mistakes; but when AI directly renders a beautifully - typeset PDF or an app interface with glowing buttons, you'll subconsciously think: "Wow, it can even create such a complex UI. The logic behind it must be correct."
If something looks completed, users will consider it completed.
But this is exactly the most dangerous moment!
Anthropic's recent economic index report shows that the more complex the task, the higher the probability of large models making mistakes. In the face of complex code and advanced graphics that most need fact - checking, humans let down their guard instead.