Anthropic internal 95% of business analysis tasks are assigned to Claude, and the secret does not lie in a more powerful model
When you ask AI to retrieve data, it gives seemingly comprehensive answers, yet you can't trust them.
Just now, regarding the most headache - inducing issue for countless AI data analysts, Anthropic has presented its solution and posted two figures of 95% on its official blog:
95% of the company's internal business analysis queries are now automatically completed by Claude;
The overall accuracy is approximately 95%.
https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude?utm_source=chatgpt.com
This blog directly addresses the core pain point of AI data queries: the answers seem correct, but you can't easily trust them, not knowing where the pitfalls might be.
Anthropic's official team has even given this situation a name, the "false sense of precision":
If you directly connect Claude to the data warehouse and let it run, it might give you an answer with a nice format and a confident tone, but silently use the wrong table.
The author of this blog is from Anthropic's data science and data engineering team. After delegating the repetitive and mechanical data retrieval tasks to Claude, they have freed up time to work on causal modeling, prediction, machine learning, and other tasks.
One of the most counter - intuitive viewpoints they mentioned in the blog is that the hardest part of getting the model to accurately retrieve numbers is not writing SQL.
Structured Query Language (SQL) is the language used to request data from a database. In the past, being able to write it was a threshold for data analysis.
However, for today's large models, translating natural language into SQL is no longer the main bottleneck. The real difficulty lies in the step before writing SQL.
Three Common Types of Errors: The Data Itself is a "Muddle"
Anthropic believes that the difficulty in data analysis lies in the fact that the data itself is a "muddle".
For the same question, there are often several sets of data that look similar. It's hard to tell which one to use.
What AI really needs to do right is to pick out the data you're looking for from this pile. Once this step is done correctly, writing SQL to retrieve the data later is almost a natural consequence.
Anthropic attributes the main reasons for the model's errors in data analysis to the following three types.
The real difficulty for analytical AI is to map the user's question to the correct and up - to - date data entity.
The first type is the mismatch between concepts and entities.
In a data model, there are hundreds of seemingly usable fields, and there could be millions behind them. When you ask "How many active users are there?", what actions count as "active"? Do fraudulent accounts count? Should the look - back window be 7 days or 30 days? The model can't pick the right option from these similar choices.
The second type is outdated data.
The data sources, business definitions, and table structures are changing every day. The knowledge in the model's "mind" gradually "rusts", and it starts to give answers with "minor errors". This kind of error is the hardest to detect. It seems all correct, but in fact, it's already wrong.
The third type is retrieval failure.
The information is actually in the model, and the annotations are complete. However, the search space is too large, and it simply can't find it.
Comparing it with writing code, the difference becomes clear. Writing code is an open - ended question, and documentation and unit tests naturally prevent hallucinations. Data analysis often has only one correct answer and one correct source, and there is no definite way to prove its correctness.
So, Anthropic's conclusion is that the accuracy of analysis is a problem of context and verification, not whether the model can write code.
From 21% to 95%: What Did Anthropic Do in Between?
To solve these three types of errors, Anthropic has built a system called the agentic analytics stack, which consists of four layers, each targeting a specific type of problem.
The structure of Anthropic's agentic analytics stack: The data foundations, sources of truth, skills, and validation layers each have their own functions.
The first layer, the data foundations: This is the data warehouse itself, including data models, transformations, tests, tables, and the metadata that describes them. The core action is to converge the same concept to a single authoritative table, targeting the "concept - entity ambiguity", and also building the first engineering defense against outdated data caliber.
Anthropic emphasizes that traditional data engineering techniques such as dimensional modeling are equally crucial in the AI era.
The second layer, sources of truth: These are several authoritative sources that the model refers to when retrieving data. In order of decreasing credibility, they are: semantic layer > lineage and transformation graph > query corpus > business context. Its role is to translate the vague questions from users into the only correct and maintained data caliber in the system.
The first two layers together specifically solve the pain point of "mismatched concepts".
The third layer, Skills: This layer solidifies the query processes of senior analysts into reusable modules, targeting "retrieval failure" and ensuring that the model can reliably find and use the correct answer.
The fourth layer, validation: It includes offline evaluation, ablation experiments, online verification, and maintenance processes to detect which type of the three errors is still slipping through, and it is also the main way to combat "outdated data".
During the process of building these layers, Anthropic also encountered two counter - intuitive results.
One is the cost of taking shortcuts.
They tried to let the large model automatically generate indicator definitions from the original tables. As a result, the generated definitions encoded the ambiguities they wanted to eliminate back in, and it got a negative score in the evaluation. In the end, they had to go back to the old way: Claude drafts the document, and humans make the final decision on the definitions.
The other is even more unexpected. Feeding thousands of historical SQL statements directly to the model for retrieval only increased the accuracy by less than 1 percentage point.
Among these four layers, the most significant leap in accuracy disclosed by Anthropic comes from Skills.
The sources of truth are declarative knowledge, telling the model what each indicator means; Skills are procedural knowledge, telling it where to search first, in what order to search, and what a qualified analysis looks like.
In form, Skills is a folder containing SKILL.md, descriptions, scripts, and resources, which Claude reads as needed. This mechanism can be cross - verified in Anthropic's official documentation and GitHub repository.
How amazing is the effect?
According to the figures disclosed internally by Anthropic, without Skills, Claude's accuracy in internal evaluations was no more than 21%; after adding Skills, it stably reached over 95%, and in some fields, it was close to 99%.
The difference between 21% and 95% is not a stronger model, but this structure.
Behind the 95% Figure: This System "Will Decay"
However, the 95% accuracy didn't last long.
Anthropic found that this system would become outdated: They watched as the offline accuracy dropped from about 95% to about 65% within a month.
The reason behind this is that the data model changes every day, and no one takes care of the Skill documents that describe it. So, after a few weeks, it starts to give wrong answers.
So, the Anthropic team treats maintenance as a serious engineering task: They put the Skill documents and the data model into the same code repository. When a code merge request (PR) for the model is made, the corresponding document is also updated. Now, about 90% of the data model changes are submitted along with a Skill update.
They also conducted a negative experiment.
They gave the agent full - text search (grep) permission to search through historical SQL files and confirmed in the running records that it read each one. As a result, the accuracy fluctuated by less than 1 percentage point. What's more, about 80% of the correct answers to the wrongly answered questions were actually in the corpus it had just read. It saw them but still didn't use them.
At that moment, Anthropic realized that the real bottleneck is the structure, not whether it can access the data. This judgment directly changed their roadmap for the next few months.
Finding the right structure can push the accuracy to a high level. But the last few percentage points have to be bought with real resources.
For example, adding an adversarial review to make the model repeatedly challenge its own assumptions can increase the evaluation accuracy by another 6%. The cost is a 32% increase in token consumption and a 72% increase in latency.
95% is not built, but nurtured. Once you let go, it may collapse back within a few weeks.
Reference materials:
https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude
This article is from the WeChat official account "New Intelligence Yuan", author: ASI Revelation. Republished by 36Kr with authorization.