Claude miscalculated the origin of the virus by 90 years – is it all the web pages' fault?
[Introduction] While top - tier AI in coding is advancing by leaps and bounds, it often fails in the field of biology. It's not that the models aren't smart enough, but that scientific databases have so far been designed only for humans to click the mouse.
Has the most powerful model stumbled in the most unexpected place: counting?
Recently, Anthropic published a scientific blog titled "Paving the way for agents in biology", in which a set of numbers sent shivers down people's spines.
https://www.anthropic.com/research/agents-in-biology
Researchers asked several of the current most powerful scientific agents (Claude, GPT, Biomni, Edison Analysis) to do something that sounds effortless: accurately count how many virus sequences meet the criteria from the NCBI Virus database.
As a result, none of them could answer correctly and stably.
Even more incredibly, for the same question, the same model, and the same set of prompt words, the answers could differ by dozens of times when asked three times.
When Claude Sonnet 4 was used to search for an Ebola virus sequence, it returned 106 sequences the first time, 15 the second time, and 5 the third time. The correct answer is 266.
Is AI really no good at biology?
There is a harsh truth behind this. In the scientific field, the real shortcoming of agents is not reasoning, but the lack of a stable, reproducible, and machine - accessible way to accurately retrieve data.
Without a dedicated retrieval layer, the average accuracy of various systems ranges from 16.9% to 91.3%. Even though new models have made progress, the remaining errors are still fatal because the passing line for such tasks is actually 100%.
Missing one record might make a diagnostic reagent seem to cover all prevalent strains, or cause the starting point of an epidemic to be miscalculated by several weeks.
So, where exactly does the problem lie?
A City Built for Horse - drawn Carriages Can't Accommodate Cars
Anthropic gave a very vivid analogy: Using an agent to access a biological database is like driving through an old city built before the invention of cars.
The streets may be elegant and well - planned, but they are all narrow alleys with sharp turns designed for horse - drawn carriages. Scattered databases, strange file formats, and one - time retrieval scripts are all part of this old city. You can add some traffic signs, build a few parking lots, and widen one or two roads, but the underlying city layout was not designed for cars from the start.
The software world is the opposite. It is a new city built for cars: flat asphalt roads, clear lanes, and standardized traffic lights. Version control, well - documented APIs, and package managers are a complete system that allows you to reach your destination at high speed from the start, and it is naturally paved for "cars" (that is, agents).
That's why code agents are advancing rapidly, while biological agents are going in circles.
The software field provides a structured digital workflow and reliable interfaces. For a GitHub issue, you can generate a patch, run tests, and verify on the spot. The biological field provides fragile, heterogeneous, and process - dependent infrastructure, with almost no simple, verifiable, and meaningful reward signals.
Specifically in the case of NCBI Virus, the trouble is even more obvious. It is essentially a web portal. You tick the conditions on the web page: the host is human, the sampling location is in Africa, the sequence length is greater than a certain value, and exclude laboratory - passaged samples. Only then will the website backend translate these conditions into queries for multiple underlying databases (GenBank, RefSeq, INSDC system) and filter out the results for you.
The home page of the NCBI Virus portal: To retrieve virus sequences, you have to select options, enter keywords, and click the filter on the web page. The entire interaction is designed for humans, and it is difficult for machines to directly reuse it.
A large amount of its filtering logic is written at the web page level and is not opened as a clean program interface.
For human virologists, it's just a few clicks in the browser. For machines (agents), it's a disaster. Because what agents can directly call are the underlying raw APIs (REST, Datasets, E - utilities), and these APIs do not expose the same filtering semantics as the web page.
Here is a specific example:
On the web page, "sampling location in Africa" is a tick box. Behind it, you may need to align the metadata fields of dozens of countries and process records with inconsistent field notations. For a condition like "contains surface glycoprotein", you can't judge it based on the sequence itself. You have to retrieve the gene/protein annotations of each record from GenBank for comparison.
The web page does these implicit steps for you, but the raw APIs don't.
So agents have to "guess" and piece together this set of logic again. If something is missed, there will be under - calculation (missing sequences from a certain African country). If something is mis - pieced, there will be over - calculation (misunderstanding the filtering conditions).
This is exactly the root cause of Sonnet 4 giving three different answers (106, 15, 5) to the same question: Each time it reconstructs the filtering logic differently.
What gget virus aims to solve is precisely this: to re - implement the hidden filtering behavior in the web interface as a stable, reproducible, and machine - callable programmatic system, so that agents don't have to guess every time.
Miscounting One Sequence Can Shift the Starting Point of an Epidemic by Weeks
If you think "miscounting a few sequences" doesn't matter, the following scenario will change your mind.
In May 2026, an Ebola epidemic of the Bundibugyo type broke out in the Democratic Republic of the Congo. On May 14th, the INRB in Kinshasa analyzed 13 blood samples, and 8 cases were confirmed the next day. By May 29th, the WHO reported that the number of confirmed and suspected cases had exceeded 1000, and more than 200 people had died.
Researchers were faced with three life - and - death questions: How different is this virus from previous ones? Can the existing diagnosis still detect it? Does the existing treatment still work?
To answer these questions, they need to compare the new genome with the historical Ebola genomes in the NCBI Virus database one by one. And the first step of this analysis is precisely to manually click on the web page, manually reproduce a long list of complex filtering conditions, and then hope that the retrieved dataset is complete and correct.
Researchers used the previous Ebola query to let Sonnet 4 retrieve data and build a phylogenetic tree to calculate the "Time to Most Recent Common Ancestor (TMRCA)". This is a key quantity for inferring when an epidemic originated.
The manually refined dataset gave a TMRCA of January 2014, which is consistent with previous reports.
However, two of the three datasets retrieved by Sonnet 4 were obviously incomplete. One of them pushed the inferred origin time back from 2014 to 1922, adding more than ninety years out of thin air. The remaining one seemed okay but missed the sequences from Guinea, quietly moving the origin time to April 2014, thus rewriting the timeline.
Phylogenetic tree of the Zaire Ebola virus: The top - left is the manually refined data, and Runs 1 to 3 are the retrieval results of Sonnet 4. The red dotted line marks the TMRCA, and the gray represents missing or incorrect country information.
The analysis of antibody therapies is the same. Researchers wanted to see if the sites targeted by the two Ebola antibody therapies, maftivimab and MBP134, had mutated in history to determine if the therapies could keep up with virus evolution. As a result, Sonnet 4 produced three completely different mutation scenarios in three runs.
Mutation distribution of the Zaire Ebola virus glycoprotein. The darker the red, the higher the frequency. The spheres are the binding sites of the maftivimab and MBP134 antibodies. The leftmost is the manually refined data, and the results of Sonnet 4's three retrievals (Runs 1 to 3) are different.
The failure mode is clear: Stopping halfway when expanding the result set leads to under - counting; using the wrong filtering conditions leads to over - counting. For viruses with a large number of records such as Influenza A and HIV - 1, the deviation is the largest. Once there are more than three or four parallel filtering conditions, the performance collapses directly.
Making mistakes confidently is the most terrifying kind of error in scientific research.
Dig a Machine - Specific Tunnel for the Old City
So, how to fix it?
Researchers from Anthropic and NCBI collaborated to create something called gget virus.
It is not just another fancy "AI plugin", but a deterministic retrieval layer. In essence, it translates the filtering behavior in the NCBI Virus web interface into a reproducible programmatic system.
Technically, it coordinates several underlying systems such as REST, Datasets, and E - utilities, and automatically determines which filtering can be done through the API and which needs local verification. It handles batch data retrieval, ensuring that large result sets are fully retrieved without being truncated midway.
It downloads virus nucleotide sequences and linked metadata from the INSDC system (NCBI, ENA, DDBJ) and outputs formats such as FASTA, CSV, and JSONL that both humans and machines can understand. It also comes with detailed logs to tell you how the result was calculated. For high - frequency queries, it compresses the data transfer volume by more than 98%.
The effect is immediate.
After connecting to gget virus, the accuracy of all tested systems soared above 90.0%, and GPT - 5.5 reached 99.7%. The random jitter between runs almost disappeared, and the stability increased to between 0.92 and 1.00.
The best part is that the gap between models was also significantly narrowed.
Retrieval accuracy of each agent on the VirBench benchmark: After connecting to gget virus (dark color), all exceeded 90%. The rightmost is the result of gget virus running alone.
What this means is that after adding a deterministic tool layer, it doesn't matter much which model you use.
This is really the thing worth noting.
The construction of a reliable dataset should not depend on whether you can afford the latest and most expensive model, nor on whether you happen to know which model is most suitable for which database. An inexpensive model with the right tools can still be stable.
There is also an interesting detail. In 360 runs, GPT - 5.5 found and used gget virus on its own without any prompts. And that was the only time it answered the question correctly.
The value of the tool has been voted on by the model itself.
The Real Decisive Factor Shifts from the Model to the Foundation
Looking at the bigger picture, this is not just about viruses.
The same frictions occur in every environment "designed for humans, not for agents".
A few months ago, Karpathy talked about software in the AI era and complained that when he was doing a small web application with "vibe coding", if he really wanted to launch it (login, payment, deployment), he would spend a whole week clicking around in the browser. His conclusion was: "Writing code is the easiest part."
Karpathy's presentation slide "Docs for people": The configuration documents of services such as Vercel and Clerk are all designed for humans with instructions like "click here, fill there", and LLMs cannot directly call them.
Biologists may resonate with Karpathy's complaints. They may have endured this kind of pain for many years.
gget virus is not an isolated case. A number of biomedical agents such as ToolUniverse, Robin, and Biomni are also building this kind of "context engine".
The challenge lies in: Where should the determinism be placed, and how should it be built?
Of course, some people may ask: With the rapid progress of models, what if one day agents become powerful enough to cross chaotic portals, align IDs, turn pages correctly, and self - heal from errors? Will "scaffolding" like gget virus become useless overnight?
It's possible. But Anthropic's answer is: Even if agents can do it, it doesn't mean they should reinvent the wheel every time.
A model that can navigate through this chaotic data retrieval process on its own may be too expensive, too slow, too difficult to audit, and too difficult to trust to support daily scientific research.
Moreover, even if the scaffolding will eventually become obsolete, the lesson for biological databases still holds: From now on, agents should be regarded as large - scale users, and databases should be built from the start for large - scale calls.
On the surface of this competition, it's about which model is smarter. At a deeper level, it's about which foundation is more suitable for machines to run on.
We want models to be imaginative when generating hypotheses and designing experiments. But the layer beneath them: gene identifiers, data schemas, retrieval logic, coordinate systems, and metadata conventions must be "boringly" reliable.
The curve of models is still rising.
But in this round, the real decisive factor may not be the large - scale models in the cloud, but the data infrastructure at the bottom that no one wants to fix but determines success or failure.
Reference: https://www.anthropic.com/research/agents-in-biology