Anthropic's Latest Blog: The Bottleneck for Biological Agents: Models Aren't the Issue, Data Infrastructure Is

If you want AI Agents to conduct scientific research, you must first completely rebuild the database.

Currently, Coding Agents are making rapid progress in the field of software engineering. Seeing this, scientists can't help but place high hopes: When will AI agents be able to help humans overcome the numerous challenges in drug design, virus monitoring, and biological modeling at the same speed?

However, a harsh reality is that the development of AI in the biological field is much slower than that in the programming field...

Recently, Anthropic published a new scientific blog - "Paving the way for agents in biology". The article points out: The bottleneck hindering the explosion of biological AI Agents does not lie in the insufficient reasoning ability of the large model base, but rather in the extremely backward state of the existing biological data infrastructure of humanity.

Therefore, if we hope that AI Agents can truly participate in biological research, the biological data infrastructure must become more suitable for Agents to use.

This article was written by Laura Luebbert, a biologist and machine learning researcher.

Interestingly, Laura Luebbert revealed that this blog was completed one week before Karpathy officially announced his joining of Anthropic. Since some content in the article involves Karpathy, she was worried that Anthropic might think the article had too much of a "Karpathy flavor". Unexpectedly, on the same day she sent the first draft to Anthropic, they made the official announcement...

Now, let's take a detailed look at how this article analyzes the situation.

The existing biological data infrastructure is too difficult for Agents to navigate

The author used a very interesting analogy. Asking an AI Agent to operate the biological data infrastructure is a bit like driving through an old city built before the invention of cars: This city may be very beautiful, and its planning may have been well - thought - out, but it is full of narrow and winding streets, making it difficult for modern vehicles to pass smoothly. In the field of biological data, this corresponds to various unique file formats, scattered databases, and one - time retrieval scripts.

Of course, you can try to add traffic signs and parking lots to this city, and even widen a few roads occasionally. However, its basic layout remains difficult to navigate because it was originally designed for a different mode of transportation.

In contrast, software infrastructure is almost naturally suitable for "cars", that is, Agents to use: paved roads, clear lanes, standardized signals, and a system that supports fast passage from start to finish, such as version control, well - documented APIs, and package managers.

Therefore, the development speed of Coding Agents is significantly faster than that of biological Agents.

The software field usually has structured digital workflows and reliable interfaces. However, the infrastructure for data retrieval and verification in computational biology is often fragile, heterogeneous, and highly dependent on specific processes. Correspondingly, the tools used to operate these infrastructures have to be customized and are only applicable to specific fields or specific assumptions.

In addition, software can produce results that are easy to test and can be quickly compiled and verified. For example, an Agent can solve a GitHub issue by generating a patch. As long as the patch passes the project test, its effectiveness can be determined. However, in biology, there are not many simple, verifiable, and meaningful reward signals.

So, the bottleneck of biological Agents lies not only in reasoning ability but also in the lack of a widely available deterministic execution layer to support queries of biological data. Scientists can naturally express their intentions, such as "Find all human kinases with this domain and retrieve their structures." However, Agents often lack a reliable path to access the databases containing the required information.

In biological and scientific workflows, even small errors can have serious consequences. For example, extracting coordinates from the wrong genome version may invalidate subsequent biological interpretations; inadvertently mixing RefSeq and GenBank records, treating partial genomes as complete genomes, confusing the fragment names of segmented viruses, or missing relevant records due to inconsistent metadata fields can also cause the same problems.

This is where the beauty and difficulty of scientific research lie: details are often extremely crucial.

Therefore, if we hope that Agents can truly assist in scientific discovery, we need to build the biological data infrastructure.

Karpathy's "complaints" about web development are the same problem faced by biological Agents

The author believes that the mismatch between the needs of Agents and the tools built by humans is not unique to the biological field. Whenever Agents are placed in environments designed entirely around human usage habits, similar frictions will occur.

A few months ago, during a speech on software development in the AI era, Karpathy complained that he used Vibe Coding to write a small web application. However, when he tried to make it run, the authentication, payment, and deployment processes made him spend a week clicking around in the browser background.

For this, Karpathy sighed: "Writing code is actually the easiest part! Most of the work is done in the browser by clicking." The troublesome part is "Open this URL and click this drop - down menu."

The conclusion is: We must rebuild these processes for Agents.

This is exactly the pain point that biological researchers have long faced: We are trying to make intelligent systems work in an environment designed for humans to click in the browser, and this environment is full of heterogeneous information, implicit agreements, and various manual processes.

Case study: The "click tax" in virology

Long before the emergence of AI Agents, computational biologists and geneticists had begun to develop traditional computational biology tools to try to alleviate this problem. Biopython, BioPerl, BioJulia, Entrez Direct, BioMart, gget, and many other workflow libraries are all aimed at liberating biological data from the browser interface so that researchers can directly perform calculations on this data.

However, the problem is that biological data is not stored in a unified database and does not have a unified interface. It is more like a chaotic road network: Each road has its own identifiers, agreements, formats, filtering logic, and program access capabilities. Some data can be easily called through programs, while others are much more difficult.

Virology is one of the more difficult scenarios. In many research workflows, from vaccine design and diagnostic reagent development to building training data for protein models, the first step is to retrieve sequences from NCBI Virus. NCBI Virus is a collection of virus sequence records that brings together data from GenBank, RefSeq, and the international INSDC ecosystem, including Pathoplexus, and provides access through a searchable web interface.

Researchers involved in the construction of virus epidemic monitoring tools are very aware of how much expert knowledge is hidden behind these retrieval processes. In virology laboratories, the instructions for organizing the NCBI Virus dataset often circulate in the form of a long list of complex filtering conditions. Users must manually reproduce these conditions in the web interface.

And this is exactly the type of "browser - click workflow" that Karpathy complained about.

The article takes the Bundibugyo Ebola virus epidemic declared in the Democratic Republic of the Congo in mid - May 2026 as an example to illustrate this situation.

After front - line researchers sequenced the first batch of virus genomes of the sudden epidemic, global public health officials needed to immediately answer three pressing questions:

How much has this new strain mutated compared to historical Ebola viruses?
Can the existing diagnostic kits still accurately detect it?
Can the existing antibody drugs and therapies still protect patients?

To answer these questions, the first step in the analysis must be to go to the NCBI Virus database to compare the new genome with historical data.

However, in virology laboratories, the filtering conditions for constructing this control dataset are very complex and are often passed among scientists in the form of a long list. Researchers must manually check dozens of filters in the complex web interface. This is extremely boring for humans, and it is a disaster for AI Agents aiming to improve efficiency through automation...

What will happen if an Agent tries to retrieve data on its own?

The author said that to understand the gap between Agents and databases, the research team built a benchmark test called VirBench, which contains 120 real - style virus sequence query tasks, covering 40 pathogens, and is accompanied by manually verified standard answers. The tasks come from actual scenarios such as virus monitoring, diagnostic reagent design, and protein model training data construction.

For example, one of the tasks requires the Agent to retrieve the Zaire ebolavirus sequences corresponding to TaxID 3052462 from NCBI, meeting a series of conditions: the host is human, the sampling location is in Africa, the sampling time is between January 1, 2014, and June 20, 2014, the sequence length is at least 15,200 bases, the number of ambiguous characters N does not exceed 1,900, and laboratory - passaged samples are excluded.

When Agents complete these queries independently, the results vary greatly.

The average accuracy rates of Claude Sonnet 4, Claude Opus 4.7, Biomni, Edison Analysis, GPT - 5.2 - pro, and GPT - 5.5 range from 16.9% to 91.3%. That is to say, cutting - edge models perform better, but even so, they do not stably reach the accuracy and reproducibility required for building a reliable dataset.

For this type of task, the standard must be almost 100%. Because missing or retrieving an extra record may affect whether the diagnostic reagent covers the current diversity of the prevalent virus or affect the judgment of the starting point of the epidemic. What's more troublesome is that when the same model runs the same problem three times, it often gives very different results.

In the above Ebola virus query task, the standard answer is 266 sequences, but Claude Sonnet 4 returned 106, 15, and 5 sequences in three runs respectively. The prompts were exactly the same, but the results were highly unstable.

This instability will directly affect downstream analysis. The research team used these sequences to build a phylogenetic tree to infer the relationships between different virus samples in the epidemic. One important indicator is the time to the most recent common ancestor, that is, TMRCA. The time inferred from the manually organized dataset is January 2014, which is consistent with existing research; however, the partial dataset retrieved by Sonnet 4 was obviously incomplete, and in one case, it even pushed the time of the common ancestor back to 1922.

Another example involves antibody therapy. Researchers retrieved the Ebola virus glycoprotein sequences to observe whether mutations had occurred in the targeted regions of antibody drugs such as maftivimab and MBP134.

The results showed that Sonnet 4 gave three different impressions in three runs: the first was close to the result of manual query, the second missed most of the mutation sites, and the third emphasized a different set of residues.

This shows that in scientific research, seemingly small differences in retrieval details may change biological conclusions. Agents often understand the task and are willing to try to execute it, but they lack a machine - operable, verifiable, and repeatable path. The final answer may seem reasonable but is actually wrong.

And this is especially dangerous because sequence retrieval is usually the first step in a longer - term biological workflow...

gget virus: Adding a layer of deterministic tools for virus data retrieval

To solve this problem, the research team collaborated with NCBI researchers to develop gget virus, with the goal of turning virus data retrieval into a stable tool that can be directly called by both Agents and humans.

At first, it seemed that this was just a matter of connecting a few APIs, but the actual situation is much more complicated. NCBI Virus is a portal covering multiple underlying resources, and these resources are distributed in international synchronized sequence databases maintained by multiple countries. A seemingly simple query often requires splicing information from multiple sources.

To reproduce the behavior of the NCBI Virus web interface, gget virus needs to coordinate different APIs such as REST, Datasets, and E - utilities. It will determine which filtering conditions can be completed through existing APIs and which must be checked locally because some filtering logic provided by the web interface is not exposed in a single program interface.

It also handles batch retrieval to ensure that large - scale datasets such as SARS - CoV - 2 and influenza A are retrieved completely without missing records due to pagination or truncation. If the filtering conditions depend on supplementary information in another database, such as whether a certain sequence in a GenBank record contains a specific viral protein, gget virus will retrieve these records, use them for filtering, and save the relevant GenBank information in the final output.

Finally, gget virus outputs standardized results that can be read by both humans and machines, along with detailed logs explaining how the results were generated. In this way, the answers given by Agents are no longer just "seemingly reasonable" but can be checked, reproduced, and audited.

After adding gget virus, the accuracy of all Agents has been improved to over 90%. GPT - 5.5 reached a maximum of 99.7%. The fluctuations between multiple runs have basically disappeared, and the performance gap between different models has been significantly reduced. That is to say, a deterministic retrieval layer makes the choice of model less critical.

This is very important. Building a reliable dataset should not depend on the latest and most expensive models, nor should it depend on researchers knowing which model is most suitable for which database. Cheaper models combined with appropriate tools can also reduce instability and enable more people to obtain reliable capabilities.

The real inspiration: Scientific Agents need a "boring but reliable" foundation

At the end of the article, the author emphasizes that models should be creative when generating hypotheses, designing experiments, and reasoning about mechanisms. However, the underlying parts that support this creativity, such as gene identifiers, schemas, retrieval logic, coordinate systems, metadata agreements, and data access paths, must be stable, deterministic, and reproducible.

gget virus is just an example. The larger future direction is to build a type of "context engine" for biological data: a reliable data infrastructure that can be accessed by Agents. Similar explorations have also appeared in ToolUniverse, Edison Scientific's Robin,