228 hours of non-stop work to produce 100 papers, burning through 11.4 billion Tokens: FARS has gone crazy.
During this Spring Festival, the most hardcore "reality show" in the AI circle quietly reached a phased conclusion.
The protagonist is neither an anime character nor a robot brandishing weapons, but an AI scientist named FARS (Fully Automated Research System) that never gets tired, working 7×24 hours.
This fully automated research system created by Analemma completed a continuous public operation lasting 228 hours, 28 minutes, and 33 seconds. During this period, it proposed hypotheses, conducted experiments, and wrote papers on its own, generating a total of 244 research hypotheses and "producing" 100 short papers.
Calculated, in this assembly - line "scientific research factory", a paper is produced approximately every 2 hours.
It took 228 hours to achieve the goal of having AI write 100 papers on its own. Currently, the one - month - long live - stream is still ongoing. Live - stream address: https://analemma.ai/fars
This industrial - level throughput that breaks away from the traditional scientific research paradigm quickly caught the attention of netizens.
The first batch of professional netizens who conducted in - depth "inspections" gave a rather consistent judgment: the results exceed expectations and are quite excellent.
If regarded as top - tier human conference papers, they are not extremely amazing. However, considering that this is the phased output of a fully automated system, the degree of completion has significantly exceeded many people's pre - expectations.
"Considering that this is just the autonomous start of an AI, and it can stably produce work of this quality 7×24 hours a day, what more could you ask for?"
Moreover, it really works without hallucinations throughout.
At least at the current stage, FARS has completed a crucial leap. It has proven for the first time that an unattended scientific research "assembly line" can not only operate but also continuously produce short - paper - level work with certain academic competitiveness under relatively stable conditions.
The "scarcity of publishing papers" has been destroyed.
The Terrifying "Industrial Rhythm": Computing Power is Transforming into Knowledge
FARS is not a single model but a multi - agent system, including four functional modules:
- Ideation: Responsible for literature research and hypothesis generation
- Planning: Responsible for experimental design
- Experiment: Responsible for code writing and execution
- Writing: Responsible for paper writing
From the real - time operation interface, it can be intuitively seen that FARS advances multiple research tasks in parallel in the form of a project queue. Each project goes through four stages in sequence: Ideation → Planning → Experiment → Writing. The process is highly modular, showing obvious characteristics of a "scientific research assembly line".
The real - time operation interface of FARS: From hypothesis generation to paper writing, the automated scientific research assembly line is fully presented in an observable form for the first time.
To enable it to focus on research, Analemma built a computing cluster with 160 graphics cards for it and allowed it to call almost any open - source and closed - source large models. The experimental conditions far exceed those of most university laboratories.
The production capacity of this "assembly line" has reached a level that is hard to ignore. During a continuous operation cycle of approximately 228 hours (≈9.5 days):
- The system generated 244 research hypotheses
- Completed 100 short papers
- Consumed a total of 11.4 billion Tokens
- The total cost was approximately $104,000 (≈750,000 RMB)
The whole process was unattended.
After further normalization, the "industrial rhythm" of this system becomes more intuitive: on average, a research paper is completed every about 2 hours and 17 minutes, and the average cost of each paper is about $1000, consuming more than 100 million Tokens.
Compared with the common 3 - 6 months/paper cycle in human scientific research, this throughput gap is almost an order of magnitude, and the cost is also extremely low.
However, if we shift our focus from throughput to efficiency, the consumption of approximately 114 million Tokens per paper is significantly higher than the overhead of ordinary writing generation (usually in the millions of Tokens) and common complex Agent tasks (usually in the millions or tens of millions of Tokens).
This indicates that FARS is still in the stage of "trading computing power for intelligence", and its performance mainly comes from computing density rather than the ultimate compression of algorithm efficiency.
Overall, on the one hand, FARS has proven through actual measurement that an end - to - end automated scientific research assembly line is feasible in terms of throughput. On the other hand, its current Token and cost structure still has engineering room to reach the goal of "running on a large scale at a low enough cost".
Quality: Does It Write Fast and Well?
A large quantity does not automatically mean high quality. What level are the things written by FARS?
For this reason, the research team used the AI review system Agentic Reviewer (paperreview.ai) developed by Stanford University to uniformly score these 100 papers according to the review standards of ICLR.
According to the public evaluation of the developers, Agentic Reviewer has reached the judgment level of human reviewers in terms of review consistency.
The developers conducted a comparative evaluation on the ICLR 2025 review data using the Spearman correlation coefficient. Human vs. human: 0.41; AI vs. human: 0.42. The developers believe that agentic reviewing is approaching the human level.
From the overall scoring results, among the 100 papers produced by FARS, the average score was 5.05 (ranging from 3.0 to 6.3).
A small number of papers were in the low - score range of 3.0 - 4.5, and very few exceeded 6.0 points.
The scores of FARS papers are mainly concentrated around 5 points, indicating that the output quality is not randomly fluctuating but has formed a relatively stable "quality band". A small number of samples entering the range above 6 points means that the system can occasionally produce outstanding works.
How does this result compare with human achievements?
As a reference, the average score of human submissions to ICLR 2026 was 4.21, and the average score of the finally accepted papers was 5.39.
In comparison, the average score of 5.05 for FARS is significantly higher than the overall average level of human submissions, but there is still a gap from the "average acceptance line".
It can be said to be better than some but not as good as others.
The average score of the academic papers generated by FARS exceeds the average level of human submitters, but there is still a gap from the average acceptance score.
It should be emphasized again that this automated production mainly focused on short papers and did not optimize for the review standards of current academic conferences. Therefore, the review results of either the Stanford University Agentic Reviewer or other AI review systems based on existing specific review standards can only be used as a reference, not a final conclusion.
According to the team, in addition to AI review, manual quality review is also being carried out simultaneously, and a comprehensive quality report will be formed after the evaluation is completed.
Even under this cautious premise, when observing the combined data of the two parts, the overall signal is still relatively clear: in an evaluation system close to the human review scale, FARS has become a stable mid - range output machine.
In - Depth Reading of Papers: From "Rapid Follow - up" to "Facing Failure"
If the previous data and scores can only provide a macro - scale assessment, then specific paper samples truly reveal the research quality of FARS.
Some netizens have evaluated a paper on the LLM - as - a - Judge work. They believe that such papers are quite well - organized in terms of abstract organization and problem - cutting.
Considering that this is automatically produced by AI, the degree of completion has "exceeded expectations". The framework diagrams, result diagrams, and analyses are basically complete, "looking quite professional".
Some people also think that the project numbered FA0008 "makes sense".
Next, we select