Shanghai Jiao Tong University Team: Let Claude Code Conduct "Reliable" Scientific Research While You Sleep - Two Papers Accepted by Top AI Conferences

How to make papers written by AI more reliable?

Currently, the self - developed scientific research AI Agent can already complete the entire process from inspiration to paper. When we wake up, the Agent may have already finished the experiments and even written a decent paper.

It sounds wonderful, but how do we know if the Agent has quietly "lied" in the conclusion?

Currently, there are still two key problems with the Agent. First, the generation and review are often completed by the same model series, and many systematic errors are therefore difficult to expose internally. Second, after an Agent has worked continuously for several days with little supervision, it is often difficult to judge whether the final conclusion it gives is really supported by sufficient evidence.

In response to these problems, the team from Shanghai Jiao Tong University proposed "Auto - Research - in - sleep (ARIS)", an open - source research harness for automated scientific research.

Paper link: https://arxiv.org/abs/2605.03042

The focus of this work is not to make the Agent write papers faster, but to make the written papers more testable.

It is worth mentioning that in the community practical case demonstrations, some researchers have used ARIS to complete their papers through the entire process, and these papers have been accepted by conferences.

Aris: Let Claude Code Do Scientific Research While You Sleep

According to the paper, the system topology of Aris consists of three layers:

Layer 1: Execution layer, which provides specific capabilities. It consists of reusable Markdown - defined skills and a persistent research Wiki.

Layer 2: Orchestration layer, which is responsible for stringing these capabilities into a complete process. There are 5 end - to - end workflows - idea discovery, experiment bridging, automatic review cycle, paper writing, and rebuttal, covering 4 research stages from discovery to after submission.

Layer 3: Assurance layer, which is the most core innovation of Aris. It is responsible for the audit of evidence to claims and manuscript inspection. It includes a 3 - stage evidence - claim audit cascade, a 5 - pass scientific writing editing pipeline, a mathematical proof checker, visual PDF review, and citation audit.

Figure | Aris system topology. Six groups of components interact with each other through labeled relationships (see the left sidebar): The outer Meta - Optimization loop gates the Assurance layer, which is responsible for checking Artifacts; Artifacts are generated and consumed by Workflows, and Workflows are responsible for orchestrating Skills; Skills call MCP and Tool Bridges to access external models and data. The executor and reviewer on the right use models from the same series. The ARIS - Code CLI packages all components into an independent binary program.

Core Mechanism: Cross - Model Adversarial Collaboration

The research team believes that it is difficult for a single Agent to reliably complete long - range research tasks. Therefore, they adopt a "execution - review - correction" cycle across model families.

Among them, the executor (Claude family is recommended by default) is responsible for producing code, experiments, or paper drafts; the reviewer (GPT - 5.4 family is recommended by default) scores according to predefined criteria and returns structured action items; the executor makes corrections based on this and resubmits until the score meets the standard.

Figure | Cross - model adversarial collaboration alternates between "executor generation" and "external model criticism, executable revision requests, and convergence checks". The access scope of the reviewer can range from only viewing documents to accessing the entire code repository.

End - to - End Workflows

Based on this mechanism, ARIS organizes 5 end - to - end workflows as follows:

Workflow 1: Idea Discovery, responsible for literature research, novelty check, and experiment planning;

Workflow 2: Experiment Bridging, which advances the plan to code implementation, computing power execution, and result retrieval;

Workflow 3: Automatic Review Cycle. In each round, the draft is submitted to cross - model reviewers for structured scoring, action items are extracted, GPU experiments are run as needed to obtain new evidence, affected chapters are revised, and convergence is checked;

Workflow 4: Paper Writing Stage. The system will complete 7 key steps in sequence: first, do paper planning and chart generation, then conduct LaTeX writing and five rounds of editing; if necessary, add proof checking, then conduct conclusion audit, compilation, and enter the automatic improvement cycle through two rounds of visual review and automatic revision based on GPT - 5.4 xhigh;

Workflow 5: Post - Submission Stage. The system will complete parsing review comments, splitting key issues, planning response strategies, drafting responses, passing three security checks, conducting stress tests, and finally finalizing the paper. The security checks are used to prevent fabrication, over - commitment, and omission of responses.

Figure | ARIS workflow library. Top: End - to - end combination of 5 workflows and their product contracts, grouped by 4 research stages: discovery, experiment, manuscript, and post - submission stages; dotted lines represent reviewer feedback, GPU - triggered evidence collection, and Wiki memory. Bottom: Compressed internal structure of several workflows not separately expanded in the main text, including W1 Idea Discovery (iterative refinement with reviewer gating), W1.5 Experiment Bridging (with code review and automatic debugging rollback), and W4 Response to Review Comments (with security gates and stress tests).

Adding a "Self - Verification Safety Net" to AI Output

The most distinctive design of ARIS is the setting of a 3 - step audit chain. First, the research team checks whether the experiment itself is reliable, focusing on issues such as false labels, ghost results, unexecuted indicators, and over - extrapolation. Second, they match each candidate conclusion with existing evidence one by one to judge whether it is "supported", "partially supported", or "not valid". Third, they return to the paper text to directly check whether the original results, experimental settings, and the numbers and tables in the text are consistent.

Beyond this audit chain, the research team has also set up additional safeguards. After the first draft is completed, ARIS will conduct 5 rounds of scientific editing, dealing with redundant expressions, active voice, local coherence, term consistency, and numerical consistency in sequence. For papers with a heavy theoretical component, a proof checker will be called to review proof obligations. During the review stage, the system will check for problems such as misaligned figure captions, abnormal page layouts, and table readability. Finally, citation verification will be carried out, checking not only whether the literature exists and whether the metadata is correct, but also whether the citations really support the assertions in the main text.

Figure | Evidence - to - claim audit cascade. Stage 1 (experiment - audit): Reviewers audit the evaluation scripts and result files to check for integrity failure modes. Stage 2 (result - to - claim): Map the results to explicit claim judgments (supported, partially supported, falsified); for any problem with audit failure, the relevant claims will be downgraded. Stage 3 (paper - claim - audit): A new reviewer without any context information will compare each quantitative claim in the manuscript with the claim ledger and the original result files one by one.

From "Trial and Error" to "Spiral Learning"

The research knowledge base is also an important part of ARIS. It is not an ordinary note, but a project - level persistent recording system that continuously saves relevant papers, research ideas, experimental processes, and phased conclusions, and records the relationships between them. Without this memory mechanism, the same unworkable idea may be repeatedly proposed in different rounds. With it, failed directions will be excluded in time, and verified conclusions will become the starting point for the next round of research.

Figure | Why the wiki is important. Without the wiki (left), each session starts from scratch; the same failed idea A may be tried infinitely many times because the system cannot remember previous results. With the wiki (right), the failure in the first session will be recorded; in the second session, the wiki will be read during the idea - generation stage, skipping A and successfully trying B; in the third session, the research will continue based on B and explore C/D. Failed ideas will become a "do - not - try list", and verified assertions will become the basis for the next round of idea - generation, thus transforming the one - time research process into spiral learning.

What's the effect?

As of now, the skill library of ARIS has expanded from the initial 21 core skills to more than 65, covering multiple directions such as robotics, hardware design, communication, mathematical proof, fund application, and presentation generation. At the same time, ARIS has also been tested on three platforms: Claude Code, Codex CLI, and Cursor. The review end can currently access multiple model back - ends such as GPT, Gemini, and DeepSeek.

The research team also provided a real overnight operation record. Within about 8 hours, ARIS completed 4 rounds of "review - modification" cycles, and the internal review score increased from 5.0/10 to 7.5/10. During this period, more than 20 GPU experiments were triggered, and some conclusions lacking sufficient evidence were actively deleted. This shows that ARIS can at least turn "review - driven modification" into an executable process, rather than just staying at the level of wording refinement.

However, the research team was very cautious in presenting these results. They emphasized in the paper that these are only observational evidence, and no causal judgment can be made based on them. That is to say, this operation only shows that "conclusion pruning" and "review - driven modification" can be operationalized in the real process. It cannot further prove that cross - family review is necessarily better than same - family review, nor can it show that the current dual - reviewer structure is the optimal solution.

Deficiencies and Future Directions

The lack of controlled evaluation is the main limitation of the current research. All reported results in the paper are observational records, and the research team also admits that model selection, task difficulty, and operation intensity will affect the results, and the effects cannot be causally attributed to ARIS itself.

On the other hand, ARIS cannot guarantee that any output is definitely correct, novel, or scientifically reasonable. The three - stage audit chain can prevent many common distortions, but it cannot guarantee to find all errors or fabrications. If the reviewers continuously prefer a certain methodology, the system may ultimately optimize to cater to the reviewers' tastes, rather than the real scientific quality. The paper also points out that the final selection of research directions, evidence verification, and submission decisions still need to be made by humans. In terms of security, repository - level review may involve sending source code to external model interfaces, and the local review route is still in the planning stage.

It is worth noting that the mechanism in ARIS may not only serve paper writing. The independent reviewer, the "evidence - to - conclusion" audit, and the traceable ledger can theoretically be placed between the model output and the subsequent training signals to become an explicit supervision layer for the self - improvement system.

Ultimately, what ARIS really promotes is not the speed of automated scientific research, but its credibility. It may not have provided the standard answer, but at least it has brought a long - neglected problem to the forefront. For automated scientific research, this may be more important than being a little faster and more automated.

This article is from the WeChat official account "Academic Headlines" (ID: SciTouTiao), author: Academic Headlines. Republished by 36Kr with permission.