20.000 Artikel in einem Tag bewerten: AAAI 2026 führt erstmalig KI-basierte Gutachterarbeit ein

In sechs Schlüsselvergleichen wurde der Mensch von KI direkt besiegt.

Is the evaluation of scientific articles by AI reliable?

Different people may give different answers, but it is undeniable that the acceptance of AI evaluation is gradually increasing. Some top conferences have also started this practice under the pressure of a large number of article submissions. For example, ICML 2026 has relaxed the requirements for AI evaluation, although a full evaluation by AI is still not allowed. See the report "Should the author decide whether to use AI for evaluation? The new evaluation guideline of ICML 2026 has been released".

A few days ago, AAAI 2026, another top conference that is also suffering from the large number of submissions, has also made its own attempt. One must consider that the Main Technical Track of this conference has received nearly 30,000 submissions, which means an enormous evaluation effort. See the report "Explosion of submissions at AAAI - 26: Nearly 30,000 articles, 20,000 from China, the evaluation system is on the verge of collapse".

Specifically, the AAAI organization, in cooperation with several universities and research institutes, has conducted a pilot study: An AI evaluation result was created for each submission in the Main - Track of AAAI - 26.

As for the results, it was expected by many: Overall, the AI has achieved better performance than humans. Or as the AAAI organization puts it: "A comprehensive survey among the authors and members of the program committee of AAAI - 26 shows that the participants not only consider the AI evaluation useful, but also actually prefer the AI evaluation in important dimensions such as technical accuracy and research recommendations."

Report title: AI - Assisted Peer Review at Scale: The AAAI - 26 AI Review Pilot

Report address: https://arxiv.org/abs/2604.13940

In the following, we will take a closer look at the report on the "AAAI - 26 AI evaluation pilot study".

The current evaluation problems in the AI field

With the rapid development of AI technology, the traditional system of scientific peer - review is under unprecedented pressure. In both well - known academic journals like Nature and conferences like NeurIPS, the number of submissions has increased exponentially in recent years.

However, the evaluation system that the academic world relies on has hardly changed and heavily depends on the unpaid work of human experts, who have to invest a lot of time and effort.

In the face of the shortage of reviewers and the over - burden of experienced scientists, it is becoming increasingly difficult to maintain the quality of article evaluation, the consistency of evaluation criteria, and the timeliness of the results.

To deal with the record number of submissions at AAAI 2026, the conference organizers have even included more than 28,000 members in the program committee, which is three times as many as in the previous conference!

An unprecedented mass action: 20,000 in - depth evaluations in one day

In this situation where a solution is urgently needed, the AAAI 2026 AI evaluation pilot project has emerged. The detailed report reveals how they used the latest Large Language Models (LLM) in the real, high - pressure environment of a top scientific conference to 22,977 articles that reached the comprehensive evaluation stage for in - depth AI evaluation.

In previous studies, research teams could often only test the capabilities of AI in article evaluation in isolated simulation environments or based on a small number of already published articles.

This AAAI 2026 pilot project is the first time in the history of science that an AI - generated evaluation system has been directly introduced and officially used in the strict, real double - blind submission process of a large - scale conference.

All 22,977 articles in the Main - Track that successfully reached the first evaluation stage of AAAI 2026 received an evaluation opinion from authors and reviewers with a clear AI label.

The conference organizers have carefully set boundaries when implementing this project: The introduction of AI is only intended to provide additional information for the entire process. The evaluation authority was not taken away from any human expert by an algorithm. In addition, the final AI - generated document does not contain specific evaluation scores and does not give hard recommendations such as "acceptance" or "rejection". Instead, the members of the Senior Program Committee (SPC) and the area chairs (AC) are encouraged to compare the problems discovered by AI with the opinions of human experts to assess the quality of the articles and decide whether they should be forwarded to the second stage.

It is amazing how efficient and cost - effective this AI platform is.

The report shows that it is completely feasible and easy to implement AI evaluation on the scale of a top conference. The computing cost per long scientific article is actually less than 1 US dollar.

It is worth noting that OpenAI, as an important supporter of this conference, has provided the API resources for this project for free. Using the current top model GPT - 5 in a multi - stage workflow with complex code sandboxes and external search interfaces, the entire system has read and evaluated all over 20,000 articles in less than 24 hours.

The AAAI - 26 AI evaluation system and the schedule for evaluation generation

Architecture analysis: Moving away from end - to - end generation and introducing a strict five - step validation cycle

Previous comparative studies have shown that when developers take the easy way out and simply give long scientific documents to a Large Language Model, hoping that it will directly spit out a detailed evaluation opinion, the result is usually superficial talk or full of illusions.

After the development team has learned these lessons, it has developed a complex, multi - stage LLM industrial pipeline.

In view of the limitations of top language models in processing high - resolution images or heterogeneous multimodal documents, each PDF manuscript is standardized and pre - processed at the front nodes of the system. All images are re - scanned at 250 DPI to utilize the graphics memory. Since printing tests have shown that pure text extraction often leads to a catastrophic misinterpretation of mathematical formulas and multi - stage tables, the technical team has introduced a special olmOCR to convert the original PDF into a Markdown file with accurate LaTeX mathematical symbols and structured table information.

After the AI evaluation system has both the visual cues of the PDF and the Markdown text, it begins to work simultaneously in five core scientific review stations:

Storytelling structure review (Story): It is strictly checked whether the author's problem statement makes sense, whether the recognition of literature gaps is realistic, whether the core contribution is sustainable, and whether the proof structure in the article is correct.

Presentation and structure review (Presentation): The clarity of the text, the coherence of the chapters, and the readability of the grammar are evaluated, and it is checked whether the complex technical context is easily understandable for colleagues.

Experiment evaluation review (Evaluations): The integrated Python code interpreter is activated to check the reference baseline, the test data, and the statistical significance indicators of the article to check whether the experiments supporting the core claim have data gaps, and specifically to question the reproducibility.

Correctness review (Correctness): Here again, the computing power of the code sandbox is used to check the absolute correctness of complex mathematical formulas, logical proofs, algorithm pseudocodes, and diagram data.

Significance and industry positioning review (Significance): The Large Language Model is authorized to connect to a customized broadband Internet search engine to conduct a cross - search in the literature. To avoid information contamination, the search access is restricted to the officially published articles of relevant top conferences, and all non - peer - reviewed preprints are excluded to evaluate the real innovation strength of the article and to find the comparison experiments intentionally avoided by the authors.

After these five reviews, the system will reassemble the scattered insights and generate a formal and detailed first draft of the evaluation. Then comes the most important step: The system starts the "self - criticism" module.

The Large Language Model is asked to change its own role and carefully look at the draft it has just written to find unfounded accusations, factual errors, or contradictory passages in the original article. Finally, the Large Language Model will rewrite and output the final AI evaluation report based on the correction list generated by the self - criticism. All lower - level dialog protocols, intermediate state checkpoints, and debugging reports will be permanently stored for human review.

Before the report is finally sent to the authors, there is still a quality filter based on GPT - 4o - mini that silently blocks. It is specifically responsible for checking whether the anonymity of the authors has been accidentally revealed by the carelessness of the Large Language Model in the text, whether there are offensive words, whether there are systematic biases against gender and region, or whether the structure itself has been damaged. Only after it has withstood such processing can the report be published.

AI beats humans in six key comparisons

Regardless of the magnificent parameters of the system, the final decision always lies in the hands of the broad research community. To determine the actual effectiveness of this costly pilot project, the research team has sent a follow - up survey to all interested parties of the conference and finally received 5,834 feedback data.

The survey contains nine hard gold standards for evaluating the quality of the evaluation, and the respondents must give their assessment on a 5 - point Likert scale.

The final statistical diagrams show a reality that makes traditional scientists a little uncomfortable: In six out of nine comparison groups, the AI evaluation has outperformed the reports written by human scientists in the average score.

Interestingly, the author group of the articles being evaluated has shown a stronger preference for the AI evaluation results than the critical reviewers.

Survey responses: Comparative analysis between AI and manual review (a) and AI review questions (b)

Specifically, the AI has shown overwhelming superiority in the following dimensions (the p - values of all data show a strong statistical difference):

In the accurate identification of in - depth technical errors, the AI was extremely sensitive (the average score led by +0.67).

It has raised important counter - evidences that the author was in a thinking error when writing and did not consider at all (+0.61).

It has provided practical improvement suggestions for adjusting the argumentation structure and optimizing the article graphics (+0.54).

It has issued constructive technical suggestions for repairing the experimental logic and strengthening the research design (+0.49).

For a top conference like AAAI, the detailedness and thoroughness of the report generated by the AI were superior to humans (+0.48).

Of course, the machine is not an invincible enforcer at present. In the remaining three criteria, the respondents still claimed the superiority of human reviewers.

The data shows that the AI often gets into a dead - end and inflates insignificant details into fatal problems (the deficit is - 0.36); in long texts, there is also a certain probability that the Large Language Model itself writes technical gaps in the evaluation report (- 0.22); and it sometimes gives ridiculous and practically unimplementable suggestions (- 0.11).

Finally, 53.9% of the respondents said that the AI was very useful in this epic evaluation process, while only 20.2% of the total number of respondents thought that the machine was rather a hindrance. Furthermore, 61.5% of the experts said that they hoped that the AI would continue to participate in the peer - review in their future academic careers.

It is remarkable that, although the participants were already aware before the...

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

20.000 Artikel in einem Tag bewerten: AAAI 2026 führt erstmals KI-basierte Gutachterarbeit ein, die Kosten pro Artikel betragen weniger als 1 US-Dollar.

The current evaluation problems in the AI field

AI beats humans in six key comparisons