HomeArticle

Review 20,000 papers in one day. AAAI 2026 implements AI paper review for the first time, with a cost of less than $1 per paper.

机器之心2026-04-19 10:03
In six key comparisons, humans were soundly defeated by AI.

Is it reliable to use AI for paper review?

Different people may have different answers. However, there is no doubt that people's acceptance of AI review is gradually increasing. Some top - tier conferences are also promoting this matter under the pressure of a large number of paper submissions. For example, ICML 2026 has relaxed the requirements for AI review, but it still does not allow reviews to be completely carried out by AI. Refer to the report "Do authors have the final say on whether to use AI for review? The new review policy of ICML 2026 is out".

A few days ago, another top - tier conference, AAAI 2026, which was also overwhelmed by a huge number of submissions, also made its own attempt. You know, the Main Technical Track of this conference received nearly 30,000 submissions, and the review workload was extremely large. Refer to the report "The submission volume of AAAI - 26 has exploded: nearly 30,000 papers, 20,000 from China, and the review system is on the verge of collapse".

Specifically, the AAAI official, in collaboration with multiple universities and research institutions, carried out a pilot study: an AI review result was generated for each main - track submission to the AAAI - 26 conference.

As for the result, it may be within many people's expectations: the overall performance of AI has outperformed humans. Or, as the AAAI official put it, "A large - scale survey of the authors and program committee members of AAAI - 26 shows that participants not only consider AI review useful, but actually prefer AI review in key dimensions such as technical accuracy and research suggestions."

Report title: AI - Assisted Peer Review at Scale: The AAAI - 26 AI Review Pilot

Report address: https://arxiv.org/abs/2604.13940

Now let's take a closer look at this "AAAI - 26 AI Review Pilot" research report.

The review challenges currently faced in the AI field

With the rapid evolution of AI technology, the traditional scientific peer - review system is facing an unprecedented workload. Whether it is top - tier academic platforms like Nature or NeurIPS, the number of submissions in recent years has been increasing at an astonishing rate.

However, the review mechanism that the academic community relies on has hardly progressed. It heavily depends on human experts who contribute a large amount of effort and time for free.

In the situation where the resources of reviewers are becoming increasingly scarce and senior scholars are stretched thin, it is becoming more and more difficult to maintain the high quality of paper reviews, the consistency of evaluation criteria, and the timeliness of results.

In order to handle the record - high number of submissions to AAAI 2026, the conference organizing committee was even forced to recruit more than 28,000 program committee members, which is even three times the size of the previous conference!

An unprecedented large - scale deployment: completing 20,000 in - depth reviews in one day

At this moment when a solution is urgently needed, the AAAI 2026 AI Review Pilot Project has emerged. Its long - form report details how they used cutting - edge LLMs to conduct a thorough AI review of 22,977 papers that entered the full - review stage in the high - pressure environment of a real top - tier academic conference.

In previous related explorations, research teams could only test the review capabilities of AI in isolated simulation environments or by selecting a small number of published and mature papers.

This time, the AAAI 2026 pilot program is the first time in the history of the entire academic community that an AI - generated review system has been directly introduced and officially deployed in the strict real double - blind submission process of a large - scale conference.

As long as a paper successfully enters the first review stage of AAAI 2026 in the mainstream track, both the author and the reviewer will receive a review opinion clearly marked as AI - generated.

The conference organizing committee was very cautious in implementing this plan and established a red line: the introduction of AI is only to provide additional input in multiple dimensions to the entire process, and no human expert's review qualification has been replaced by the algorithm. In addition, the final document generated by AI definitely does not contain specific numerical scores, nor does it give a hard - line recommendation such as "accept" or "reject". Instead, senior program committee members (SPC) and area chairs (AC) are encouraged to cross - reference the problems discovered by AI with the opinions of human experts when making decisions, comprehensively assess the quality of the paper, and decide whether to advance it to the second stage.

What is truly astonishing is the high efficiency and cost - control demonstrated by this AI platform.

The report provides a clear calculation result: it is completely feasible and easy to implement AI review on a large scale in the context of a top - tier conference. The average computing cost per long - form academic paper is actually less than $1.

It is worth mentioning that as an important back - up for this conference, OpenAI provided free API resource sponsorship for the project. In a multi - process workflow that includes a complex code sandbox and an external search interface, using the current state - of - the - art GPT - 5 model engine, the entire underlying system completed the reading and correction of all more than 20,000 papers in less than 24 hours.

The AAAI - 26 AI review system and the review generation timeline

Architecture analysis: abandoning end - to - end generation and introducing a strict five - step verification cycle

Early comparative studies have sounded the alarm. If developers take shortcuts by simply feeding long academic documents to large models and hoping for a detailed review opinion, they usually get superficial nonsense or full - page hallucinations.

After learning from these lessons, the R & D team carefully constructed a complex, multi - link nested LLM industrial - grade pipeline.

Considering the throughput limitations of top - tier language models when processing high - resolution pixel images or heterogeneous multimodal documents, the pre - processing nodes of the system will standardize and pre - process each PDF manuscript uniformly. All illustrations will be resampled to 250 DPI to fit the video memory. Since previous stress tests have shown that the pure text extraction mode often leads to the model catastrophically misinterpreting esoteric mathematical formulas and multi - level tables, the technical team introduced targeted olmOCR to strip and convert the original PDF into a Markdown file with embedded accurate LaTeX mathematical symbols and structured table information.

After obtaining both the visual clues of the PDF and the Markdown text, the AI review system begins to operate simultaneously in five core scientific review compartments:

Storyline review (Story): Strictly evaluate whether the author's problem setting is valid, whether the statement of the literature gap is true, whether the core contribution is tenable, and whether the evidence chain in the paper can stand up to scrutiny.

Presentation and structure scan (Presentation): Judge the clarity of the writing, the coherence of the chapters, and the readability of the grammar, and review whether the complex technical context is easy for peers to understand.

Experimental evaluation check (Evaluations): Activate the embedded Python code interpreter to scrutinize the baseline, test set, and statistical significance indicators selected in the article, check for data loopholes in the experiments supporting the core claims, and specifically question the reproducibility.

Correctness deduction (Correctness): Also relying on the computing power of the code sandbox, forcefully deduce and verify the absolute correctness of complex mathematical formulas, logical proofs, algorithmic pseudocodes, and chart - mapped data.

Significance and industry positioning (Significance): Authorize the large model to connect to a customized wide - area network search engine for cross - library literature tracking. To prevent information pollution, the search authority is strictly limited to the officially published literature of relevant top - tier conferences, excluding all non - peer - reviewed preprints. Use this to mercilessly evaluate the real innovation of the article and search for comparative experiments that the author deliberately avoids.

After these five tests are completed, the system will reorganize the scattered insights and typeset them into a well - formatted and detailed initial review draft. Then, the most crucial step appears: the system will activate the "self - reflection and criticism" module.

The large model will be ordered to change its identity and closely examine the draft it just wrote to find baseless accusations, factual misjudgments, or paragraphs that contradict the original paper. Finally, based on the correction list generated by self - criticism, the large model will rewrite and output the final AI review report. All underlying dialogue logs, intermediate state checkpoints, and debugging reports are permanently retained for human auditing.

Before the report is finally sent to the author, there is also a quality filter based on GPT - 4o - mini silently intercepting. It is specifically responsible for screening whether the text accidentally reveals the anonymous author's identity due to the negligence of the large model, whether there are insulting words, whether there are systematic biases based on gender and region, or whether the structure itself is damaged. Only after such refinement can the report be released.

In six key comparisons, humans were defeated by AI

No matter how magnificent the system's parameters are, the real decision - making power always lies in the hands of the vast number of researchers in the community. In order to explore the actual effectiveness of this costly pilot, the research team sent out tracking questionnaires to all stakeholders of the conference and finally successfully recovered 5,834 pieces of feedback data.

The questionnaire contains nine hard - and - fast gold standards for measuring the quality of the review, and respondents are required to give their judgments on a 5 - point Likert scale.

The final statistical chart reveals a reality that makes traditional scholars a little uncomfortable: in a total of nine control groups, the average score of AI review in six dimensions mercilessly exceeded that of the reports written by human scholars.

More interestingly, compared with the picky reviewers, the group of paper authors being reviewed showed a stronger preference for the AI review results.

Survey responses: Comparative analysis of AI and manual review (a) and AI review issues (b)

Specifically, AI showed overwhelming advantages in the following dimensions (the p - values of all data show strong statistical differences):

AI is extremely sensitive in accurately identifying deep - seated technical errors (the average score lead reaches +0.67, the highest in the field).

It presents important counter - evidence that the author overlooked completely when writing (+0.61).

It contributes practical improvement guidelines for adjusting the argumentative structure and optimizing the paper's chart presentation (+0.54).

It outputs constructive technical suggestions on how to repair the experimental logic and strengthen the research design (+0.49).

For top - tier conferences like AAAI, the comprehensiveness and thoroughness of the reports produced by AI make humans pale in comparison (+0.48).

Of course, machines are not invincible at present. In the remaining three considerations, respondents still insist on the superiority of human reviewers.

The data shows that AI often gets stuck in a rut and magnifies trivial details into fatal problems (the lag is - 0.36); in long - winded discussions, the large model itself also has a certain probability of writing review words with technical loopholes (- 0.22); and it sometimes gives ridiculous and impractical suggestions (- 0.11).

Finally, up to 53.9% of the respondents believe that AI played a very beneficial role in this epic review process, while only 20.2% of the total think that the machine was a hindrance. Moreover, 61.5% of the practitioners said that they expect AI to continue to participate in peer review in their long - term academic careers.

It is worth noting that although people had psychological expectations before the test, 55.6% of the participants admitted that the technical penetration power demonstrated by the machine has far exceeded their perception of the AI ceiling.

Insight into public opinion clustering: the direct collision of advantages and pain points

Beyond the cold scores, the research group also used high - order large models to conduct natural language clustering analysis on the 320 pure - text subjective comments collected, and extracted the five most concentrated praises and five major criticisms from the current academic community regarding the full - scale introduction of AI.

The five most popular positive feedbacks:

Targeted modification strategies (5.3%): AI is not just about criticizing. It is very good at transforming sharp criticism into logical and actionable modification outlines.

Amazing reading breadth and meticulousness (5.2%): Machines do not have a fatigue period. Their fanatical analysis that covers every detail makes humans feel inferior.

Technical flaw detector (5.0%): It frequently and accurately identifies formula errors that several human peers have overlooked in the dense derivations.

Cold and absolute objectivity (4.3%): AI has no academic sectarian disputes and its emotions are absolutely stable. Its involvement is like a moat, effectively diluting the unfairness caused by individual reviewers' subjective biases or deliberate suppression of dissidents.

Grammar and layout optimization (4.2%): It effectively addresses various spelling issues, tense errors, and irregularities in picture layout.

The five major shortcomings that are widely criticized:

Extremely lacking in macro - pattern and scientific intuition (9.1%): This is an insurmountable gap for current machines. They often seem clumsy when judging whether a study has epoch - making industry - disrupting power or hidden huge scientific benefits.

Getting bogged down in details and being overly critical (8.5%): They often write long - winded reports because of a few non - standard formats, causing the review reports to lose focus and covering up the truly important logical flaws.

Information overload causing mental overload (8.3%): A report that is several pages long and contains dozens of minor questions actually greatly increases the processing burden on the authors being reviewed and the review chairs.

Catastrophic misinterpretation of facts (7.7%): When facing the most cutting - edge unsolved fields or dealing with complex multi - level tensor equations, LLMs still completely misinterpret the original meaning.

Shallow domain knowledge (7.6%): They cannot, like experts who have been working in a narrow and specialized field for more than a decade, point out the potential connection between the article and an unremarkable technology from ten years ago at a glance.

An anonymous researcher wrote the following in the feedback box: "I am shuddered by the thoroughness of this system. It found the deep - seated technical holes that are easily filtered out by human vision and effortlessly provided the most relevant references. Its coldness ensures that there are no subjective prejudices. However, it lacks an intuition, an aura that only scholars who have spent countless days and nights in the laboratory can have. When faced with creative ideas that deviate slightly from the orthodox paradigm but contain amazing potential, AI will only suppress them rigidly."

This scholar finally suggested that in the future, tasks such as literature