Not enough difficult problems on Codeforces to solve? Xie Saining and others created an AI problem generator that can generate original programming problems.
Rich Sutton once said, "AI can only create and maintain knowledge within the scope of self - verification." Einstein and Infeld also wrote in their co - authored book The Evolution of Physics, "Asking a question is often more important than solving it. The latter may merely be a matter of mathematical or experimental skills. However, posing new questions, new possibilities, and looking at old problems from a new perspective requires creative imagination and marks the real progress of science." As large language models (LLMs) move towards general capabilities with the ultimate goal of artificial general intelligence (AGI), testing their ability to generate questions is becoming increasingly important. Especially when applying LLMs to advanced programming tasks, as the future development and economic integration of LLM programming capabilities will require a large amount of verification work.
First, creating problems for programming competitions requires a deeper understanding of algorithms than solving them.
For example, basic problems may be reduced to recognizable templates and solved with simple techniques; many standard programming problems often allow the submission of partially correct or boilerplate solutions, which may mask incorrect reasoning processes. In contrast, competition programming problems have strict standards, aiming to evaluate a deeper understanding of underlying algorithm design principles, data structures, and complexity trade - offs. Verifying a large number of possible solutions and fully covering various shortcuts or boundary cases is extremely challenging but necessary for competition programming problems. Therefore, creating problems not only encompasses all the challenges of solving them but also goes beyond.
Second, better problem - creating ability will lead to more rigorous competition programming benchmarks. Since the official test data of top - tier platforms like Codeforces and AtCoder are not publicly available, researchers currently rely on synthetic datasets such as CodeContests+, TACO, and HardTests.
However, analysis shows that existing test datasets may have both high false positive rates (FPR) and high false negative rates (FNR). For example, a greedy algorithm with poor time complexity may pass a series of small - scale random tests but fail in adversarial construction cases designed to expose its flaws. This key weakness creates a distorted evaluation environment that rewards models that can find shortcuts.
Third, successfully proposing novel challenges may pave the way for the self - improvement of models and AGI, while also verifying the deployment of models in complex software stacks.
So, can we train AI to propose high - quality, even unthinkable new problems for humans, just as we train it to solve problems? Recently, the LiveCodeBench Pro team gave a resounding answer: AutoCode. This is a systematic framework that uses LLMs in a closed - loop, multi - role system to automate the entire lifecycle of competition programming problem creation and evaluation.
- Paper title: AutoCode: LLMs as Problem Setters for Competitive Programming
- Paper address: https://arxiv.org/abs/2510.12803v1
- Project page: https://livecodebenchpro.com/projects/autocode/overview
Notably, the team consists of researchers from ten institutions, with a total of 5 co - first authors. In addition, the author list includes well - known researchers such as Xiesaining.
Overall, this research makes two major contributions:
- An enhanced Validator - Generator - Checker framework, which achieves state - of - the - art reliability in test case generation.
- An innovative process for generating high - quality new problems. This process starts from a "seed problem" to inspire the LLM in a promising direction.
Test Case Generation
The team's test case generation process is a structured framework aiming to achieve the highest level of rigor and coverage.
As shown in Figure 1, the framework begins with the Validator, which is the cornerstone of the entire system. Its function is to ensure that any given input strictly adheres to all the constraints specified in the problem description. A validator is crucial for minimizing the false negative rate (FNR) because it prevents correct programs from failing on incorrectly formatted data.
Next, the Generator uses diverse strategies to create a wide range of inputs, aiming to reduce the false positive rate (FPR), that is, the incorrect or inefficient programs are wrongly judged as correct. Any invalid cases generated by the Generator will be filtered out by the Validator, ensuring that the team obtains a set of high - quality inputs.
Finally, to evaluate the contestants' outputs, the Checker compares them with the outputs of the reference solution.
For interactive tasks, the Interactor conducts multiple rounds of dialogue with the contestants' programs to give a final verdict.
Since one of the team's prominent goals is to provide high - quality Validators for RLVR (Reinforcement Learning from Verified Results), the team pays special attention to reducing the false positive rate (FPR). The team distinguishes between test cases (input - answer pairs) and test data, the latter also including the Checker and Interactor programs required for evaluation.
Benchmarking: Robustness of Test Cases
To rigorously evaluate the team's test case generation framework, they established two different benchmarks.
The main benchmark contains 7538 problems, derived from the intersection of well - known existing datasets: CodeContests+, CodeContests, HardTests, and TACO.
Notably, this large - scale collection does not include interactive problems, and due to the inherent filtering of these datasets, the average difficulty of its test data generation is slightly lower than that of a typical Codeforces competition.
To address this issue and test the new system under more challenging real - world conditions, the team created the second benchmark, which includes 720 recent, rated competition problems from Codeforces. This collection is completely unfiltered, including interactive problems known for their difficulty to handle and problems that require complex, structured test data. The team said that they were unable to evaluate previous methods on this newer benchmark because their data generation codebases were not publicly available.
The team's evaluation is based on three key metrics:
- Consistency measures the overall percentage of agreement between the team's test verdicts and the official verdicts. The team further breaks down the inconsistent cases into two key error rates.
- False Positive Rate (FPR) is defined as the proportion of officially incorrect solutions that are incorrectly accepted by the team's generated tests.
- False Negative Rate (FNR) is the proportion of officially correct solutions that are incorrectly rejected by the team's tests.
Comparison with Other Benchmarks
The team evaluated AutoCode against four leading benchmarks on the benchmark containing 7538 problems.
As shown in Table 1, the team's framework achieved a 91.1% consistency with the official verdicts. This represents a significant leap, as previous methods' consistency did not exceed 81.0%. Crucially, AutoCode significantly reduced the false positive rate (FPR) to only 3.7% and the false negative rate (FNR) to 14.1%, representing approximately a 50% reduction in both metrics compared to the current state - of - the - art.
Figure 2 shows the distribution of incorrect verdicts, indicating that the verdicts of most problems are consistent with the ground - truth verdicts.
To further test the system's robustness, the team also compiled a more challenging benchmark, including 720 recent, unfiltered Codeforces problems, including complex interactive tasks.
As shown in Table 2, AutoCode maintained its excellent performance, achieving a 98.7% consistency. This result validates the effectiveness of the team's method on modern, difficult problems, on which previous methods could not be evaluated.
The team also verified the effectiveness of the method through ablation experiments.
After establishing such a powerful test case generation ability, the researchers turned their attention to a more creative task: directly generating new high - quality problems.
Problem Generation
The team's newly proposed problem generation framework is built on the aforementioned robust test generation framework (as shown in Figure 1) but introduces a key dual - verification protocol to ensure correctness without human intervention.
Each generated problem is scored by top human competition programmers on a 6 - level scale. The team consulted 8 human expert problem setters, and they all said that when creating new problems, they often base on a specific existing problem. By adding, deleting, or modifying certain conditions of such a "seed problem", they can create new, usually more difficult problems that require novel insights.
Inspired by their insights, the team's method is to first randomly select a Codeforces problem (with a difficulty score below 2200) as the "seed problem". The LLM's task is to generate a new problem by adding, deleting, or modifying certain conditions of the seed problem, and at the same time provide an efficient reference solution (std.cpp) and a brute - force solution (brute.cpp).
brute.cpp usually has a higher time complexity but is almost impossible to be wrong. Therefore, the team uses it to stress - test the validity of the problem. Using the team's enhanced test case generation technology, the team constructs a comprehensive set of test data that fully covers small - scale cases. Then both brute.cpp and std.cpp run on this dataset. Only when for each test case, the outputs of the two programs (where the brute - force solution may legally fail to complete due to timeout) are pairwise verified by the Checker as consistent answers and outputs, a problem is considered correct.
The ingenuity of this design lies in that it uses the "slow but almost never wrong" brute - force solution to provide an absolutely reliable "ground - truth standard" without human intervention for the "fast but may have logical loopholes" efficient solution, thus achieving automated correctness verification.
This dual - verification protocol (where brute.cpp serves as the initial ground - truth, and the verified reference solution goes through a complete test generation cycle) successfully filters out 27% of error - prone problems, increasing the correctness rate of the reference solutions provided by the LLM from 86% to 94%.
After screening, more than 80% of the problems are marked as of sufficient quality to be used as training data for the model, and 23% of the problems involve novel or creative designs. The team shows the detailed scoring criteria and score distribution in Figure 3.
Next, the team summarized several key findings about the LLM's performance in problem generation.
- Finding 1: LLMs can generate solvable problems that they themselves cannot solve.
- Finding 2: LLMs tend to create new problems by combining existing problem frameworks and emphasizing knowledge and implementation. That is, LLMs are better at "knowledge recombination" than original innovation.
- Finding 3: The difficulty increase of new problems is often greater than that of the seed problem, and when the difficulty of the corresponding seed problem is moderate, the quality of the generated problems is the highest.
- Finding 4: There is almost no correlation between human experts' and LLMs' judgments on problem quality and novelty.
- Finding 5: The difficulty and difficulty increase of the generated problems are better indicators of problem quality than the LLM's self - assessment.
In summary, these findings paint a clear picture of the current LLMs in creative tasks: LLMs are powerful "knowledge recombiners", not a true "original thinker".
Conclusion
In this work, the LiveCodeBench Pro team proposed Auto