StartseiteArtikel

Gründung der acht Transformer-Sonderlinge: KI erobert Wettbewerb für NP-Schwerproblems, und Agenten sind unter den besten 2 % der Teilnehmer

新智元2025-06-17 16:36
Der KI - Agent Sakana hat in der AtCoder - Wettbewerbsrangliste Platz 21 erreicht und sich in die besten 2 % geschafft, um NP - schwere Probleme zu lösen.

[Introduction] Programming agents are truly remarkable! Llion Jones, the author of Transformer, founded a startup that specifically collected NP-hard problems and tested AI agents. The result was astonishing: it ranked 21st in a competition with over a thousand participants! This means it has written better code than the vast majority of people.

Logistics route selection, staff scheduling, factory dispatching, power grid balancing, travel routes...

These real - world optimization tasks, seemingly ordinary in daily life, are actually extremely difficult.

The difficulty lies in the fact that once the problem scale expands, traditional algorithms can hardly calculate the optimal solution.

Usually, one can only rely on heuristic or approximation algorithms to approach the answer.

This is exactly the typical characteristic of NP - hard (Non - deterministic Polynomial - time hard) problems.

Facing such complex problems, can AI handle them? How do programming agents perform?

To explore this issue, Sakana AI collaborated with AtCoder to jointly build ALE - Bench (ALgorithm Engineering Benchmark).

Co - founder Llion Jones is one of the eight key figures behind Transformer.

Different from traditional programming benchmarks, ALE - Bench focuses on high - difficulty NP - hard problems that require long - term reasoning and creative thinking.

Due to the NP - hard nature, there is no clear optimal solution for this type of problem, so the score can be continuously improved.

Researchers believe that it has the potential to become an important evaluation standard for the new generation of reasoning and programming abilities.

To deal with this type of problem, this study specially designed an end - to - end agent, ALE - Agent.

Based on Gemini 2.5 Pro, it adopts two core strategies:

(1) Provide domain knowledge of common algorithms and technologies through prompts;

(2) Generate diverse solutions during the reasoning phase to enhance performance.

In the real - world environment, ALE - Agent has demonstrated powerful capabilities.

Figure 1: Overview of ALE - Bench. (Left) ALE - Bench integrates questions from previous AtCoder heuristic competitions, such as complex optimization problems without known optimal solutions like path planning and task scheduling, and ranks submitted programs based on scores. (Right) ALE - Bench supports comprehensive evaluation from basic large language models (LLMs) to agents with structured guidance capabilities (scaffolded agents): after receiving a task, the agent submits code and can optionally call test - running and visualization tools to iteratively optimize solutions like human contestants.

Take Figure 2 below as an example. The task description is as follows:

Write a program. The input is a large number of pickup - delivery pairs on a two - dimensional grid. The task is to select a specified number of requests from them and plan a path that starts from the warehouse and finally returns to the warehouse.

The path must meet the following constraints: for each selected request, its pickup point must be visited first, and then its corresponding delivery point.

The goal of the program is to make the total length of this path as short as possible.

The score is based on the total length of the path. The shorter the route, the higher the score.

(The CPU time limit for each group of inputs is 2 seconds)

Figure 2: An example problem (ahc006) from ALE - Bench

In May, the programming competition platform AtCoder held a heuristic competition (AtCoder Heuristic Competition, AHC), which attracted top developers from around the world.

The agent competed with 1,000 human contestants in real - time.

Finally, ALE - Agent performed excellently, ranking 21st and making it into the top 2%.

In the leaderboard of the 47th AtCoder Heuristic Competition (AHC047), the 21st - ranked contestant named 'fishylene' is actually the ALE - Agent submitted by Sakana AI.

This achievement marks a breakthrough for AI in solving real - world optimization problems.

Paper link: https://arxiv.org/abs/2506.09050

Dataset: https://huggingface.co/datasets/SakanaAI/ALE - Bench

Code: https://github.com/SakanaAI/ALE - Bench

NP - hard Problems, a New Benchmark for Programming Agents

ALE - Bench is built based on the AtCoder Heuristic Competition (AHC).

Why is AHC worth paying attention to?

Because AHC is one of the well - known programming competitions hosted by AtCoder:

  • One of the largest scoring - based algorithm competitions at present: The competition is held about 10 - 18 times a year. As of May 1, 2025, a total of 49 official competitions have been held.
  • Many participants: Each competition attracts an average of about 1,000 contestants, and more than 6,000 users have participated in the competitions in the past two years.
  • Problems are close to reality: The problem types are diverse, covering multiple fields such as path planning, task scheduling, and multi - agent control.
  • It supports features such as long - term competitions and visualization tools.

At the beginning of each competition, the organizer will release a newly designed problem.

The problem shown in Figure 2 is a typical path - planning problem. Most of these tasks have high requirements for computing resources, and the running time limit for each test case is usually 2 to 10 seconds.

AHC offers two competition formats: short - term competitions (lasting about 4 hours) and long - term competitions (lasting 1 - 2 weeks).

There are significant differences in problem design and challenge difficulty between the two.

The problems in short - term competitions can sometimes be solved by standard algorithms such as simulated annealing and beam search;

while long - term competitions value in - depth analysis and repeated trials more, and the solutions are often 'polished' out.

Figure 3 shows the process of contestants' scores gradually increasing during the competition.

Figure 3: Score increase in the long - term competition of AHC

In the two - week AHC014 competition, Figure 3 shows that the scores of specific rankings at each time point show continuous improvement.

The line colors in Figure 3 mark different color levels, for example, performance perf = 2800 (6th place) and performance perf = 1200 (379th place).

However, in either format, to get a high score, one needs to reason and repeatedly optimize according to the problem itself.

As the competition progresses, contestants can continuously submit optimized solutions to gradually improve their scores.

Figure 4: Distribution of ratings and average performance. As of May 1, 2025, the cumulative ratings and average performance distribution of users who have participated in at least 5 competitions (the background colors represent different rating levels)

A New Programming Benchmark: No Best Answer

To build ALE - Bench, the research team released a dataset containing 40 AHC problems on HuggingFace. These problems are all from the official competitions held before the end of April 2025.

Dataset: https://huggingface.co/datasets/SakanaAI/ALE - Bench/tree/main

This dataset is called the full version, and an additional lite version is provided, which selects 10 representative problems for quick evaluation and testing.

The data package for each problem contains four major parts:

  1. Problem: The complete description of the problem in Markdown format, with all relevant diagrams attached;
  2. Scorer: A scoring program written in Rust to evaluate the performance of the code submitted by contestants on given test cases;
  3. Visualizer: A web - based visualization tool and a Rust program to dynamically display the execution process of the code. The image in Figure 2 is an example;
  4. Leaderboard: Reference data for calculating and displaying the score rankings of models or contestants.

ALE - Agent, an Agent for Algorithm Engineering Design

In algorithm engineering, how much development potential do agents still have?

To initially explore the research space opened up by ALE - Bench, this study explored a specific - purpose agent in the field of algorithm engineering.

This field has some unique characteristics.

For many problem types, there are mature high - level strategies, and choosing the correct overall solution is crucial.

However, even if the overall idea is correct, the specific implementation details, hyperparameter settings, and fine - tuning optimization can still significantly affect the final result.

Based on this, in the ALE - Agent prototype, the research team proposed and implemented two techniques:

Method 1: A prompting strategy combined with domain knowledge.

Embed the expert knowledge of common techniques in algorithm engineering directly into the prompts, such as simulated annealing and beam search. The prompt content covers the design of the search space and evaluation function, the way of neighborhood generation, and common acceleration techniques.

Method 2: Solution - space search focusing on diversity.

Researchers adopted a method based on best - first search, using large language models (LLMs) to generate and optimize candidate solutions.

To avoid prematurely discarding potentially promising solution paths, an expansion strategy similar to beam search is added to the algorithm, allowing each node to generate multiple child nodes at once.

This breadth expansion helps to retain high - potential hypotheses and, in practice, effectively reduces API latency by generating candidate solutions in parallel, which is particularly advantageous when using large inference models.

See Appendix B for details.

The research team let ALE - Agent participate in two real - time competitions (AHC046 and AHC047), competing with more than 1000 human contestants under the same rules.

The results are as follows:

AHC046: Ranked 154th (top 16%).

AHC047: Ranked 21st (top 2%), with particularly excellent performance.

Evaluation Results on ALE - Bench

The research team evaluated a wider range of combinatorial optimization problems on ALE - Bench.

In addition to ALE - Agent, other state - of - the - art AI models were also tested. These models continuously improved their solutions through self - optimization within 4 hours (see the figure above).

The AI models using standard optimization methods performed roughly equivalent to the top 50% of human contestants, while ALE - Agent performed in the top 6.8%, showing significant performance improvement.

Refer to the paper for the complete experimental settings and results.